What is the derivation of the derivative of softmax regression (or multinomial logistic regression)?

Multi tool use
Multi tool use

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
4
down vote

favorite
1












Consider the training cost for softmax regression (I will use the term multinomial logistic regression):



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



according to the UFLDL tutorial the derivative of the above function is:



$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$



however, they didn't include the derivation. Does someone know what the derivation is?



I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.



So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:



$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:



$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$



then if we proceed we get:



$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$



however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?







share|cite|improve this question






















  • I think I got the solution. Will write it up later :)
    – Charlie Parker
    Sep 9 '15 at 22:41











  • Well, I'm waiting for the solution. I'm not getting the solution that is given either.
    – lars
    Feb 9 '16 at 18:10










  • @lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
    – Charlie Parker
    Feb 9 '16 at 23:16










  • @lars I have written it for you. Hope it helps.
    – lerner
    Jan 3 '17 at 22:55














up vote
4
down vote

favorite
1












Consider the training cost for softmax regression (I will use the term multinomial logistic regression):



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



according to the UFLDL tutorial the derivative of the above function is:



$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$



however, they didn't include the derivation. Does someone know what the derivation is?



I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.



So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:



$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:



$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$



then if we proceed we get:



$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$



however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?







share|cite|improve this question






















  • I think I got the solution. Will write it up later :)
    – Charlie Parker
    Sep 9 '15 at 22:41











  • Well, I'm waiting for the solution. I'm not getting the solution that is given either.
    – lars
    Feb 9 '16 at 18:10










  • @lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
    – Charlie Parker
    Feb 9 '16 at 23:16










  • @lars I have written it for you. Hope it helps.
    – lerner
    Jan 3 '17 at 22:55












up vote
4
down vote

favorite
1









up vote
4
down vote

favorite
1






1





Consider the training cost for softmax regression (I will use the term multinomial logistic regression):



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



according to the UFLDL tutorial the derivative of the above function is:



$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$



however, they didn't include the derivation. Does someone know what the derivation is?



I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.



So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:



$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:



$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$



then if we proceed we get:



$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$



however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?







share|cite|improve this question














Consider the training cost for softmax regression (I will use the term multinomial logistic regression):



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



according to the UFLDL tutorial the derivative of the above function is:



$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$



however, they didn't include the derivation. Does someone know what the derivation is?



I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.



So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:



$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:



$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$



then if we proceed we get:



$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$



however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?









share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Sep 9 '15 at 22:41

























asked Sep 9 '15 at 17:21









Charlie Parker

1,0091027




1,0091027











  • I think I got the solution. Will write it up later :)
    – Charlie Parker
    Sep 9 '15 at 22:41











  • Well, I'm waiting for the solution. I'm not getting the solution that is given either.
    – lars
    Feb 9 '16 at 18:10










  • @lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
    – Charlie Parker
    Feb 9 '16 at 23:16










  • @lars I have written it for you. Hope it helps.
    – lerner
    Jan 3 '17 at 22:55
















  • I think I got the solution. Will write it up later :)
    – Charlie Parker
    Sep 9 '15 at 22:41











  • Well, I'm waiting for the solution. I'm not getting the solution that is given either.
    – lars
    Feb 9 '16 at 18:10










  • @lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
    – Charlie Parker
    Feb 9 '16 at 23:16










  • @lars I have written it for you. Hope it helps.
    – lerner
    Jan 3 '17 at 22:55















I think I got the solution. Will write it up later :)
– Charlie Parker
Sep 9 '15 at 22:41





I think I got the solution. Will write it up later :)
– Charlie Parker
Sep 9 '15 at 22:41













Well, I'm waiting for the solution. I'm not getting the solution that is given either.
– lars
Feb 9 '16 at 18:10




Well, I'm waiting for the solution. I'm not getting the solution that is given either.
– lars
Feb 9 '16 at 18:10












@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
– Charlie Parker
Feb 9 '16 at 23:16




@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
– Charlie Parker
Feb 9 '16 at 23:16












@lars I have written it for you. Hope it helps.
– lerner
Jan 3 '17 at 22:55




@lars I have written it for you. Hope it helps.
– lerner
Jan 3 '17 at 22:55










1 Answer
1






active

oldest

votes

















up vote
3
down vote













I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



Let's expand p and make the equation to this:



$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



And rearranging gives:

$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



Then we replace p back and get:



$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$






share|cite|improve this answer






















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "69"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1428344%2fwhat-is-the-derivation-of-the-derivative-of-softmax-regression-or-multinomial-l%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote













    I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



    $$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



    Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



    Let's expand p and make the equation to this:



    $ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



    And rearranging gives:

    $ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



    The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



    $ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



    Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



    Then we replace p back and get:



    $$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$






    share|cite|improve this answer


























      up vote
      3
      down vote













      I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



      $$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



      Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



      Let's expand p and make the equation to this:



      $ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



      And rearranging gives:

      $ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



      The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



      $ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



      Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



      Then we replace p back and get:



      $$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$






      share|cite|improve this answer
























        up vote
        3
        down vote










        up vote
        3
        down vote









        I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



        $$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



        Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



        Let's expand p and make the equation to this:



        $ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



        And rearranging gives:

        $ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



        The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



        $ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



        Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



        Then we replace p back and get:



        $$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$






        share|cite|improve this answer














        I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



        $$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



        Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



        Let's expand p and make the equation to this:



        $ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



        And rearranging gives:

        $ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



        The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



        $ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



        Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



        Then we replace p back and get:



        $$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Aug 16 at 10:01









        chandresh

        905815




        905815










        answered Jan 3 '17 at 5:15









        lerner

        246113




        246113






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1428344%2fwhat-is-the-derivation-of-the-derivative-of-softmax-regression-or-multinomial-l%23new-answer', 'question_page');

            );

            Post as a guest













































































            WUgEgsKDCXRh9ixANV5scZa2FeQ4Qht iTuEW9sQ4Q3gsEvwLKpuSwATKXLu7 20jTYQAty8S94Z,MToYE
            QsIWjpIG MR1ILKNACK t P9X5ed ZNPu7oW1G e 7YdaZILfZNkJhl6nPg3ioyOhBg,c0 M0Bd4T

            這個網誌中的熱門文章

            How to combine Bézier curves to a surface?

            Propositional logic and tautologies

            Distribution of Stopped Wiener Process with Stochastic Volatility