What is the derivation of the derivative of softmax regression (or multinomial logistic regression)?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
4
down vote

favorite
1












Consider the training cost for softmax regression (I will use the term multinomial logistic regression):



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



according to the UFLDL tutorial the derivative of the above function is:



$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$



however, they didn't include the derivation. Does someone know what the derivation is?



I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.



So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:



$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:



$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$



then if we proceed we get:



$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$



however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?







share|cite|improve this question






















  • I think I got the solution. Will write it up later :)
    – Charlie Parker
    Sep 9 '15 at 22:41











  • Well, I'm waiting for the solution. I'm not getting the solution that is given either.
    – lars
    Feb 9 '16 at 18:10










  • @lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
    – Charlie Parker
    Feb 9 '16 at 23:16










  • @lars I have written it for you. Hope it helps.
    – lerner
    Jan 3 '17 at 22:55














up vote
4
down vote

favorite
1












Consider the training cost for softmax regression (I will use the term multinomial logistic regression):



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



according to the UFLDL tutorial the derivative of the above function is:



$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$



however, they didn't include the derivation. Does someone know what the derivation is?



I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.



So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:



$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:



$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$



then if we proceed we get:



$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$



however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?







share|cite|improve this question






















  • I think I got the solution. Will write it up later :)
    – Charlie Parker
    Sep 9 '15 at 22:41











  • Well, I'm waiting for the solution. I'm not getting the solution that is given either.
    – lars
    Feb 9 '16 at 18:10










  • @lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
    – Charlie Parker
    Feb 9 '16 at 23:16










  • @lars I have written it for you. Hope it helps.
    – lerner
    Jan 3 '17 at 22:55












up vote
4
down vote

favorite
1









up vote
4
down vote

favorite
1






1





Consider the training cost for softmax regression (I will use the term multinomial logistic regression):



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



according to the UFLDL tutorial the derivative of the above function is:



$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$



however, they didn't include the derivation. Does someone know what the derivation is?



I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.



So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:



$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:



$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$



then if we proceed we get:



$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$



however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?







share|cite|improve this question














Consider the training cost for softmax regression (I will use the term multinomial logistic regression):



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



according to the UFLDL tutorial the derivative of the above function is:



$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$



however, they didn't include the derivation. Does someone know what the derivation is?



I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.



So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:



$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:



$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$



then if we proceed we get:



$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$



however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?









share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Sep 9 '15 at 22:41

























asked Sep 9 '15 at 17:21









Charlie Parker

1,0091027




1,0091027











  • I think I got the solution. Will write it up later :)
    – Charlie Parker
    Sep 9 '15 at 22:41











  • Well, I'm waiting for the solution. I'm not getting the solution that is given either.
    – lars
    Feb 9 '16 at 18:10










  • @lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
    – Charlie Parker
    Feb 9 '16 at 23:16










  • @lars I have written it for you. Hope it helps.
    – lerner
    Jan 3 '17 at 22:55
















  • I think I got the solution. Will write it up later :)
    – Charlie Parker
    Sep 9 '15 at 22:41











  • Well, I'm waiting for the solution. I'm not getting the solution that is given either.
    – lars
    Feb 9 '16 at 18:10










  • @lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
    – Charlie Parker
    Feb 9 '16 at 23:16










  • @lars I have written it for you. Hope it helps.
    – lerner
    Jan 3 '17 at 22:55















I think I got the solution. Will write it up later :)
– Charlie Parker
Sep 9 '15 at 22:41





I think I got the solution. Will write it up later :)
– Charlie Parker
Sep 9 '15 at 22:41













Well, I'm waiting for the solution. I'm not getting the solution that is given either.
– lars
Feb 9 '16 at 18:10




Well, I'm waiting for the solution. I'm not getting the solution that is given either.
– lars
Feb 9 '16 at 18:10












@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
– Charlie Parker
Feb 9 '16 at 23:16




@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
– Charlie Parker
Feb 9 '16 at 23:16












@lars I have written it for you. Hope it helps.
– lerner
Jan 3 '17 at 22:55




@lars I have written it for you. Hope it helps.
– lerner
Jan 3 '17 at 22:55










1 Answer
1






active

oldest

votes

















up vote
3
down vote













I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



Let's expand p and make the equation to this:



$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



And rearranging gives:

$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



Then we replace p back and get:



$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$






share|cite|improve this answer






















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "69"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1428344%2fwhat-is-the-derivation-of-the-derivative-of-softmax-regression-or-multinomial-l%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote













    I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



    $$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



    Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



    Let's expand p and make the equation to this:



    $ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



    And rearranging gives:

    $ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



    The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



    $ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



    Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



    Then we replace p back and get:



    $$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$






    share|cite|improve this answer


























      up vote
      3
      down vote













      I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



      $$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



      Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



      Let's expand p and make the equation to this:



      $ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



      And rearranging gives:

      $ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



      The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



      $ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



      Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



      Then we replace p back and get:



      $$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$






      share|cite|improve this answer
























        up vote
        3
        down vote










        up vote
        3
        down vote









        I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



        $$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



        Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



        Let's expand p and make the equation to this:



        $ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



        And rearranging gives:

        $ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



        The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



        $ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



        Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



        Then we replace p back and get:



        $$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$






        share|cite|improve this answer














        I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.



        $$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$



        Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).



        Let's expand p and make the equation to this:



        $ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$



        And rearranging gives:

        $ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$



        The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):



        $ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$



        Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.



        Then we replace p back and get:



        $$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Aug 16 at 10:01









        chandresh

        905815




        905815










        answered Jan 3 '17 at 5:15









        lerner

        246113




        246113






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1428344%2fwhat-is-the-derivation-of-the-derivative-of-softmax-regression-or-multinomial-l%23new-answer', 'question_page');

            );

            Post as a guest













































































            這個網誌中的熱門文章

            How to combine Bézier curves to a surface?

            Mutual Information Always Non-negative

            Why am i infinitely getting the same tweet with the Twitter Search API?