What is the derivation of the derivative of softmax regression (or multinomial logistic regression)?

What is the derivation of the derivative of softmax regression (or multinomial logistic regression)?

Clash Royale CLAN TAG#URR8PPP

up vote
4
down vote

favorite

1

Consider the training cost for softmax regression (I will use the term multinomial logistic regression):

$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

according to the UFLDL tutorial the derivative of the above function is:

$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$

however, they didn't include the derivation. Does someone know what the derivation is?

I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.

So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:

$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:

$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$

then if we proceed we get:

$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$

however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?

edited Sep 9 '15 at 22:41

asked Sep 9 '15 at 17:21

Charlie Parker

1,0091027

I think I got the solution. Will write it up later :)
â€“Â Charlie Parker
Sep 9 '15 at 22:41

Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â€“Â lars
Feb 9 '16 at 18:10

@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â€“Â Charlie Parker
Feb 9 '16 at 23:16

@lars I have written it for you. Hope it helps.
â€“Â lerner
Jan 3 '17 at 22:55

add a commentÂ |Â

up vote
4
down vote

favorite

1

Consider the training cost for softmax regression (I will use the term multinomial logistic regression):

$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

according to the UFLDL tutorial the derivative of the above function is:

$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$

however, they didn't include the derivation. Does someone know what the derivation is?

I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.

So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:

$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:

$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$

then if we proceed we get:

$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$

however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?

edited Sep 9 '15 at 22:41

asked Sep 9 '15 at 17:21

Charlie Parker

1,0091027

I think I got the solution. Will write it up later :)
â€“Â Charlie Parker
Sep 9 '15 at 22:41

Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â€“Â lars
Feb 9 '16 at 18:10

@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â€“Â Charlie Parker
Feb 9 '16 at 23:16

@lars I have written it for you. Hope it helps.
â€“Â lerner
Jan 3 '17 at 22:55

add a commentÂ |Â

up vote
4
down vote

favorite

1

up vote
4
down vote

favorite

1

1

Consider the training cost for softmax regression (I will use the term multinomial logistic regression):

$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

according to the UFLDL tutorial the derivative of the above function is:

$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$

however, they didn't include the derivation. Does someone know what the derivation is?

I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.

So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:

$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:

$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$

then if we proceed we get:

$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$

however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?

edited Sep 9 '15 at 22:41

asked Sep 9 '15 at 17:21

Charlie Parker

1,0091027

Consider the training cost for softmax regression (I will use the term multinomial logistic regression):

$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

according to the UFLDL tutorial the derivative of the above function is:

$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$

however, they didn't include the derivation. Does someone know what the derivation is?

I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.

So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:

$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:

$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$

then if we proceed we get:

$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$

however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?

edited Sep 9 '15 at 22:41

asked Sep 9 '15 at 17:21

Charlie Parker

1,0091027

edited Sep 9 '15 at 22:41

edited Sep 9 '15 at 22:41

edited Sep 9 '15 at 22:41

asked Sep 9 '15 at 17:21

Charlie Parker

1,0091027

asked Sep 9 '15 at 17:21

Charlie Parker

1,0091027

asked Sep 9 '15 at 17:21

Charlie Parker

1,0091027

1,0091027

I think I got the solution. Will write it up later :)
â€“Â Charlie Parker
Sep 9 '15 at 22:41

Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â€“Â lars
Feb 9 '16 at 18:10

@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â€“Â Charlie Parker
Feb 9 '16 at 23:16

@lars I have written it for you. Hope it helps.
â€“Â lerner
Jan 3 '17 at 22:55

add a commentÂ |Â

I think I got the solution. Will write it up later :)
â€“Â Charlie Parker
Sep 9 '15 at 22:41

Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â€“Â lars
Feb 9 '16 at 18:10

@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â€“Â Charlie Parker
Feb 9 '16 at 23:16

@lars I have written it for you. Hope it helps.
â€“Â lerner
Jan 3 '17 at 22:55

I think I got the solution. Will write it up later :)
â€“Â Charlie Parker
Sep 9 '15 at 22:41

I think I got the solution. Will write it up later :)
â€“Â Charlie Parker
Sep 9 '15 at 22:41

Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â€“Â lars
Feb 9 '16 at 18:10

Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â€“Â lars
Feb 9 '16 at 18:10

@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â€“Â Charlie Parker
Feb 9 '16 at 23:16

@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â€“Â Charlie Parker
Feb 9 '16 at 23:16

@lars I have written it for you. Hope it helps.
â€“Â lerner
Jan 3 '17 at 22:55

@lars I have written it for you. Hope it helps.
â€“Â lerner
Jan 3 '17 at 22:55

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
3
down vote

I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.

$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).

Let's expand p and make the equation to this:

$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$

And rearranging gives:

$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$

The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):

$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$

Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.

Then we replace p back and get:

$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$

edited Aug 16 at 10:01

chandresh

905815

answered Jan 3 '17 at 5:15

lerner

246113

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Â

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1428344%2fwhat-is-the-derivation-of-the-derivative-of-softmax-regression-or-multinomial-l%23new-answer', 'question_page');

);

Post as a guest

Name

Email

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

active

oldest

votes

active

oldest

votes

up vote
3
down vote

I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.

$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).

Let's expand p and make the equation to this:

$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$

And rearranging gives:

$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$

The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):

$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$

Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.

Then we replace p back and get:

$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$

edited Aug 16 at 10:01

chandresh

905815

answered Jan 3 '17 at 5:15

lerner

246113

add a commentÂ |Â

up vote
3
down vote

I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.

$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).

Let's expand p and make the equation to this:

$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$

And rearranging gives:

$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$

The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):

$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$

Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.

Then we replace p back and get:

$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$

edited Aug 16 at 10:01

chandresh

905815

answered Jan 3 '17 at 5:15

lerner

246113

add a commentÂ |Â

up vote
3
down vote

up vote
3
down vote

I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.

$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).

Let's expand p and make the equation to this:

$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$

And rearranging gives:

$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$

The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):

$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$

Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.

Then we replace p back and get:

$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$

edited Aug 16 at 10:01

chandresh

905815

answered Jan 3 '17 at 5:15

lerner

246113

I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.

$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$

Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).

Let's expand p and make the equation to this:

$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$

And rearranging gives:

$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$

The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):

$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$

Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.

Then we replace p back and get:

$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$

edited Aug 16 at 10:01

chandresh

905815

answered Jan 3 '17 at 5:15

lerner

246113

edited Aug 16 at 10:01

chandresh

905815

edited Aug 16 at 10:01

chandresh

905815

edited Aug 16 at 10:01

chandresh

905815

905815

answered Jan 3 '17 at 5:15

lerner

246113

answered Jan 3 '17 at 5:15

lerner

246113

answered Jan 3 '17 at 5:15

lerner

246113

246113

add a commentÂ |Â

add a commentÂ |Â

Â

draft saved

draft discarded

Â

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1428344%2fwhat-is-the-derivation-of-the-derivative-of-softmax-regression-or-multinomial-l%23new-answer', 'question_page');

);

Post as a guest

Name

Email

Name

Email

Name

Email

Name

Email

Name

Email