What is the derivation of the derivative of softmax regression (or multinomial logistic regression)?
Clash Royale CLAN TAG#URR8PPP
up vote
4
down vote
favorite
Consider the training cost for softmax regression (I will use the term multinomial logistic regression):
$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
according to the UFLDL tutorial the derivative of the above function is:
$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$
however, they didn't include the derivation. Does someone know what the derivation is?
I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.
So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:
$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:
$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$
then if we proceed we get:
$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$
however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?
multivariable-calculus optimization machine-learning
add a comment |Â
up vote
4
down vote
favorite
Consider the training cost for softmax regression (I will use the term multinomial logistic regression):
$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
according to the UFLDL tutorial the derivative of the above function is:
$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$
however, they didn't include the derivation. Does someone know what the derivation is?
I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.
So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:
$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:
$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$
then if we proceed we get:
$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$
however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?
multivariable-calculus optimization machine-learning
I think I got the solution. Will write it up later :)
â Charlie Parker
Sep 9 '15 at 22:41
Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â lars
Feb 9 '16 at 18:10
@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â Charlie Parker
Feb 9 '16 at 23:16
@lars I have written it for you. Hope it helps.
â lerner
Jan 3 '17 at 22:55
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
Consider the training cost for softmax regression (I will use the term multinomial logistic regression):
$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
according to the UFLDL tutorial the derivative of the above function is:
$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$
however, they didn't include the derivation. Does someone know what the derivation is?
I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.
So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:
$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:
$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$
then if we proceed we get:
$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$
however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?
multivariable-calculus optimization machine-learning
Consider the training cost for softmax regression (I will use the term multinomial logistic regression):
$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
according to the UFLDL tutorial the derivative of the above function is:
$$ bigtriangledown_ theta^(k) J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$
however, they didn't include the derivation. Does someone know what the derivation is?
I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.
So I first took the gradient $bigtriangledown_ theta^(k) J( theta )$ as they suggested:
$$ bigtriangledown_ theta^(k) J( theta ) = - bigtriangledown_ theta^(k) sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
but since we are taking the gradient with respect to $theta^(k)$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:
$$ bigtriangledown_ theta^(k) J( theta ) = - sum^m_i=1 bigtriangledown_ theta^(k) log p(y^(i) = k mid x^(i) ; theta) $$
then if we proceed we get:
$$ - sum^m_i=1 frac1p(y^(i) = k mid x^(i) ; theta) bigtriangledown_ theta^(k) p(y^(i) = k mid x^(i) ; theta) $$
however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?
multivariable-calculus optimization machine-learning
edited Sep 9 '15 at 22:41
asked Sep 9 '15 at 17:21
Charlie Parker
1,0091027
1,0091027
I think I got the solution. Will write it up later :)
â Charlie Parker
Sep 9 '15 at 22:41
Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â lars
Feb 9 '16 at 18:10
@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â Charlie Parker
Feb 9 '16 at 23:16
@lars I have written it for you. Hope it helps.
â lerner
Jan 3 '17 at 22:55
add a comment |Â
I think I got the solution. Will write it up later :)
â Charlie Parker
Sep 9 '15 at 22:41
Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â lars
Feb 9 '16 at 18:10
@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â Charlie Parker
Feb 9 '16 at 23:16
@lars I have written it for you. Hope it helps.
â lerner
Jan 3 '17 at 22:55
I think I got the solution. Will write it up later :)
â Charlie Parker
Sep 9 '15 at 22:41
I think I got the solution. Will write it up later :)
â Charlie Parker
Sep 9 '15 at 22:41
Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â lars
Feb 9 '16 at 18:10
Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â lars
Feb 9 '16 at 18:10
@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â Charlie Parker
Feb 9 '16 at 23:16
@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â Charlie Parker
Feb 9 '16 at 23:16
@lars I have written it for you. Hope it helps.
â lerner
Jan 3 '17 at 22:55
@lars I have written it for you. Hope it helps.
â lerner
Jan 3 '17 at 22:55
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
3
down vote
I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.
$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).
Let's expand p and make the equation to this:
$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$
And rearranging gives:
$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$
The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):
$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$
Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.
Then we replace p back and get:
$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.
$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).
Let's expand p and make the equation to this:
$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$
And rearranging gives:
$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$
The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):
$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$
Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.
Then we replace p back and get:
$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$
add a comment |Â
up vote
3
down vote
I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.
$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).
Let's expand p and make the equation to this:
$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$
And rearranging gives:
$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$
The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):
$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$
Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.
Then we replace p back and get:
$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$
add a comment |Â
up vote
3
down vote
up vote
3
down vote
I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.
$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).
Let's expand p and make the equation to this:
$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$
And rearranging gives:
$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$
The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):
$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$
Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.
Then we replace p back and get:
$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$
I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.
$$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i) = k log p(y^(i) = k mid x^(i) ; theta) $$
Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).
Let's expand p and make the equation to this:
$ J( theta ) = - sum^m_i=1 sum^K_k=1 1 y^(i)=k log frace^theta^T_k x^(i)sum^K_l=1e^theta^T_l x^(i)$
And rearranging gives:
$ J( theta ) =- sum^m_i=1 sum^K_k=1 1 y^(i)=k[log e^theta^T_kx^(i)-log sum^K_l=1e^theta^T_lx^(i)]$
The partial derivative of $J$ with respect to $theta_k$ is (treat $1 y^(i)=k$ as a constant):
$ bigtriangledown_ theta_k J( theta ) = - sum^m_i=1 1 y^(i)=k[x^(i)-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)] + 1y^(i) neq k[-underbracefrac1sum^K_l=1e^theta^T_lx^(i)e^theta^T_kx^(i)_p(y^(i) = k mid x^(i) ; theta)x^(i)]$
Note that for the kth category only one element in $bigtriangledown_ theta_k sum^K_l=1e^theta^T_lx^(i)$ is nonzero, that is $e^theta^T_kx^(i)$.
Then we replace p back and get:
$$ bigtriangledown_ theta_k J( theta ) = -sum^m_i=1 [x^(i) (1 y^(i) = k - p(y^(i) = k mid x^(i) ; theta) ) ] $$
edited Aug 16 at 10:01
chandresh
905815
905815
answered Jan 3 '17 at 5:15
lerner
246113
246113
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1428344%2fwhat-is-the-derivation-of-the-derivative-of-softmax-regression-or-multinomial-l%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
I think I got the solution. Will write it up later :)
â Charlie Parker
Sep 9 '15 at 22:41
Well, I'm waiting for the solution. I'm not getting the solution that is given either.
â lars
Feb 9 '16 at 18:10
@lars will a handwritten one do for you? I'm too lazy to write it up in latex :p (let me see if I find the paper, wrote it a couple of months ago but never uploaded it cuz I didn't want to writ it in latex)
â Charlie Parker
Feb 9 '16 at 23:16
@lars I have written it for you. Hope it helps.
â lerner
Jan 3 '17 at 22:55