Why we consider log likelihood instead of Likelihood in Gaussian Distribution

up vote
28
down vote

favorite

I am reading Gaussian Distribution from a machine learning book. It states that -

We shall determine values for the unknown parameters $mu$ and
$sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood
function. Because the logarithm is monotonically increasing function
of its argument, maximization of the log of a function is equivalent
to maximization of the function itself. Taking the log not only
simplifies the subsequent mathematical analysis, but it also helps
numerically because the product of a large number of small
probabilities can easily underflow the numerical precision of the
computer, and this is resolved by computing instead the sum of the log
probabilities.

can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.

Thanks in advance!

edited Aug 23 at 10:11

jojek

669614

asked Aug 10 '14 at 11:11

Kaidul Islam

243136

add a commentÂ |Â

up vote
28
down vote

favorite

I am reading Gaussian Distribution from a machine learning book. It states that -

We shall determine values for the unknown parameters $mu$ and
$sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood
function. Because the logarithm is monotonically increasing function
of its argument, maximization of the log of a function is equivalent
to maximization of the function itself. Taking the log not only
simplifies the subsequent mathematical analysis, but it also helps
numerically because the product of a large number of small
probabilities can easily underflow the numerical precision of the
computer, and this is resolved by computing instead the sum of the log
probabilities.

can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.

Thanks in advance!

edited Aug 23 at 10:11

jojek

669614

asked Aug 10 '14 at 11:11

Kaidul Islam

243136

add a commentÂ |Â

up vote
28
down vote

favorite

I am reading Gaussian Distribution from a machine learning book. It states that -

We shall determine values for the unknown parameters $mu$ and
$sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood
function. Because the logarithm is monotonically increasing function
of its argument, maximization of the log of a function is equivalent
to maximization of the function itself. Taking the log not only
simplifies the subsequent mathematical analysis, but it also helps
numerically because the product of a large number of small
probabilities can easily underflow the numerical precision of the
computer, and this is resolved by computing instead the sum of the log
probabilities.

can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.

Thanks in advance!

edited Aug 23 at 10:11

jojek

669614

asked Aug 10 '14 at 11:11

Kaidul Islam

243136

I am reading Gaussian Distribution from a machine learning book. It states that -

We shall determine values for the unknown parameters $mu$ and
$sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood
function. Because the logarithm is monotonically increasing function
of its argument, maximization of the log of a function is equivalent
to maximization of the function itself. Taking the log not only
simplifies the subsequent mathematical analysis, but it also helps
numerically because the product of a large number of small
probabilities can easily underflow the numerical precision of the
computer, and this is resolved by computing instead the sum of the log
probabilities.

can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.

Thanks in advance!

edited Aug 23 at 10:11

jojek

669614

asked Aug 10 '14 at 11:11

Kaidul Islam

243136

edited Aug 23 at 10:11

jojek

669614

edited Aug 23 at 10:11

jojek

669614

edited Aug 23 at 10:11

jojek

669614

asked Aug 10 '14 at 11:11

Kaidul Islam

243136

asked Aug 10 '14 at 11:11

Kaidul Islam

243136

asked Aug 10 '14 at 11:11

Kaidul Islam

243136

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
38
down vote

accepted

It is extremely useful for example when you want to calculate the
joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
$$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
the likelihood for each point, i.e.:
$$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
model parameters: vector of means $mu$ and covariance matrix
$Sigma$. If you use the log-likelihood you will end up with sum
instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
p(x_imidTheta) $$

Also in the case of Gaussian, it allows you to avoid computation of
the exponential:

$$p(xmidTheta) =
dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
Which becomes:

$$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$

Like you mentioned $ln x$ is a monotonically increasing function,
thus log-likelihoods have the same relations of order as the
likelihoods:

$$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
p(xmidTheta_1)>ln p(xmidTheta_2)$$

From a standpoint of computational complexity, you can imagine that first
of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
even more important, likelihoods would become very small and you
will run out of your floating point precision very quickly, yielding
an underflow. That's why it is way more convenient to use the
logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.

Additionally in the classification framework you can simplify
calculations even further. The relations of order will remain valid
if you drop the division by $2$ and the $dln(2pi)$ term. You can do
that because these are class independent. Also, as one might notice if
variance of both classes is the same ($Sigma_1=Sigma_2 $), then
you can also remove the $ln(det Sigma) $ term.

edited Jun 11 '17 at 21:44

Michael Hardy

205k23187463

answered Aug 10 '14 at 12:01

jojek

669614

In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
â€“Â Justin Liang
Jun 8 '16 at 8:09

add a commentÂ |Â

up vote
7
down vote

First of all as stated, the log is monotonically increasing so maximizing likelihood is equivalent to maximizing log likelihood. Furthermore, one can make use of $ln(ab) = ln(a) + ln(b)$. Many equations simplify significantly because one gets sums where one had products before and now one can maximize simply by taking derivatives and setting equal to $0$.

answered Aug 10 '14 at 11:19

hickslebummbumm

1,294510

Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/â€¦ . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
â€“Â hickslebummbumm
Aug 10 '14 at 11:21

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f892832%2fwhy-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
38
down vote

accepted

It is extremely useful for example when you want to calculate the
joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
$$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
the likelihood for each point, i.e.:
$$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
model parameters: vector of means $mu$ and covariance matrix
$Sigma$. If you use the log-likelihood you will end up with sum
instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
p(x_imidTheta) $$

Also in the case of Gaussian, it allows you to avoid computation of
the exponential:

$$p(xmidTheta) =
dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
Which becomes:

$$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$

Like you mentioned $ln x$ is a monotonically increasing function,
thus log-likelihoods have the same relations of order as the
likelihoods:

$$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
p(xmidTheta_1)>ln p(xmidTheta_2)$$

From a standpoint of computational complexity, you can imagine that first
of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
even more important, likelihoods would become very small and you
will run out of your floating point precision very quickly, yielding
an underflow. That's why it is way more convenient to use the
logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.

Additionally in the classification framework you can simplify
calculations even further. The relations of order will remain valid
if you drop the division by $2$ and the $dln(2pi)$ term. You can do
that because these are class independent. Also, as one might notice if
variance of both classes is the same ($Sigma_1=Sigma_2 $), then
you can also remove the $ln(det Sigma) $ term.

edited Jun 11 '17 at 21:44

Michael Hardy

205k23187463

answered Aug 10 '14 at 12:01

jojek

669614

In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
â€“Â Justin Liang
Jun 8 '16 at 8:09

add a commentÂ |Â

up vote
38
down vote

accepted

It is extremely useful for example when you want to calculate the
joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
$$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
the likelihood for each point, i.e.:
$$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
model parameters: vector of means $mu$ and covariance matrix
$Sigma$. If you use the log-likelihood you will end up with sum
instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
p(x_imidTheta) $$

Also in the case of Gaussian, it allows you to avoid computation of
the exponential:

$$p(xmidTheta) =
dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
Which becomes:

$$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$

Like you mentioned $ln x$ is a monotonically increasing function,
thus log-likelihoods have the same relations of order as the
likelihoods:

$$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
p(xmidTheta_1)>ln p(xmidTheta_2)$$

From a standpoint of computational complexity, you can imagine that first
of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
even more important, likelihoods would become very small and you
will run out of your floating point precision very quickly, yielding
an underflow. That's why it is way more convenient to use the
logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.

Additionally in the classification framework you can simplify
calculations even further. The relations of order will remain valid
if you drop the division by $2$ and the $dln(2pi)$ term. You can do
that because these are class independent. Also, as one might notice if
variance of both classes is the same ($Sigma_1=Sigma_2 $), then
you can also remove the $ln(det Sigma) $ term.

edited Jun 11 '17 at 21:44

Michael Hardy

205k23187463

answered Aug 10 '14 at 12:01

jojek

669614

In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
â€“Â Justin Liang
Jun 8 '16 at 8:09

add a commentÂ |Â

up vote
38
down vote

accepted

It is extremely useful for example when you want to calculate the
joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
$$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
the likelihood for each point, i.e.:
$$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
model parameters: vector of means $mu$ and covariance matrix
$Sigma$. If you use the log-likelihood you will end up with sum
instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
p(x_imidTheta) $$

Also in the case of Gaussian, it allows you to avoid computation of
the exponential:

$$p(xmidTheta) =
dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
Which becomes:

$$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$

Like you mentioned $ln x$ is a monotonically increasing function,
thus log-likelihoods have the same relations of order as the
likelihoods:

$$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
p(xmidTheta_1)>ln p(xmidTheta_2)$$

From a standpoint of computational complexity, you can imagine that first
of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
even more important, likelihoods would become very small and you
will run out of your floating point precision very quickly, yielding
an underflow. That's why it is way more convenient to use the
logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.

Additionally in the classification framework you can simplify
calculations even further. The relations of order will remain valid
if you drop the division by $2$ and the $dln(2pi)$ term. You can do
that because these are class independent. Also, as one might notice if
variance of both classes is the same ($Sigma_1=Sigma_2 $), then
you can also remove the $ln(det Sigma) $ term.

edited Jun 11 '17 at 21:44

Michael Hardy

205k23187463

answered Aug 10 '14 at 12:01

jojek

669614

It is extremely useful for example when you want to calculate the
joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
$$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
the likelihood for each point, i.e.:
$$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
model parameters: vector of means $mu$ and covariance matrix
$Sigma$. If you use the log-likelihood you will end up with sum
instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
p(x_imidTheta) $$

Also in the case of Gaussian, it allows you to avoid computation of
the exponential:

$$p(xmidTheta) =
dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
Which becomes:

$$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$

Like you mentioned $ln x$ is a monotonically increasing function,
thus log-likelihoods have the same relations of order as the
likelihoods:

$$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
p(xmidTheta_1)>ln p(xmidTheta_2)$$

From a standpoint of computational complexity, you can imagine that first
of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
even more important, likelihoods would become very small and you
will run out of your floating point precision very quickly, yielding
an underflow. That's why it is way more convenient to use the
logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.

Additionally in the classification framework you can simplify
calculations even further. The relations of order will remain valid
if you drop the division by $2$ and the $dln(2pi)$ term. You can do
that because these are class independent. Also, as one might notice if
variance of both classes is the same ($Sigma_1=Sigma_2 $), then
you can also remove the $ln(det Sigma) $ term.

edited Jun 11 '17 at 21:44

Michael Hardy

205k23187463

answered Aug 10 '14 at 12:01

jojek

669614

edited Jun 11 '17 at 21:44

Michael Hardy

205k23187463

edited Jun 11 '17 at 21:44

Michael Hardy

205k23187463

edited Jun 11 '17 at 21:44

Michael Hardy

205k23187463

answered Aug 10 '14 at 12:01

jojek

669614

answered Aug 10 '14 at 12:01

jojek

669614

answered Aug 10 '14 at 12:01

jojek

669614

In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
â€“Â Justin Liang
Jun 8 '16 at 8:09

add a commentÂ |Â

In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
â€“Â Justin Liang
Jun 8 '16 at 8:09

In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
â€“Â Justin Liang
Jun 8 '16 at 8:09

add a commentÂ |Â

up vote
7
down vote

answered Aug 10 '14 at 11:19

hickslebummbumm

1,294510

Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/â€¦ . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
â€“Â hickslebummbumm
Aug 10 '14 at 11:21

add a commentÂ |Â

up vote
7
down vote

answered Aug 10 '14 at 11:19

hickslebummbumm

1,294510

Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/â€¦ . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
â€“Â hickslebummbumm
Aug 10 '14 at 11:21

add a commentÂ |Â

up vote
7
down vote

answered Aug 10 '14 at 11:19

hickslebummbumm

1,294510

answered Aug 10 '14 at 11:19

hickslebummbumm

1,294510

answered Aug 10 '14 at 11:19

hickslebummbumm

1,294510

answered Aug 10 '14 at 11:19

hickslebummbumm

1,294510

answered Aug 10 '14 at 11:19

hickslebummbumm

1,294510

Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/â€¦ . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
â€“Â hickslebummbumm
Aug 10 '14 at 11:21

add a commentÂ |Â

Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/â€¦ . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
â€“Â hickslebummbumm
Aug 10 '14 at 11:21

Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/â€¦ . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
â€“Â hickslebummbumm
Aug 10 '14 at 11:21

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

Vtyjkyuk