Why we consider log likelihood instead of Likelihood in Gaussian Distribution

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
28
down vote

favorite
22












I am reading Gaussian Distribution from a machine learning book. It states that -




We shall determine values for the unknown parameters $mu$ and
$sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood
function. Because the logarithm is monotonically increasing function
of its argument, maximization of the log of a function is equivalent
to maximization of the function itself. Taking the log not only
simplifies the subsequent mathematical analysis, but it also helps
numerically because the product of a large number of small
probabilities can easily underflow the numerical precision of the
computer, and this is resolved by computing instead the sum of the log
probabilities.




can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.



Thanks in advance!







share|cite|improve this question


























    up vote
    28
    down vote

    favorite
    22












    I am reading Gaussian Distribution from a machine learning book. It states that -




    We shall determine values for the unknown parameters $mu$ and
    $sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood
    function. Because the logarithm is monotonically increasing function
    of its argument, maximization of the log of a function is equivalent
    to maximization of the function itself. Taking the log not only
    simplifies the subsequent mathematical analysis, but it also helps
    numerically because the product of a large number of small
    probabilities can easily underflow the numerical precision of the
    computer, and this is resolved by computing instead the sum of the log
    probabilities.




    can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.



    Thanks in advance!







    share|cite|improve this question
























      up vote
      28
      down vote

      favorite
      22









      up vote
      28
      down vote

      favorite
      22






      22





      I am reading Gaussian Distribution from a machine learning book. It states that -




      We shall determine values for the unknown parameters $mu$ and
      $sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood
      function. Because the logarithm is monotonically increasing function
      of its argument, maximization of the log of a function is equivalent
      to maximization of the function itself. Taking the log not only
      simplifies the subsequent mathematical analysis, but it also helps
      numerically because the product of a large number of small
      probabilities can easily underflow the numerical precision of the
      computer, and this is resolved by computing instead the sum of the log
      probabilities.




      can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.



      Thanks in advance!







      share|cite|improve this question














      I am reading Gaussian Distribution from a machine learning book. It states that -




      We shall determine values for the unknown parameters $mu$ and
      $sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood
      function. Because the logarithm is monotonically increasing function
      of its argument, maximization of the log of a function is equivalent
      to maximization of the function itself. Taking the log not only
      simplifies the subsequent mathematical analysis, but it also helps
      numerically because the product of a large number of small
      probabilities can easily underflow the numerical precision of the
      computer, and this is resolved by computing instead the sum of the log
      probabilities.




      can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.



      Thanks in advance!









      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Aug 23 at 10:11









      jojek

      669614




      669614










      asked Aug 10 '14 at 11:11









      Kaidul Islam

      243136




      243136




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          38
          down vote



          accepted










          1. It is extremely useful for example when you want to calculate the
            joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
            $$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
            the likelihood for each point, i.e.:
            $$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
            model parameters: vector of means $mu$ and covariance matrix
            $Sigma$. If you use the log-likelihood you will end up with sum
            instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
            p(x_imidTheta) $$


          2. Also in the case of Gaussian, it allows you to avoid computation of
            the exponential:



            $$p(xmidTheta) =
            dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
            Which becomes:



            $$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
            Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$




          3. Like you mentioned $ln x$ is a monotonically increasing function,
            thus log-likelihoods have the same relations of order as the
            likelihoods:



            $$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
            p(xmidTheta_1)>ln p(xmidTheta_2)$$




          4. From a standpoint of computational complexity, you can imagine that first
            of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
            even more important, likelihoods would become very small and you
            will run out of your floating point precision very quickly, yielding
            an underflow. That's why it is way more convenient to use the
            logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.



            Additionally in the classification framework you can simplify
            calculations even further. The relations of order will remain valid
            if you drop the division by $2$ and the $dln(2pi)$ term. You can do
            that because these are class independent. Also, as one might notice if
            variance of both classes is the same ($Sigma_1=Sigma_2 $), then
            you can also remove the $ln(det Sigma) $ term.







          share|cite|improve this answer






















          • In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
            – Justin Liang
            Jun 8 '16 at 8:09

















          up vote
          7
          down vote













          First of all as stated, the log is monotonically increasing so maximizing likelihood is equivalent to maximizing log likelihood. Furthermore, one can make use of $ln(ab) = ln(a) + ln(b)$. Many equations simplify significantly because one gets sums where one had products before and now one can maximize simply by taking derivatives and setting equal to $0$.






          share|cite|improve this answer




















          • Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/… . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
            – hickslebummbumm
            Aug 10 '14 at 11:21











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "69"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f892832%2fwhy-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution%23new-answer', 'question_page');

          );

          Post as a guest






























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          38
          down vote



          accepted










          1. It is extremely useful for example when you want to calculate the
            joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
            $$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
            the likelihood for each point, i.e.:
            $$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
            model parameters: vector of means $mu$ and covariance matrix
            $Sigma$. If you use the log-likelihood you will end up with sum
            instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
            p(x_imidTheta) $$


          2. Also in the case of Gaussian, it allows you to avoid computation of
            the exponential:



            $$p(xmidTheta) =
            dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
            Which becomes:



            $$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
            Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$




          3. Like you mentioned $ln x$ is a monotonically increasing function,
            thus log-likelihoods have the same relations of order as the
            likelihoods:



            $$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
            p(xmidTheta_1)>ln p(xmidTheta_2)$$




          4. From a standpoint of computational complexity, you can imagine that first
            of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
            even more important, likelihoods would become very small and you
            will run out of your floating point precision very quickly, yielding
            an underflow. That's why it is way more convenient to use the
            logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.



            Additionally in the classification framework you can simplify
            calculations even further. The relations of order will remain valid
            if you drop the division by $2$ and the $dln(2pi)$ term. You can do
            that because these are class independent. Also, as one might notice if
            variance of both classes is the same ($Sigma_1=Sigma_2 $), then
            you can also remove the $ln(det Sigma) $ term.







          share|cite|improve this answer






















          • In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
            – Justin Liang
            Jun 8 '16 at 8:09














          up vote
          38
          down vote



          accepted










          1. It is extremely useful for example when you want to calculate the
            joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
            $$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
            the likelihood for each point, i.e.:
            $$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
            model parameters: vector of means $mu$ and covariance matrix
            $Sigma$. If you use the log-likelihood you will end up with sum
            instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
            p(x_imidTheta) $$


          2. Also in the case of Gaussian, it allows you to avoid computation of
            the exponential:



            $$p(xmidTheta) =
            dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
            Which becomes:



            $$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
            Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$




          3. Like you mentioned $ln x$ is a monotonically increasing function,
            thus log-likelihoods have the same relations of order as the
            likelihoods:



            $$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
            p(xmidTheta_1)>ln p(xmidTheta_2)$$




          4. From a standpoint of computational complexity, you can imagine that first
            of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
            even more important, likelihoods would become very small and you
            will run out of your floating point precision very quickly, yielding
            an underflow. That's why it is way more convenient to use the
            logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.



            Additionally in the classification framework you can simplify
            calculations even further. The relations of order will remain valid
            if you drop the division by $2$ and the $dln(2pi)$ term. You can do
            that because these are class independent. Also, as one might notice if
            variance of both classes is the same ($Sigma_1=Sigma_2 $), then
            you can also remove the $ln(det Sigma) $ term.







          share|cite|improve this answer






















          • In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
            – Justin Liang
            Jun 8 '16 at 8:09












          up vote
          38
          down vote



          accepted







          up vote
          38
          down vote



          accepted






          1. It is extremely useful for example when you want to calculate the
            joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
            $$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
            the likelihood for each point, i.e.:
            $$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
            model parameters: vector of means $mu$ and covariance matrix
            $Sigma$. If you use the log-likelihood you will end up with sum
            instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
            p(x_imidTheta) $$


          2. Also in the case of Gaussian, it allows you to avoid computation of
            the exponential:



            $$p(xmidTheta) =
            dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
            Which becomes:



            $$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
            Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$




          3. Like you mentioned $ln x$ is a monotonically increasing function,
            thus log-likelihoods have the same relations of order as the
            likelihoods:



            $$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
            p(xmidTheta_1)>ln p(xmidTheta_2)$$




          4. From a standpoint of computational complexity, you can imagine that first
            of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
            even more important, likelihoods would become very small and you
            will run out of your floating point precision very quickly, yielding
            an underflow. That's why it is way more convenient to use the
            logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.



            Additionally in the classification framework you can simplify
            calculations even further. The relations of order will remain valid
            if you drop the division by $2$ and the $dln(2pi)$ term. You can do
            that because these are class independent. Also, as one might notice if
            variance of both classes is the same ($Sigma_1=Sigma_2 $), then
            you can also remove the $ln(det Sigma) $ term.







          share|cite|improve this answer














          1. It is extremely useful for example when you want to calculate the
            joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
            $$X=x_1,x_2,ldots,x_N $$ The total likelihood is the product of
            the likelihood for each point, i.e.:
            $$p(XmidTheta)=prod_i=1^Np(x_imidTheta) $$ where $Theta$ are the
            model parameters: vector of means $mu$ and covariance matrix
            $Sigma$. If you use the log-likelihood you will end up with sum
            instead of product: $$ln p(XmidTheta)=sum_i=1^Nln
            p(x_imidTheta) $$


          2. Also in the case of Gaussian, it allows you to avoid computation of
            the exponential:



            $$p(xmidTheta) =
            dfrac1(sqrt2pi)^dsqrtdetSigmae^-frac12(x-mu)^T Sigma^-1(x-mu)$$
            Which becomes:



            $$ln p(xmidTheta) = -fracd2ln(2pi)-frac12ln(det
            Sigma)-frac12(x-mu)^TSigma^-1(x-mu)$$




          3. Like you mentioned $ln x$ is a monotonically increasing function,
            thus log-likelihoods have the same relations of order as the
            likelihoods:



            $$p(xmidTheta_1)>p(xmidTheta_2) Leftrightarrow ln
            p(xmidTheta_1)>ln p(xmidTheta_2)$$




          4. From a standpoint of computational complexity, you can imagine that first
            of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
            even more important, likelihoods would become very small and you
            will run out of your floating point precision very quickly, yielding
            an underflow. That's why it is way more convenient to use the
            logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator - almost impossible.



            Additionally in the classification framework you can simplify
            calculations even further. The relations of order will remain valid
            if you drop the division by $2$ and the $dln(2pi)$ term. You can do
            that because these are class independent. Also, as one might notice if
            variance of both classes is the same ($Sigma_1=Sigma_2 $), then
            you can also remove the $ln(det Sigma) $ term.








          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited Jun 11 '17 at 21:44









          Michael Hardy

          205k23187463




          205k23187463










          answered Aug 10 '14 at 12:01









          jojek

          669614




          669614











          • In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
            – Justin Liang
            Jun 8 '16 at 8:09
















          • In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
            – Justin Liang
            Jun 8 '16 at 8:09















          In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
          – Justin Liang
          Jun 8 '16 at 8:09




          In your second reasoning, you avoid the computation of the exponential but now you have to compute the $ln$ instead. Just wondering, is $ln$ always computationally less expensive than the exponential?
          – Justin Liang
          Jun 8 '16 at 8:09










          up vote
          7
          down vote













          First of all as stated, the log is monotonically increasing so maximizing likelihood is equivalent to maximizing log likelihood. Furthermore, one can make use of $ln(ab) = ln(a) + ln(b)$. Many equations simplify significantly because one gets sums where one had products before and now one can maximize simply by taking derivatives and setting equal to $0$.






          share|cite|improve this answer




















          • Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/… . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
            – hickslebummbumm
            Aug 10 '14 at 11:21















          up vote
          7
          down vote













          First of all as stated, the log is monotonically increasing so maximizing likelihood is equivalent to maximizing log likelihood. Furthermore, one can make use of $ln(ab) = ln(a) + ln(b)$. Many equations simplify significantly because one gets sums where one had products before and now one can maximize simply by taking derivatives and setting equal to $0$.






          share|cite|improve this answer




















          • Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/… . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
            – hickslebummbumm
            Aug 10 '14 at 11:21













          up vote
          7
          down vote










          up vote
          7
          down vote









          First of all as stated, the log is monotonically increasing so maximizing likelihood is equivalent to maximizing log likelihood. Furthermore, one can make use of $ln(ab) = ln(a) + ln(b)$. Many equations simplify significantly because one gets sums where one had products before and now one can maximize simply by taking derivatives and setting equal to $0$.






          share|cite|improve this answer












          First of all as stated, the log is monotonically increasing so maximizing likelihood is equivalent to maximizing log likelihood. Furthermore, one can make use of $ln(ab) = ln(a) + ln(b)$. Many equations simplify significantly because one gets sums where one had products before and now one can maximize simply by taking derivatives and setting equal to $0$.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered Aug 10 '14 at 11:19









          hickslebummbumm

          1,294510




          1,294510











          • Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/… . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
            – hickslebummbumm
            Aug 10 '14 at 11:21

















          • Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/… . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
            – hickslebummbumm
            Aug 10 '14 at 11:21
















          Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/… . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
          – hickslebummbumm
          Aug 10 '14 at 11:21





          Look at the example on this page: unc.edu/courses/2007spring/enst/562/001/docs/lectures/… . Maximizing the product would be a horrible task, maximizing the sum however is quite doable.
          – hickslebummbumm
          Aug 10 '14 at 11:21


















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f892832%2fwhy-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution%23new-answer', 'question_page');

          );

          Post as a guest













































































          這個網誌中的熱門文章

          How to combine Bézier curves to a surface?

          Mutual Information Always Non-negative

          Why am i infinitely getting the same tweet with the Twitter Search API?