Proving the information inequality using measure theory

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite












The information inequality is a theorem that shows that the Kullback-Leibler divergence between two probability distributions is always non negative. This can be proved easily using the Jensen's inequality with the $-log$ function, but the proofs I have read always have to distinguish if the probability distributions are defined by continuous or discrete random variables.



I was trying to see if it is possible to not distinguish cases, and I thinked about using measure theory, with the Jensen's inequality provided in Rudin's book:



Let $mu$ be a probability measure on a $sigma$-algebra $mathcalM$ in a set $Omega$. If $f$ is a real integrable function with $a < f(x) < b$ for all $x in Omega$, and if $varphi$ is convex on $]a,b[$, then
$$varphileft( int_Omegaf dmuright) le int_Omega(varphi circ f) dmu. $$



So using this I have
$$KL(p|q) = intlogfracp(x)q(x)dp = int - log fracq(x)p(x) dp ge - log intfracq(x)p(x) dp, $$
but now the only way I find to continue is distinguishing if the distributions are discrete or continuous. I haven't studied measure theory and I only have some basic notions, so I'm not sure how to continue. Also, if my notations are wrong, please tell me. Any help will be appreciated.







share|cite|improve this question
















  • 1




    Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
    – Sangchul Lee
    Aug 17 at 9:56















up vote
3
down vote

favorite












The information inequality is a theorem that shows that the Kullback-Leibler divergence between two probability distributions is always non negative. This can be proved easily using the Jensen's inequality with the $-log$ function, but the proofs I have read always have to distinguish if the probability distributions are defined by continuous or discrete random variables.



I was trying to see if it is possible to not distinguish cases, and I thinked about using measure theory, with the Jensen's inequality provided in Rudin's book:



Let $mu$ be a probability measure on a $sigma$-algebra $mathcalM$ in a set $Omega$. If $f$ is a real integrable function with $a < f(x) < b$ for all $x in Omega$, and if $varphi$ is convex on $]a,b[$, then
$$varphileft( int_Omegaf dmuright) le int_Omega(varphi circ f) dmu. $$



So using this I have
$$KL(p|q) = intlogfracp(x)q(x)dp = int - log fracq(x)p(x) dp ge - log intfracq(x)p(x) dp, $$
but now the only way I find to continue is distinguishing if the distributions are discrete or continuous. I haven't studied measure theory and I only have some basic notions, so I'm not sure how to continue. Also, if my notations are wrong, please tell me. Any help will be appreciated.







share|cite|improve this question
















  • 1




    Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
    – Sangchul Lee
    Aug 17 at 9:56













up vote
3
down vote

favorite









up vote
3
down vote

favorite











The information inequality is a theorem that shows that the Kullback-Leibler divergence between two probability distributions is always non negative. This can be proved easily using the Jensen's inequality with the $-log$ function, but the proofs I have read always have to distinguish if the probability distributions are defined by continuous or discrete random variables.



I was trying to see if it is possible to not distinguish cases, and I thinked about using measure theory, with the Jensen's inequality provided in Rudin's book:



Let $mu$ be a probability measure on a $sigma$-algebra $mathcalM$ in a set $Omega$. If $f$ is a real integrable function with $a < f(x) < b$ for all $x in Omega$, and if $varphi$ is convex on $]a,b[$, then
$$varphileft( int_Omegaf dmuright) le int_Omega(varphi circ f) dmu. $$



So using this I have
$$KL(p|q) = intlogfracp(x)q(x)dp = int - log fracq(x)p(x) dp ge - log intfracq(x)p(x) dp, $$
but now the only way I find to continue is distinguishing if the distributions are discrete or continuous. I haven't studied measure theory and I only have some basic notions, so I'm not sure how to continue. Also, if my notations are wrong, please tell me. Any help will be appreciated.







share|cite|improve this question












The information inequality is a theorem that shows that the Kullback-Leibler divergence between two probability distributions is always non negative. This can be proved easily using the Jensen's inequality with the $-log$ function, but the proofs I have read always have to distinguish if the probability distributions are defined by continuous or discrete random variables.



I was trying to see if it is possible to not distinguish cases, and I thinked about using measure theory, with the Jensen's inequality provided in Rudin's book:



Let $mu$ be a probability measure on a $sigma$-algebra $mathcalM$ in a set $Omega$. If $f$ is a real integrable function with $a < f(x) < b$ for all $x in Omega$, and if $varphi$ is convex on $]a,b[$, then
$$varphileft( int_Omegaf dmuright) le int_Omega(varphi circ f) dmu. $$



So using this I have
$$KL(p|q) = intlogfracp(x)q(x)dp = int - log fracq(x)p(x) dp ge - log intfracq(x)p(x) dp, $$
but now the only way I find to continue is distinguishing if the distributions are discrete or continuous. I haven't studied measure theory and I only have some basic notions, so I'm not sure how to continue. Also, if my notations are wrong, please tell me. Any help will be appreciated.









share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Aug 17 at 8:43









utbutnut

211




211







  • 1




    Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
    – Sangchul Lee
    Aug 17 at 9:56













  • 1




    Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
    – Sangchul Lee
    Aug 17 at 9:56








1




1




Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
– Sangchul Lee
Aug 17 at 9:56





Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
– Sangchul Lee
Aug 17 at 9:56











1 Answer
1






active

oldest

votes

















up vote
0
down vote













A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".



Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
$$
D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
$$
where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
$$
D(mu|nu)geq 0.
$$



There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
$$
D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
$$
where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.






share|cite|improve this answer






















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "69"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2885547%2fproving-the-information-inequality-using-measure-theory%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".



    Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
    $$
    D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
    $$
    where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
    $$
    D(mu|nu)geq 0.
    $$



    There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
    $$
    D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
    $$
    where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.






    share|cite|improve this answer


























      up vote
      0
      down vote













      A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".



      Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
      $$
      D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
      $$
      where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
      $$
      D(mu|nu)geq 0.
      $$



      There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
      $$
      D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
      $$
      where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.






      share|cite|improve this answer
























        up vote
        0
        down vote










        up vote
        0
        down vote









        A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".



        Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
        $$
        D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
        $$
        where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
        $$
        D(mu|nu)geq 0.
        $$



        There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
        $$
        D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
        $$
        where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.






        share|cite|improve this answer














        A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".



        Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
        $$
        D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
        $$
        where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
        $$
        D(mu|nu)geq 0.
        $$



        There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
        $$
        D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
        $$
        where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Aug 17 at 12:37

























        answered Aug 17 at 12:22









        Arash

        9,20821537




        9,20821537






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2885547%2fproving-the-information-inequality-using-measure-theory%23new-answer', 'question_page');

            );

            Post as a guest













































































            這個網誌中的熱門文章

            How to combine Bézier curves to a surface?

            Why am i infinitely getting the same tweet with the Twitter Search API?

            Carbon dioxide