Proving the information inequality using measure theory
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
The information inequality is a theorem that shows that the Kullback-Leibler divergence between two probability distributions is always non negative. This can be proved easily using the Jensen's inequality with the $-log$ function, but the proofs I have read always have to distinguish if the probability distributions are defined by continuous or discrete random variables.
I was trying to see if it is possible to not distinguish cases, and I thinked about using measure theory, with the Jensen's inequality provided in Rudin's book:
Let $mu$ be a probability measure on a $sigma$-algebra $mathcalM$ in a set $Omega$. If $f$ is a real integrable function with $a < f(x) < b$ for all $x in Omega$, and if $varphi$ is convex on $]a,b[$, then
$$varphileft( int_Omegaf dmuright) le int_Omega(varphi circ f) dmu. $$
So using this I have
$$KL(p|q) = intlogfracp(x)q(x)dp = int - log fracq(x)p(x) dp ge - log intfracq(x)p(x) dp, $$
but now the only way I find to continue is distinguishing if the distributions are discrete or continuous. I haven't studied measure theory and I only have some basic notions, so I'm not sure how to continue. Also, if my notations are wrong, please tell me. Any help will be appreciated.
measure-theory inequality information-theory
add a comment |Â
up vote
3
down vote
favorite
The information inequality is a theorem that shows that the Kullback-Leibler divergence between two probability distributions is always non negative. This can be proved easily using the Jensen's inequality with the $-log$ function, but the proofs I have read always have to distinguish if the probability distributions are defined by continuous or discrete random variables.
I was trying to see if it is possible to not distinguish cases, and I thinked about using measure theory, with the Jensen's inequality provided in Rudin's book:
Let $mu$ be a probability measure on a $sigma$-algebra $mathcalM$ in a set $Omega$. If $f$ is a real integrable function with $a < f(x) < b$ for all $x in Omega$, and if $varphi$ is convex on $]a,b[$, then
$$varphileft( int_Omegaf dmuright) le int_Omega(varphi circ f) dmu. $$
So using this I have
$$KL(p|q) = intlogfracp(x)q(x)dp = int - log fracq(x)p(x) dp ge - log intfracq(x)p(x) dp, $$
but now the only way I find to continue is distinguishing if the distributions are discrete or continuous. I haven't studied measure theory and I only have some basic notions, so I'm not sure how to continue. Also, if my notations are wrong, please tell me. Any help will be appreciated.
measure-theory inequality information-theory
1
Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
â Sangchul Lee
Aug 17 at 9:56
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
The information inequality is a theorem that shows that the Kullback-Leibler divergence between two probability distributions is always non negative. This can be proved easily using the Jensen's inequality with the $-log$ function, but the proofs I have read always have to distinguish if the probability distributions are defined by continuous or discrete random variables.
I was trying to see if it is possible to not distinguish cases, and I thinked about using measure theory, with the Jensen's inequality provided in Rudin's book:
Let $mu$ be a probability measure on a $sigma$-algebra $mathcalM$ in a set $Omega$. If $f$ is a real integrable function with $a < f(x) < b$ for all $x in Omega$, and if $varphi$ is convex on $]a,b[$, then
$$varphileft( int_Omegaf dmuright) le int_Omega(varphi circ f) dmu. $$
So using this I have
$$KL(p|q) = intlogfracp(x)q(x)dp = int - log fracq(x)p(x) dp ge - log intfracq(x)p(x) dp, $$
but now the only way I find to continue is distinguishing if the distributions are discrete or continuous. I haven't studied measure theory and I only have some basic notions, so I'm not sure how to continue. Also, if my notations are wrong, please tell me. Any help will be appreciated.
measure-theory inequality information-theory
The information inequality is a theorem that shows that the Kullback-Leibler divergence between two probability distributions is always non negative. This can be proved easily using the Jensen's inequality with the $-log$ function, but the proofs I have read always have to distinguish if the probability distributions are defined by continuous or discrete random variables.
I was trying to see if it is possible to not distinguish cases, and I thinked about using measure theory, with the Jensen's inequality provided in Rudin's book:
Let $mu$ be a probability measure on a $sigma$-algebra $mathcalM$ in a set $Omega$. If $f$ is a real integrable function with $a < f(x) < b$ for all $x in Omega$, and if $varphi$ is convex on $]a,b[$, then
$$varphileft( int_Omegaf dmuright) le int_Omega(varphi circ f) dmu. $$
So using this I have
$$KL(p|q) = intlogfracp(x)q(x)dp = int - log fracq(x)p(x) dp ge - log intfracq(x)p(x) dp, $$
but now the only way I find to continue is distinguishing if the distributions are discrete or continuous. I haven't studied measure theory and I only have some basic notions, so I'm not sure how to continue. Also, if my notations are wrong, please tell me. Any help will be appreciated.
measure-theory inequality information-theory
asked Aug 17 at 8:43
utbutnut
211
211
1
Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
â Sangchul Lee
Aug 17 at 9:56
add a comment |Â
1
Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
â Sangchul Lee
Aug 17 at 9:56
1
1
Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
â Sangchul Lee
Aug 17 at 9:56
Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
â Sangchul Lee
Aug 17 at 9:56
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
0
down vote
A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".
Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
$$
D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
$$
where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
$$
D(mu|nu)geq 0.
$$
There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
$$
D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
$$
where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".
Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
$$
D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
$$
where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
$$
D(mu|nu)geq 0.
$$
There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
$$
D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
$$
where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.
add a comment |Â
up vote
0
down vote
A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".
Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
$$
D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
$$
where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
$$
D(mu|nu)geq 0.
$$
There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
$$
D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
$$
where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".
Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
$$
D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
$$
where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
$$
D(mu|nu)geq 0.
$$
There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
$$
D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
$$
where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.
A rigorous definition of information theory quantities can be found in M.S. Pinsker's book "Information and information stability of random variables and processes".
Consider two probability measures $mu$ and $nu$ defined on the measurable space $(Omega,mathcal M)$. Let $E_i$ be a partition of $Omega$. Then KL-divergence can be defined as
$$
D(mu|nu)=supsum_i mu(E_i)logfracmu(E_i)nu(E_i)
$$
where the supremum is taken over all partitions of $Omega$. Since this definition utilizes a sum like discrete KL-divergence, the non-negative property follows accordingly:
$$
D(mu|nu)geq 0.
$$
There is a theorem by Gelfand-Yaglom-Perez stating that if $D(mu|nu)$ is finite, then $mu$ is absolutely continuous with respect to $nu$ and
$$
D(mu|nu)=int_Omega logfracdmudnumathrmdmu,
$$
where $fracdmudnu$ is the Radon-Nikodym derivation. If you want, you can define the KL-divergence directly using the above equation.
edited Aug 17 at 12:37
answered Aug 17 at 12:22
Arash
9,20821537
9,20821537
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2885547%2fproving-the-information-inequality-using-measure-theory%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Your derivation immediately generalized to the case where $q$ is absolutely continuous w.r.t. $p$, meaning that $dq(x)=f(x)dp(x)$ for some measurable function $fgeq0$.
â Sangchul Lee
Aug 17 at 9:56