information-theoretic identities

29
Information-Theoretic Identities, Part 1 Prapun Suksompong [email protected] May 6, 2008 Abstract This article reviews some basic results in information theory. It is based on information-theoretic entropy, developed by Shannon in the landmark paper [22]. Contents 1 Mathematical Background and Notation 1 2 Entropy: H 3 2.1 MEPD: Maximum Entropy Probability Distri- butions ...................... 6 2.2 Stochastic Processes and Entropy Rate ..... 8 3 Relative Entropy / Informational Divergence / Kullback Leibler “distance”: D 8 4 Mutual Information: I 11 5 Functions of random variables 13 6 Markov Chain and Markov Strings 14 6.1 Homogeneous Markov Chain .......... 15 7 Independence 15 8 Convexity 15 9 Continuous Random Variables 16 9.1 MEPD ....................... 20 9.2 Stochastic Processes ............... 21 10 General Probability Space 21 11 Typicality and AEP (Asymptotic Equipartition Properties) 22 12 I -measure 26 13 MATLAB 27 1 Mathematical Background and Notation 1.1. Based on continuity arguments, we shall assume that 0 ln 0 = 0, 0 ln 0 q = 0 for q> 0, 0 ln p 0 = for p> 0, and 0 ln 0 0 = 0. 1.2. log x = (log e) (ln x), d dx (x log x) = log ex = log x + log e, and d dx g (x) log g (x)= g 0 (x) log (eg (x)). 1.3. Fundamental Inequality:1 - 1 x ln (x) x - 1 with equality if and only if x = 1. Note that the first inequality follows from the second one via replacing x by 1/x. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 ln(x) x-1 x 1 1 x Figure 1: Fundamental Inequality For x> 0 and y 0, y - y ln y x - y ln x, with equality if and only if x = y. For x 0, x - x 2 ≤-x ln x 1 - x. For x [0, 1], x - x 2 ≤-(1 - x) ln(1 - x) x. So, for small x, -(1 - x) ln(1 - x) x. 1.4. Figure 2 shows plots of p log p and i(x)= - log p(x) on [0, 1]. The first one has maximum at 1/e. 1.5. Log-sum inequality: For positive numbers a 1 ,a 2 ,... and nonnegative numbers b 1 ,b 2 ,... such that i a i < and 1

Upload: prapun

Post on 12-Nov-2014

270 views

Category:

Documents


0 download

DESCRIPTION

There are many useful identities in information theory, some of which are easy to prove but difficult to come up with in the first place. Therefore, I decided to collect them here. This article can also serve as a starting point for a second course on information theory.

TRANSCRIPT

Page 1: Information-Theoretic Identities

Information-Theoretic Identities, Part 1

Prapun [email protected]

May 6, 2008

Abstract

This article reviews some basic results in information theory.It is based on information-theoretic entropy, developed byShannon in the landmark paper [22].

Contents

1 Mathematical Background and Notation 1

2 Entropy: H 32.1 MEPD: Maximum Entropy Probability Distri-

butions . . . . . . . . . . . . . . . . . . . . . . 62.2 Stochastic Processes and Entropy Rate . . . . . 8

3 Relative Entropy / Informational Divergence /Kullback Leibler “distance”: D 8

4 Mutual Information: I 11

5 Functions of random variables 13

6 Markov Chain and Markov Strings 146.1 Homogeneous Markov Chain . . . . . . . . . . 15

7 Independence 15

8 Convexity 15

9 Continuous Random Variables 169.1 MEPD . . . . . . . . . . . . . . . . . . . . . . . 209.2 Stochastic Processes . . . . . . . . . . . . . . . 21

10 General Probability Space 21

11 Typicality and AEP (Asymptotic EquipartitionProperties) 22

12 I-measure 26

13 MATLAB 27

1 Mathematical Background andNotation

1.1. Based on continuity arguments, we shall assume that0 ln 0 = 0, 0 ln 0

q = 0 for q > 0, 0 ln p0 = ∞ for p > 0, and

0 ln 00 = 0.

1.2. log x = (log e) (lnx),ddx (x log x) = log ex = log x+ log e, andddxg (x) log g (x) = g′ (x) log (eg (x)).

1.3. Fundamental Inequality: 1− 1x ≤ ln (x) ≤ x−1 with

equality if and only if x = 1. Note that the first inequalityfollows from the second one via replacing x by 1/x.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-5

-4

-3

-2

-1

0

1

2

3

4

5

ln(x) x-1

x

11x

Proof. To show the second inequality, consider ( ) ln 1f x x x= − + for x > 0. Then,

( ) 1 1f xx

′ = − , and ( ) 0f x′ = iff 1x = . Also, ( ) 2

1 0f xx

′′ = − < . Hence, f is ∩,

and attains its maximum value when 1x = . ( )1f 0= . Therefore, . It is also clear that equality holds iff ln 1 0x x− + ≤ 1x = .

To show the fist inequality, use the second inequality, with x replaced x by 1x

.

• In all definitions of information measures, we adopt the convention that summation is

taken over the corresponding support. • Log-sum inequality:

For positive numbers and nonnegative numbers such that 1 2, ,a a … 1 2, ,b b … ii

a < ∞∑

and 0 , ii

b< <∑ ∞

log logi

iii i

i iii

i

aaa ab b

⎛ ⎞⎜ ⎟⎛ ⎞ ⎛ ⎞ ⎝ ⎠≥⎜ ⎟ ⎜ ⎟ ⎛ ⎞⎝ ⎠⎝ ⎠⎜ ⎟⎝ ⎠

∑∑ ∑

with the convention that log0

ia= ∞ .

Moreover, equality holds if and only if i

i

ab

= constant i∀ .

Note: logx x is convex ∪.

Figure 1: Fundamental Inequality

For x > 0 and y ≥ 0, y− y ln y ≤ x− y lnx, with equalityif and only if x = y.

For x ≥ 0, x− x2 ≤ −x lnx ≤ 1− x.

For x ∈ [0, 1], x − x2 ≤ −(1 − x) ln(1 − x) ≤ x. So, forsmall x, −(1− x) ln(1− x) ≈ x.

1.4. Figure 2 shows plots of p log p and i(x) = − log p(x) on[0, 1]. The first one has maximum at 1/e.

1.5. Log-sum inequality: For positive numbers a1, a2, . . .and nonnegative numbers b1, b2, . . . such that

∑i

ai <∞ and

1

Page 2: Information-Theoretic Identities

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

( )p x

( ) ( )logp x p x−

• ( ) ( ) 1log with equality iff H X x p x≤ ∀ ∈X XX

= (X has a uniform distribution over X ).

Proof ( ) ( ) ( )

( ) ( ) ( )

1log log log

1 11 1

1 1 1 0

x

x

H X E p X E Ep X

E p xp X p X∈

⎡ ⎤− = − − ⎡ ⎤ =⎡ ⎤ ⎢ ⎥⎣ ⎦ ⎣ ⎦

⎢ ⎥⎣ ⎦⎡ ⎤ ⎛

≤ − = ⎜ ⎟⎢ ⎥ ⎜ ⎟⎢ ⎥⎣ ⎦ ⎝

= − = − =

X

X

X XX

X X

X

X X

⎞−

( ) ( )1log 1H X xp x

= ⇔∀XX

= .

Proof. Let ( ) 1q x =X

. Then, x∀ ∈X

( ) ( )( )

( ) ( )log log log1p X p X

D p q H Xq X

= = = − + X

X

E E .

We know that ( ) 0D p q ≥ with equality iff ( ) ( )p x q x= x∀ ∈X , i.e., ( ) 1p x =X

. x∀ ∈X

Proof. Let ( )ip p i= . Set ( ) lni i ii i

G p p p pλ ⎛= + ⎜⎝ ⎠

⎞⎟∑ ∑ . Let

( ) 10 ln i in i

G p p pp p

λ⎛ ⎞∂

= = − +⎜ ⎟∂ ⎝ ⎠+ , then 1

ip eλ −= ; so all ip are the same.

( ) ( ) ( )( ) ( ) ( )1, log log log

p x xi x i x x p x

p x p x= = = = − .

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

7

( )i x

( )p x • Average Mutual information

• A measure of the amount of information that one random variable contains about another random variable. ( ) ( ) ( )( );H X Y H X I X Y= − .

• The reduction in the uncertainty of one random variable due to the knowledge of the other.

• A special case relative entropy.

• Need on average ( ) ( ),H p x y info bits to describe (x,y). If instead,

assume that X and Y are independent, then would need on average ( ) ( ) ( ) ( ) ( ) ( )( ),H p x p y D p x y p x p y+ info bits to describe (x,y).

• In view of relative entropy, it is natural to think of ( );I X Y as a measure of how far X and Y are from being independent.

• Average mutual information

( ) ( )( ) ( )

( )( )

( )( )iff independent

,0 ; log log log

P X Y Q Y XP X YI X Y E E E

p X q Y p X q Y

⎡ ⎤ ⎡ ⎤⎡ ⎤≤ = = =⎢ ⎥ ⎢ ⎥⎢ ⎥

⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦.

( ) ( ) ( )( ) ( )

( )( )

( ) ( ) ( )

( ) ( ) ( )( )

,

,; , log

,log ;

,

0 with equality iff and are independent

x y

p x y

p x yI X Y p x y

p x p y

p X YE E i

p X p Y

D p x y p x p y

X Y

∈ ∈

=

⎡ ⎤= = ⎡⎢ ⎥ ⎣ ⎦

⎣ ⎦

=

∑∑X Y

X Y ⎤

Proof

Figure 2: p log p and i(x) = − log p(x)

0 <∑i

bi <∞,

∑i

(ai log

aibi

)≥

(∑i

ai

)log

(∑i

ai

)(∑

i

bi

)with the convention that log ai

0 =∞. Moreover, equality holdsif and only if ai

bi= constant ∀i. In particular,

a1 loga1

b1+ a2 log

a2

b2≥ (a1 + a2) log

a1 + a2

b1 + b2.

The proof follows from rewriting the inequality as∑i

ai log aibi−∑

i

ai log AB , where A =

∑i

ai and B =∑i

bi. Then, combine

the sum and apply ln(x) ≥ 1− 1x .

1.6. The function x ln xy on [0, 1]× [0, 1] is convex ∪ in the

pair (x, y).

This follows directly from log-sum inequality (1.5).

For fixed y, it is a convex ∪ function of x, starting at 0,decreasing to its minimum at −ye < 0, then increasing to− ln y > 0.

For fixed x, it is a decreasing, convex ∪ function of y.

See also Figure 3.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5

0

0.5

1

1.5

2

2.5

x

• For fixed y, it is a convex ∪ function of x, starting at 0, decreasing to its minimum

at 0ye

− < , then increasing to ln 0y− > .

That is for [ ]0,1λ ∈ ,

( )( ) ( )( ) ( )1 2 1 21 2 1 2

11 ln ln 1 ln

x x x xx x x xy y y

λ λλ λ λ λ

+ − ⎛ ⎞ ⎛ ⎞+ − ≤ + −⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠.

Proof. ln ln 1d x xxdx y y

⎛ ⎞= +⎜ ⎟

⎝ ⎠.

2

2

1ln 0d xxdx y x

= > .

• For fixed x, it is a decreasing, convex ∪ function of y.

1y =

0.1y =

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5

0

0.5

1

1.5

2

2.5

3

y

Proof. ln 0d x xxdy y y

= − < . 2

2 2ln 0d x xxdy y y

= > .

• It is convex ∪ in the pair ( ),x y . That is for [ ]0,1λ ∈ ,

( )( ) ( )( )( )( ) ( )1 2 1 2

1 2 1 21 2 1 2

11 ln ln 1 ln

1x x x xx x x xy y y y

λ λλ λ λ λ

λ λ+ − ⎛ ⎞ ⎛ ⎞

+ − ≤ + −⎜ ⎟ ⎜ ⎟+ − ⎝ ⎠ ⎝ ⎠.

Proof. Apply the log-sum inequality ( ) 1 2 1 21 2 1 2

1 2 1 2

log log loga a a aa a a ab b b b+

+ ≤ ++

.

• In fact, this part already implies the first two.

0.1x =

1x =

Figure 3: The plots of x ln xy .

1.7. For measure space (X,F , µ), suppose non-negative fand g are µ-integrable. Consider A ∈ F and assume g > 0 onA.

Divergence/Gibbs Inequality: Suppose∫Afdµ =

∫Agdµ <∞, then∫

A

f logf

gdµ ≥ 0,

or, equivalently,

−∫A

f log fdµ ≤ −∫A

f log gdµ

with equality if and only if f = g µ−a.e. on A.

Log-sum inequality: Suppose∫Afdµ,

∫Agdµ <∞,

then ∫A

f logf

gdµ ≥

∫A

fdµ

log

∫A

fdµ∫A

gdµ.

In fact, the log can be replaced by any functionh : [0,∞]→ R such that h(y) ≥ c

(1− 1

y

)for some

c > 0.

Pinsker’s inequality: Suppose∫fdµ =

∫gdµ =1

(A = X), then∫f log

f

gdµ ≥ log e

2

(∫|f − g|dµ

)2

. (1)

The three inequalities here include (1.5), (3.3), (9.11), and(3.11) as special cases. The log-sum inequality implies theGibbs inequality. However, it is easier to prove the Gibbsinequality first using the fundamental inequality ln(x) ≥ 1− 1

x .Then, to prove the log-sum inequality, we let α =

∫A

fdµ,

β =∫A

gdµ and normalize f, g to f = f/α, g = g/β. Note that

f and g satisfy the condition for Gibbs inequality. The log-suminequality is proved by substituting f = αf and g = βg intothe LHS.

For Pinsker’s inequality, partition the LHS integral into A =[f ≤ g] and Ac. Applying log-sum inequality to each term givesα ln α

β + (1− α) ln 1−α1−β which we denoted by r(β). Note also

that∫|f − g|dµ = 2 (β − α). Now, r (β) = r(α) +

β∫α

r′ (t)dt

where r(α) = 0 and r′ (t) = t−αt(1−t) ≥ 4 (t− α) for t ∈ [0, 1].

1.8. For n ∈ N = 1, 2, . . ., we define [n] = 1, 2, . . . , n.

We denote random variables by capital letters, e.g., X, andtheir realizations by lower case letters, e.g., x. The probabilitymass function (pmf) of the (discrete) random variable X isdenoted by pX(x). When the subscript is just the capitalized

2

Page 3: Information-Theoretic Identities

version of the argument in the parentheses, we will often writesimply p(x) or px. Similar convention applies to the (cumula-tive) distribution function (cdf) FX(x) and the (probability)density function (pdf) fX(x). The distribution of X will bedenoted by PX or L(X).

1.9. Suppose I is an index set. When Xi’s are randomvariables, we define a random vector XI by XI = (Xi : i ∈ I).Then, for disjoint A,B, XA∪B = (XA, XB). If I = [n], thenwe write XI = Xn

1 .When Ai’s are sets, we define the set AI by the union

∪i∈IAi.

We think of information gain (I) as the removal of uncer-tainty. The quantification of information then necessitates thedevelopment of a way to measure one level of uncertainty (H).

In what followed, although the entropy (H), relative entropy(D), and mutual information (I) are defined in terms of randomvariables, their definitions extend to random vectors in astraightforward manner. Any collection Xn

1 of discrete randomvariables can be thought of as a discrete random variable itself.

2 Entropy: H

We begin with the concept of entropy which is a measure ofuncertainty of a random variable [6, p 13]. Let X be a discreterandom variable which takes values in alphabet X .

The entropy H(X) of a discrete random variable X is afunctional of the distribution of X defined by

H (X) = −∑x∈X

p (x) log p (x) = −E [log p (X)]

≥ 0 with equality iff ∃x ∈ X p (x) = 1

≤ log |X |with equality iff ∀x ∈ X p (x) =1|X |

.

In summary,

0deterministic

≤ H (X) ≤ log |X |uniform

.

The base of the logarithm used in defining H can be chosento be any convenient real number b > 1. If the base of thelogarithm is b, denote the entropy as Hb (X). When usingb = 2, the unit for the entropy is [bit]. When using b = e, theunit is [nat].

Remarks:

The entropy depends only on the (unordered) probabilities(pi) and not the values x. Therefore, sometimes, wewrite H

(PX)

instead of H(X) to emphasize that theentropy is a functional for the probability distribution.

H(X) is 0 if an only if there is no uncertainty, that iswhen one of the possible values is certain to happen.

If the sum in the definition diverges, then the entropyH(X) is infinite.

0 ln 0 = 0, so the x whose p(x) = 0 does not contributeto H(X).

H (X) = log |X | − D(PX ‖U

)where U is the uniform

distribution with the same support.

2.1. Entropy is both shift and scale invariant, that is ∀a 6= 0and ∀b: H(aX + b) = H(X). In fact, for any injective (1-1)function g on X , we have H(g(X)) = H(X).

2.2. H (p (x)) is concave (convex ∩) in p (x). Thatis, ∀λ ∈ [0, 1] and any two p.m.f. p1 (x) , x ∈ X andp2 (x) , x ∈ X, we have

H (p∗) ≥ λH (p1) + λH (p2) ,

where p∗ (x) ≡ λp1 (x) + (1− λ) p2 (x) ∀x ∈ X .

2.3. Asymptotic value of multinomial coefficient [6, Q11.21p 406]: Fix a p.m.f. P = (p1, p2, . . . , pM ). For i = 1, . . . ,m−1, define an,i = bnpic. Set am = n −

∑m−1j=0 bnpjc so that∑m

i=1 ai = n. Then,

1n

limn→∞

log(

n

a1 a2 · · · am

)= limn→∞

1n

logn!

M∏i=1

ai!= H(P ).

(2)

2.4. (Differential Entropy Bound on Discrete Entropy) ForX on a1, a2, . . ., let pi = pX (ai), then

H(X) ≤ 12

log(

2πe(

Var(∗)[X] +112

))where

Var(∗) [X] =

(∑i∈N

i2pi

)−

(∑i∈N

ipi

)2

which is not the variance of X itself but of an integer-valuedrandom variable with the same probabilities (and hence thesame entropy). Moreover, for every permutation σ, we canreplace Var(∗)[X] above by

Var(σ)[X] =

(∑i∈N

i2pσ(i)

)−

(∑i∈N

ipσ(i)

)2

.

2.5. Example:

If X ∼ P (λ), then H(X) ≡ H (P (λ)) = λ − λ log λ +E logX!. Figure 4 plots H (P (λ)) as a function of λ.Note that, by CLT, we may approximate H (P (λ)) byh (N (λ, λ)) when λ is large.

3

Page 4: Information-Theoretic Identities

X ∼ Support set X pX (k) H(X)Uniform Un 1, 2, . . . , n 1

n log n

Bernoulli B(1, p) 0, 1

1− p, k = 0p, k = 1 hb(p)

Binomial B(n, p) 0, 1, . . . , n(nk

)pk (1− p)n−k

Geometric G(p) N ∪ 0 (1− p) pk 11−phb(p) = 1

1−phb(1− p)= EX log

(1 + 1

EX)

+ log(1 + EX)Geometric G′(p) N (1− p)k−1p 1

phb(p)

Poisson P(λ) N ∪ 0 e−λ λk

k! λ log e+ EX log λ+ E logX!

Table 1: Examples of probability mass functions and corresponding discrete entropies. Here, p, β ∈ (0, 1). λ > 0. hb(p) =−p log p− (1− p) log (1− p) is the binary entropy function.

X ∼ fX (x) h(X)Uniform U(a, b) 1

b−a1[a,b] (x) log(b− a) = 12 log 12σ2

X ≈ 1.79 + log2 σX [bits]Exponential E(λ) λe−λx1[0,∞] (x) log e

λ = log eσX ≈ 1.44 + log2 σX [bits]

Shifted Exponential 1µ−s0 e

− x−s0µ−s0 1[s0,∞)(x); log (e(µ− s0))on [s0,∞), mean µ µ > s0

Bounded Exp. αe−αa−e−αb e

−αx1[a,b] (x) log(e1−αa−e1−αb

α

)+ αae

−αa−be−αbe−αa−e−αb log e

Laplacian L(α) α2 e−α|x| log 2e

α = 12 log 2e2σ2

X ≈ 1.94 + log2 σX [bits]

Normal N (µ, σ2) 1σ√

2πe−

12 ( x−µσ )2

12 log

(2πeσ2

)≈ 2.05 + log2 σ [bits]

Gamma Γ (q, λ) λqxq−1e−λx

Γ(q) 1(0,∞)(x) q (log e) + (1− q)ψ (q) + log Γ(q)λ

Pareto Par(α) αx−(α+1)1[1,∞] (x) − log (α) +(

1α + 1

)log e

Par(α, c) = cPar(α) αc

(cx

)α+1 1(c,∞)(x) log(cα

)+(

1α + 1

)log e

Beta β (q1, q2) Γ(q1+q2)Γ(q1)Γ(q2)x

q1−1 (1− x)q2−1 1(0,1) (x)logB (q1, q2)− (q1 − 1) (ψ (q1)− ψ (q1 + q2))− (q2 − 1) (ψ (q2)− ψ (q1 + q2))

Beta prime Γ(q1+q2)Γ(q1)Γ(q2)

xq1−1

(x+1)(q1+q2) 1(0,∞) (x)

Rayleigh 2αxe−αx21[0,∞] (x) log

(1

2√α

)+(1 + γ

2

)log e

Standard Cauchy 1π

11+x2 log (4π)

Cau(α) 1π

αα2+x2 log (4πα)

Cau(α, d) Γ(d)√παΓ(d− 1

2 )1(

1+( xα )2)d

Log Normal eN(µ,σ2) 1σx√

2πe−

12 ( ln x−µ

σ )2

1(0,∞) (x) 12 log

(2πeσ2

)+ µ log e

Table 2: Examples of probability density functions and their entropies. Here, c, α, q, q1, q2, σ, λ are all strictly positive and d > 12 .

γ = −ψ(1) ≈ .5772 is the Euler-constant. ψ (z) = ddz log Γ (z) = (log e) Γ′(z)

Γ(z) is the digamma function. B(q1, q2) = Γ(q1)Γ(q2)Γ(q1+q2) is

the beta function.

4

Page 5: Information-Theoretic Identities

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

λ

( )( ),h λ λN

( )( )H λP

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5

0

0.5

1

1.5

2

λ

Figure 4: H (P (λ)) and its approximation by h (N (λ, λ)) =12 log (2πeλ)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

H(p

)

• Logarithmic Bounds: ( )( ) ( ) ( ) ( )(1ln ln log ln lnln 2

p q e H p p q≤ ≤ )

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

• Power-type bounds: ( )( ) ( ) ( ) ( )( )1

ln 4ln 2 4 log ln 2 4pq e H p pq≤ ≤

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Entropy for two random variables

• For two random variables X and Y with a joint pmf ( ),p x y and marginal pmf p(x) and p(y).

Figure 5: Binary Entropy Function

2.6. Binary Entropy Function : We define hb(p), h (p) orH(p) to be −p log p− (1− p) log (1− p), whose plot is shownin Figure 5. The concavity of this function can be regardedas an example of 2.2, 8.1. Some properties of hb are:

(a) H(p) = H(1− p).

(b) dhdp (p) = log 1−p

p .

(c) ddxh (g (x)) =

(dgdx (x)

)log 1−g(x)

g(x) .

(d) 2h(p) = p−p (1− p)−(1−p).

(e) 2−h(p) = pp (1− p)1−p.

(f) h(

1b

)= log b− b−1

b log (b− 1).

(g) 1n+12nhb(

rn ) ≤

(nr

)≤ 2nhb(

rn ) [5, p 284-286].

(h) limn→∞

1n log

bαnc∑i=0

(ni

)volume

= limn→∞

1n log

(nbαnc

)surface shell

= hb (α) [6,

Q11.21 p 406]. See also (2).

(i) Quadratic approximation: hb(p) ≈ 4p(1− p)

There are two bounds for H(p):

Logarithmic Bounds:

(ln p) (ln q) ≤ (log e)H (p) ≤ 1ln 2

(ln p) (ln q) .

Power-type bounds:

(ln 2) (4pq) ≤ (log e)H (p) ≤ (ln 2) (4pq)1

ln 4 .

2.7. For two random variables X and Y with a joint p.m.f.p(x, y) and marginal p.m.f. p(x) and p(y), the conditionalentropy is defined as

H (Y |X ) = −E [log p (Y |X )] = −∑x∈X

∑y∈Y

p (x, y) logp (y |x )

=∑x∈X

p (x)H (Y |X = x ),

where

H (Y |X = x ) = −E [ log(p(Y |x))|X = x]

= −∑y∈Y

p (y |x ) logp (y |x ) .

Note that H(Y |X) is a function of p(x, y), not just p(y|x).

Example 2.8. Thinned Poisson: Suppose we have X ∼ P (λ)and conditioned on X = x, we define Y to be a Binomial r.v.with size x and success probability y:

X → s → Y.

Then, Y ∼ P (sλ) with H(Y ) = H (P (sλ)). Moreover,

p (x |y ) = e−λ(1−s) (λ (1− s))x−y

(x− y)!1 [x ≥ y] ;

which is simply a Poisson r.v. shifted by y. Consequently,

H (X |y ) = H (P (λ (1− s))) = H (X |Y ) .

2.9. 0Y=g(X)

≤ H (Y |X ) ≤ H (Y )X,Y independent

.

2.10. The discussion above for entropy and conditional en-tropy is still valid if we replace random variables X and Y inthe discussion above with random vectors. For example, thejoint entropy for random variables X and Y is defined as

H (X,Y ) = −E [log p (X,Y )] = −∑x∈X

∑y∈Y

p (x, y) logp (x, y).

More generally, for a random vector Xn1 ,

H (Xn1 ) = −E [log p (Xn

1 )] =n∑i=1

H(Xi

∣∣Xi−11

)≤

n∑i=1

H (Xi)

Xi’s are independent

.

5

Page 6: Information-Theoretic Identities

2.11. Chain rule:

H (X1, X2, . . . , Xn) =n∑i=1

H (Xi |Xi−1, . . . , X1 ),

or simply

H (Xn1 ) =

n∑i=1

H(Xi

∣∣Xi−11

).

Note that the term in sum when i = 1 is H(X1). Moreover,

H (Xn1 ) =

n∑i=1

H(Xi

∣∣Xni+1

).

In particular, for two variables,

H (X,Y ) = H (X) +H (Y |X ) = H (Y ) +H (X |Y ) .

Chain rule is still true with conditioning:

H (Xn1 |Z ) =

n∑i=1

H(Xi

∣∣Xi−11 , Z

).

In particular,

H (X,Y |Z) = H (X |Z ) +H (Y |X,Z )= H (Y |Z ) +H (X |Y,Z ) .

2.12. Conditioning only reduces entropy:

H (Y |X ) ≤ H (Y ) with equality if and only if X and Yare independent. That is H (p (x) p (y)) = H (p (y))+H (p (x)).

H (X |Y ) ≥ H (X |Y,Z ) with equality if and only if givenY we have X and Z are independent i.e. p (x, z |y ) =p (x |y ) p (z |y ).

2.13. H (X |X ) = 0.

2.14. H (X,Y ) ≥ max H (X) , H (Y |X ) , H (Y ) , H (X |Y ).

2.15. H (X1, X2, . . . , Xn|Y ) ≤n∑i=1

H (Xi|Y ) with equality

if and only if X1, X2, . . . , Xn are independent conditioning on

Y

(p (xn1 |y ) =

n∏i=1

p (xi |y ))

.

2.16. Suppose Xi’s are independent.

Let Y = X1 +X2. Then, H(Y |X1) = H(X2) ≤ H(Y )and H(Y |X2) = H(X1) ≤ H(Y ).

If two finite index sets satisfy J ⊂ I, thenH(∑

j∈J Xj

)≤ H

(∑i∈I Xi

). Also,

H(

1|J|∑j∈J Xj

)≤ H

(1|I|∑i∈I Xi

).

More specifically, H(

1n

n∑i=1

Xi

)is an increasing function

of n.

2.17. Suppose X is a random variable on X . ConsiderA ⊂ X . Let PA = P [X ∈ X]. Then, H(X) ≤ h(PA) + (1 −PA) log(|X | − |A|) + PA log |A|.

2.18. Fano’s Inequality: Suppose U and V be randomvariables on common alphabet set of cardinality M . LetE = 1U 6=V and Pe = P [E = 1] = P [U 6= V ]. Then,

H (U |V ) ≤ h (Pe) + Pe log (M − 1)

with equality if and only if

P [U = u, V = v] = P [V = v]×

PeM−1 , u 6= v

1− Pe, u = v.

Note that if Pe = 0, then H(U |V ) = 0

2.19. Extended Fano’s Inequality: Let UL1 , VL1 ∈ UL =

VL where |U| = |V| = M . Define Pe,` = P [V` 6= U`] and

P e = 1L

L∑=1

Pe,`. Then,

1LH(UL1∣∣V L1 ) ≤ h (P e)+ P e log (M − 1) .

2.20 (Han’s Inequality).

H (Xn1 ) ≤ 1

n− 1

n∑i=1

H(X[n]\i

).

2.21. Independent addition (in an additive group G) increaseentropy in a sublinear way: For two random variables X andZ,

H(X) ≤ H(X ⊕ Z) ≤ H(X) +H(Z)

[4]. To see this, note that conditioning only reduces entropy;therefore,

H(X) = H(X ⊕ Z|Z) ≤ H(X ⊕ Z).

The last item is simply a function of X,Z. We know that

H(g(X,Z)) ≤ H(X,Z) ≤ H(X) +H(Z).

The independence between X and Z is only used in the lastinequality. See also Figure 6.

2.22. Majorization: Given two probability distributions p =(p0 ≥ p1 ≥ · · · ≥ pm) and q = (q0 ≥ q1 ≥ · · · ≥ qm), we saythat p majorizes q if for all k,

∑ki=1 pi ≥

∑ki=1 qi. In which

case, H(p) ≤ H(q). [17]

2.1 MEPD: Maximum Entropy ProbabilityDistributions

Consider fixed (1) countable (finite or infinite) S ⊂ R and(2) functions g1, . . . , gm on S. Let C be a class of probability

6

Page 7: Information-Theoretic Identities

Z X

X Z+

( ), 0H X Z X Z+ = 0

00a

( ), 0H X X Z Z+ = ( ), 0H Z X Z X + =

b b a= −

c d

if X and Z are independent

Figure 6: Information diagram For X, Z, and X + Z.

mass function pX which are supported on S (pX = 0 onSc) and satisfy the moment constraints E [gk(X)] = µk, for1 ≤ k ≤ m. Let PMF p∗ be the MEPD for C. Define PMF q

on S by qx = c0e

m∑k=1

λkgk(x)where c0 =

(∑x∈S

e

m∑k=1

λkgk(x))−1

and λ1, . . . , λm are chosen so that q ∈ C. Note that p∗ and qmay not exist.

(a) If p∗ exists (and ∃p′ ∈ C such that p′ > 0 on S), then qexists and q = p∗.

(b) If q exists, then p∗ exists and p∗ = q.

Note also that c0 > 0.

Example 2.23.

(a) When there is no constraint, if X is finite, then MEPDis uniform with pi = 1

|X | . If X is countable, then MEPDdoes not exist.

(b) If require EX = µ, then MEPD is geometric (or truncatedgeometric) pi = c0β

i. If, in addition, S = N ∪ 0, then

pi = µi

(1+µ)i+1 = 11+µ

1+µ

)iwith corresponding entropy

11− β

hb(β) =1

1− βhb(1− β) (3)

= −µ logµ+ (1 + µ) log(1 + µ) (4)

= µ log(

1 +1µ

)+ log(1 + µ). (5)

Observe that

(i) (4) is similar to the one defining the binary entropyfunction hb in (2.6),

(ii) H(X) is a strictly increasing function of µ, and

(iii) H(X)EX = H(X)

µ is a strictly decreasing function of µ.

See [14] for more examples.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

m = 0:0.01:5; h = m.*log2(1+1./m) + log2(1+m); set(plot(m,h,'k',m,h./m,'k-.'),'LineWidth',1.5') axis([0 5 0 5])

50

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

( )H X

( )H X EX

EX

Figure 7: Entropy of Geometric distribution as a function ofits mean.

2.24. The Poisson distribution P(λ) maximizes entropywithin the class of Bernoulli sums of mean λ. Formallyspeaking, define S(λ) to be the class of Bernoulli sumsSn = X1 + · · · + Xn, for Xi independent Bernoulli, andESn = λ. Then, the entropy of a random variable in S(λ) isdominated by that of the Poisson distribution:

supS∈S(λ)

H (S) = H (P (λ)) .

Note that P(λ) is not in the class S(λ).

2.25. Most of the standard probability distributions canbe characterized as being MEPD. Consider any pmf on S

of the form qX(x) = c0e

m∑k=1

λkgk(x)with finite E [gk(X)] for

k = 1, . . . ,m. qX is the MEPD for the class of pmf p on Sunder the constraints by the E [gk(X)]’s. Hence, to see whichconstraints characterize a pmf as an MEPD, rewrite the pmfin the exponential form.

S Constraints MEPD[n] No constraint Un

N ∪ 0 EX G0(p)N EX G1(p)

0, 1, . . . , n EX,E ln(nX

)B(n, p)

N ∪ 0 EX = λ,E lnX! P(λ)

Table 3: MEPD: Discrete cases. Most of the standard proba-bility distributions can be characterized as being MEPD whenvalues of one or more of some moments are prescribed [14, p359].

7

Page 8: Information-Theoretic Identities

2.2 Stochastic Processes and Entropy Rate

In general, the uncertainty about the values of X(t) on theentire t axis or even on a finite interval, no matter how small,is infinite. However, if X(t) can be expressed in terms of itsvalues on a countable set of points, as in the case for bandlim-ited processes, then a rate of uncertainty can be introduced.It suffices, therefore, to consider only discrete-time processes.[21]

2.26. Consider discrete stationary source (DSS) U (k). LetU be the common alphabet set. By staionaryness, H (Un1 ) =H(Uk+n−1k

). Define

Per letter entropy of an L-block:

HL =H(UL1)

L=H(Uk+L−1k

)L

;

Incremental entropy change:

hL = H(UL∣∣UL−1

1

)= H

(UL1)−H

(UL−1

1

)It is the conditional entropy of the symbol when the pre-ceding ones are known. Note that for stationary Markovchain, hL,markov = H

(UL∣∣UL−1

1

)= H (UL |UL−1 ) =

H (U2 |U1 ) ∀L ≥ 2.

Then,

h1 = H1 = H (U1) = H (Uk).

hL ≤ HL.

hL = LHL − (L− 1)HL−1.

Both hL and HL are non-increasing () function of L,converging to same limit, denoted H.

The entropy rate of a stationary source is defined as

H (U) = H (U`) = HU

= limL→∞

H(UL1)

L= limL→∞

H(UL∣∣UL−1

1

).

Remarks:

Note that H(UL1 ) is an increasing function of L. So, forstationary source, the entropy H(UL1 ) grows (asymptoti-cally) linearly with L at a rate HU .

For stationary Markov chain of order r, HU =H (Ur+1 |Ur1 ) = hr+1.

For stationary Markov chain of order 1, HU =H (U2 |U1 ) = h2 < H (U1) = H (U2). In partic-ular, let Xi be a stationary Markov chain withstationary distribution u and transition matrix P .Then, the entropy rate is H = −

∑ij

uiPij logPij ,

where Pij = P [Next state is j |Current state is i ] =P [X2 = j |X1 = i ] and ui = P [X1 = i]. Also, HU =

∑i

uiH (U2 |U1 = i ) whereH (U2 |U1 = i )’s are computed,

for each i, by the transition probabilities Pij from state i.

When there are more than one communicating class,HU =

∑i

P [classi]H (U2 |U1 , classi).

The statement of convergence of the entropy at time n ofa random process divided by n to a constant limit called theentropy rate of the process is known as the ergodic theoremof information theory or asymptotic equirepartitionproperty (AEP). Its original version proven in the 50’sfor ergodic stationary process with a finite state space, isknown as Shannon-McMillan theorem for the convergence inmean and as Shannon-McMillan-Breiman theorem for the a.s.convergence.

2.27. Weak AEP: For Uk i.i.d. pU (u),

− 1L

log pUL1(UL1) P−→ H (U) .

See also section (11).

2.28. Shannon-McMillan-Breiman theorem [5, Sec15.7]: If the source Uk is stationary and ergodic, then

− 1L

log pUL1(UL1) a.s.−−→ HU .

This is also referred to as the AEP for ergodic stationarysources/process.

3 Relative Entropy / InformationalDivergence / Kullback Leibler“distance”: D

3.1. Let p and q be two pmf’s on a common countablealphabet X . Relative entropy (Kullback Leibler “distance”,cross-entropy, directed divergence, informational divergence,or information discrimination) between p and q (from p to q)is defined as

D (p ‖q ) =∑x∈X

p (x) logp (x)q (x)

= E[log

p (X)q (X)

], (6)

where the distribution of X is p.

Not a true distance since symmetry and triangle inequal-ity fail. However, it can be regarded as a measure ofthe difference between the distributions of two randomvariables. Convergence in D is stronger than convergencein L1 [see (3.11)].

If q is uniform, i.e. q = u = 1|X | , then D (p ‖u ) = log |X |−

H (X).

8

Page 9: Information-Theoretic Identities

By convexity of D, this also shows that H(X) isconcave ∩ w.r.t. p.

Maximizing the entropy subject to any given set ofconstraints is identical to minimizing the relativeentropy from the uniform distribution subject to thesame constraints [13, p 160].

If ∃x such that p(x) > 0 but q(x) = 0 (that is support ofp is not a subset of support of q), then D(p‖q) =∞.

3.2. Although relative entropy is not a proper metric, it isnatural measure of “dissimilarity” in the context of statistics[5, Ch. 12]. Some researchers regard it as an information-theoretic distance measures.

3.3. Divergence Inequality: D (p ‖q ) ≥ 0 with equality ifand only if p = q. This inequality is also known as GibbsInequality. Note that this just means if we have two vectorsu,v with the same length, each have nonnegative elementswhich summed to 1. Then,

∑i

ui log uivi≥ 0.

3.4. D (p ‖q ) is convex ∪ in the pair (p, q). That is if (p1, q1)and (p2, q2) are two pairs of probability mass functions, then

D (λp1 + (1− λ) p2 ‖λq1 + (1− λ) q2 )≤ λD (p1 ‖q1 ) + (1− λ)D (p2 ‖q2 )∀ 0 ≤ λ ≤ 1.

This follows directly from the convexity ∪ of x ln xy .

For fixed p, D (p ‖q ) is convex ∪ functional of q. That is

D (λq1 + (1− λ) q2 ‖p ) ≤ λD (q1 ‖p )+(1− λ)D (q2 ‖p ) .

For fixed q, D (p ‖q ) is convex ∪ functional of p.

3.5. For binary random variables:

D (p ‖q ) = p logp

q+ (1− p) log

1− p1− q

.

Note that D (p ‖q ) is convex ∪ in the pair (p, q); hence it isconvex ∪ in p for fix q and convex ∪ in q for fix p.

3.6. For p = (p1, . . . , pn) and q = (q1, . . . , qn),

D(p‖q) is a continuous function of p1, . . . , pn and of(q1, . . . , qn) [13, p 155];

D(p‖q) is permutationally symmetric, i.e., the value ofthis measure does not change if the outcomes are labeleddifferently if the pairs (p1, q1), . . . , (pn, qn) are permutedamong themselves [13, p 155].

3.7. The conditional relative entropy is defined by

D (p (y |x ) ‖q (y |x ) ) = E[log

p (Y |X )q (Y |X )

]=∑x

p (x)∑y

p (y |x ) logp (y |x )q (y |x )

.

3.8. D (p (x |z ) ‖q (x |z ) ) ≥ 0. In fact, it is E[log p(X|Z )

q(X|Z )

]=∑

zp (z) E

[log p(X|z )

q(X|z )

∣∣∣ z] where E[

log p(X|z )q(X|z )

∣∣∣ z] ≥ 0 ∀z.

3.9. Chain rule for relative entropy:

D (p (x, y) ‖q (x, y) ) = D (p (x) ‖q (x) )+D (p (y |x ) ‖q (y |x ) ) .

3.10. Let p and q be two probability distributions on acommon alphabet X . The variational distance between p andq is defined by

dTV (p, q) =12

∑x∈X|p (x)− q (x)|.

It is the metric induced by the `1 norm. See also 10.5.

3.11. Pinsker’s Inequality [6, lemma 11.6.1 p 370],[5,lemma 12.6.1 p 300]:

D (p ‖q ) ≥ 2 (logK e) d2TV (p, q) ,

where K is the base of the log used to define D. In particular,if we use ln when defining D (p ‖q ), then

D (p ‖q ) ≥ 2d2TV (p, q) .

Pinsker’s Inequality shows that convergence in D is strongerthan convergence in L1. See also 10.6

3.12. Suppose X1, X2, . . . , Xn are independent andY1, Y2, . . . , Yn are independent, then

D(pXn1

∥∥pY n1 ) =n∑i=1

D (pXi ‖pYi ).

Combine with (3.14), we have

D

(p n∑i=1

Xi

∥∥∥∥∥p n∑i=1

Yi

)≤

n∑i=1

D (pXi ‖pYi ).

3.13 (Relative entropy expansion). Let (X,Y ) have jointdistribution pX,Y (x, y) on X × Y with marginals pX(x) andpY (y), respectively. Let qX(x) and qY (y) denote two arbitrarymarginal probability distributions on X and Y, respectively.Then,

D (pX,Y ‖qXqY ) = D (pX ‖qX ) +D (pY ‖qY ) + I (X;Y ) .

More generally, Let Xn1 have joint distribution pXn1 (xn1 ) on

X1 × X2 × · · · × Xn with marginals pXi (xi)’s. Let qXi(xi)’sdenote n arbitrary marginal distributions. Then,

D

(pXn1

∥∥∥∥∥n∏i=1

qX1

)

=n∑i=1

D(pXi

∥∥qXi )+n−1∑i=1

I(Xi;Xn

i+1

)=

n∑i=1

D(pXi

∥∥qXi )+n∑i=1

H (Xi)−H (Xn1 ) .

9

Page 10: Information-Theoretic Identities

Note that the final equation follows easily from

lnp (xn1 )

n∏i=1

qXi (xi)= ln

p (xn1 )n∏i=1

qXi (xi)

n∏i=1

pXi (xi)

n∏i=1

pXi (xi)

=n∑i=1

lnpXi (xi)qXi (xi)

+ log p (xn1 )−n∑i=1

log pXi (xi).

See also (7.3).

3.14 (Data Processing Inequality for relative entropy). LetX1 and X2 be (possibly dependent) random variables on X .Q(y|x) is a channel. Yi is the output of the channel when theinput is Xi. Then,

D (Y1 ‖Y2 ) ≤ D (X1 ‖X2 ) .

In particular,

D (g (X1) ‖g (X2) ) ≤ D (X1 ‖X2 ) .

The inequality follows from applying the log-sum inequalityto pY1 (y) ln pY1 (y)

pY2 (y) where pYi (y) =∑xQ (y |x ) pXi (x).

X ( )g X

Z

0

0

0 0≥

1Y

2X

( )Q y x

( )Q y x

1X

2Y

Figure 8: Data Processing Inequality for relative entropy

3.15. Poisson Approximation for sums of binary ran-dom variables [15, 12]:

Given a random variable X with corresponding pmf pwhose support is inside N ∪ 0, the relative entropyD (p‖P(λ)) is minimized over λ at λ = EX [12, Lemma7.2 p. 131].

In fact,

D (p ‖q ) = λ− EX ln (λ) +∞∑i=0

p (x) ln (p (x)x!).

andd

dλD (p ‖q ) = 1− 1

λEX.

Let X1, X2, . . . , Xn denote n possibly dependent binaryrandom variables with parameters pi = P [Xi = 1]. LetSn =

∑ni=1Xi. Let Λn be a Poisson random variable

with mean λ =n∑i=1

pi. Then,

D(PSn

∥∥PΛn)≤ log e

n∑i=1

p2i +

(n∑i=1

H (Xi)−H (Xn1 )

).

Note that the first term quantifies how small the pi’sare. The second term quantifies the degree of dependence.(See also (7.3).)

Also see (2.24).

3.16 (Han’s inequality for relative entropies). SupposeY1, Y2, . . . are independent. Then,

D (Xn1 ‖Y n1 ) ≥ 1

n− 1

n∑i=1

D(X[n]\i

∥∥Y[n]\i),

or equivalently,

D (Xn1 ‖Y n1 ) ≤

n∑i=1

(D (Xn

1 ‖Y n1 )−D(X[n]\i

∥∥Y[n]\i)).

3.17. Minimum cross-entropy probability distribution(MCEPD) [13, p 158, 160][6, Q12.2, p 421]:

Consider fixed (1) finite S ⊂ R, (2) functions g1, . . . , gmon S, and (3) pmf q on S. Let C be a class of probabilitymass function pX which are supported on S (pX = 0 onSc) and satisfy the moment constraints E [gk(X)] = µk, for1 ≤ k ≤ m. Let pmf p∗ be the probability distribution thatminimizes the relative entropy D( · ‖q) for C. Define PMF p

on S by p = qeλ0e

m∑k=1

λkgkwhere λ0, λ1, . . . , λm are chosen so

that p ∈ C. Note that p∗ and p may not exist.Suppose p exists.

(a) Fix r ∈ C.

∑x∈S r(x) log p(x)

q(x) = D(p‖q); that isfor any Y ∼ r ∈ C,

E[log

p(Y )q(Y )

]= D(p‖q).

D(r‖q)−D(p‖q) = D(r‖p) ≥ 0.

(b) p∗ exists and p∗ = p.

(c) D(p∗‖q) = D(p‖q) = (λ0 +∑mk=1 λkµk) log e.

Example 3.18. Suppose pmf q = P(b) and C is the class ofpmf with mean λ. Then, the MCEPD is P(λ) [13, p 176–177].

3.19. Alternating Minimization Procedure : Given twoconvex sets A and B of PMF p and q, the minimum relativeentropy between this two sets of distributions is defined as

dmin = minp∈A,q∈B

D(p‖q).

Suppose we first take any distribution p(1) in A, and find adistribution q(1) in B that is closet to it. Then fix this q(1)

and find the closet distribution in A. Repeating this process,then the relative entropy converges to dmin [6, p 332]. (Seealso [8].)

10

Page 11: Information-Theoretic Identities

3.20. Let p(x)Q(y|x) be a given joint distribution with cor-responding distributions q(y), P (x, y), and P (x|y).

(a) argminr(y)

D (p(x)Q(y|x)‖p(x)r(y)) = q(y).

(b) argmaxr(x|y)

∑x,y p(x)Q(y|x) log r(x|y)

p(x) = P (x|y).

[6, Lemma 10.8.1 p 333]

3.21. Related quantities

(a) J-divergence

J(p, q) = 12 (D(p‖q) +D(q‖p)).

Average of two K-L distances.

Symmetric.

(b) Resistor average

R(p, q) = D(p‖q)D(q‖p)D(p‖q)+D(q‖p) .

Parallel resistor formula:

1R(p, q)

=1

D(p‖q)+

1D(q‖p)

.

Does not satisfy the triangle inequality.

4 Mutual Information: I

4.1. The mutual information I(X;Y ) between two ran-dom variables X and Y is defined as

I (X;Y ) = E[log

p (X,Y )p (X) q (Y )

](7)

= E[log

P (X |Y )p (X)

](8)

= E[log

Q (Y |X )q (Y )

](9)

=∑x∈X

∑y∈Y

p (x, y) logp (x, y)p (x) q (y)

(10)

= D (p (x, y)‖ p (x) q (y)) (11)= H (X) +H (Y )−H (X,Y ) (12)= H (Y )−H (Y |X ) (13)= H (X)−H (X |Y ) , (14)

where p (x, y) = P [X = x, Y = y] , p (x) = P [X = x] , q (y) =P [Y = y] , P (x |y ) = P [X = x |Y = y ] , and Q (y |x ) =P [Y = y |X = x ].

I(X;Y ) = I(Y ;X).

The mutual information quantifies the reduction in theuncertainty of one random variable due to the knowl-edge of the other. It can be regarded as the informationcontained in one random variable about the other.

The name mutual information and the notation I(X;Y )was introduced by [Fano 1961 Ch 2].

Mutual information is a measure of the amount of infor-mation one random variable contains about another [6, p13]. See (13) and (14).

By (11), mutual information is the (Kullback-Leibler)divergence between the joint and product-of-marginaldistributions. Hence, it is natural to think of I(X;Y ) asa measure of how far X and Y are from being independent.

If we define

I (x) = E[

logQ (Y |x )q (Y )

∣∣∣∣X = x

]= E

[log

P (x|Y )p (x)

∣∣∣∣X = x

]= E [− log q (Y )|X = x]−H (Y |x ) ,

thenI(X;Y ) = E [I(X)] .

4.2. I(X;Y ) ≥ 0 with equality if and only if X and Y areindependent.

If X or Y is deterministic, then I(X;Y ) = 0.

4.3. I (X;X) = H (X). Hence, entropy is the self-information.

4.4. I(X;Y ) ≤ min H(X), H(Y ).

Example 4.5. Consider again the thinned Poisson examplein (2.8). The mutual information is

I (X;Y ) = H (P (λ))−H (P ((1− s)λ)) .

Example 4.6. Binary Channel. Let

X = Y = 0, 1; p (1) = P [X = 1] = p = 1− P [X = 0] = 1− p (0);

T = [P [Y = j |X = i ]] =[

1− a ab 1− b

]=[a ab b

].

p = 1− p and q = 1− q. The distribution vectors of X and Y : p =

[p p

]and

q =[q q

].

Alternative Proof.

( ) ( ) ( ) ( ); ; , ; ; ; 0 0 0I X Y Z I X Y Z I X Z Y I Y Z X= − − = − −

X Y

Z

0 0

0

X

Y

Z • Suppose I is an index set. Define a random vector ( ):I iX X i I= ∈ .

• Suppose we have nonempty disjoint index sets A, B. Then,

( ); 0A BI X X = iff nonempty A A∀ ⊂ nonempty B B∀ ⊂ . ( ); 0BAI X X =

Proof. “⇐” is obvious because A A⊂ and B B⊂ . “⇒” We can write . The above result (*) then gives ( ) ( \; ; ,A B A B B BI X X I X X X= )

( );A BI X X = 0 . Now write ( ) ( )\; ,A B A A A; BI X X I X X X= . Then, (*) gives

( ); 0BAI X X = .

• 1 2, , , nX X X… are independent iff

• . ( ) ( )11

nn

ii

H X H X=

= ∑

• ( )11; 0i

iI X X − = i∀

Proof. ( ) ( ) ( ) ( )

( ) ( )( ) ( )

11 1

1 1 1

1 11 1

1 1

0

;

n n nn i

i ii i i

n ni i

i i ii i

H X H X H X X H X

H X X H X I X X

= = =

− −

= =

= − = −

= − =

∑ ∑ ∑

∑ ∑

i

This happens iff ( )11; 0i

iI X X − = i∀

Alternative Proof. This is obvious from 1 2, , , nX X X… are independent iff i∀ iX and

11iX − are independent.

Example • Binary Channel:

0,1= =X Y ,

( ) ( )1 Pr 1 1 Pr 0 1 0p X p X p= = = = − = = − ,

1Pr

1a a a a

T Y j X ib b b−⎡ ⎤ ⎡

b⎡ ⎤= = = = =⎢ ⎥ ⎢⎣ ⎦ −⎣ ⎦ ⎣

⎤⎥⎦

.

0 0

1 1

a

b

1 a−

1 b−

X Y

Figure 9: Binary Channel

Then,

11

Page 12: Information-Theoretic Identities

P = [P [X = i, Y = j]] =[pa papb pb

].

q = pT =[pa+ pb pa+ pb

].

T = [P [X = j |Y = i ]] =

[pa

pa+pbpb

pa+pbpa

pa+pbpb

pa+pb

].

H (X) = h (p).

H(Y ) = h (pa+ pb).

H (Y |X ) = ph (a) + ph (b) = ph (a) + ph(b).

I (X;Y ) = h(pa+ pb

)−(ph (a) + ph

(b)).

Recall that h is concave ∩.

For binary symmetric channel (BSC), we set a = b = α.

4.7. The conditional mutual information is defined as

I (X;Y |Z ) = H (X |Z )−H (X |Y,Z )

= E[log

p (X,Y |Z )P (X |Z ) p (Y |Z )

]=∑z

p (z) I (X;Y |Z = z ),

where I (X;Y |Z = z ) = E[

log p(X,Y |z )P (X|z )p(Y |z )

∣∣∣ z]. I (X;Y |z ) ≥ 0 with equality if and only if X and Y are

independent given Z = z.

I (X;Y |Z ) ≥ 0 with equality if and only if X and Y areconditionally independent given Z; that is X − Y − Zform a Markov chain.

4.8. Chain rule for information:

I (X1, X2, . . . , Xn;Y ) =n∑i=1

I (Xi;Y |Xi−1, Xi−1, . . . , X1 ),

or simply

I (Xn1 ;Y ) =

n∑i=1

I(Xi;Y

∣∣Xi−11

).

In particular, I (X1, X2;Y ) = I (X1;Y ) + I (X2;Y |X1 ). Sim-ilarly,

I (Xn1 ;Y |Z ) =

n∑i=1

I(Xi;Y

∣∣Xi−11 , Z

).

4.9. Mutual information (conditioned or not) between setsof random variables can not be increased by removing randomvariable(s) from either set:

I (X1, X2;Y |Z) ≥ I (X1;Y |Z) .

See also (5.6). In particular,

I (X1, X2;Y1, Y2) ≥ I (X1;Y1) .

4.10. Conditional v.s. unconditional mutual information:

If X, Y , and Z forms a Markov chain (any order is OK),then I(X;Y ;Z) ≥ 0 and conditioning only reduces mu-tual information: I (X;Y |Z ) ≤ I (X;Y ) , I (X;Z |Y ) ≤I (X;Z) , and I (Y ;Z |X ) ≤ I (Y ;Z).

Furthermore, if, for example, X and Z are not indepen-dent, then I(X;Z) > 0 and I (X;Y ) > I (X;Y |Z ). Inparticular, let X has nonzero entropy, and X = Y = Z,then I (X;Y ) = h (X) > 0 = I (X;Y |Z ).

If any of the two r.v.’s among X, Y, and Z are indepen-dent, then I(X;Y ;Z) ≤ 0 and conditioning only increasesmutual information: I (X;Y ) ≤ I (X;Y |Z ) , I (X;Z) ≤I (X;Z |Y ) , and I (Z;Y ) ≤ I (Z;Y |X ) .

Each case above has one inequality which is easy to see. IfX − Y − Z forms a Markov chain, then, I(X;Z|Y ) = 0. Weknow that I(X;Z) ≥ 0. So, I(X;Z|Y ) ≤ I(X;Z). On theother hand, if X and Z are independent, then I(X;Z) = 0.We know that I(X;Z|Y ) ≥ 0. So, I(X;Z|Y ) ≥ I(X;Z).

4.11 (Additive Triangle Inequality). Let X, Y , Z be threereal- or discrete-valued mutually independent random vari-ables, and let the “+” sign denote real or modulo addition.Then

I(X;X + Z) ≤ I(X;X + Y ) + I(Y ;Y + Z) (15)

[27]. This is similar to triangle inequality if we defined(X,Y ) = I(X;X + Y ), then (15) says

d(X,Z) ≤ d(X,Y ) + d(Y, Z).

4.12. Given processes X = (X1, X2, . . .) and Y =(Y1, Y2, . . .), the information rate between the processesX and Y is given by

I(X;Y ) = limn→∞

1nI (Xn

1 ;Y n1 ) .

4.13 (Generalization of mutual information.). There isn’treally a notion of mutual information common to three randomvariables [6, Q2.25 p 49].

(a) Venn diagrams [1]:

I(X1;X2; · · · ;Xn) = µ∗ (X1 ∩X2 ∩ · · · ∩Xn)

=∑S⊂[n]

(−1)|S|+1H(XS).

See also section 12 on I-measure.

(b) D

(PX

n1

∥∥∥∥ n∏i=1

PXi)

.

4.14. The lautum information [20] is the divergence be-tween the product-of-marginal and joint distributions, i.e.,swapping the arguments in the definition of mutual informa-tion.

12

Page 13: Information-Theoretic Identities

(a) L(X;Y ) = D(pXpY ‖pX,Y )

(b) Lautum (“elegant” in Latin) is the reverse spelling ofmutual.

5 Functions of random variables

The are several occasions where we have to deal with functionsof random variables. In fact, for those who knows I-measure,the diagram in Figure 10 already summarizes almost all iden-tities of our interest.

First, note that ( ) ( ) ( )( )

( ):

X Xf Xx f x y

p y p x p x=

′= ≥∑ if ( )x y′f = .

( )( ) ( ) ( ) ( ) ( )

( )( )

( ) ( )

( ) ( ) ( )( )

( ) ( )( )

( ) ( ) ( )

:

:

:

log

log

log

log

log

f X f Xy

X f Xy x f x y

X f Xy x f x y

X Xy x f x y

X Xx

H f X p y p y

p x p y

p x p y

p x p x

p x p x H X

=

=

=

= −

⎛ ⎞= − ⎜ ⎟⎜ ⎟

⎝ ⎠⎛ ⎞

= − ⎜ ⎟⎜ ⎟⎝ ⎠⎛ ⎞

≤ − ⎜ ⎟⎜ ⎟⎝ ⎠

= − =

∑ ∑

∑ ∑

∑ ∑

• ( )( ) ( ),H X g X H X=

Proof. ( )( ) ( ) ( )( ),H X g X H X H g X X= +0

• ( )( ), 0H g X X Y = , ( )( ); 0I g X Y X = .

X

( )g X

Y

0

0

0≥

Proof. 1) ( )( ) 0H g X X = , ( )( ) ( )( ),H g X X H g X X Y≥ , and ( ) 0H ⋅ ≥ ⇒

( )( ), 0H g X X Y = .

2) ( )( ) ( )( ) ( )( ); ,I g X Y X H g X X H g X X Y= − = −0 0 0=

)

.

Or, can use ( ) (;I Z Y H Z≤ . Hence, ( )( ) ( )( )0 ;I Y g X X H g X X 0≤ ≤ = .

Note that can also prove that ( )( ),H g X X Y 0= and ( )( );I g X Y X = 0 together by

argue that ( )( ) ( )( ) ( )( ), ;H g X X H g X X Y I g X Y X= + 0= . Because both of the summands are nonnegative, they both have to be 0.

• ( )( ) ( ) ( )( ),H Y X g X H Y X H Y g X= ≤

Figure 10: Information diagram for X and g(X)

5.1. I(X; g(X)) = H(g(X)).

5.2. When X is given, we can simply disregard g(X).

H (g (X) |X ) = 0. In fact, ∀x ∈ X , H (g (X) |X = x) =0. That is, given X, g(X) is completely determined andhence has no uncertainty.

H (g (X) |X,Y ) = 0.

I (g (X) ;Y |X ) = I (g (X) ;Y |X,Z ) = 0.

5.3. g(X) has less uncertainty than X. H (g (X)) ≤ H (X)with equality if and only if g is one-to-one (injective). Thatis deterministic function only reduces the entropy. Similarly,H (X |Y ) ≥ H (g (X) |Y ).

5.4. We can “attach” g(X) to X. Suppose f, g, v are deter-ministic function on appropriate domain.

H (X, g (X)) = H (X).

H (Y |X, g (X) ) = H (Y |X ) andH (X, g (X) |Y ) = H (X |Y ).An expanded version is H (X, g (X) |Y, f (Y ) ) =H (X |Y, f (Y ) ) = H (X, g (X) |Y ) = H (X |Y ).

H (X, g (X) , Y ) = H (X,Y ).An expended version isH (X,Y, v (X,Y )) = H (X, g (X) , Y, f (Y )) =H (X, g (X) , Y ) = H (X,Y, f (Y )) = H (X,Y ).

I (X, g (X) ;Y ) = I (X;Y ).An expanded version is I (X, g (X) ;Y, f (Y )) =I (X, g (X) ;Y ) = I (X;Y, f (Y )) = I (X;Y ).

5.5. I (X, f (X,Z) ;Y, g (Y,Z) |Z ) = I (X;Y |Z ).

5.6. Compared to X, g(X) gives us less information aboutY .

H (Y |X, g (X) ) = H (Y |X ) ≤ H (Y |g (X) ).

I (X;Y ) ≥ I (g (X) ;Y ) ≥ I (g(X); f(Y )). Note that thisagrees with the data-processing theorem (6.1 and 6.4)using the chain f (X)−X − Y − g (Y ).

5.7. I (X; g (X) ;Y ) = I (g (X) ;Y ) ≥ 0.

5.8. If X = g1(W ) and W = g2(Y ), then H (X |Y ) ≤H (W |Y ) ≤ H

(W∣∣∣W )

. The statement is also true when

W −X − Y − W forms a Markov chain and X = g1(W ).Remark: If W −X−Y − W forms a Markov chain, we have

H (W |Y ) ≤ H(W∣∣∣W )

, but it is not true in general thatH (X |Y ) ≤ H (W |Y ).

5.9. For the following chain

X −→ g (·) Y=g(X)−−−−−→ Q (· |· ) −→ Z

where g is deterministic and Q is a probabilistic channel,

H (Z |X ) = H (Z |g (X) ) = H (Z |X, g (X) ).

I (Z;X) = I (Z; g (X)) = I (X; g (X) ;Z) ≥ 0.

Again, the diagram in Figure 11 summarizes the above results:

( )( ) ( )( ) ( )( ) ( )( )

( )( )( )( )

,

,

xg x y

xg x y

p g X z g X y p y z p x

p g X z X x

=

=

Φ = = = Φ =

= Φ = =

Hence,

( )( )( )H g X XΦ

( )( )( ) ( )( )( )( )( )( ) ( )( )( )

( )

( )( )( ) ( )( ) ( )( )( )

( )( ) ( )( )( ) ( )( )( )( )

( )( ) ( )( )( ) ( )( ) ( )( )( )( )( ) ( )( )

, log

, log

, log

log ,

log ,

z x

z y xg x y

z y xg x y

z y xg x y

z y

p g X z X x p g X z X x

p g X z X x p g X z X x

p g X z X x p g X z g X y

p g X z g X y p g X z X x

p g X z g X y p g X z g X y

H g X g X

=

=

=

= Φ = = Φ = =

= Φ = = Φ = =

= Φ = = Φ = =

⎛ ⎞⎜ ⎟= Φ = = Φ =⎜ ⎟⎜ ⎟⎝ ⎠

= Φ = = Φ = =

= Φ

∑∑

∑∑ ∑

∑∑ ∑

∑∑ ∑

∑∑

=

• Consider ( ) ( ) ( )Y g XX g Q= Z⎯⎯→ ⋅ ⎯⎯⎯⎯→ ⋅ ⋅ ⎯⎯→ . Then,

X

( )g X

Z

0

0

0 0≥

• ( ) ( )( ) ( )( ),H Z X H Z g X H Z X g X= = .

• ( ) ( )( ) ( )( ); ; ; ;I Z X I Z g X I X g X Z= = 0≥ .

• Let , and , then ( )1X g W= ( )2W g Y= ( ) ( ) ( )ˆH X Y H W Y H W W≤ ≤ .

Figure 11: Information diagram for Markov chain where thefirst transition is a deterministic function

5.10. If ∀y g(x, y) is invertible as a function of x,then H(X|Y ) = H(g(X,Y )|Y ). In fact,∀y H(X|y) =H(g(X,Y )|y).

For example, g(x, y) = x − y or x + y. So, H(X|Y ) =H(X + Y |Y ) = H(X − Y |Y ).

5.11. If T (Y), a deterministic function of Y , is a sufficientstatistics for X. Then, I (X; Y) = I (X;T (Y)).

13

Page 14: Information-Theoretic Identities

5.12. Data Processing Inequality for relative entropy:Let Xi be a random variable on X , and Yi be a random variableon Y. Q (y |x ) is a channel whose input and output are Xi

and Yi, respectively. Then,

D (Y1 ‖Y2 ) ≤ D (X1 ‖X2 ) .

In particular,

D (g (X1) ‖g (X2) ) ≤ D (X1 ‖X2 ) .

Remark: The same channel Q or, in the second case, samedeterministic function g, are applied to X1 and X2. See also(3.14).

6 Markov Chain and Markov Strings

6.1. Suppose X − Y − Z form a Markov chain.

Data processing theorem:

I (X;Y ) ≥ I (X;Z) .

The interpretation is that no clever manipulation of thedata (received data) can improve the inferences that canbe made from the data.

I (Z;Y ) ≥ I (Z;X).

X Y Z

( );I X Z ( );I X Y ( );I X Y Z

Figure 12: Data processing inequality

I (X;Y ) ≥ I (X;Y |Z ); that is the dependence of X andY is decreased (or remain unchanged) by the observationof a ”downstream” random variable Z. In fact, we alsohave I (X;Z |Y ) ≤ I (X;Z) , and I (Y ;Z |X ) ≤ I (Y ;Z)(see 4.10). That is in this case conditioning only reducesmutual information.

6.2. Markov-type relations

(a) Disappearing term: The following statements are equiva-lent:

I (X;Y,Z) = I (X;Y ) I (X;Z |Y ) = 0. X − Y − Z forms a Markov chain.

(b) Disappearing term: The following statements are equiva-lent:

I (X;Y, Z |W ) = I (X;Y |W ) I (X;Z |Y,W ) = 0 X − (Y,W )− Z forms a Markov chain.

(c) Term moved to condition: The following statements areequivalent:

I (X;Y, Z) = I (X;Y |Z ) I (X;Z) = 0 X and Z are independent.

(d) Term moved to condition: The following statements areequivalent:

I (X;Y, Z |W ) = I (X;Y |Z,W ) I (X;Z|W ) = 0 X −W − Z forms a Markov chain.

6.3. Suppose U − X − Y − V form a Markov chain, thenI (X;Y ) ≥ I (U ;V ).

6.4. Consider a (possibly non-homogeneous) Markov chain(Xi).

If k1 ≤ k2 < k3 ≤ k4, then I (Xk1 ;Xk4) ≤ I (Xk2 ;Xk3).(Note that Xk1−Xk2−Xk3−Xk4 is still a Markov chain.)

Conditioning only reduces mutual information when therandom variables involved are from the Markov chain.

6.5. Consider two Markov chains governed by pi(xn0 ) =pi(x0)

∏n−1k=0 pi(xk+1|xk) for i = 1, 2. The relative entropy

D (p1(xn0 )‖pi(xn0 )) is given by

D (p1 (x0) ‖p2 (x0) ) +n−1∑k=0

D (p1 (xk+1 |xk ) ‖p2 (xk+1 |xk ) ).

6.6. Consider two (possibly non-homogeneous) Markov chainswith the same transition probabilities. Let p1 (xn) and p2 (xn)be two p.m.f. on the state space Xn of a Markov chain at timen. (They comes possibly from different initial distributions.)Then,

( )nP i j

( )nP i j

( ) 1 np x

( ) 2 np x

( ) 1 0p x

( ) 2 0p x

( ) 1 1np x −

( ) 2 1np x −

X ( )g X

Y

0

0

0≥

The relative entropy D (p1 (xn) ‖p2 (xn) ) decreases withn:

D (p1 (xn+1) ‖p2 (xn+1) ) ≤ D (p1 (xn) ‖p2 (xn) )

14

Page 15: Information-Theoretic Identities

D (p1 (xI) ‖p2 (xI) ) = D (p1 (xmin I) ‖p2 (xmin I) ) whereI is some index set and pi (xI) is the distribution for therandom vector XI = (Xk : k ∈ I) of chain i.

For homogeneous Markov chain, if the initial distribu-tion p2 (x0) of the second chain is the stationary dis-tribution p, then ∀n and x we have p2 (xn = x) = p (x)and D (p1 (xn) ‖p (xn) ) is a monotonically decreasing non-negative function of n approaching some limit. The limitis actually 0 if the stationary distribution is unique.

Consider homogeneous Markov chain. Let p (xn) be thep.m.f. on the state space at time n. Suppose the stationarydistribution p exists.

Suppose the stationary distribution is non-uniform andwe set the initial distribution to be uniform, then theentropy H (Xn) = H (p (xn)) decreases with n.

Suppose the stationary distribution is uniform, then,H (Xn) = H (p (xn)) = log |X | − D (p (xn) ‖u ) ismonotone increasing.

6.1 Homogeneous Markov Chain

For this subsection, we consider homogeneous Markov chain.

6.7. H(X0|Xn) is non-decreasing with n; that isH(X0|Xn) ≥H(X0|Xn+1) ∀n.

6.8. For a stationary Markov process (Xn),

H(Xn) = H(X1), i.e. is a constant ∀n.

H(Xn

∣∣Xn−11

)= H (Xn |Xn−1 ) = H (X1 |X0 ).

H(Xn|X1) increases with n. That is H (Xn |X1 ) ≥H (Xn−1 |X1 ) .

7 Independence

7.1. I (Z;X,Y ) = 0 if and only if Z and (X,Y ) are inde-pendent. In which case,

I (Z;X,Y ) = I (Z;X) = I (Z;Y ) = I (Z;Y |X) =I (Z;X|Y ) = 0.

I (X;Y |Z ) = I (X;Y ).

I (X;Y ;Z) = 0.

7.2. Suppose we have nonempty disjoint index sets A, B.Then, I (XA;XB) = 0 if and only if ∀ nonempty A ⊂ A and∀ nonempty B ⊂ B we have I (XA;XB) = 0.

7.3. D(PX

n1

∥∥∥∥ n∏i=1

PXi)

=(

n∑i=1

H (Xi))−H (Xn

1 ) =

n−1∑i=1

I(Xi

1;Xi+1

)=n−1∑i=1

I(Xi;Xn

i+1

). Notice that this

function is symmetric w.r.t. its n arguments. It admits anatural interpretation as a measure of how far the Xi arefrom being independent [15].

D

(PX

n1

∥∥∥∥ n∏i=1

PXi)≥

n∑i=1

I (Xi;Y )− I (Xn1 ;Y ).

D

(PX

n1

∥∥∥∥ n∏i=1

PXi)≤(

n∑i=1

H (Xi))−maxiH (Xi) .

7.4. Suppose we have a collection of random variablesX1, X2, . . . , Xn, then the following statements are equivalent:

X1, X2, . . . , Xn are independent.

H (Xn1 ) =

n∑i=1

H (Xi).

I(Xi

1;Xi+1

)= 0 ∀i ∈ [n− 1].

I(Xi;Xn

i+1

)= 0 ∀i ∈ [n− 1].

D

(PX

n1

∥∥∥∥ n∏i=1

PXi)

= 0.

7.5. The following statements are equivalent:

(a) X1, X2, . . . , Xn are mutually independent conditioningon Y (a.s.).

(b) H (X1, X2, . . . , Xn|Y ) =n∑i=1

H (Xi|Y ).

(c) p (xn1 |y ) =n∏i=1

p (xi |y ).

(d) ∀i ∈ [n] \ 1 p(xi∣∣xi−1

1 , y)

= p (xi |y ).

(e) ∀i ∈ [n] I(Xi;X[n]\i |Y

)= 0.

(f) ∀i ∈ [n] Xi and the vector (Xj)[n]\i are independentconditioning on Y .

(g) H (X1, X2, . . . , Xn|Y ) =n∑i=1

H(Xi|X[n]\i, Y

).

(h) D

(PX

n1

∥∥∥∥ n∏i=1

PXi)

=n∑i=1

I (Xi;Y ) − I (Xn1 ;Y ). (See

also (7.3).)

8 Convexity

8.1. H (p (x)) is concave ∩ in p (x).

8.2. H (Y |X ) is a linear function of p(x) for fixedQ(y|x).

8.3. H (Y ) is a concave function of p(x) for fixed Q(y|x).

8.4. I(X;Y ) = H(Y ) − H(Y |X) is a concave function ofp(x) for fixed Q(y|x).

8.5. I(X;Y ) = D (p(x, y)‖p(x)q(y)) is a convex function ofQ(y|x) for fixed p(x).

8.6. D (p ‖q ) is convex ∪ in the pair (p, q).For fixed p, D (p ‖q ) is convex ∪ functional of q. For fixed

q, D (p ‖q ) is convex ∪ functional of p.

15

Page 16: Information-Theoretic Identities

9 Continuous Random Variables

The differential entropy h(X) or h (f) of an absolutelycontinuous random variable X with a density f(x) is definedas

h(X) = −∫Sf(x) log f(x)dx = −E [log f(X)] ,

where S is the support set of the random variable.

It is also known as Boltzmann entropy or Boltzmann’sH-function.

Differential entropy is the “entropy” of a continuous ran-dom variable. It has no fundamental physical meaning,but occurs often enough to have a name [3].

As in the discrete case, the differential entropy dependsonly on the probability density of the random variable,and hence the differential entropy is sometimes writtenas h(f) rather than h(X).

As in every example involving an integral, or even adensity, we should include the statement “if it exists”.It is easy to construct examples of random variables forwhich a density function does not exist or for which theabove integral does not exist.

9.1. Unlike discrete entropy, the differential entropy h(X)

can be negative. For example, consider the uniform dis-tribution on [0, a]. h(X) can even be −∞ [12, lemma1.3];

is not invariant under invertible coordinate transformation[3]. See also 9.5.

9.2. For any one-to-one differentiable g, we have

h(g(X)) = h(X) + E log |g′(X)|.

Interesting special cases are as followed:

Differential entropy is translation invariant:

h(X + c) = h(X).

In fact,h(aX + b) = h(X) + log |a| .

h(eX)

= h (X) + EX

See also 9.5.

9.3. Examples:

(a) Uniform distribution on [a, b]: h(X) = log(b− a). Notethat h(X) < 0 if and only if 0 < b− a < 1.

(b) Triangular-shape pdf with support on [a, b] with heightA = 2

b−a :

h(X) =12− ln(A) = ln(b− a)−

(ln(2)− 1

2

)[nats]

≈ log2(b− a)− 0.28 [bits].

(c) N(m,σ2

): h(X) = 1

2 log(2πeσ2) bits. Note that h(X) <0 if and only if σ < 1√

2πe. Let Z = X1 + X2 where

• Gaussian: ( ) ( )21 log 22

h X eπ σ=N

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2-2

-1

0

1

2

3

4

( )21 log 22

eπ σ

1 0.2422 eπ

σ

• Among such that , one that maximize differential entropy is Y 2Y =E a ( )0,aN .

Proof. has 0 mean. So, we already know that Y − EY( ) [ ]( )( )0,Varh Y Y h Y− ≤ NE . Now, ( ) ( )h Y Y h Y− =E and

[ ] ( )2Var Y a Y a= − E ≤ . So, ( ) [ ]( )( ) ( )( )0,Var 0,h Y h Y h a≤ ≤N N .

Note that for ( )~ 0,Z aN , 2Z a=E . So, the upper bound is achieved by

( )0,aN .

• Among such that , one that maximize differential entropy is Y 2Y ≤E a ( )0,aN .

Proof. Among those with 2Y b=E , the maximum is achieve by ( )0,bN . For Gaussian, larger variance means larger differential entropy.

• Suppose XY⎛ ⎞⎜ ⎟⎝ ⎠

is a jointly Gaussian random vector with covariance matrix

. Then, X XY

YX Y

Λ Λ⎛ ⎞Λ = ⎜ ⎟Λ Λ⎝ ⎠

( ) ( ) ( )( )

det det1; log2 det

X YI X Y⎛ ⎞Λ Λ

= ⎜ ⎟⎜ ⎟Λ⎝ ⎠.

Proof. ( ) ( ) ( ) ( ); ,I X Y h X h Y h X Y= + − .

• Exponential family: Let ( ) ( )( )1 T x

Xf x ec

θ

θ⋅= where ( ) ( )T xc eθθ ⋅= dx∫ is a

normalizing constant. Then,

Figure 13: h (XN ) = 12 log

(2πeσ2

)X1 and X2 are independent normal random variableswith mean µi and variance σ2

i , i = 1, 2. Then, h (Z) =12 log

(2πe

(σ2

1 + σ22

))(d) Γ(q, λ):

h (X) = (log e)(q + (1− q)ψ (q) + ln

Γ (q)λ

)= h (q) + log σX

where

ψ (z) = ddz ln Γ (z) = Γ′(z)

Γ(z) is the digamma function;

h (q) = (log e)(q + (1− q)ψ (q) + ln Γ(q)√

q

)is a

strictly increasing function. h(1) = log e whichagrees the exponential case. By CLT, lim

q→∞h (q) =

12 log 2πe ≈ 2.0471 which agrees with the Gaussiancase.

(e) Note that the entropy of any pdf of the from ce

m∑k=1

λkg(x)

with support S (e.g. R or [a, b] or [s0,∞)) is

h (X) = − log c−m∑k=1

λkE [g (x)] log e.

The MEPDs in 9.22 are of this form.

For bounded exponential on [a, b], the entropy is− log c+ αµ log e where simple manipulation showsµα = 1 + c

(ae−αa − be−αb

).

More examples can be found in Table 2 and [25].

9.4. Relation of Differential Entropy to Discrete En-tropy:

16

Page 17: Information-Theoretic Identities

Differential entropy is not the limiting case of the discreteentropy. Consider a random variable X with density f(x).Suppose we divide the range of X into bins of length ∆. Let usassume that the density is continuous within the bins. Thenby the mean value theorem, ∃ a value xi within each bin suchthat f(xi)∆ =

∫ (i+1)∆

i∆f(x)dx. Define a quantized (discrete)

random variable X∆, which by X∆ = xi, if i∆ ≤ X ≤ (i+1)∆.

(a) If the density f(x) of the random variable X is Riemannintegrable, then as ∆ → 0, H

(X∆

)+ log ∆ → h (X).

That is H(X∆) ≈ h(X)− log ∆.

(b) When ∆ = 1n , we call X∆ the n-bit quantization of X.

The entropy of an n-bit quantization is approximatelyh(X) + n.

(c) h(X) + n is the number of bits on the average requiredto describe X to n bit accuracy

(d) H(X∆|Y ∆) ≈ h(X|Y )− log ∆.

[6, Section 8.3].Another interesting relationship is that of Figure 4 where we

approximate the entropy of a Poisson r.v. by the differentialentropy of a Gaussian r.v. with the same mean and variance.Note that in this case ∆ = 1 and hence log ∆ = 0.

The differential entropy of a set X1, . . . , Xn of randomvariables with density f(Xn

1 ) is defined as

h(X1n) = −

∫f(xn1 ) log f(xn1 )dxn1 .

If X,Y have a joint density function f(x, y), we can definethe conditional differential entropy h(X|Y ) as

h(X|Y ) = −∫f(x, y) log f(x|y)dxdy

= −E[log fX|Y (X |Y )

]= h(X,Y )− h(Y )

=∫x

f (x)H (Y |x ) dx,

where H (Y |x ) =∫y

fY |X (y |x ) log fY |X (y |x )dy.

9.5. Let X and Y be two random vectors both in Rk suchthat Y = g(X) where g is a one-to-one differentiable transfor-mation. Then,

fY (y) =1

|det (dg (x))|fX (x)

and hence

h (Y) = h (X) + E [log |det (dg (X))|] .

In particular,

h(AX +B) = h(X) + log |detA| .

Note also that, for general g,

h (Y) ≤ h (X) + E [log |det (dg (X))|] .

9.6. Examples [9]:

Let Xn1 have a multivariate normal distribution with

mean µ and covariance matrix Λ. Then, h(Xn1 ) =

12 log ((2πe)n |Λ|) bits, where |Λ| denotes the determinantof Λ. In particular, if Xi’s are independent normal r.v.with the same variance σ2, then h (Xn

1 ) = n2 log

(2πeσ2

) For any random vector X = Xn

1 , we have

E[(X− µ)T Λ−1

X (X− µ)]

=∫fX (x) (x− µ)T Λ−1

X (x− µ) dx = n

Exponential family: Suppose fX (x) = 1c(θ)e

θTT(x)

where the real-valued normalization constant c (θ) =∫eθTT(x)dx, then

h (X) = ln c (θ)− 1c (θ)

θT (∇θc (θ))

= ln c (θ)− θT (∇θ (ln c (θ)))

In 1-D case, we have fX (x) = 1c(θ)e

θ·T (x) and

h (X) = ln c (θ)− θ

c (θ)c′ (θ) = ln c (θ)− θ d

dθ(ln c (θ))

where c (θ) =∫eθ·T (x)dx See also (9.22).

Let Y = (Y1, . . . , Yk) =(eX1 , . . . , eXk

), then h (Y) =

h (X) + (log e)k∑i=1

EXi. Note that if X is jointly gaussian,

then Y is lognormal.

9.7. h(X|Y ) ≤ h(X) with equality if and only if X and Yare independent.

9.8. Chain rule for differential entropy:

h(Xn1 ) =

n∑i=1

h(Xi|Xi−11 ).

9.9. h(Xn1 ) ≤

∑ni=1 h(Xi), with equality if and only if

X1, X2, . . . Xn are independent.

9.10 (Hadamard’s inequality). |Λ| ≤∑ni=1 Λii with equality

iff Λij = 0, i 6= j, i.e., with equality iff Λ is a diagonal matrix.

17

Page 18: Information-Theoretic Identities

The relative entropy (Kullback Leibler distance) D(f‖g)between two densities f and g (with respect to the Lebesguemeasure m) is defined by

D(f‖g) =∫f log

f

gdm. (16)

Finite only if the support set of f is contained in thesupport set of g. (I.e., infinite if ∃x0 f(x0) > 0, g(x0) =0.)

D(f‖g) is infinite if for some region R,∫Rf(x)dx = 0

and∫Rf(x)dx 6= 0.

For continuity, assume 0 log 00 = 0. Also, for a > 0,

a log a0 =∞, 0 log 0

a = 0.

D (fX ‖fY ) = E[log fX(X)

fY (X)

]= −h (X)− E [log fY (X)].

Relative entropy usually does not satisfy the symmetryproperty. In some special cases, it can be symmetric, e.g.(23).

9.11. D(f‖g) ≥ 0 with equality if and only if f = g almosteverywhere (a.e.).

9.12. Relative entropy is invariant under invertible coordi-nate transformations such as scale changes and rotation ofcoordinate axes.

9.13. Relative entropy and Uniform random variable:Let U be uniformly distributed on a set S. For any X withthe same support,

−E log fU (X) = h (U)

andD (fX ||U) = h(U)− h(X) ≥ 0.

This is a special case of (24).

9.14. Relative entropy and exponential random vari-able: Consider X on [0,∞). Suppose XE is exponential withthe same mean as X. Then

−E log fXE (X) = h (XE)

andD (fX ||fE) = h(XE)− h(X) ≥ 0.

This is a special case of (24).

The mutual information I(X;Y ) between two randomvariables with joint density f(x, y) is defined as

I(X;Y ) =∫f(x, y) log

f(x, y)f(x)f(y)

dxdy (17)

= h(X)− h(X|Y ) = h(Y )− h(Y |X) (18)= h(X) + h(Y )− h(X,Y ) (19)= D(fX,Y ‖fXfY ) (20)

= lim∆→0

I(X∆;Y ∆). (21)

Hence, knowing how to find the differential entropy, we canfind the mutual information from (19). Also, from (21), themutual information between two random variables is the limitof the mutual information between their quantized versions.See also 9.4.

9.15. Mutual information is invariant under invertible coor-dinate transformations; that is for random vectors X1 and X1

and invertible functions g1 and g2, we have

I (g1(X1); g2(X2)) = I (X1; X2) .

See also 9.12.

9.16. I(X;Y ) ≥ 0 with equality if and only if X and Y areindependent.

9.17. Gaussian Random Variables and Vectors

(a) Gaussian Upper Bound for Differential Entropy:For any random vector X,

h (X) ≤ 12

log ((2πe)n det (ΛX))

with equality iff X ∼ N (m,ΛX) for some m. See also9.22. Thus, among distributions with the same variance,the normal distribution maximizes the entropy.

In particular,

h(X) ≤ 12

log(2πeσ2

X

).

(b) For any random variable X and Gaussian Z,

−E log fZ (X) =1

2

(log 2πσ2

Z + (log e)σ2

X + (EX − EZ)2

σ2Z

).

(c) Suppose XN is Gaussian with the same mean and varianceas X. Then,

−E log fXN (X) = −E log fXN (XN )

= h (XN ) =12

log 2πeσ2X

andh (XN )− h (X) = D (fX ‖fXN ) ≥ 0. (22)

This is a special case of (24).

(d) The relative entropy D (fX ‖fY ) between n-dimensionalrandom vectors X ∼ N (mX ,ΛX) and Y ∼ N (mY ,ΛY )is given by

1

2log e−n det ΛY

det ΛX

+1

2(log e)

(tr(Λ−1

Y ΛX

)+ (∆m)T Λ−1

Y (∆m)),

18

Page 19: Information-Theoretic Identities

where ∆m = mX −mY . In 1-D, we have

12

log e−1 σ2Y

σ2X

+12

(log e)

(σ2X

σ2Y

+(

∆mσY

)2),

or equivalently,

logσYσX

+12

(log e)

(σ2X

σ2Y

+(

∆mσY

)2

− 1

)

[24, p 1025]. In addition, when σX = σY = σ, the relativeentropy is simply

D (fX ‖fY ) =12

(log e)(mX −mY

σ

)2

. (23)

(e) Monotonic decrease of the non-Gaussianness of the sumof independent random variables [23]: Consider i.i.d. ran-dom variables X1, X2, . . .. Let S(n) =

∑nk=1Xk and S(n)

Nbe a Gaussian random variable with the same mean andvariance as S(n). Then,

D(S(n)||S(n)

N

)≤ D

(S(n−1)||S(n−1)

N

).

(f) Suppose(XY

)is a jointly Gaussian random vector with

covariance matrix Λ =(

ΛX ΛXYΛYX ΛY

). The mutual infor-

mation between two jointly Gaussian vectors X and Yis

I(X;Y ) =12

log(

(det ΛX)(det ΛY )det Λ

).

In particular, for jointly Gaussian random variables X,Y ,we have

I (X;Y ) = −12

log

(1−

(Cov (X,Y )σXσY

)2).

9.18. Additive Channel: Suppose Y = X + N . Then,h(Y |X) = h(N |X) and thus I(X;Y ) = h(Y ) − h(N |X) be-cause h is translation invariant. In fact, h(Y |x) is alwaysh(N |x).

Furthermore, if X and N are independent, then h(Y |X) =h(N) and I(X;Y ) = h(Y )− h(N). In fact, h(Y |x) is alwaysh(N).

For nonnegative Y ,

I (X;Y )EY

≤ logK (eEY )− h (N)EY

≤ logKeKh(N)

where the first inequality is because exponential Y max-imizes h(Y ) for fixed EY and the second inequality isbecause EY = Kh(N) maximizes the middle term.

For Y ∈ [s0,∞), by the same reasoning, we have

I (X;Y )EY

≤ logK(e(µ− s0))− h(N)µ− s0

≤ logKeKh(N)

but in this case shifted exponential on [S0,∞) maximizeh(Y ) for fixed EY = µ. Second inequality use µ =s0 +Kh(N) which maximize the middle term.

Suppose for Y ∈ [s0,∞), we now want to maximize I(X;Y )EY+r

for some r ≥ 0. Then, we can use the same technique butwill have to solve for optimal µ∗ of µ numerically. Theupper bound in this case is logK e

µ∗−s0 .

9.19. Additive Gaussian Noise [11, 18, 19]: Suppose Nis a (proper complex-valued multidimensional) Gaussian noisewhich is independent of (a complex-valued random vector)X. Here the distribution of X is not required to be Gaussian.Then, for

Y =√

SNRX +N,

we have

d

dSNRI(X;Y ) = E

[|X − E [X|Y ]|2

].

or equivalently, in an expanded form,

d

dSNRI(X;

√SNRX +N) = E

[∣∣∣X − E[X∣∣∣√SNRX +N

]∣∣∣2] ,where the RHS is the MMSE corresponding to the best

estimation of X upon the observation Y for a given signal-to-noise ratio (SNR). Here, the mutual information is in nats.Furthermore, for a deterministic matrix A, suppose

Y = AX +N.

Then, the gradient of the mutual information with respect tothe matrix A can be expressed as

∇AI (X;Y ) = Cov [X − E [X|Y ]]

= AE[(X − E [X|Y ])(X − E [X|Y ])H),

]where the RHS is the covariance matric of the estimation errorvector, also known as the MMSE matrix. Here, the complexderivative of a real-valued scalar function f is defined as

df

dx∗=

12

(∂f

∂Re x+ j

∂f

∂Im x

)and the complex gradient matrix is defined as ∇Af = ∂f

∂A∗

where[∂f∂A∗

]ij

= ∂f∂[A∗]ij

.

9.20. Gaussian Additive Channel: Suppose X and Nare independent independent Gaussian random vectors andY = X +N , then

I (X;Y ) =12

logdet (ΛX + ΛN )

det ΛN.

19

Page 20: Information-Theoretic Identities

In particular, for one-dimensional case,

I (X;Y ) =12

log(

1 +σ2X

σ2N

).

9.21. Let Y = X + Z where X and Z are independent andX is Gaussian. Then,

among Z with fixed mean and variance, Gaussian Zminimizes I(X;Y ).

among Z with fixed EZ2, zero-mean Gaussian Z mini-mizes I(X;Y ).

9.1 MEPD

9.22. Maximum Entropy Distributions:Consider fixed (1) closed S ⊂ R and (2) measurable

functions g1, . . . , gm on S. Let C be a class of probabilitydensity f (of a random variable X) which are supportedon S (f = 0 on Sc) and satisfy the moment constraintsE [gi (X)] ≡

∫f (x) gi (x)dx = αi, for 1 ≤ i ≤ m. f∗ is an

MEPD for C if ∀f ∈ C, h(f∗) ≥ h(f). Define fλ on S by

fλ(x) = c0e

m∑k=1

λkgk(x)where c0 =

(∫S

e

m∑i=1

λigi(x)dx

)−1

and

λ1, . . . , λm are chosen so that fλ ∈ C. Note that f∗ and fλmay not exist.

(a) If f∗ exists (and ∃f ∈ C such that f > 0 on S), then fλexists and fλ = f∗.

(b) If fλ exists, then there exists unique MEPD f∗ and f∗ =fλ.In which case, ∀ f ∈ C,

E log fλ(X) = −∫f (x) log fλ (x)dx

= −∫fλ (x) log fλ (x)dx = h (fλ)

andD (f ‖fλ ) = h (fλ)− h (f) ≥ 0. (24)

(See also (9.6).)

c0 > 0 or equivalently c0 = eλ0 for some λ0.

h (fλ) = −(λ0 +

m∑k=1

λkαk

)log e.

Uniform distribution has maximum entropy among alldistributions with bounded support:Let S = [a, b], with no other constraints. Then, themaximum entropy distribution is the uniform distributionover this range.

Normal distribution is the law with maximum entropyamong all distributions with finite variances:If the constraints is on (1) EX2, or (2) σ2

X , then f∗ hasthe same form as a normal distribution. So, we just haveto find a Normal random variable satisfying the condition(1) or (2).

Exponential distribution has maximum entropy amongall distributions concentrated on the positive halfline andpossessing finite expectations:If S = [0,∞) and EX = µ > 0, thenf∗ (x) = 1

µe− xµ 1[0,∞)(x) (exponential) with correspond-

ing h (X∗) = log (eµ).If S = [s0,∞) and EX = µ > s0, then f∗ (x) =

1µ−s0 e

− x−s0µ−s0 1[s0,∞)(x) (shifted exponential) with corre-sponding h (X∗) = log (e (µ− s0)).

9.23. Most of the standard probability distributions canbe characterized as being MEPD. Consider any pdf on S

of the form fX(x) = c0e

m∑k=1

λkgk(x)with finite E [gk(X)] for

k = 1, . . . ,m. fX is the MEPD for the class of pdf f on Sunder the constraints by the E [gk(X)]’s. Hence, to see whichconstraints characterize a pdf as an MEPD, rewrite the pdfin the exponential form.

S Constraints MEPD(a, b) No constraint U(a, b)[0,∞) No constraint N/Aor R

[0,∞) EX = µ > 0 E(

)[s0,∞) EX = µ > s0 Shifted exp.[a, b] EX = µ ∈ [a, b] Truncated exp.

R EX = µ, VarX = σ2 N (µ, σ2)[0,∞) EX and E lnX Γ (q, λ)(0, 1) E lnX,E ln(1−X) β(q1, q2)[0,∞) E lnX,E ln(1 +X) Beta prime

R E ln(1 +X2

)Cauchy

R E ln(1 +X2

)= 2 ln 2 Std. Cauchy

(0,∞) E lnX = µ,Var lnX = σ2 eN(µ,σ2)

R E|X| = w L(

1w

)(c,∞)

∫S

lnxdx Par(α, c)

Table 4: MEPD: Continuous cases. Most of the standardprobability distributions can be characterized as being MEPDwhen values of one or more of the following moments areprescribed: EX,E|X|, EX2, E lnX, E ln(1−X),E ln(1 +X),E ln(1 +X2) [14].

9.24. Let X1, X2, . . ., Xn be n independent, symmetricrandom variables supported on the interval [−1, 1]. The differ-ential entropy of their sum Sn =

∑nk=1Xi is maximized when

20

Page 21: Information-Theoretic Identities

X1, . . . , Xn−1 are Bernoulli taking on +1 or −1 with equalprobability and Xn is uniformly distributed [17]. We list belowthe properties of this maximizing distribution. Let S∗n be thecorresponding maximum differential entropy distribution.

(a) The sum∑n−1k=1 Xk is binomial on

2k − (n− 1) : k = 0, 1, . . . , n− 1 where the prob-ability at the point 2j − (n− 1) is

(n−1j

)2n−1. There is

no close form expression for the differential entropy ofthis binomial part.

(b) The maximum differential entropy is sum of the binomialpart and the uniform part. The differential entropy ofthe uniform part is logK 2.

(c) For n = 1 and n = 2, S∗n is uniformly distributed on[−1, 1] and [−2, 2], respectively.

9.25. Stationary Markov Chain: Consider the joint dis-tribution fX,Y of (X,Y ) on S × S such that the marginaldistributions fX = fY ≡ f satisfy the moment constraints∫

f (x) gi (x)dx = αi, for 1 ≤ i ≤ m.

Note also that h (X) = h (Y ) = h (f) . Let fλ be theMEPD for the a class of probability density which are sup-ported on S and satisfy the moment constraints. Definef

(λ)X,Y (x, y) = fλ(x)fλ(y) with corresponding conditional den-

sity f (λ)Y |X(y|x) = fλ(y). Then,

(a) fX,Y = f(λ)X,Y maximizes h(X,Y ) and h(Y |X) with corre-

sponding maximum values 2h(fλ) and h(fλ);

(b) D(fX,Y

∥∥∥f (λ)X,Y

)= h

(f

(λ)X,Y

)− h (fX,Y ) = 2h (fλ) −

h (X,Y );

(c) the conditional relative entropy can be expressed as

D(fY |X

∥∥∥f (λ)Y |X

)≡∫f (x)

∫fY |X (y |x ) log

fY |X (y |x )

f(λ)Y |X (y |x )

dydx

= h (fλ)− h (Y |X )

So, for stationary (first-order) Markov chain X1, X2, . . . withmoment constraint(s), the entropy rate is maximized whenXi’s are independent with marginal distribution being thecorresponding MEPD distribution fλ. The maximum entropyrate is h(fλ).

9.26. Minimum relative entropy from a reference dis-tribution [16]: Fix a pdf f and a measurable g. Considerany pdf f such that α =

∫g (x) f (x)dx exists. Suppose

M (β) =∫f (x) eβg(x)dx exists for β in some interval. Then

D(f∥∥∥f ) ≥ αβ − logM (β)

where α = ddβ logM (β) with equality if and only if

f (x) = f∗ (x) = c0f (x) eβg(x)

where c0 = (M (β))−1. Note that f∗ is said to generate anexponential family of distribution.

9.2 Stochastic Processes

9.27. Gaussian Processes: The entropy rate of the Gaussianprocess (Xk) with power spectrum S(w) is

H ((Xk)) = ln√

2πe+1

∫ π

−πlnS(ω)dω

[21, eq (15-130) p 568].

9.28. (Inhomogeneous) Poisson Processes: Let Πi be Pois-son process with rate λi(t) on [0, T ]. The relative entropyD(Π1‖Π2) between the two processes is given by

−∫ T

0

λ1(t)− λ2(t)dt log e+∫ T

0

λ1(t) logλ1(t)λ2(t)

dt

[24, p 1025]. This is the same as D (P1‖P2) where

Pi (n, un1 ) = e−mi(T ) (mi (T ))n

n!︸ ︷︷ ︸Probability of having npoints in [0,T ]

n∏k=1

λi (uk)mi (T )

.︸ ︷︷ ︸Conditional pdf ofunordered times

10 General Probability Space

10.1. Consider probability measures P and Q on a commonmeasurable space (Ω,F).

If P is not absolutely continuous with respect to Q, thenD (P ||Q) =∞.

If P Q, then D (P ||Q) < ∞, the Radon-Nikodymderivative δ = dP

dQ exists, and

D (P ‖Q ) =∫

log δdP =∫δ log δdQ.

The quantity log δ (if it exists) is called the entropy densityor relative entropy density of P with respect to Q [10,lemma 5.2.3].

If P and Q are discrete with corresponding pmf’s p andq, then dP

dQ = pq and we have (6). If P and Q are both

absolutely continuous with respect to (σ-finite) measure M(e.g. M = (P +Q)/2) with corresponding densities (Radon-Nikodym derivatives) dP

dM = δP and dQdM = δQ respectively,

then dPdQ = δP

δQand

D (P ‖Q ) =∫δP log

δPδQdM.

If M is the Lebesgue measure m, then we have (16).

21

Page 22: Information-Theoretic Identities

10.2. D (P ‖Q ) = supn∑k=1

P (Ak) log P (Ak)Q(Ak) where the supre-

mum is taken on all finite partitions of the space.

10.3. For random variables X and Y on common probabilityspace, we define

I(X;Y ) = D(PX,Y ‖PX × PY

)and

H(X) = I(X;X).

10.4. More directly, we define mutual information in termsof finite partitions of the range of the the random variable[6, p 251–252]. Let X be the range of a random variable Xand P = Xi be a finite partition of X . The quantizationof X by P (denoted by [X]P) is the discrete random variabledefined by P [X ∈ Xi]. For two random variables X and Ywith partitions P and Q, the mutual information between Xand Y is given in terms of the its discrete version by

I(X;Y ) = supP,Q

I ([X]P ; [Y ]Q) ,

where the supremum is over all finite partitions P and Q.

By continuing to refine the partitions P and Q, one findsa monotonically increasing sequence I ([X]P ; [Y ]Q) I.

10.5. Let (X ,A) be any measurable space. The total varia-tion distance dTV between two probability measures P andQ on X is defined to be

dTV (P,Q) = supA∈A|P (A)−Q (A)| . (25)

The total variation distance between two random variablesX and Y is denoted by dTV (L (X) ,L (Y )) where L (X) isthe distribution or law of X. We sometimes simply writedTV (X,Y ) with the understanding that it is in fact a functionof the marginal distributions and not the joint distribution. IfX and Y are discrete random variables, then

dTV (X,Y ) =12

∑k∈X

|pX (k)− pY (k)|.

If X and Y are absolutely continuous random variables withdensities fX (x) and fY (y), then

dTV (X,Y ) =12

∫x∈X

|fX (x)− fY (x)|dx.

More generally, if P and Q are both absolutely continuouswith respect to some measure M ( i.e. P,QM), and havecorresponding densities (Radon-Nikodym derivatives) dP

dM =δP and dQ

dM = δQ respectively, then

dTV (P,Q) =12

∫|δP − δQ|dM = 1−

∫min (δP , δQ)dM

and the supremum in (25) is achieved by the set B =[δP > δQ].dTV is a true metric. In particular, 1) dTV (µ1, µ2) ≥ 0 with

equality if and only if µ1 = µ2, 2) dTV (µ1, µ2) = dTV (µ2, µ1),and 3) dTV (µ1, µ2) ≤ dTV (µ1, ν) + dTV (ν, µ2). Furthermore,because µi (A) ∈ [0, 1], we have |µ1 (A)− µ2 (A)| ≤ 1 and thusdTV (µ1, µ2) ≤ 1.

10.6 (Pinsker’s inequality).

D(P ||Q) ≥ 2(log e)d2TV (P,Q).

This is exactly (1). In other words, if P and Q are bothabsolutely continuous with respect to some measure M( i.e. P,QM), and have corresponding densities (Radon-Nikodym derivatives) dP

dM = δP and dQdM = δQ respectively,

then

2∫δP log

δPδQdM ≥ (log e)

(∫|δP − δQ|dM

)2

.

See [10, lemma 5.2.8] for detailed proof.

11 Typicality and AEP (AsymptoticEquipartition Properties)

The material in this section is based on (1) chapter 3 andsection 13.6 in [5], (2) chapter 5 in [26]. Berger [2] introducedstrong typicality which was further developed into the methodof types in the book by Csiszar and Korner [7]. First, weconsider discrete random variables.

11.1. Weak Typicality: Consider a sequence Xk : k ≥ 1where Xk are i.i.d. with distribution pX (x). The quantity

− 1n log p (xn1 ) = − 1

n

n∑k=1

log p (xk) is called the empirical en-

tropy of the sequence xn1 . By weak law of large number,− 1n log p (Xn

1 )→ H (X) in probability as n→∞. That is

∀ε > 0, limn→∞

P[∣∣− 1

n log p (Xn1 )−H (X)

∣∣ < ε]

= 1;

(1) ∀ε > 0, for n sufficiently large,

P[∣∣∣∣− 1

nlog p (Xn

1 )−H (X)∣∣∣∣ < ε

]> 1− ε.

The weakly typical set A(n)ε (X) w.r.t. p (x) is the set

of sequence xn1 ∈ Xn such that∣∣− 1

n log p (xn1 )−H (X)∣∣ ≤ ε,

or equivalently, 2−n(H(X)+ε) ≤ p (xn1 ) ≤ 2−n(H(X)−ε), whereε is an arbitrarily small positive real number. The sequencexn1 ∈ A

(n)ε (X) are called weakly ε-typical sequences.The

following hold ∀ε > 0:

(2) For n sufficiently large,

P[A(n)ε (X)

]= P

[Xn

1 ∈ A(n)ε (X)

]> 1− ε;

22

Page 23: Information-Theoretic Identities

equivalently,

P[(A(n)ε (X)

)c]= P

[Xn

1 /∈ A(n)ε (X)

]< ε.

(3) For n sufficiently large,

(1− ε) 2n(H(X)−ε) ≤∣∣∣A(n)

ε (X)∣∣∣ ≤ 2n(H(X)+ε).

Note that the second inequality in fact holds ∀n ≥ 1.

Remark: This does not say that most of the sequencesin Xn are weakly typical. In fact, when X is not uniform,|A(n)ε (X)||X |n → 0 as n→∞. (If X is uniform, then all sequence

is typical.) Although the size of the weakly typical set may beinsignificant compared with the size of the set of all sequences,the former has almost all the probability. The most likelysequence is in general not weakly typical. Roughly speaking,probability-wise, for n large, we only have to focus on≈ 2nH(X)

typical sequences, each with probability ≈ 2−nH(X).

11.2. Jointly Weak Typicality: A pair of sequences(xn1 , y

n1 ) ∈ Xn × Yn is said to be (weakly) δ-typical w.r.t.

the distribution P (x, y) if

(a)∣∣∣− log p(xn1 )

n −H (X)∣∣∣ < ε,

(b)∣∣∣− log q(yn1 )

n −H (Y )∣∣∣ < ε, and

(c)∣∣∣− logP (xn1 ,y

n1 )

n −H (X,Y )∣∣∣ < ε.

The set A(n)ε (X,Y ) is the collection of all jointly typical se-

quences with respect to the distribution P (x, y). It is the setof n-sequences with empirical entropies ε- close to the trueentropies. Note that if (xn1 , y

n1 ) ∈ A(n)

ε (X,Y ), then

2−n(H(X,Y )+ε) < P (xn1 , yn1 ) < 2−n(H(X,Y )−ε)

xn1 ∈ A(n)ε (X) . That is

2−n(H(X)+ε) < p (xn1 ) < 2−n(H(X)−ε).

yn1 ∈ A(n)ε (Y ) . That is

2−n(H(Y )+ε) < q (yn1 ) < 2−n(H(Y )−ε).

Suppose that (Xi, Yi) is drawn i.i.d. ∼ P (x, y). Then,

(1) limn→∞

P[(Xn

1 , Yn1 ) ∈ A(n)

ε

]= 1.

Equivalently, ∀ε′ > 0 ∃N > 0 such that ∀n > N

0 ≤ P[(Xn

1 , Yn1 ) /∈ A(n)

ε

]< ε′

which is equivalent to

1− ε′ < P[(Xn

1 , Yn1 ) ∈ A(n)

ε

]≤ 1.

(2) ∀ε′ > 0 ∃N > 0 such that ∀n > N

(1− ε′) 2n(H(X,Y )−ε) <∣∣∣A(n)

ε

∣∣∣ ≤ 2n(H(X,Y )+ε),

where the second inequality is true ∀n ≥ 1.

(3) If Xn1 and Y n1 are independent with the same

marginals as P (xn1 , yn1 ), then (1− ε′) 2−n(I(X;Y )+3ε) ≤

P[(Xn

1 , Yn1

)∈ A(n)

ε

]≤ 2−n(I(X;Y )−3ε) where the second

inequality is true ∀n ≥ 1.

11.3. Weak Typicality:Let (X1, X2, . . . , Xk) denote a finite collection of discrete

random variables with some fixed joint distribution p(xk1),

xk1 ∈k∏i=1

Xi. Let J ⊂ [k]. Suppose X(i)J is drawn i.i.d.

p(xk1). We consider sequence

(xiJ)(n)

i=(1)=(x

(1)J , . . . , x

(n)J

)∈(

k∏i=1

Xi)n

=k∏i=1

Xni . Note that(xiJ)(n)

i=(1)is in fact a matrix

with nk element. For conciseness, we shall denote it by s.

By law of large number, the empirical entropy− 1n log(S) = − 1

n log((XiJ

)(n)

i=(1)

)→ H (XJ).

The set A(n)ε of ε-typical n-sequence is A

(n)ε (XJ) =

s :∣∣− 1

n log p (s)−H (XT )∣∣ < ε,∀ T

6=∅⊂ J

.

By definition, if J ⊂ K ⊂ [k], then A(n)ε (XJ) ⊂

A(n)ε (XK).

(1) ∀ε > 0 and n large enough P[A

(n)ε (XJ)

]≥ 1− ε.

(2) If s ∈ A(n)ε (XJ), then p (s) .= 2−n(H(XJ )∓ε).

(3) ∀ε > 0 and n large enough∣∣∣A(n)

ε (XJ)∣∣∣ .= 2n(H(XJ )±2ε).

Consider disjoint Jr ⊂ [k]. Let sr =(xiJr)(n)

(1).

(4) If (s1, s2) ∈ A(n)ε (XJ1 , XJ2), then, p (s1 |s2 ) .=

2−n(H(XJ1 |XJ2 )±2ε).

For any ε > 0, define A(n)ε (XJ1 |s2 ) to be the set of sequences

s1 that are jointly ε-typical with a particular s2 sequence,i.e. the elements of the form(s1, s2) ∈ A

(n)ε (XJ1 , XJ2). If

s2 ∈ A(n)ε (XJ2), then

(5.1) For sufficiently large n,∣∣∣A(n)ε (XJ1 |s2 )

∣∣∣ ≤ 2n(H(XJ1 |XJ2 )+2ε).

(5.2) (1− ε) 2n(H(XJ1 |XJ2 )−2ε) ≤∑s2

p (s2)∣∣∣A(n)

ε (XJ1 |s2 )∣∣∣.

23

Page 24: Information-Theoretic Identities

Suppose XiJ1, Xi

J2, Xi

J3is drawn i.i.d. according to

p (xJ1 |xJ3 ) p (xJ2 |xJ3 ) p (xJ3), that is XiJ1, Xi

J2are condition-

ally independent given XiJ3

but otherwise share the same pair-

wise marginals of XJ1 , XJ2 , XJ3 . Let sr =(xiJr)(n)

(1). Then

(6) P[S1, S2, S3 ∈ A(n)

ε (XJ1 , XJ2 , XJ3)].=

2−n(I(XJ1 ;XJ2 |XJ3 )∓6ε).

11.4. Strong typicality Suppose X is finite. For a sequence∀xn1 ∈ Xn and a ∈ X , define

N (a |xn1 ) = |k : 1 ≤ k ≤ n, xk = a|

=n∑k=1

1 [xk = a].

Then, N (a |xn1 ) is the number of occurrences of the symbol ain the sequence xn1 . Note that ∀xn1 ∈ Xn

∑x∈X

N (a |xn1 ) = n.

For i.i.d. Xi ∼ p (x), ∀xn1 ∈ Xn

pXn1 (xn1 ) =∏x∈X

p (x)N(x|xn1 );

A sequence xn1 ∈ Xn is said to be δ-strongly typical w.r.t.p (x) if (1) ∀a ∈ X with p (a) > 0,

∣∣∣N(a|xn1 )n − p (a)

∣∣∣ < δ|X | ,

and (2) ∀a ∈ X with p (a) = 0, N (a |xn1 ) = 0.

This implies∑x

∣∣ 1nN (a |xn1 )− p (a)

∣∣ < δ which is the typ-

icality condition used in [26, Yeung].

The set of all δ-strongly typical sequences above is called thestrongly typical set and is denoted Tδ = Tnδ (p) = Tnδ (X) =xn1 ∈ Xn : xn1 is δ-typical.

Suppose Xi’s are drawn i.i.d. ∼ pX (x).

(1) ∀δ > 0 limn→∞

P [Xn1 ∈ Tδ] = 1 and lim

n→∞P [Xn

1 /∈ Tδ] =0. Equivalently, ∀α < 1, for n sufficiently large,P [Xn

1 ∈ Tδ] > 1− α.

Define pmin = min p (x) > 0 : x ∈ X, εδ = δ |log pmin|.Note that pmin gives maximum |log pmin| and εδ > 0can be made arbitrary small by making δ small enough.(

limδ→0

εδ = 0)

. Then,

(2) For xn1 ∈ Tδ (p), we have∣∣∣− log pXn1

(xn1 )

n −H (X)∣∣∣ < εδ

which is equivalent to 2−n(H(X)+εδ) < pXn1 (xn1 ) <

2−n(H(X)−εδ). Hence, Tnδ (X) ⊂ Anεδ (X).

(3) ∀α < 1, for n sufficiently large, (1− α) 2n(H(X)−εδ) ≤|Tδ (X)| ≤ 2n(H(X)+εδ) where the second inequality holds∀n ≥ 1.

11.5. Jointly Strong Typicality:Let P (x, y) = p (x)Q (y |x ) , x ∈ X , y ∈ Y be the joint

p.m.f. over X × Y. Denote the number of occurrences ofthe point (x, y) in the pair of sequences (xn1 , y

n1 )

N (a, b |xn1 , yn1 ) = |k : 1 ≤ k ≤ n, (xk, yk) = (a, b)|

=n∑k=1

1 [xk = a] 1 [yk = b].

Then, N (x |xn1 ) =∑y∈Y

N (x, y |xn1 , yn1 ) and N (y |yn1 ) =∑x∈X

N (x, y |xn1 , yn1 ).A pair of sequences (xn1 , y

n1 ) ∈ Xn × Yn is said to be

strongly δ-typical w.r.t. P (x, y) if

∀a ∈ X ,∀b ∈ Y∣∣∣∣N (a, b |xn1 , yn1 )

n− P (a, b)

∣∣∣∣ < δ

|X | |Y|.

The set of all strongly typical sequences is called the stronglytypical set and is denoted by

Tδ = Tnδ (pQ) = Tnδ (X,Y )= (xn1 , yn1 ) : (xn1 , y

n1 ) is δ - typical of P (x, y) .

Suppose (Xi, Yi) is drawn i.i.d. ∼ P (x, y).

(1) limn→∞

P [(Xn1 , Y

n1 ) /∈ Tnδ (X,Y )] = 0. That is ∀α > 0 for n

sufficiently large,

P [(Xn1 , Y

n1 ) /∈ Tnδ (X,Y )] > 1− α.

(2) Suppose (xn1 , yn1 ) ∈ Tnδ (X,Y ), then 2−n(H(X,.Y )+εδ) <

P (xn1 , yn1 ) < 2−n(H(X,Y )−εδ) where εδ = δ |logPmin| and

Pmin = min P (x, y) > 0 : x ∈ X , y ∈ Y. Suppose (xn1 , y

n1 ) ∈ Tnδ (X,Y ), then xn1 ∈ Tδ (X) , yn1 ∈

Tδ (Y ). That is joint typicality ⇒ marginal typicality.

This further implies

2−n(H(X)+εδ) < pXn1 (xn1 ) < 2−n(H(X)−εδ)

and

2−n(H(Y )+εδ) < qY n1 (yn1 ) < 2−n(H(Y )−εδ).

Consider (xn1 , yn1 ) ∈ Tnδ (X,Y ). Suppose g : X × Y → R.

Then, g (xn1 , yn1 ) = Eg(U, V )± gmax

δ|X ||Y| . where gmax =

maxu,v g(u, v).

(3) ∀α > 0 for n sufficiently large, we have

(1− α) 2n(H(X,Y )−εδ) ≤ |Tδ (P )| ≤ 2n(H(X,Y )+εδ).

(4) If(Xn

1 , Yn1

)∼

n∏i=1

(pX (xi) qY (yi)) that is Xi and Yi are

independent with the same marginals as Xi and Yi. Then,∀α > 0 for n sufficiently large, (1− α) 2−n(I(X;Y )+3εδ) ≤P[(Xn

1 , Yn1

)∈ Tnδ (X,Y )

]≤ 2−n(I(X;Y )−3εδ).

24

Page 25: Information-Theoretic Identities

For any xn1 ∈ Tnδ (X), define

Tnδ (Y |X ) (xn1 ) = yn1 : (xn1 , yn1 ) ∈ Tnδ (X,Y ) .

(5) For any xn1 such that ∃yn1 with (xn1 , yn1 ) ∈ Tnδ (X,Y ),

|Tnδ (Y |X ) (xn1 )| .= 2n(H(Y |X )±ε′δ)

where ε′δ → 0 as δ → 0 and n→∞.

xn1 ∈ Tnδ (X) combined with the condition of the state-ment above is equivalent to |Tnδ (Y |X ) (xn1 )| ≥ 1.

(6) Let Yi be drawn i.i.d. ∼ qY (y), thenP [(xn1 , Y

n1 ) ∈ Tnδ (X,Y )] = P [Y n1 ∈ Tnδ (Y |X ) (xn1 )] .=

2−n(I(X;Y )∓ε′′δ ), where ε′′δ → 0 as δ → 0 and n→∞.

Now, we consider the continuous random variables.

11.6. The AEP for continuous random variables:Let (Xi)ni=1 be a sequence of random variables drawn i.i.d.

according to the density f(x). Then

− 1n log f(Xn

1 )→ E [− log f(X)] = h(X) in probability.

For ε > 0 and any n, we define the typical set A(n)ε with

respect to f(x) as follows:

A(n)ε =

xn1 ∈ Sn :

∣∣∣∣− 1n

log f (xn1 )− h (X)∣∣∣∣ ≤ ε ,

where S is the support set of the random variable X, andf(xn1 ) = Πn

i=1f(xi). Note that the condition is equivalent to

2−n(h(X)+ε) ≤ f (xn1 ) ≤ 2−n(h(X)−ε).

We also define the volume Vol (A) of a setA to be Vol (A) =∫Adx1dx2 · · · dxn.

(1) P[A

(n)ε

]> 1− ε for n sufficiently large.

(2) ∀n, Vol(A

(n)ε

)≤ 2n(h(X)+ε), and Vol

(A

(n)ε

)≥ (1 −

ε)2n(h(X)−ε) for n sufficiently large. That is for n suffi-ciently large, we have (1− ε)2n(h(X)−ε) ≤ Vol

(A

(n)ε

)≤

2n(h(X)+ε).

The set A(n)ε is the smallest volume set with probability

≥ 1− ε to first order in the exponent. More specifically,for each n = 1, 2, . . ., let B(n)

δ ⊂ Sn be any set with

P[B

(n)δ

]≥ 1 − δ. Let X1, . . . , Xn be i.i.d. ∼ p(x). For

δ < 12 and any δ′ > 0, 1

n log Vol(B

(n)δ

)> h(X) − δ′ for

n sufficiently large.

Equivalently, for δ < 12 , Vol

(B

(n)δ

).= Vol

(A

(n)ε

).= 2nH .

The notation an.= bn means lim

n→∞1n log an

bn= 0, which

implies that an and bn are equal to the first order in theexponent.

The volume of the smallest set that contains most ofthe probability is approximately 2nh(X). This is an n-dimensional volume, so the corresponding side lengthis (2nh(X))

1n = 2h(X). Differential entropy is then the

logarithm of the equivalent side length of the smallestset that contains most of the probability. Hence, lowentropy implies that the random variable is confined toa small effective volume and high entropy indicates thatthe random variable is widely dispersed.

Remark: Just as the entropy is related to the volume ofthe typical set, there is a quantity called Fisher informa-tion which is related to the surface area of the typicalset.

11.7. Jointly Typical SequencesThe set A(n)

ε of jointly typical sequences(x(n), y(n)

)with

respect to the distribution fX,Y (x, y) is the set of n-sequenceswith empirical entropies ε-close to the true entropies, i.e., A(n)

ε

is the set of(x(n), y(n)

)∈ Xn × Yn such that

(a)∣∣− 1

n log fXn1 (xn1 )− h (X)∣∣ < ε,

(b)∣∣− 1

n log fY n1 (yn1 )− h (Y )∣∣ < ε, and

(c)∣∣− 1

n log fXn1 ,Y n1 (xn1 , yn1 )− h (X,Y )

∣∣ < ε.

Note that the followings give equivalent definition:

(a) 2−n(h(X)+ε) < fXn1 (xn1 ) < 2−n(h(X)−ε),

(b) 2−n(h(Y )+ε) < fY n1 (yn1 ) < 2−n(h(Y )−ε), and

(c) 2−n(h(X,Y )+ε) < fXn1 ,Y n1 (xn1 , yn1 ) < 2−n(h(X,Y )−ε).

Let (Xn1 , Y

n1 ) be sequences of length n drawn i.i.d. according

to fXi,Yi (xi, yi) = fX,Y (xi, yi). Then

(1) P[A

(n)ε

]= P

[(Xn

1 , Yn1 ) ∈ A(n)

ε

]→ 1 as n→∞.

(2)∣∣∣A(n)

ε

∣∣∣ ≤ 2n(H(X,Y )+ε), and for sufficiently large n,∣∣∣A(n)ε

∣∣∣ ≥ (1− ε) 2n(h(X,Y )−ε).

(3) If (Un1 , Vn1 ) ∼ fXn1 (un1 ) fY n1 (vn1 ), i.e., Un1 and V n1 are

independent with the same marginals as fXn1 ,Y n1 (xn1 , yn1 ),

then

P[(Un1 , V

n1 ) ∈ A(n)

ε

]≤ 2−n(I(X;Y )−3ε).

Also, for sufficiently large n,

P[(Un1 , V

n1 ) ∈ A(n)

ε

]≥ (1− ε) 2−n(I(X;Y )+3ε).

25

Page 26: Information-Theoretic Identities

12 I-measure

In this section, we present theories which establish one-to-one correspondence between Shannon’s information measuresand set theory. The resulting theorems provide alternativeapproach to information-theoretic equalities and inequalities.

Consider n random variables X1, X2, . . . , Xn. For any ran-dom variable X, let X be a set corresponding to X. Definethe universal set Ω to be

⋃i∈[n]

Xi. The field Fn generated

by sets X1, X2, . . . , Xn is the collection of sets which can beobtained by any sequence of usual set operations (union, inter-section, complement, and difference) on X1, X2, . . . , Xn. The

atoms of Fn are sets of the formn⋂i=1

Yi, where Yi is either

Xi or Xci . Note that all atoms in Fn are disjoint. The set

A0 =⋂

i∈NnXci = ∅ is called the empty atom of Fn. All the

atoms of Fn other than A0 are called nonempty atoms . LetA be the set of all nonempty atoms of Fn. Then, |A|, thecardinality of A, is equal to 2n − 1.

Each set in Fn can be expressed uniquely as the union ofa subset of the atoms of Fn.

Any signed measure µ on Fn is completely specified bythe values of µ on the nonempty atoms of Fn.

12.1. We define the I -measure µ∗ on Fn by

µ∗(XG

)= H (XG) for all nonempty G ⊂ [n] .

12.2. For all (not necessarily disjoint) subsets G,G′, G′′ of[n]

(a) µ∗(XG ∪ XG′′

)= µ∗

(XG∪G′′

)= H (XG∪G′′)

(b) µ∗(XG ∩ XG′ − XG′′

)= I (XG;XG′ |XG′′ )

(c) µ∗ (A0) = 0

Note that (2) is the necessary and sufficient condition forµ∗ to be consistent with all Shannon’s information measuresbecause

When G and G′ are nonempty,µ∗(XG ∩ XG′ − XG′′

)= I (XG;XG′ |XG′′ ).

When G′′ = ∅, we have µ∗(XG ∩ XG′

)= I (XG;XG′).

When G′ = G, we haveµ∗(XG − XG′′

)= I (XG;XG |XG′′ ) = H (XG |XG′′ ).

When G′ = G, and G′′ = ∅, we haveµ∗(XG

)= I (XG;XG) = H (XG).

In fact, µ∗ is the unique signed measure on Fn which isconsistent with all Shannon’s information measures. We thenhave substitution of symbols as shown in table (5).

H, I ↔ µ∗

, ↔ ∪; ↔ ∩| ↔ -

Table 5: Substitution of symbols

Motivated by the substitution of symbols, wewill write µ∗

(XG1 ∩ XG2 ∩ · · · ∩ XGm − XF

)as

I(XG1 ; XG2 ; · · · ; XGm

∣∣∣ XF

).

12.3. If there is no constraint on X1, X2, . . . , Xn, then µ∗

can take any set of nonnegative values on the nonempty atomsof Fn.

12.4. Because of the one-to-one correspondence betweenShannon’s information measures and set theory, it is validto use an information diagram, which is a variation of aVenn diagram, to represent relationship between Shannon’sinformation measures. However, one must be careful. AnI-measure µ∗ can take negative values. Therefore, when wesee in an information diagram that A is a subset of B, wecannot conclude from this fact alone that µ∗ (A) ≤ µ∗ (B)unless we know from the setup of the problem that µ∗ isnonnegative. For example, µ∗ is nonnegative if the randomvariables involved form a Markov chain.

For a given n, there aren∑k=3

(nk

)nonempty atoms that do

not correspond to Shannon’s information measures andhence can be negative.

For n ≥ 4, it is not possible to display an informationdiagram perfectly in two dimensions. In general, aninformation diagram for n random variables, needs n− 1dimensions to be displayed perfectly.

In information diagram, the universal set Ω is not shownexplicitly.

When µ∗ takes the value zero on an atom A of Fn, we donot need to display A in an information diagram becauseA does not contribute to µ∗ (B) for any set B ∈ Fncontaining the atom A.

12.5. Special cases:

When we are given that X and Y are independent, wecan’t draw X and Y as disjoint sets because we can’tguarantee that all subsets (or atoms which are subsetsof) of X ∩ Y has µ∗ = 0.

26

Page 27: Information-Theoretic Identities

For example, try X and Y Bernoulli(0.5) on 0, 1.Then, conditioned on another random variable Z =X + Y mod 2, they are no longer independent. Infact, conditioned on Z, knowing Y completely speci-fies X and vice versa.

When Y = g(X), we can draw Y as a subset of X. Thatis, any atom V which is a subset of Y \X = Y ∩ Xc

satisfies µ∗(V)

= 0. In fact, let I1 and I2 be disjoint

index set. Then, for any set of the form V = Y ∩⋂i∈I1

Zi∩

Xc⋂j∈I2

Zcj = Y ∩⋂i∈I1

Zi ∩ Xc ∩ZcI2 , we have µ∗(V)

= 0.

In other words,

H (g (X) |X,Z 1 , . . . , Zn) = 0 I (g (X) ;V1;V2; · · · ;Vm |X,Z 1 , . . . , Zn) = 0.

12.6. For two random variables, µ∗ is always nonnegative.The information diagram is shown in Figure 14.

1 2c cX X∩

1 2cX X∩ 1 2

cX X∩ 1 2X X∩

1X 2X

X Y

( );I X Y ( )H X Y ( )H Y X

( )H X ( )H Y

Figure 14: Information diagram for n = 2.

12.7. For n = 3, µ∗ (X1 ∩X2 ∩X3) = I (X1;X2;X3) can benegative. µ∗ on other nonempty atoms are always nonnegative.

Proof. We have shown that *μ is consistent with all Shannon’s information measures. Now, for μ to be consistent with all Shannon’s information measures, need ( ) ( )G GX H Xμ = , which is in fact the definition of *μ .

• Motivated by the substitution of symbols, we will write

( )1 2

*mG G G FX X Xμ ∩ ∩ ∩ − X as ( )1 2

; ; ;mG G G FI X X X X .

• For n = 2, *μ is always nonnegative.

Proof. For n = 2, the three nonempty atoms of are 2F 1 2X X∩ , 1 2X X− , and

2 1X X− . The values of *μ on these atoms are

( ) ( ) ( )1 2 1 2 2 1; , ,I X X H X X H X X , respectively. These quantities are Shannon’s information measures and hence nonnegative by the basic inequalities. *μ for any set in is a sum of 2F *μ on atoms and hence is always nonnegative.

• For n = 3, ( ) ( )*1 2 3 1 2; ;X X X I X X Xμ ∩ ∩ = 3 can be negative. *μ on other

nonempty atoms are always nonnegative.

( )1 2 3,H X X X ( )2 1 3,H X X X

( )3 1 2,H X X X

X1 X2

X3

( )1 2 3;I X X X

( )1 2 3; ;I X X X

( )1 3 2;I X X X ( )2 3 1;I X X X

Proof. We will give an example which has ( )1 2 3; ; 0I X X X < .

Let . Then, for distinct i, j, and k, 1 2 3 0X X X⊕ ⊕ = ( ),i j kx f x x= , and

( ), 0i j kH X X X = . So, ( ),i j∀ i j≠ , ( ) ( )1 2 3, , ,i jH X X X H X X= .

Furthermore, let 1 2, , 3X X X be pair-wise independent. Then, ,

.

( ),i j∀ i j≠

( ); 0i jI X X =

6

Figure 15: Information diagram for n = 3.

12.8. Information diagram for the Markov chain X1 → X2 →X3:

12.9. For four random variables (or random vectors), theatoms colored in sky-blue in Figure 16 can be negative.

( ) ( ) ( )0

; ; ; ;I X Z I X Z Y I X Y Z≤

= + ,

• Look at the example below for a case when ( ) ( ); ;I X Y I X Y Z< .

• For four random variables, the following atoms colored in sky-blue can be negative:

1X

2X

1Y

2Y

• ( ) ( ) ( )( ); ;I X Y I f X g Y≥

Proof. We first show that ( ) ( )( ); ;I X Y I f X Y≥ :

( )( ), ;I X f X Y

( ) ( )( ); ;I X Y I f X Y X= + ( )( ) ( )( )0

0

; ;I f X Y I X Y f X≥

= +

By symmetry, ( ) ( )( ); ;I Z Y I Z g Y≥ .

Let ( )Z f X= and combine the two inequalities above.

• Example: X and Y i.i.d. Bernoulli 12

⎛ ⎞⎜ ⎟⎝ ⎠

. Let ( )mod2Z X Y X Y= ⊕ = + . Then,

(1) ( ) ( ) ( ) 1H X H Y H Z= = =

(2) Any pair of variables is independent: ( ) ( ) ( ); ; ;I X Y I Z Y I X Z 0= = = .

(3) Given any two, then the last one is determined: ( ) ( ) ( ), , ,H X Y Z H Y X Z H Z X Y= = = 0 .

(4) ( ) ( ) ( ) ( ); ;I X Y Z H X I X Y H X Y Z, 1= − − = .

By symmetry, ( ) ( ) ( ); ; ;I X Y Z I Z Y X I Z X Y 1= = = .

Figure 16: Information Diagram for n = 4

ex1

1

2

3 4

MC

2 1 n 3

IDMC

1X 2X nX

Figure 17: The information diagram for the Markov chainX1 → · · · → Xn

12.10. For a Markov chain X1−X2− · · · −Xn, the informa-tion diagram can be displayed in two dimensions. One suchconstruction is Figure 17.

The I-measure µ∗ for a Markov chain X1 → · · · → Xn isalways nonnegative. This facilitates the use of the informationdiagram because if B ⊂ B′ in the information diagram, thenµ∗(B′) ≥ µ∗(B).

13 MATLAB

In this section, we provide some MATLAB codes for calculatinginformation theoretic quantities.

13.1. Function entropy2 calculates the entropy H(X) =H(p) (in bits) of a pmf pX specified by a row vector p wherethe ith element is the probability of xi.

function H = entropy2(p)% ENTROPY2 accepts probability mass function% as a row vector, calculate the corresponding% entropy in bits.p=p(find(abs(sort(p)−1)>1e−8));p=p(find(abs(p)>1e−8));if length(p)==0

H = 0;else

H = −sum(p.*log(p))/log(2);end

13.2. Function information calculates the mutual informa-tion I(X;Y ) = I(p,Q) where p is the row vector describingpX and Q is a matrix defined by Qij = P [Y = yj |X = xi].

27

Page 28: Information-Theoretic Identities

function I = information(p,Q)X = length(p);q = p*Q;HY = entropy2(q);temp = [];for i = 1:X

temp = [temp entropy2(Q(i,:))];endHYgX = sum(p.*temp);I = HY−HYgX;

13.3. Function capacity calculate the pmf p∗ which achievescapacity C = maxp I(p,Q) using Blahut-Arimoto algorithm[6, Section 10.8]. Given a DMC with transition probabilitiesQ(y|x) and any input distribution p0(x), define a sequencepr(x), r = 0, 1, . . . according to the iterative prescription

pr+1 (x) =pr (x) cr (x)∑xpr (x) cr (x)

,

where

log cr (x) =∑y

Q (y |x ) logQ (y |x )qr (y)

(26)

andqr (y) =

∑x

pr (x)Q (y |x ).

Then,

log

(∑x

pr (x) cr (x)

)≤ C ≤ log

(maxx

cr (x)).

Note that (26) is D(PY |X=x| PY

)when PX = pr.

function ps = capacity(pT,Q,n)%n = number of iterationfor k = 1:n

X = size(Q,1);Y = size(Q,2);qT = pT*Q;CT = [];for i = 1:X

sQlq = Q(i,:).*log2(qT);temp = −entropy2(Q(i,:))−sum(sQlq);CT = [CT 2ˆ(temp)];

endCT;temp = sum(pT.*CT);pT = 1/temp*(pT.*CT);if(plt)

figureplot(pT)

endendps = pT;

Alternatively, the following code use MATLAB functionfmincon to find p∗.

function [ps C] = capacity fmincon(p,Q)% CAPACITY FMINCON accects intial input% probability mass function p = pX and% transition matrix Q = fY |X,% calculate the corresponding capacuty.% Note that p is a column vector.mi = @(p) −information(p.',Q);sp = size(p);onep = ones(sp); zerop = zeros(sp);[ps Cm] = fmincon(mi,p,[],[],onep.',1,zerop);%The 5th and 6th arguments force the sum of%elements in p to be 1. The 7th argument forces%the elements to be nonnegative.ps; C = −Cm;

13.4. The following script demonstrate how to use symbolictoolbox to calculate the mutual information between twocontinuous random variables.

syms x y%Define the densities fX and fY |XfX = 1/sqrt(2*pi*4)*exp(−1/2*xˆ2/4);fYcX = 1/sqrt(2*pi)*exp(−1/2*(y−x)ˆ2);%Support for X and YrX = [−inf, inf];rY = [−inf, inf];%Calculate mutual informationfY = int(fX*fYcX,x,rX(1),rX(2));hY = −int(fY*log2(fY),y,rY(1),rY(2));hYcx = −int(fYcX*log2(fYcX),y,rY(1),rY(2));hYcX = int(fX*hYcx,x,rX(1),rX(2));IXY = hY−hYcX;eval(IXY)

References

[1] N. Abramson. Information theory and coding. McGraw-Hill, New York, 1963. 1

[2] T. Berger. Multiterminal source coding. In Lecture notespresented at the 1977 CISM Summer School, Udine, Italy,July 18-20 1977. 11

[3] Richard E. Blahut. Principles and practice of informationtheory. Addison-Wesley Longman Publishing Co., Inc.,Boston, MA, USA, 1987. 9, 9.1

[4] A. S. Cohen and R. Zamir. Entropy amplification propertyand the loss for writing on dirty paper. InformationTheory, IEEE Transactions on, 54(4):1477–1487, April2008. 2.21

[5] Thomas M. Cover and Joy A. Thomas. Elements ofInformation Theory. Wiley Series in Telecommunications.John Wiley & Sons, New York, 1991. 7, 2.28, 3.2, 3.11,11

[6] Thomas M. Cover and Joy A. Thomas. Elements ofInformation Theory. Wiley-Interscience, 2006. 2, 2.3, 8,3.11, 3.17, 3.19, 3.20, 4.1, 4.13, 9.4, 10.4, 13.3

28

Page 29: Information-Theoretic Identities

[7] I. Csiszar and J. Korner. Information Theory: CodingTheorems for Discrete Memoryless Systems. AcademicPress, 1981. 11

[8] I. Csiszar and G. Tusnady. Information geometry andalternating minimization procedures. Recent results inestimation theory and related topics, 1984. 3.19

[9] G.A. Darbellay and I. Vajda. Entropy expressions formultivariate continuous distributions. IEEE Transactionson Information Theory, 46:709–712, 2000. 9.6

[10] R. M. Gray. Entropy and Information Theory. Springer-Verlag, New York, New York, 1990. 10.1, 10.6

[11] D. Guo, S. Shamai (Shitz), and S. Verd. Mutual in-formation and minimum mean-square error in gaussianchannels. IEEE Trans. Inf. Theory, 51:1261–1282, 2005.9.19

[12] Oliver Johnson. Information Theory and the CentralLimit Theorem. Imperial College Press, 2004. 3.15, 9.1

[13] J. N. Kapur and H. K. Kesavan. Entropy OptimizationPrinciples With Applications. Academic Pr, 1992. 3.1,3.6, 3.17, 3.18

[14] Jagat Narain Kapur. Maximum Entropy Models in Sci-ence and Engineering. John Wiley & Sons, New York,1989. 2.23, 3, 4

[15] I. Kontoyiannis, P. Harremoes, and O. Johnson. Entropyand the law of small numbers. IEEE Trans. Inform.Theory, 51:466– 472, 2005. 3.15, 7.3

[16] S. Kullback. Information Theory and Statistics. PeterSmith, Gloucester, 1978. 9.26

[17] E. Ordentlich. Maximizing the entropy of a sum of inde-pendent bounded random variables. Information Theory,IEEE Transactions on, 52(5):2176–2181, May 2006. 2.22,9.24

[18] D. P. Palomar and S. Verd. Gradient of mutual informa-tion in linear vector gaussian channels. IEEE Trans. Inf.Theory, 52:141–154, 2006. 9.19

[19] D. P. Palomar and S. Verdu. Representation of mutualinformation via input estimates. IEEE Transactions onInformation Theory, 53:453–470, 2007. 9.19

[20] D. P. Palomar and S. Verdu. Lautum information. In-formation Theory, IEEE Transactions on, 54(3):964–975,March 2008. 4.14

[21] Athanasios Papoulis. Probability, Random Variables andStochastic Processes. McGraw-Hill Companies, 1991. 2.2,9.27

[22] Claude E. Shannon. A mathematical theory of commu-nication. Bell Syst. Tech. J., 27(3):379–423, July 1948.Continued 27(4):623-656, October 1948. (document)

[23] A.M. Tulino and S. Verdu. Monotonic decrease of thenon-gaussianness of the sum of independent random vari-ables: A simple proof. IEEE Transactions on InformationTheory, 52:4295–4297, 2006. 5

[24] S. Verdu. On channel capacity per unit cost. InformationTheory, IEEE Transactions on, 36(5):1019–1030, Sep1990. 4, 9.28

[25] A.C.G. Verdugo Lazo and P.N. Rathie. On the entropy ofcontinuous probability distributions. IEEE Transactionson Information Theory, 24:120–122, 1978. 9.3

[26] Raymond W. Yeung. First Course in Information Theory.Kluwer Academic Publishers, 2002. 11, 11.4

[27] R. Zamir and U. Erez. A gaussian input is not too bad.Information Theory, IEEE Transactions on, 50(6):1362–1367, June 2004. 4.11

29