foundations of probabilities

Post on 08-Jan-2022

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Foundations of Probabilities

and Information Theory

for Machine Learning

0.

Random Variables

Some proofs

1.

E[X + Y ] = E[X] + E[Y ]where X and Y are random variables of the same type (i.e. either discrete or cont.)

The discrete case:

E[X + Y ] =∑

ω∈Ω

(X(ω) + Y (ω)) · P (ω)

=∑

ω

X(ω) · P (ω) +∑

ω

Y (ω) · P (ω) = E[X ] + E[Y ]

The continuous case:

E[X + Y ] =

x

y

(x+ y)pXY (x, y)dydx

=

x

y

xpXY (x, y)dydx+

x

y

ypXY (x, y)dydx

=

x

x

y

pXY (x, y)dydx+

y

y

x

pXY (x, y)dxdy

=

x

xpX(x)dx +

y

ypY (y)dy = E[X ] + E[Y ]

2.

X and Y are independent ⇒ E[XY ] = E[X] · E[Y ],X and Y being random variables of the same type (i.e. either discrete or continuous)

The discrete case:

E[XY ] =∑

x∈V al(X)

y∈V al(Y )

xyP (X = x, Y = y) =∑

x∈V al(X)

y∈V al(Y )

xyP (X = x) · P (Y = y)

=∑

x∈V al(X)

xP (X = x)∑

y∈V al(Y )

yP (Y = y)

=∑

x∈V al(X)

xP (X = x)E[Y ] = E[X ] · E[Y ]

The continuous case:

E[XY ] =

x

y

xy p(X = x, Y = y)dydx =

x

y

xy p(X = x) · p(Y = y)dydx

=

x

x p(X = x)

(∫

y

y p(Y = y)dy

)

dx =

x

x p(X = x)E[Y ]dx

= E[Y ] ·∫

x

x p(X = x)dx = E[X ] ·E[Y ]

3.

Binomial distribution: b(r;n, p)def.= Cr

n pr(1− p)n−r

Significance: b(r;n, p) is the probability of drawing r heads in n independentflips of a coin having the head probability p.

b(r;n, p) indeed represents a probability distribution:

• b(r;n, p) = Crn p

r(1− p)n−r ≥ 0 for all p ∈ [0, 1], n ∈ N and r ∈ 0, 1, . . . , n,•

∑n

r=0 b(r;n, p) = 1:

(1− p)n + C1np(1− p)n−1 + · · ·+ Cn−1

n pn−1(1− p) + pn = [p + (1− p)]n = 1

4.

Binomial distribution: calculating the mean

E[b(r;n, p)]def.=

n∑

r=0

r · b(r;n, p) =

= 1 · C1np(1− p)n−1 + 2 · C2

np2(1− p)n−2 + · · ·+ (n− 1) · Cn−1

n pn−1(1− p) + n · pn= p

[C1

n(1− p)n−1 + 2 · C2np(1− p)n−2 + · · ·+ (n− 1) · Cn−1

n pn−2(1− p) + n · pn−1]

= np[(1− p)n−1 + C1

n−1p(1− p)n−2 + · · ·+ Cn−2n−1p

n−2(1− p) + Cn−1n−1p

n−1]

(1)

= np[p + (1− p)]n−1 = np

For the (1) equality we used the following property:

k Ckn = k

n!

k! (n− k)! =n!

(k − 1)! (n− k)! =n (n− 1)!

(k − 1)! (n− 1− (k − 1))!

= n Ck−1n−1, ∀k = 1, . . . , n.

5.

Binomial distribution: calculating the variancefollowing www.proofwiki.org/wiki/Variance of Binomial Distribution, which cites

“Probability: An Introduction”, by Geoffrey Grimmett and Dominic Welsh,

Oxford Science Publications, 1986

We will make use of the formula Var[X ] = E[X2]−E2[X ].By denoting q = 1− p, it follows:

E[b2(r;n, p)]def.=

n∑

r=0

r2Crnp

rqn−r =

n∑

r=0

r2n(n− 1) . . . (n− r + 1)

r!prqn−r

=

n∑

r=1

rn(n− 1) . . . (n− r + 1)

(r − 1)!prqn−r =

n∑

r=1

rnCr−1n−1 p

rqn−r

= npn∑

r=1

r Cr−1n−1 p

r−1q(n−1)−(r−1)

6.

Binomial distribution: calculating the variance (cont’d)

By denoting j = r − 1 and m = n− 1, we’ll get:

E[b2(r;n, p)] = npm∑

j=0

(j + 1)Cjm pjqm−j

= np

m∑

j=0

j Cjm pjqm−j +

m∑

j=0

Cjm pjqm−j

= np

m∑

j=0

jm · . . . · (m− j + 1)

j!pjqm−j + (p+ q

︸ ︷︷ ︸1

)m

= np

m∑

j=1

mCj−1m−1 p

jqm−j + 1

= np

mp

m∑

j=1

Cj−1m−1 p

j−1q(m−1)−(j−1) + 1

= np[(n− 1)p(p+ q︸ ︷︷ ︸

1

)m−1 + 1] = np[(n− 1)p+ 1] = n2p2 − np2 + np

Finally,

Var[X ] = E[b2(r;n, p)] − (E[b(r;n, p)])2 = n2p2 − np2 + np− n2p2 = np(1− p)

7.

Binomial distribution: calculating the variance

Another solution

• se demonstreaza relativ usor ca orice variabila aleatoare urmanddistributia binomiala b(r;n, p) poate fi vazuta ca o suma de n vari-abile independente care urmeaza distributia Bernoulli de parametrup;a

• stim (sau, se poate dovedi imediat) ca varianta distributiei Bernoullide parametru p este p(1− p);

• tinand cont de proprietatea de liniaritate a variantelor — Var[X1 +X2 . . . + Xn] = Var[X1] + Var[X2] . . . + Var[Xn], daca X1, X2, . . . , Xn suntvariabile independente —, rezulta ca Var[X ] = np(1− p).

aVezi www.proofwiki.org/wiki/Bernoulli Process as Binomial Distribution, care citeaza de asemenea ca sursa “Proba-bility: An Introduction” de Geoffrey Grimmett si Dominic Welsh, Oxford Science Publications, 1986.

8.

The Gaussian distribution: p(X = x) = 1√2πσe− (x− µ)2

2σ2

Calculating the mean: E[Nµ,σ(x)]def.=∫∞−∞ xp(x)dx = 1√

2πσ

∫∞−∞ x · e

−(x− µ)2

2σ2 dx

Using the variable transformation v =x− µσ

will imply x = σv + µ and dx = σdv, so:

E[X ] =1√2πσ

∫ ∞

−∞(σv + µ)e

−v2

2 (σdv) =σ√2πσ

σ

∫ ∞

−∞ve

−v2

2 dv + µ

∫ ∞

−∞e−v2

2 dv

=1√2π

−σ

∫ ∞

−∞(−v)e

−v2

2 dv + µ

∫ ∞

−∞e−v2

2 dv

=

1√2π

−σ e−v2

2

∣∣∣∣∣∣∣

−∞︸ ︷︷ ︸

=0

∫ ∞

−∞e−v2

2 dv

=µ√2π

∫ ∞

−∞e−v2

2 dv (see the next slide for the computation on this last integral)

=µ√2π

√2π = µ

9.

The Gaussian distribution: calculating the mean (Cont’d)

v=−∞

e−

v2

2 dv

2

=

x=−∞

e−

x2

2 dx

·

y=−∞

e−

y2

2 dy

=

x=−∞

y=−∞

e−

x2 + y2

2 dydx

=

∫∫

R2

e−

x2 + y2

2 dydx

By switching from x, y to polar coordinates r, θ (see the Note below), it follows:

v=−∞

e−

v2

2 dv

2

=

r=0

θ=0

e−

r2

2 (rdrdθ) =

r=0

re−

r2

2

(∫

θ=0

)

dr =

r=0

re−

r2

2 θ|2π0 dr

= 2π

r=0

re−

r2

2 dr = 2π (−e−

r2

2 )

0

= 2π(0− (−1)) = 2π ⇒∫

v=−∞

e−

v2

2 dv =√2π.

Note: x = r cos θ and y = r sin θ, with r ≥ 0 and θ ∈ [0, 2π). Therefore, x2 + y2 = r2, and theJacobian matrix is

∂(x, y)

∂(r, θ)=

∂x

∂r

∂x

∂θ

∂y

∂r

∂y

∂θ

=

cos θ −r sin θ

sin θ r cos θ

= r cos2 θ + r sin2 θ = r ≥ 0. So, dxdy = rdrdθ.

10.

Bivariate Gaussian

11.

The Gaussian distribution: calculating the variance

We will make use of the formula Var[X ] = E[X2]− E2[X ].

E[X2] =

∫ ∞

−∞x2p(x)dx =

1√2πσ

∫ ∞

−∞x2 · e

−(x− µ)2

2σ2 dx

Again, using the transformation v =x− µσ

will imply x = σv+µ and dx = σdv. Therefore,

E[X2] =1√2πσ

∫ ∞

−∞(σv + µ)2 e

−v2

2 (σdv)

=σ√2πσ

∫ ∞

−∞(σ2v2 + 2σµv + µ2) e

−v2

2 dv

=1√2π

σ2

∫ ∞

−∞v2 e

−v2

2 dv + 2σµ

∫ ∞

−∞v e

−v2

2 dv + µ2

∫ ∞

−∞e−v2

2 dv

Note that we have already computed∫∞−∞ ve

−v2

2 dv = 0 and∫∞−∞ e

−v2

2 dv =√2π.

12.

The Gaussian distribution: calculating the variance (Cont’d)

Therefore, we only need to compute

∫ ∞

−∞v2e

−v2

2 dv =

∫ ∞

−∞(−v)

−ve

−v2

2

dv =

∫ ∞

−∞(−v)

e

−v2

2

dv

= (−v) e−v2

2

∣∣∣∣∣∣∣

−∞

−∫ ∞

−∞(−1)e

−v2

2 dv = 0 +

∫ ∞

−∞e−v2

2 dv =√2π.

Here above we used the fact that

limv→∞

v e−v2

2 = limv→∞

v

e

v2

2

l’Hopital= lim

v→∞1

v e

v2

2

= 0 = limv→−∞

v e−v2

2

So, E[X2] = 1√2π

(σ2√2π + 2σµ · 0 + µ2

√2π)= σ2 + µ2.

And, finally, Var[X ] = E[X2]− (E[X ])2 = (σ2 + µ2)− µ2 = σ2.

13.

Vectors of random variables.

A property:

The covariance matrix Σ corresponding to such a vector issymmetric and positive semi-definite

Chuong Do, Stanford University, 2008

[adapted by Liviu Ciortuz]

14.

Fie variabilele aleatoare X1, . . . , Xn, cu Xi : Ω → R pentrui = 1, . . . , n. Matricea de covarianta a vectorului de variabilealeatoare X = (X1, . . . , Xn) este o matrice patratica de dimensiune

n×n, ale carei elemente se definesc astfel: [Cov(X)]ijdef.= Cov(Xi, Xj),

pentru orice i, j ∈ 1, . . . , n.

Aratati ca Σnot.= Cov(X) este matrice simetrica si pozitiv semi-

definita, cea de-a doua proprietate ınsemnand ca pentru oricevector z ∈ Rn are loc inegalitatea z⊤Σz ≥ 0. (Vectorii z ∈ Rn suntconsiderati vectori-coloana, iar simbolul ⊤ reprezinta operatia detranspunere de matrice.)

15.

Cov(X)i,jdef.= Cov(Xi, Xj), for all i, j ∈ 1, . . . , n, and

Cov(Xi, Xj)def.= E[(Xi − E[Xi])(Xj − E[Xj ])] = E[(Xj − E[Xj ])(Xi − E[Xi])] = Cov(Xj , Xi),

therefore Cov(X) is a symmetric matrix.

We will show that zTΣz ≥ 0 for any z ∈ Rn (seen as a column-vector):

zTΣz =

n∑

i=1

zi(

n∑

j=1

Σijzj) =

n∑

i=1

n∑

j=1

(ziΣijzj) =

n∑

i=1

n∑

j=1

(ziCov[Xi, Xj] zj)

=

n∑

i=1

n∑

j=1

(zi E[(Xi − E[Xi])(Xj − E[Xj])] zj) = E

n∑

i=1

n∑

j=1

zi (Xi − E[Xi])(Xj − E[Xj ]) zj

= E

(n∑

i=1

zi (Xi − E[Xi])

)

n∑

j=1

(Xj − E[Xj ]) zj

= E

(n∑

i=1

(Xi − E[Xi]) zi

)

n∑

j=1

(Xj − E[Xj ]) zj

= E[((X − E[X ])T · z)2] ≥ 0

16.

Multi-variate Gaussian distributions:

A property:

When the covariance matrix of a multi-variate (d-dimensional) Gaussian

distribution is diagonal, then the p.d.f. (probability density function) of

the respective multi-variate Gaussian is equal to the product of d

independent uni-variate Gaussian densities.

Chuong Do, Stanford University, 2008

[adapted by Liviu Ciortuz]

17.

Let’s consider X = [X1 . . . Xd]T , µ ∈ Rd and Σ ∈ Sd+, where Sd+ is the set of symmetric

positive definite matrices (which implies |Σ| 6= 0 and (x − µ)TΣ−1(x − µ) > 0, therefore− 1

2 (x− µ)TΣ−1(x− µ) < 0, for any x ∈ Sd, x 6= µ).

The probability density function of a multi-variate Gaussian distribution of parametersµ and Σ is:

p(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp

(

−1

2(x − µ)TΣ−1(x − µ)

)

,

Notation: X ∼ N (µ,Σ).Show that when the covariance matrice Σ is diagonal, then the p.d.f. (probability den-sity function) of the respective multi-variate Gaussian is equal to the product of dindependent uni-variate Gaussian densities.

We will make the proof for d = 2(generalization to d > 2 will be easy):

x =

[x1x2

]

µ =

[µ1

µ2

]

Σ =

[σ21 0

0 σ22

]

Note: It is easy to show that if Σ ∈ Sd+ is diagonal, the

elements on the principal diagonal Σ are indeed strictly

positive. (It is enough to consider z = (1, 0) and respec-

tively z = (0, 1) in formula for pozitive-definiteness of Σ.)

This is why we wrote these elements of σ as σ21 and σ2

2 .

18.

19.

p(x;µ,Σ) =1

∣∣∣∣

σ21 0

0 σ22

∣∣∣∣

12

exp

(

−1

2

[x1 − µ1

x2 − µ2

]T [σ21 0

0 σ22

]−1 [x1 − µ1

x2 − µ2

])

=1

2π σ1σ2exp

−1

2

[x1 − µ1

x2 − µ2

]T

1

σ21

0

01

σ22

[x1 − µ1

x2 − µ2

]

=1

2π σ1σ2exp

−1

2

[x1 − µ1

x2 − µ2

]T

1

σ21

(x1 − µ1)

1

σ22

(x2 − µ2)

=1

2π σ1σ2exp

(

− 1

2σ21

(x1 − µ1)2 − 1

2σ22

(x2 − µ2)2

)

= p(x1;µ1, σ21) p(x2;µ2, σ

22).

20.

Bi-variate Gaussian distributions. A property:

The conditional distributions X1|X2 and X2|X1 are alsoGaussians.

The calculation of their parameters

Duda, Hart and Stork, Pattern Classification, 2001,Appendix A.5.2

[adapted by Liviu Ciortuz]

21.

Fie X o variabila aleatoare care urmeaza o distributie gaussiana bi-variatade parametri µ (vectorul de medii) si Σ (matricea de covarianta). Asadar,µ = (µ1, µ2) ∈ R2, iar Σ ∈M2×2(R).

Prin definitie, Σ = Cov(X,X), unde Xnot.= (X1, X2), asadar Σij = Cov(Xi, Xj)

pentru i, j ∈ 1, 2. De asemenea, Cov(Xi, Xi) = Var[Xi]not.= σ2

i ≥ 0 pentru

i ∈ 1, 2, ın vreme ce pentru i 6= j avem Cov(Xi, Xj) = Cov(Xj , Xi)not.= σij.

In sfarsit, daca introducem ,,coeficientul de corelare“ ρdef.=

σ12σ1σ2

, rezulta ca

putem scrie astfel matricea de covarianta:

Σ =

[σ21 ρσ1σ2

ρσ1σ2 σ22

]

. (2)

22.

Demonstrati ca ipoteza X ∼ N (µ,Σ), im-plica faptul ca distributia conditionalaX2|X1 este de tip gaussian, si anume

X2|X1 = x1 ∼ N (µ2|1, σ22|1),

cu µ2|1 = µ2+ρσ2σ1

(x1−µ1) si σ22|1 = σ2

2(1−ρ2).

Observatie: Pentru X1|X2, rezultatuleste similar: X1|X2 = x2 ∼ N (µ1|2, σ

21|2), cu

µ1|2 = µ1 + ρσ1σ2

(x2 − µ2) si σ21|2 = σ2

1(1− ρ2).

Source:

Pattern Classification, Appendix A.5.2,

Duda, Hart and Stork, 2001

23.

Answer

pX2|X1(x2|x1) def.

=pX1,X2

(x1, x2)

pX1(x1)

, (3)

where

pX1,X2(x1, x2) =

1

(√2π)2

|Σ|exp

(

−1

2(x− µ)⊤Σ−1(x− µ)

)

si

pX1(x1) =

1√2πσ1

exp

(

− 1

2σ21

(x1 − µ1)2

)

. (4)

From (2) it follows that |Σ| = σ21σ

22(1 − ρ2). In order that

|Σ| and Σ−1 be defined, it

follows that ρ ∈ (−1, 1). Moreover, since σ1, σ2 > 0, we will have√

|Σ| = σ1σ2√

1− ρ2.

Σ−1 =1

σ21σ

22(1− ρ2)

Σ∗ =1

σ21σ

22(1 − ρ2)

[σ22 −ρσ1σ2

−ρσ1σ2 σ21

]

=1

(1 − ρ2)

1

σ21

− ρ

σ1σ2

− ρ

σ1σ2

1

σ22

24.

So,

pX1,X2(x1, x2) =

=1

2πσ1σ2√

1− ρ2exp

− 1

2(1− ρ)2 (x1 − µ1, x2 − µ2)

1

σ21

− ρ

σ1σ2

− ρ

σ1σ2

1

σ22

(x1 − µ1

x2 − µ2

)

=1

2πσ1σ2√

1− ρ2·

exp

(

− 1

2(1− ρ2)

[(x1 − µ1

σ1

)2

− 2ρ

(x1 − µ1

σ1

)(x2 − µ2

σ2

)

+

(x2 − µ2

σ2

)2])

(5)

25.

By substitution (4) and (5) in the definition (3), we will get:

p(x2|x1) =pX1,X2

(x1, x2)

pX1(x1)

=1

2πσ1σ2√

1− ρ2· exp

(

− 1

2(1− ρ2)

[(x1 − µ1

σ1

)2

− 2ρ

(x1 − µ1

σ1

)(x2 − µ2

σ2

)

+

(x2 − µ2

σ2

)2])

·√2πσ1 exp

(

1

2

(x1 − µ1

σ1

)2)

=1√

2πσ2√

1− ρ2exp

[

− 1

2(1− ρ2)

(x2 − µ2

σ2− ρx1 − µ1

σ1

)2]

=1√

2πσ2√

1− ρ2exp

−1

2

x2 − [µ2 + ρσ2σ1

(x1 − µ1)]

σ2√

1− ρ2

2

Therefore,

X2|X1 = x1 ∼ N (µ2|1, σ22|1) with µ2|1

not.= µ2 + ρ

σ2σ1

(x1 − µ1) and σ22|1

not.= σ2

2(1− ρ2).

26.

Using the Central Limit Theorem (the i.i.d. version)

to compute the real error of a classifier

CMU, 2008 fall, Eric Xing, HW3, pr. 3.3

27.

Chris recently adopts a new (binary) classifier to filter emailspams. He wants to quantitively evaluate how good the classi-fier is.

He has a small dataset of 100 emails on hand which, you canassume, are randomly drawn from all emails.

He tests the classifier on the 100 emails and gets 83 classifiedcorrectly, so the error rate on the small dataset is 17%.

However, the number on 100 samples could be either higher orlower than the real error rate just by chance.

With a confidence level of 95%, what is likely to be the range ofthe real error rate? Please write down all important steps.

(Hint: You need some approximation in this problem.)

28.

Notations:

Let Xi, i = 1, . . . , n = 100 be defined as:Xi = 1 if the email i was incorrectly classified, and 0 otherwise;

E[Xi]not.= µ

not.= ereal ; Var(Xi)

not.= σ2

esamplenot.=

X1 + . . .+Xn

n= 0.17

Zn =X1 + . . .+Xn − nµ√

n σ(the standardized form of X1 + . . .+Xn)

Key insight:

Calculating the real error of the classifier (more exactly, a symmetric interval

around the real error pnot.= µ) with a “confidence” of 95% amounts to finding a > 0

sunch that P (|Zn| ≤ a) ≥ 0.95.

29.

Calculus:

| Zn |≤ a ⇔∣∣∣∣

X1 + . . .+Xn − nµ√n σ

∣∣∣∣≤ a ⇔

∣∣∣∣

X1 + . . .+Xn − nµnσ

∣∣∣∣≤ a√

n

⇔∣∣∣∣

X1 + . . .+Xn − nµn

∣∣∣∣≤ aσ√

n⇔∣∣∣∣

X1 + . . .+Xn

n− µ

∣∣∣∣≤ aσ√

n

⇔ |esample − ereal| ≤aσ√n⇔ |ereal − esample| ≤

aσ√n

⇔ − aσ√n≤ ereal − esample ≤

aσ√n

⇔ esample −aσ√n≤ ereal ≤ esample +

aσ√n

⇔ ereal ∈[

esample −aσ√n, esample +

aσ√n

]

30.

Important facts:

The Central Limit Theorem: Zn → N (0; 1)

Therefore, P (|Zn| ≤ a) ≈ P (|X | ≤ a) = Φ(a)− Φ(−a), where X ∼ N (0; 1)

and Φ is the cumulative function distribution of N (0; 1).

Calculus:

Φ(−a) + Φ(a) = 1⇒ P (| Zn |≤ a) = Φ(a)− Φ(−a) = 2Φ(a)− 1

P (| Zn |≤ a)=0.95⇔2Φ(a)− 1=0.95⇔Φ(a)=0.975⇔ a ∼= 1.97 (see Φ table)

σ2 not.= Varreal = ereal(1− ereal) because Xi are Bernoulli variables.

Futhermore, we can approximate ereal with esemple, because

E[esample] = ereal and Varsample =1

nVarreal → 0 for n→ +∞,

cf. CMU, 2011 fall, T. Mitchell, A. Singh, HW2, pr. 1.ab.

Finally:

⇒ aσ√n≈ 1.97 ·

0.17(1− 0.17)√100

∼= 0.07

|ereal − esample| ≤ 0.07 ⇔ |ereal − 0.17| ≤ 0.07 ⇔ −0.07 ≤ ereal − 0.17 ≤ 0.07

⇔ ereal ∈ [0.10, 0.24]

31.

Exemplifying

a mixture of categorical distributions;

how to compute its expectation and variance

CMU, 2010 fall, Aarti Singh, HW1, pr. 2.2.1-2

32.

Suppose that I have two six-sided dice, one is fair and the other one isloaded – having:

P (x) =

1

2x = 6

1

10x ∈ 1, 2, 3, 4, 5

I will toss a coin to decide which die to roll. If the coin flip is heads I willroll the fair die, otherwise the loaded one. The probability that the coinflip is heads is p ∈ (0, 1).

a. What is the expectation of the die roll (in terms of p).

b. What is the variation of the die roll (in terms of p).

33.

Solution:

a.

E[X ] =6∑

i=1

i · [P (i|fair) · p+ P (i|loaded) · (1− p)]

=

[6∑

i=1

i · P (i|fair)]

p +

[6∑

i=1

i · P (i|loaded)]

(1− p)

=7

2p+

9

2(1− p) = 9

2− p

34.

b. Recall that we may write Var (X) = E[X2]− (E[X ])2, therefore:

E[X2] =

6∑

i=1

i2 · [P (i|fair) · p+ P (i|loaded) · (1− p)]

=

[6∑

i=1

i2 · P (i|fair)]

p+

[6∑

i=1

i2 · P (i|loaded)]

(1− p)

=91

6p+ (

36

2+

55

10)(1− p)

=47

2− 25

3p

Combining this with the result of the previous question yields:

Var (X) = E[X2]− (E[X ])2 =141

6− 50

6p− (

9

2− p)2

=141

6− 50

6p− (

81

4− 9p+ p2)

= (141

6− 81

4)− (

50

6− 9)p− p2

=13

4+

2

3p− p2

35.

Elements of Information Theory:Some examples and then some useful proofs

36.

Computing entropies and specific conditional entropies

for discrete random variables

CMU, 2012 spring, R. Rosenfeld, HW2, pr. 2

37.

On the roll of two six-sided fair dice,

a. Calculate the distribution of the sum (S) of the total.

b. The amount of information (or surprise) when seeing the outcome x

for a random variable X is defined as log21

P (X = x)= − log2 P (X = x). How

surprised are you (in bits) to observe S = 2, S = 11, S = 5, S = 7?

c. Calculate the entropy of S [as the expected value of the random variable− log2 P (X = x)].

d. Let’s say you throw the die one by one, and the first die shows 4. Whatis the entropy of S after this observation? Was any information gained / lostin the process? If so, calculate how much information (in bits) was lost orgained.

38.

a.

S 2 3 4 5 6 7 8 9 10 11 12P (S) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

b.

Information(S = 2) = − log2(1/36) = log2 36 = 2 log2 6 = 2(1 + log2 3)

= 5.169925001 bits

Information(S = 11) = − log2 2/36 = log2 18 = 1 + 2 log2 3 = 4.169925001 bits

Information(S = 5) = − log2 4/36 = log2 9 = 2 log2 3 = 3.169925001 bits

Information(S = 7) = − log2 6/36 = log2 6 = 1 + log2 3 = 2.584962501 bits

39.

c.

H(S) = −n∑

i=1

pi log pi

= −(

2 · 136

log1

36+ 2 · 2

36log

2

36+ 2 · 3

36log

3

36+ 2 · 4

36log

4

36+

2 · 536

log5

36+

6

36log

6

36

)

=1

36(2 log2 36 + 4 log2 18 + 6 log2 12 + 8 log2 9 + 10 log2

36

5+ 6 log2 6)

=1

36(2 log2 6

2 + 4 log2 6 · 3 + 6 log2 6 · 2 + 8 log2 32 + 10 log2

62

5+ 6 log2 6)

=1

36(40 log2 6 + 20 log2 3 + 6− 10 log2 5)

=1

36(60 log2 3 + 46− 10 log2 5) = 3.274401919 bits.

40.

d.

S 2 3 4 5 6 7 8 9 10 11 12P (S|...) 0 0 0 1/6 1/6 1/6 1/6 1/6 1/6 0 0

H(S|First-die-shows-4) = −6 · 16log2

1

6= log2 6 = 2.58 bits,

IG(S;First-die-shows-4) = H(S)−H(S|First-die-shows-4) = 3.27− 2.58 = 0.69 bits.

41.

Computing entropies and mean conditional entropies

for discrete random variables

CMU, 2012 spring, R. Rosenfeld, HW2, pr. 3

42.

A doctor needs to diagnose a person having cold (C). The primary factorhe considers in his diagnosis is the outside temperature (T ). The randomvariable C takes two values, yes / no, and the random variable T takes 3values, sunny, rainy, snowy. The joint distribution of the two variables isgiven in following table.

T = sunny T = rainy T = snowy

C = no 0.30 0.20 0.10C = yes 0.05 0.15 0.20

a. Calculate the marginal probabilities P (C), P (T ).

Hint: Use P (X = x) =∑

Y P (X = x;Y = y). For example,

P (C = no) = P (C = no, T = sunny) + P (C = no, T = rainy) + P (C = no, T = snowy).

b. Calculate the entropies H(C), H(T ).

c. Calculate the mean conditional entropies H(C|T ), H(T |C).

43.

a. PC = (0.6, 0.4) si PT = (0.35, 0.35, 0.30).

b.

H(C) = 0.6 log5

3+ 0.4 log

5

2= log 5− 0.6 log 3− 0.4 = 0.971 bits

H(T ) = 2 · 0.35 log 20

7+ 0.3 log

10

3= 0.7(2 + log 5− log 7) + 0.3(1 + log 5− log 3)

= 1.7 + log 5− 0.7 log 7− 0.3 log 3 = 1.581 bits.

44.

c.

H(C|T ) def.=

t∈Val (T )

P (T = t) ·H(C|T = t)

= P (T = sunny) ·H(C|T = sunny) + P (T = rainy) ·H(C|T = rainy) +

P (T = snowy) ·H(C|T = snowy)

= 0.35 ·H(

0.30

0.30 + 0.05,

0.05

0.30 + 0.05

)

+ 0.35 ·H(

0.20

0.20 + 0.15,

0.15

0.20 + 0.15

)

+

0.30 ·H(

0.10

0.10 + 0.20,

0.20

0.20 + 0.10

)

=7

20·H(6

7,1

7

)

+7

20·H(4

7,3

7

)

+3

10·H(1

3,2

3

)

=7

20·(6

7log

7

6+

1

7log 7

)

+7

20·(4

7log

7

4+

3

7log

7

3

)

+3

10·(1

3log 3 +

2

3log

3

2

)

=7

20·(

log 7− 6

7− 6

7log 3

)

+7

20·(

log 7− 8

7− 3

7log 3

)

+3

10·(

log 3− 2

3

)

=7

10log 7−

(3

10+

4

10+

2

10

)

−(

6

20+

3

20− 3

10

)

· log 3 =7

10log 7− 3

20log 3− 9

10= 0.82715 bits.

45.

H(T |C) def.=

c∈Val (C)

P (C = c) ·H(T |C = c)

= P (C = no) ·H(T |C = no) + P (C = yes) ·H(T |C = yes)

= 0.60 ·H(

0.30

0.30 + 0.20 + 0.10,

0.20

0.30 + 0.20 + 0.10,

0.10

0.30 + 0.20 + 0.10

)

+

0.40 ·H(

0.05

0.05 + 0.15 + 0.20,

0.15

0.05 + 0.15 + 0.20,

0.20

0.05 + 0.15 + 0.20

)

=3

5·H(1

2,1

3,1

6

)

+2

5·H(1

8,3

8,1

2

)

=3

5

(1

2+

1

3log 3 +

1

6(1 + log 3)

)

+2

5

(1

8· 3 + 3

8(3− log 3) +

1

2

)

=3

5

(2

3+

1

2log 3

)

+2

5

(

2− 3

8log 3

)

=6

5+

3

20log 3 = 1.43774 bits.

46.

Computing the entropy of theexponential distribution

CMU, 2011 spring, R. Rosenfeld,

HW2, pr. 2.c

xxxxxx Exponential p.d.f.

0 1 2 3 4 5

0.0

0.5

1.0

1.5

xp(

x)

0 1 2 3 4 5

0.0

0.5

1.0

1.5

θ = 0.5θ = 1θ = 1.5

47.

Pentru o distributie de probabilitate continua P , entropia se defineste ast-fel:

H(P ) =

∫ +∞

−∞

P (x) log21

P (x)dx

Calculati entropia distributiei continue exponentiale de parametru λ > 0.Definitia acestei distributii este urmatoarea:

P (x) =

λe−λx, daca x ≥ 0;0, daca x < 0.

Indicatie: Daca P (x) = 0, veti presupune ca −P (x) log2 P (x) = 0.

48.

Answer

H(P ) =

∫ 0

−∞P (x) log2

1

P (x)dx+

∫ ∞

0

P (x) log21

P (x)dx

def. P=

∫ 0

−∞0 log2 0 dx

︸ ︷︷ ︸0

+

∫ ∞

0

λe−λx log21

λe−λxdx =

∫ ∞

0

λe−λx log21

λe−λxdx

⇒ H(P ) =1

ln 2

∫ ∞

0

λe−λx ln1

λe−λxdx =

1

ln 2

∫ ∞

0

λe−λx

(

ln1

λ+ ln

1

e−λx

)

dx

=1

ln 2

∫ ∞

0

λe−λx(− lnλ+ ln eλx

)dx

=1

ln 2

∫ ∞

0

λe−λx (− lnλ+ λx) dx

=1

ln 2

∫ ∞

0

λe−λx(− lnλ) dx+1

ln 2

∫ ∞

0

λe−λxλxdx

=− lnλ

ln 2

∫ ∞

0

λe−λx dx+λ

ln 2

∫ ∞

0

λe−λxx dx

=lnλ

ln 2

∫ ∞

0

(e−λx

)′dx− λ

ln 2

∫ ∞

0

(e−λx

)′x dx

49.

Prima integrala se rezolva foarte usor:

∫ ∞

0

(e−λx

)′dx = e−λx

∣∣∞0

= e−∞ − e0 = 0− 1 = −1

Pentru a rezolva cea de-a doua integrala se poate folosi formula de integrare prin parti:

∫ ∞

0

(e−λx

)′x dx = e−λxx

∣∣∞0−∫ ∞

0

e−λxx′ dx = e−λxx∣∣∞0−∫ ∞

0

e−λx dx

Integrala definita e−λxx∣∣∞0

nu se poate calcula direct (din cauza conflictului 0 · ∞ carese produce atunci cand lui x i se atribuie valoarea-limita ∞), ci se calculeaza folosindregula lui l’Hopital:

limx→∞

xe−λx = limx→∞

x

eλx= lim

x→∞x′

(eλx)′ = lim

x→∞1

λeλx=

1

λlimx→∞

e−λx = e−∞ = 0,

decie−λxx

∣∣∞0

= 0− 0 = 0.

50.

Integrala∫∞0 e−λx dx se calculeaza usor:

∫ ∞

0

e−λx dx = − 1

λ

∫ ∞

0

(e−λx

)′dx = − 1

λe−λx

∣∣∞0

= − 1

λ(0− 1) =

1

λ

Prin urmare,∫ ∞

0

(e−λx

)′x dx = 0− 1

λ= − 1

λ,

ceea ce conduce la rezultatul final:

H(P ) =ln λ

ln 2(−1)− λ

ln 2

(

− 1

λ

)

= − lnλ

ln 2+

1

ln 2=

1− lnλ

ln 2.

51.

Derivation of entropy definition,

starting from a set of desirable properties

CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2.2

52.

Remark: The definition we gave for entropy−∑ni=1 pi log pi is not very intuitive.

Theorem:

If ψn(p1, . . . , pn) satisfies the following axioms

A0. [LC:] ψn(p1, . . . , pn) ≥ 0 for any n ∈ N∗ and p1, . . . , pn, since we view ψn is a

measure of disorder ; also, ψ1(1) = 0 because in this case there is no disorder;

A1. ψn should be continuous in pi and symmetric in its arguments;

A2. if pi = 1/n then ψn should be a monotonically increasing function of n;(If all events are equally likely, then having more events means beingmore uncertain.)

A3. if a choice among N events is broken down into successive choices, thenthe entropy should be the weighted sum of the entropy at each stage;

then ψn(p1, . . . , pn) = −K∑

i pi log pi where K is a positive constant.

(As we’ll see, K depends however on ψs

(

1

s, . . . ,

1

s

)

for a certain s ∈ N∗).

Remark: We will prove the theorem firstly for uniform distributions (pi = 1/n) and secondly

for the case pi ∈ Q (only!).

53.

Example for the axiom A3:

Encoding 1

(a,b,c)

1/2 1/3 1/6

b ca

Encoding 2

(a,b,c)

a

1/2 1/2

(b,c)

b c

2/3 1/3

H

(

1

2,1

3,1

6

)

=1

2log 2 +

1

3log 3 +

1

6log 6 =

(

1

2+

1

6

)

log 2 +

(

1

3+

1

6

)

log 3 =2

3+

1

2log 3

H

(

1

2,1

2

)

+1

2H

(

2

3,1

3

)

= 1 +1

2

(

2

3log

3

2+

1

3log 3

)

= 1 +1

2

(

log 3− 2

3

)

=2

3+

1

2log 3

The next 3 slides:

Case 1: pi = 1/n for i = 1, . . . , n; proof steps:

54.

a. A(n)not.= ψ(1/n, 1/n, . . . , 1/n) implies

A(sm) = mA(s) for any s,m ∈ N∗. (1)

b. If s,m ∈ N⋆ (fixed), s 6= 1, and t, n ∈ N⋆ such that sm ≤ tn ≤ sm+1, then

| mn −log tlog s |≤ 1

n . (2)

c. For sm ≤ tn ≤ sm+1 as above, due to A2 it follows (immediately)

ψsm

(1

sm, . . . ,

1

sm

)

≤ ψtn

(1

tn, . . . ,

1

tn

)

≤ ψsm+1

(1

sm+1, . . . ,

1

sm+1

)

i.e. A(sm) ≤ A(tn) ≤ A(sm+1)c. Show that

| mn −A(t)A(s) |≤ 1

n for s 6= 1. (3)

d. Combining (2) + (3) immediately gives

| A(t)A(s) −log tlog s |≤ 2

n pentru s 6= 1 (4)

d. Show that this inequation implies

A(t) = K log t with K > 0 (due to A2). (5)

55.

Proofa.

. . . . . .. . .

. . . . . .. . .

. . .

1/s 1/s 1/s

1/s 1/s 1/s

nivel 2

nivel 1

1/s 1/s 1/s

nivel m

de s ori

. . .de s ori

de s ori

ms

1m

s

1

de s orim

. . .

ms

1

Applying the axion A3 on the right encoding from above gives:

A(sm) = A(s) + s · 1sA(s) + s2 · 1

s2A(s) + . . .+ sm−1 · 1

sm−1A(s)

= A(s) +A(s) +A(s) + . . .+A(s)︸ ︷︷ ︸

m times

= mA(s)

56.

Proof (cont’d)

b.

sm ≤ tn ≤ sm+1 ⇒ m log s ≤ n log t ≤ (m+ 1) log s⇒m

n≤ log t

log s≤ m

n+

1

n⇒ 0 ≤ log t

log s− m

n≤ 1

n⇒∣∣∣∣

log t

log s− m

n

∣∣∣∣≤ 1

n

c.

A(sm) ≤ A(tn) ≤ A(sm+1)(1)⇒ mA(s) ≤ n A(t) ≤ (m+ 1) A(s)

s 6=1⇒m

n≤ A(t)

A(s)≤ m

n+

1

n⇒ 0 ≤ A(t)

A(s)− m

n≤ 1

n⇒∣∣∣∣

A(t)

A(s)− m

n

∣∣∣∣≤ 1

n

d. Consider again sm ≤ tn ≤ sm+1 with s, t fixed. If m→∞ then n→∞ and

from

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣≤ 2

nit follows that

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣→ 0.

Therefore

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣= 0 and so

A(t)

A(s)=

log t

log s.

Finally, A(t) =A(s)

log slog t = K log t, where K =

A(s)

log s> 0 (if s 6= 1).

57.

Case 2: pi ∈ Q for i = 1, . . . , n

Let’s consider a set of N ≥ 2equiprobable random events, and P =(S1, S2, . . . , Sk) a partition of this set.Let’s denote pi =| Si | /N .

A “natural” two-step ecoding (asshown in the nearby figure) leads toA(N) = ψk(p1, . . . , pk) +

i piA(| Si |),based on the axiom A3.

Finally, using the result A(t) = K log t,gives:

K logN = ψk(p1, . . . , pk) +K∑

i pi log | Si |

S i

1/|S |i1/|S |i 1/|S |i

|S |/N2 |S |/Ni

S 1 S 2 S k

|S |/Nk|S |/N1

level 1

level 2

. . .. . .

. . .

. . . . . . . . .

⇒ ψk(p1, . . . , pk) = K[ logN −∑

i

pi log | Si | ]

= K[ logN∑

i

pi −∑

i

pi log | Si | ] = −K∑

i

pi log| Si |N

= −K∑

i

pi log pi

58.

Entropie, entropie corelata,

entropie conditionala, castig de informatie:

definitii si proprietati imediate

CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2

59.

Definitii

• Entropia variabilei X:

H(X)def.= −∑i P (X = xi) logP (X = xi)

not.= EX [− logP (X)].

• Entropia conditionala specifica a variabilei Y ın raport cu valoarea xk a variabilei X:

H(Y | X = xk)def.= −∑j P (Y = yj | X = xk) logP (Y = yj | X = xk)

not.= EY |X=xk

[− logP (Y | X = xk)].

• Entropia conditionala medie a variabilei Y ın raport cu variabila X:

H(Y | X)def.=∑

k P (X = xk)H(Y | X = xk)not.= EX [H(Y | X)].

• Entropia corelata a variabilelor X si Y :

H(X,Y )def.= −∑i

j P (X = xi, Y = yj) logP (X = xi, Y = yj)not.= EX,Y [− logP (X,Y )].

• Informatia mutuala a variabilelor X si Y , numita de asemenea castigul de informatieal variabilei X ın raport cu variabila Y (sau invers):

MI(X,Y )not.= IG(X,Y )

def.= H(X)−H(X | Y ) = H(Y )−H(Y | X)

(Observatie: ultima egalitate de mai sus are loc datorita rezultatului de la punctulc de mai jos.)

60.

a.H(X) ≥ 0.

H(X) = −∑

i

P (X = xi) logP (X = xi) =∑

i

P (X = xi)︸ ︷︷ ︸

≥0

log1

P (X = xi)︸ ︷︷ ︸

≥0

≥ 0

Mai mult, H(X) = 0 daca si numai daca variabila X este constanta:

,,⇒“ Presupunem ca H(X) = 0, adica∑

i P (X = xi) log1

P (X = xi)= 0. Datorita faptului

ca fiecare termen din aceasta suma este mai mare sau egal cu 0, rezulta ca H(X) = 0

doar daca pentru ∀i, P (X = xi) = 0 sau log1

P (X = xi)= 0, adica daca pentru ∀i,

P (X = xi) = 0 sau P (X = xi) = 1. Cum ınsa∑

i P (X = xi) = 1 rezulta ca exista osingura valoare x1 pentru X astfel ıncat P (X = x1) = 1, iar P (X = x) = 0 pentruorice x 6= x1. Altfel spus, variabila aleatoare discreta X este constanta.

,,⇐“ Presupunem ca variabila X este constanta, ceea ce ınseamna ca X ia o singuravaloare x1, cu probabilitatea P (X = x1) = 1. Prin urmare, H(X) = −1 · log 1 = 0.

61.

b.

H(Y | X) = −∑

i

j

P (X = xi, Y = yj) logP (Y = yj | X = xi)

H(Y | X) =∑

i

P (X = xi)H(Y | X = xi)

=∑

i

P (X = xi)

[

−∑

j

P (Y = yj | X = xi) logP (Y = yj | X = xi)

]

= −∑

i

j

P (X = xi)P (Y = yj | X = xi)︸ ︷︷ ︸

=P (X=xi,Y=yj)

logP (Y = yj | X = xi)

= −∑

i

j

P (X = xi, Y = yj) logP (Y = yj | X = xi)

62.

c.H(X, Y ) = H(X) +H(Y | X) = H(Y ) +H(X | Y )

H(X, Y ) = −∑

i

j

p(xi, yj) log p(xi, yj)

= −∑

i

j

p(xi) · p(yj | xi) log[p(xi) · p(yj | xi)]

= −∑

i

j

p(xi) · p(yj | xi)[log p(xi) + log p(yj | xi)]

= −∑

i

j

p(xi) · p(yj | xi) log p(xi)−∑

i

j

p(xi) · p(yj | xi) log p(yj | xi)

= −∑

i

p(xi) log p(xi) ·∑

j

p(yj | xi)︸ ︷︷ ︸

=1

−∑

i

p(xi)∑

j

p(yj | xi) log p(yj | xi)

= H(X) +∑

i

p(xi)H(Y | X = xi) = H(X) +H(Y | X)

63.

Mai general (regula de ınlantuire):

H(X1, . . . , Xn) = H(X1) +H(X2 | X1) + . . .+H(Xn | X1, . . . , Xn−1)

H(X1, . . . , Xn) =E

[

log1

p(x1, . . . , xn)

]

=− Ep(x1,...,xn)[log p(x1, . . . , xn)]

=− Ep(x1,...,xn)[log p(x1) + log p(x2 | x1) + . . .+ log p(xn | x1, . . . , xn−1)]

=− Ep(x1)[log p(x1)]−Ep(x1,x2)[log p(x2 | x1)]− . . .− Ep(x1,...,xn)[log p(xn | x1, . . . , xn−1)]

=H(X1) +H(X2 | X1) + . . .+H(Xn | X1, . . . , Xn−1)

64.

An upper bound for the entropy of a discrete distribution

CMU, 2003 fall, T. Mitchell, A. Moore, HW1, pr. 1.1

65.

Fie X o variabila aleatoare discreta care ia n valori si urmeaza distributiaprobabilista P . Conform definitiei, entropia lui X este

H(X) = −n∑

i=1

P (X = xi) log2 P (X = xi).

Aratati ca H(X) ≤ log2 n.

Sugestie: Puteti folosi inegalitatea lnx ≤ x − 1 care are loc pentru oricex > 0.

66.

Answer

H(X) =1

ln 2

(

−n∑

i=1

P (X = xi) lnP (X = xi)

)

Asadar,

H(X) ≤ log2 n ⇔ 1

ln 2

(

−n∑

i=1

P (X = xi) lnP (X = xi)

)

≤ log2 n

⇔ −n∑

i=1

P (xi) lnP (xi) ≤ lnn

⇔n∑

i=1

P (xi) ln1

P (xi)−(

n∑

i=1

P (xi)

)

︸ ︷︷ ︸

1

lnn ≤ 0

⇔n∑

i=1

P (xi) ln1

P (xi)−

n∑

i=1

P (xi) lnn ≤ 0

⇔n∑

i=1

P (xi)

(

ln1

P (xi)− lnn

)

≤ 0

⇔n∑

i=1

P (xi) ln1

nP (xi)≤ 0

67.

Aplicand inegalitatea lnx ≤ x− 1 pentru x =1

nP (xi), vom avea:

n∑

i=1

P (xi) ln1

nP (xi)≤

n∑

i=1

P (xi)

(1

nP (xi)− 1

)

=

n∑

i=1

1

n−

n∑

i=1

P (xi)

︸ ︷︷ ︸1

= 1− 1 = 0

Observatie: Aceasta margine superioara chiar este ,,atinsa“. De exemplu, ın cazul ıncare o variabila aleatoare discreta X avand n valori urmeaza distributia uniforma, sepoate verifica imediat ca H(X) = log2 n.

68.

Relative entropy a.k.a. the Kulback-Leibler divergence,

and the [relationship to] information gain;

some basic properties

CMU, 2007 fall, C. Guestrin, HW1, pr. 1.2[adapted by Liviu Ciortuz]

69.

The relative entropy — also known as the Kullback-Leibler (KL) diver-gence — from a distribution p to a distribution q is defined as

KL(p||q) def.= −

x∈X

p(x) logq(x)

p(x)

From an information theory perspective, the KL-divergence specifies thenumber of additional bits required on average to transmit values of X ifthe values are distributed with respect to p but we encode them assumingthe distribution q.

70.

Notes

1. KL is not a distance measure, since it is not symmetric (i.e., in generalKL(p||q) 6= KL(q||p)).

Another measure, which is defined as JSD(p||q) = 1

2(KL(p||q)+KL(q||p)), and

is called the Jensen-Shannon divergence is symmetric.

2. The quantity

d(X, Y )def.= H(X, Y )− IG (X, Y ) = H(X) +H(Y )− 2IG (X, Y )

= H(X | Y ) +H(Y | X)

known as variation of information, is a distance metric, i.e., it is non-negative, symmetric, implies indiscernability, and satisfies the triangle in-equality.

71.

a. Show that KL(p||q) ≥ 0, and KL(p||q) = 0 iff p(x) = q(x) for all x.(More generally, the smaller the KL-divergence, the more similar the two distributions.)

Indicatie:

Pentru a demonstra punctul acesta puteti folosi inegalitatea lui Jensen:

Daca ϕ : R → R este o functie convexa, atunci pentru orice t ∈ [0, 1] si orice x1, x2 ∈ R

urmeaza ϕ(tx1 + (1− t)x2) ≤ tϕ(x1) + (1− t)ϕ(x2).Daca ϕ este functie strict convexa, atunci egalitatea are loc doar daca x1 = x2.

Mai general, pentru orice ai ≥ 0, i = 1, . . . , n cu∑

i ai 6= 0 si orice xi ∈ R, i = 1, . . . , n, avem

ϕ

(∑

i aixi∑

j aj

)

≤∑

i aiϕ(xi)∑

j aj.

Daca ϕ este strict convexa, atunci egalitatea are loc doar daca x1 = . . . = xn.

Evident, rezultate similare pot fi formulate si pentru functii concave.

72.

Answer

Vom dovedi inegalitatea KL(p||q) ≥ 0 folosind inegalitatea lui Jensen, ın expresia careia

vom ınlocui ϕ cu functia convexa − log2, pe ai cu p(xi) si pe xi cuq(xi)

p(xi).

(Pentru convenienta, ın cele ce urmeaza vor renunta la indicele variabilei x.)Vom avea:

KL (p || q) def.= −

x

p(x) logq(x)

p(x)

Jensen

≥ − log(∑

x

p(x)q(x)

p(x)) = − log(

x

q(x)

︸ ︷︷ ︸1

) = − log 1 = 0

Asadar, KL (p || q) ≥ 0, oricare ar fi distributiile (discrete) p si q.

73.

Vom demonstra acum ca KL(p||q) = 0⇔ p = q.

⇐Egalitatea p(x) = q(x) implica

q(x)

p(x)= 1, deci log

q(x)

p(x)= 0 pentru orice x, de unde

rezulta imediat KL(p||q) = 0.

⇒Stim ca ın inegalitatea lui Jensen are loc egalitatea doar ın cazul ın care xi = xjpentru orice i si j.

In cazul de fata, aceasta conditie se traduce prin faptul ca raportulq(x)

p(x)este acelasi

pentru orice valoare a lui x.

Tinand cont ca∑

x p(x) = 1 si∑

x p(x)q(x)

p(x)=∑

x q(x) = 1, rezulta caq(x)

p(x)= 1 sau,

altfel spus, p(x) = q(x) pentru orice x, ceea ce ınseamna ca distributiile p si q suntidentice.

74.

b. We can define the information gain as the KL-divergence from the observed jointdistribution of X and Y to the product of their observed marginals:

IG(X,Y )def.= KL(pX,Y || (pX pY )) = −

x

y

pX,Y (x, y) log

(pX(x)p(y)

pY (x, y)

)

not.= −

x

y

p(x, y) log

(p(x)p(y)

p(x, y)

)

Prove that this definition of information gain is equivalent to the one given in problemCMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2. That is, show that IG(X,Y ) =H [X ]−H [X |Y ] = H [Y ]−H [Y |X ], starting from the definition in terms of KL-divergence.

Remark:

It follows that

IG(X,Y ) =∑

y

p(y)∑

x

p(x | y) log p(x | y)p(x)

=∑

y

p(y)KL(pX|Y || pX)

= EY [KL(pX|Y || pX)]

75.

Answer

By making use of the multiplication rule, namely p(x, y) = p(x | y)p(y), we will have:

KL(pXY || (pX pY ))

def. KL= −

x

y

p(x, y) log

(p(x)p(y)

p(x, y)

)

= −∑

x

y

p(x, y) log

(p(x)p(y)

p(x | y)p(y)

)

= −∑

x

y

p(x, y)[log p(x)− log p(x | y)]

= −∑

x

y

p(x, y) log p(x)−(

−∑

x

y

p(x, y) log p(x | y))

= −∑

x

log p(x)∑

y

p(x, y)

︸ ︷︷ ︸

=p(x)

−H [X | Y ]

= H [X ]−H [X | Y ] = IG(X,Y )

76.

c.A direct consequence of parts a. and b. is that IG(X, Y ) ≥ 0 (and thereforeH(X) ≥ H(X|Y ) and H(Y ) ≥ H(Y |X)) for any discrete random variables Xand Y .

Prove that IG(X, Y ) = 0 iff X and Y are independent.

Answer:

This is also an immediate consequence of parts a. and b. already proven:

IG (X, Y ) = 0(b)⇔ KL (pXY ||pX pY ) = 0

(a)⇔ X and Y are independent.

77.

Remark

Putem demonstra inegalitatea IG(X,Y ) ≥ 0 si ın maniera directa, folosind rezultatul dela punctul b. si aplicand inegalitatea lui Jensen ın forma generalizata, cu urmatoarele,,amendamente“:

− ın locul unui singur indice, se vor considera doi indici (asadar ın loc de ai si xi vomavea aij si respectiv xij);

− vom lua ϕ = − log2 iar aij ← p(xi, yj) si xij ←p(xi)p(xj)

p(xi, xj);

− ın fine, vom tine cont ca∑

i

j p(xi, yj) = 1.

Prin urmare,

IG(X,Y ) =∑

i

j

p(xi, yj) logp(xi, yj)

p(xi) · p(yj)=∑

i

j

p(xi, yj)

[

− logp(xi) · p(yj)p(xi, yj)

]

≥ − log

i

j

p(xi, yj)p(xi) · p(yj)p(xi, yj)

= − log(∑

i

j

p(xi) · p(yj))

= − log(∑

i

p(xi)

︸ ︷︷ ︸1

·∑

j

p(yj)

︸ ︷︷ ︸1

) = − log 1 = 0

In concluzie, IG(X,Y ) ≥ 0.

78.

Remark (cont’d)

Daca X si Y sunt variabilele independente,atunci p(xi, yj) = p(xi)p(yj) pentru orice i si j.

In consecinta, toti logaritmii din partea dreapta a primei egalitati din calculul de maisus sunt 0 si rezulta IG(X,Y ) = 0.

Invers, presupunand ca IG(X,Y ) = 0, vom tine cont de faptul ca putem exprima castigulde informatie cu ajutorul divergentei KL si vom aplica un rationament similar cu celde la punctul a.

Rezulta cap(xi)p(yj)

p(xi, yj)= 1 si deci p(xi)p(yj) = p(xi, yj) pentru orice i si j.

Aceasta echivaleaza cu a spune ca variabilele X si Y sunt independente.

79.

Proving [in a direct manner] thatthe Information Gain is always positive or 0

(an indirect proof was made at CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2)

Liviu Ciortuz, 2017

80.

Definitia castigului de informatie (sau: a informatiei mutuale) al unei vari-abile aleatoare X ın raport cu o alta variabila aleatoare Y este

IG (X,Y ) = H(X)−H(X | Y ) = H(Y )−H(Y | X).

La CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a demonstrat — pentru cazul ın care X

si Y sunt discrete — ca IG(X,Y ) = KL(PX,Y ||PX PY ), unde KL desemneaza entropia relativa (sau:

divergenta Kullback-Leibler), PX si PY sunt distributiile variabilelor X si, respectiv, Y , iar PX,Y este

distributia corelata a acestor variabile. Tot la CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a

aratat ca divergenta KL este ıntotdeauna ne-negativa. In consecinta, IG(X,Y ) ≥ 0 pentru orice X

si Y .

La acest exercitiu va cerem sa demonstrati inegalitatea IG(X,Y ) ≥ 0 ınmaniera directa, plecand de la prima definitie data mai sus, fara a [mai] apelala divergenta Kullback-Leibler.

81.

Sugestie: Puteti folosi urmatoarea forma a inegalitatii lui Jensen:

n∑

i=1

ai log xi ≤ log

(n∑

i=1

aixi

)

unde baza logaritmului se considera supraunitara, ai ≥ 0 pentru i = 1, . . . , n si∑n

i=1 ai = 1.

Observatie: Avantajul la aceasta problema, comparativ cu CMU, 2007 fall, Carlos Guestrin,

HW1, pr. 1.2.a, este ca aici se lucreaza cu o singura distributie (p), nu cu doua distributii (p si q).

Totusi, demonstratia de aici va fi mai laborioasa.

Answer (in Romanian)

Presupunem ca valorile variabilei X sunt x1, x2, . . . , xn, iar valorile variabilei Ysunt y1, y2, . . . , ym. Avem:

IG(X,Y )def.= H(X)−H(X |Y )

def.=

n∑

i=1

−P (xi) log2 P (xi)−m∑

j=1

P (yj)

n∑

i=1

(−P (xi|yj) log2 P (xi|yj))

82.

−IG(X,Y ) =

n∑

i=1

P (xi) log2 P (xi)−m∑

j=1

P (yj)

n∑

i=1

P (xi|yj) log2 P (xi|yj)

def.=

prob. marg.

n∑

i=1

m∑

j=1

P (xi, yj)

log2 P (xi)−m∑

j=1

P (yj)

n∑

i=1

P (xi|yj) log2 P (xi|yj)

distrib.·,+=

n∑

i=1

m∑

j=1

P (xi, yj) log2 P (xi)−m∑

j=1

n∑

i=1

P (yj)P (xi|yj) log2 P (xi|yj)

def.=

prob. cond.

n∑

i=1

m∑

j=1

P (xi, yj) log2 P (xi)−m∑

j=1

n∑

i=1

P (xi, yj) log2 P (xi|yj)

distrib.·,+=

n∑

i=1

m∑

j=1

P (xi, yj)(log2 P (xi)− log2 P (xi|yj))

prop.=log.

n∑

i=1

m∑

j=1

P (xi, yj) log2P (xi)

P (xi|yj)

reg. de=

multipl.

n∑

i=1

m∑

j=1

P (xi|yj)P (yj) log2P (xi)

P (xi|yj)

distrib.·,+=

m∑

j=1

P (yj)

n∑

i=1

P (xi|yj)︸ ︷︷ ︸

ai

log2P (xi)

P (xi|yj)

83.

Intrucat pe de o parte P (xi|yj) ≥ 0 si pe de alta parte∑n

i=1 P (xi|yj) = 1 pentrufiecare valoare yj a lui Y ın parte, putem aplica inegalitatea lui Jensen pentrucea de-a doua suma din ultima expresie de mai sus — mai exact, pentrufiecare valoare a indicelui j ın parte — si obtinem:

−IG(X,Y ) ≤m∑

j=1

P (yj) log2

(n∑

i=1

P (xi|yj)P (xi)

P (xi|yj)

)

=m∑

j=1

P (yj) log2

(n∑

i=1

P (xi)

)

︸ ︷︷ ︸1

= 0

Prin urmare, IG(X,Y ) ≥ 0.

84.

top related