information theory rong jin. outline information entropy mutual information noisy channel model

Information Theory Rong Jin

Post on 21-Dec-2015




2 download


Page 1: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information Theory

Rong Jin

Page 2: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Outline Information Entropy Mutual information Noisy channel model

Page 3: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information Information knowledge Information: reduction in uncertainty Example:

1. flip a coin

2. roll a die #2 is more uncertain than #1 Therefore, more information is provided by the

outcome of #2 than #1

Page 4: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Definition of Information Let E be some event that occurs with

probability P(E). If we are told that E has occurred, then we say we have received I(E)=log2(1/P(E)) bits of information

Example: Result of a fair coin flip (log22=1 bit)

Result of a fair die roll (log26=2.585 bits)

Page 5: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth a 1000 words!

Page 6: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth a 1000 words!

Page 7: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth a 1000 words!

Page 8: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth more than a 1000 words!

Page 9: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Outline Information Entropy Mutual Information Cross Entropy and Learning

Page 10: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Entropy A zero-memory information source S is a source that emits

symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent.

What is the average amount of information in observing the output of the source S?

Call this entropy:

( )~1 1

( ) ( ) log [log ]( )i i i p s P

ii i

H s p I s p Ep p s

Page 11: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Entropy A zero-memory information source S is a source that emits

symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent.

What is the average amount of information in observing the output of the source S?

Call this entropy:

( )~1 1

( ) ( ) log [log ]( )i i i p s P

ii i

H s p I s p Ep p s

Page 12: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Explanation of Entropy1

( ) logiii

H P pp

1. Average amount of information provided per symbol

2. Average # of bits needed to communicate each symbol

Page 13: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Properties of Entropy

1. Non-negative: H(P) 0

2. For any other probability distribution {q1,…,qk},

3. H(P) logk, with equality iff pi=1/k for all i

4. The further P is from uniform, the lower the entropy.

1( ) logi


H P pp

1 1( ) log logi i

i ii i

H P p pp q

Page 14: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Entropy: k = 2

1 1( ) log (1 ) log

1H P p p

p p

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10









• zero information at edges

• maximum information at 0.5 (1 bit)

• drop off more quickly close edges than in the middle

Page 15: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive

characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character

Assuming independence between successive words: Uniform word distribution: log100,1000/6.5 = 2.55

bits/char True word distribution: 9.45/6.5 = 1.45 bits/character

True entropy of English is much lower!

Page 16: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive

characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character

Assuming independence between successive words: Uniform word distribution: (log100,1000)/6.5 = 2.55

bits/char True word distribution: 9.45/6.5 = 1.45 bits/character

True entropy of English is much lower!

Page 17: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Entropy of Two Sources

Temperature T

P(T = hot) = 0.3

P(T = mild) = 0.5

P(T = cold) = 0.2

H(T) = H(0.3, 0.5, 0.2) = 1.485

Humidity M

P(M = low) = 0.6

P(M = high) = 0.4

H(M) = H(0.6, 0.4) = 0.971

Random variable T, M are not independent

• P(T=t, M=m)P(T=t)P(M=m)

Page 18: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

• H(T) = 1.485

• H(M) = 0.971

• H(T) + H(M) = 2.456

• Joint Entropy

• H(T, M) = H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1) = 2.321

• H(T, M) H(T) + H(M)

Joint Entropy

Joint Probability P(T, M)

Page 19: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Conditional Entropy Conditional Entropy

H(T|M = low) = 1.252 H(T|M = high) = 1.5 Average conditional entropy

How much is M telling us on average about T?

H(T) – H(T|M) = 1.485 – 1.351 = 0.134 bits

( | ) ( ) ( | )

0.4 1.251 0.6 1.5 1.351m

H T M P M m H T M m

Conditional Probability P(T| M)

Page 20: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Mutual Information

Properties: Indicate the amount of information one random variable can

provide to another one Symmetric I(X;Y) = I(Y;X) Non-negative Zero iff X, Y are independent



( ; ) ( ) ( | )

1 1( ) log ( , ) log

( ) ( | )

( , )( , ) log

( ) ( )

x x y

x y


P x P x yP x P x y

P x yP x y

P x P y

Page 21: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model


H(X, Y)



H(X|Y) H(Y|X)I(X;Y)

Page 22: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

A Distance Measure Between Distributions

Kullback-Leibler distance:

Properties of Kullback-Leibler distance Non-negative: KL(PD||PM)=0 iff PD= PM

Minimizing KL distance PM get close to PD

Non-symmetric: KL(PD||PM) KL(PM||PD)

~( ) ( )

( || ) ( ) log [log ]( ) ( )D

D DD M D x Px


P x P xKL P P P x E

P x P x

Page 23: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Bregman Distance

' (x) is a convex function.

Page 24: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Compression Algorithm for TC


Training Examples





New Document

Page 25: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Compression Algorithm for TC


Training Examples





New Document


New Document







Page 26: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model


The Noisy Channel Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...

Model: probability of error (noise): Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6 The Task: known: the noisy output; want to know: the input (decoding) Source coding theorem Channel coding theorem

Page 27: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model


Noisy Channel Applications OCR

straightforward: text print (adds noise), scan image

Handwriting recognition text neurons, muscles (“noise”), scan/digitize image

Speech recognition (dictation, commands, etc.) text conversion to acoustic signal (“noise”) acoustic waves

Machine Translation text in target language translation (“noise”) source language