information theory rong jin. outline information entropy mutual information noisy channel model

Information Theory

Rong Jin

Outline Information Entropy Mutual information Noisy channel model

Information Information knowledge Information: reduction in uncertainty Example:

1. flip a coin

2. roll a die #2 is more uncertain than #1 Therefore, more information is provided by the

outcome of #2 than #1

Definition of Information Let E be some event that occurs with

probability P(E). If we are told that E has occurred, then we say we have received I(E)=log2(1/P(E)) bits of information

Example: Result of a fair coin flip (log22=1 bit)

Result of a fair die roll (log26=2.585 bits)

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth a 1000 words!

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth more than a 1000 words!

Outline Information Entropy Mutual Information Cross Entropy and Learning

Entropy A zero-memory information source S is a source that emits

symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent.

What is the average amount of information in observing the output of the source S?

Call this entropy:

( )~1 1

( ) ( ) log [log ]( )i i i p s P

ii i

H s p I s p Ep p s

Explanation of Entropy1

( ) logiii

H P pp

1. Average amount of information provided per symbol

2. Average # of bits needed to communicate each symbol

Properties of Entropy

1. Non-negative: H(P) 0

2. For any other probability distribution {q1,…,qk},

3. H(P) logk, with equality iff pi=1/k for all i

4. The further P is from uniform, the lower the entropy.

1( ) logi

ii

H P pp

1 1( ) log logi i

i ii i

H P p pp q

Entropy: k = 2

1 1( ) log (1 ) log

1H P p p

p p

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Notice

• zero information at edges

• maximum information at 0.5 (1 bit)

• drop off more quickly close edges than in the middle

The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive

characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character

Assuming independence between successive words: Uniform word distribution: log100,1000/6.5 = 2.55

bits/char True word distribution: 9.45/6.5 = 1.45 bits/character

True entropy of English is much lower!

The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive

characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character

Assuming independence between successive words: Uniform word distribution: (log100,1000)/6.5 = 2.55

bits/char True word distribution: 9.45/6.5 = 1.45 bits/character

True entropy of English is much lower!

Entropy of Two Sources

Temperature T

P(T = hot) = 0.3

P(T = mild) = 0.5

P(T = cold) = 0.2

H(T) = H(0.3, 0.5, 0.2) = 1.485

Humidity M

P(M = low) = 0.6

P(M = high) = 0.4

H(M) = H(0.6, 0.4) = 0.971

Random variable T, M are not independent

• P(T=t, M=m)P(T=t)P(M=m)

• H(T) = 1.485

• H(M) = 0.971

• H(T) + H(M) = 2.456

• Joint Entropy

• H(T, M) = H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1) = 2.321

• H(T, M) H(T) + H(M)

Joint Entropy

Joint Probability P(T, M)

Mutual Information

Properties: Indicate the amount of information one random variable can

provide to another one Symmetric I(X;Y) = I(Y;X) Non-negative Zero iff X, Y are independent

,

,

( ; ) ( ) ( | )

1 1( ) log ( , ) log

( ) ( | )

( , )( , ) log

( ) ( )

x x y

x y

I X Y H X H X Y

P x P x yP x P x y

P x yP x y

P x P y

Relationship

H(X, Y)

H(X)

H(Y)

H(X|Y) H(Y|X)I(X;Y)

A Distance Measure Between Distributions

Kullback-Leibler distance:

Properties of Kullback-Leibler distance Non-negative: KL(PD||PM)=0 iff PD= PM

Minimizing KL distance PM get close to PD

Non-symmetric: KL(PD||PM) KL(PM||PD)

~( ) ( )

( || ) ( ) log [log ]( ) ( )D

D DD M D x Px

M M

P x P xKL P P P x E

P x P x

Bregman Distance

' (x) is a convex function.

Compression Algorithm for TC

Sports

Training Examples

Politics

Compress

109K

116K

New Document

Compression Algorithm for TC

Sports

Training Examples

Politics

Compress

109K

116K

New Document

Politics

New Document

Sports

Compress

129K

126K

Topic:

Sports

26

The Noisy Channel Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...

Model: probability of error (noise): Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6 The Task: known: the noisy output; want to know: the input (decoding) Source coding theorem Channel coding theorem

27

Noisy Channel Applications OCR

straightforward: text print (adds noise), scan image

Handwriting recognition text neurons, muscles (“noise”), scan/digitize image

Speech recognition (dictation, commands, etc.) text conversion to acoustic signal (“noise”) acoustic waves

Machine Translation text in target language translation (“noise”) source language

information theory rong jin. outline information entropy mutual information noisy channel model

Documents

pe bits of information

definition of information

bits example

memory information source

word vocabulary

word document

words random word

symbol slide