information theory rong jin. outline information entropy mutual information noisy channel model
Post on 21-Dec-2015
226 views
TRANSCRIPT
Information Theory
Rong Jin
Outline Information Entropy Mutual information Noisy channel model
Information Information knowledge Information: reduction in uncertainty Example:
1. flip a coin
2. roll a die #2 is more uncertain than #1 Therefore, more information is provided by the
outcome of #2 than #1
Definition of Information Let E be some event that occurs with
probability P(E). If we are told that E has occurred, then we say we have received I(E)=log2(1/P(E)) bits of information
Example: Result of a fair coin flip (log22=1 bit)
Result of a fair die roll (log26=2.585 bits)
Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words
Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits
A 1000 word document from the same source I(document) = 16,600 bits
A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits
A picture is worth a 1000 words!
Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words
Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits
A 1000 word document from the same source I(document) = 16,600 bits
A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits
A picture is worth a 1000 words!
Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words
Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits
A 1000 word document from the same source I(document) = 16,600 bits
A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits
A picture is worth a 1000 words!
Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words
Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits
A 1000 word document from the same source I(document) = 16,600 bits
A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits
A picture is worth more than a 1000 words!
Outline Information Entropy Mutual Information Cross Entropy and Learning
Entropy A zero-memory information source S is a source that emits
symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent.
What is the average amount of information in observing the output of the source S?
Call this entropy:
( )~1 1
( ) ( ) log [log ]( )i i i p s P
ii i
H s p I s p Ep p s
Entropy A zero-memory information source S is a source that emits
symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent.
What is the average amount of information in observing the output of the source S?
Call this entropy:
( )~1 1
( ) ( ) log [log ]( )i i i p s P
ii i
H s p I s p Ep p s
Explanation of Entropy1
( ) logiii
H P pp
1. Average amount of information provided per symbol
2. Average # of bits needed to communicate each symbol
Properties of Entropy
1. Non-negative: H(P) 0
2. For any other probability distribution {q1,…,qk},
3. H(P) logk, with equality iff pi=1/k for all i
4. The further P is from uniform, the lower the entropy.
1( ) logi
ii
H P pp
1 1( ) log logi i
i ii i
H P p pp q
Entropy: k = 2
1 1( ) log (1 ) log
1H P p p
p p
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Notice
• zero information at edges
• maximum information at 0.5 (1 bit)
• drop off more quickly close edges than in the middle
The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive
characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character
Assuming independence between successive words: Uniform word distribution: log100,1000/6.5 = 2.55
bits/char True word distribution: 9.45/6.5 = 1.45 bits/character
True entropy of English is much lower!
The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive
characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character
Assuming independence between successive words: Uniform word distribution: (log100,1000)/6.5 = 2.55
bits/char True word distribution: 9.45/6.5 = 1.45 bits/character
True entropy of English is much lower!
Entropy of Two Sources
Temperature T
P(T = hot) = 0.3
P(T = mild) = 0.5
P(T = cold) = 0.2
H(T) = H(0.3, 0.5, 0.2) = 1.485
Humidity M
P(M = low) = 0.6
P(M = high) = 0.4
H(M) = H(0.6, 0.4) = 0.971
Random variable T, M are not independent
• P(T=t, M=m)P(T=t)P(M=m)
• H(T) = 1.485
• H(M) = 0.971
• H(T) + H(M) = 2.456
• Joint Entropy
• H(T, M) = H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1) = 2.321
• H(T, M) H(T) + H(M)
Joint Entropy
Joint Probability P(T, M)
Conditional Entropy Conditional Entropy
H(T|M = low) = 1.252 H(T|M = high) = 1.5 Average conditional entropy
How much is M telling us on average about T?
H(T) – H(T|M) = 1.485 – 1.351 = 0.134 bits
( | ) ( ) ( | )
0.4 1.251 0.6 1.5 1.351m
H T M P M m H T M m
Conditional Probability P(T| M)
Mutual Information
Properties: Indicate the amount of information one random variable can
provide to another one Symmetric I(X;Y) = I(Y;X) Non-negative Zero iff X, Y are independent
,
,
( ; ) ( ) ( | )
1 1( ) log ( , ) log
( ) ( | )
( , )( , ) log
( ) ( )
x x y
x y
I X Y H X H X Y
P x P x yP x P x y
P x yP x y
P x P y
Relationship
H(X, Y)
H(X)
H(Y)
H(X|Y) H(Y|X)I(X;Y)
A Distance Measure Between Distributions
Kullback-Leibler distance:
Properties of Kullback-Leibler distance Non-negative: KL(PD||PM)=0 iff PD= PM
Minimizing KL distance PM get close to PD
Non-symmetric: KL(PD||PM) KL(PM||PD)
~( ) ( )
( || ) ( ) log [log ]( ) ( )D
D DD M D x Px
M M
P x P xKL P P P x E
P x P x
Bregman Distance
' (x) is a convex function.
Compression Algorithm for TC
Sports
Training Examples
Politics
Compress
109K
116K
New Document
Compression Algorithm for TC
Sports
Training Examples
Politics
Compress
109K
116K
New Document
Politics
New Document
Sports
Compress
129K
126K
Topic:
Sports
26
The Noisy Channel Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...
Model: probability of error (noise): Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6 The Task: known: the noisy output; want to know: the input (decoding) Source coding theorem Channel coding theorem
27
Noisy Channel Applications OCR
straightforward: text print (adds noise), scan image
Handwriting recognition text neurons, muscles (“noise”), scan/digitize image
Speech recognition (dictation, commands, etc.) text conversion to acoustic signal (“noise”) acoustic waves
Machine Translation text in target language translation (“noise”) source language