language and information
DESCRIPTION
Language and Information. September 21, 2000. Handout #2. Course Information. Instructor: Dragomir R. Radev ([email protected]) Office: 305A, West Hall Phone: (734) 615-5225 Office hours: TTh 3-4 Course page: http://www.si.umich.edu/~radev/760 - PowerPoint PPT PresentationTRANSCRIPT
(C) 2000, The University of Michigan
1
Language and Information
Handout #2
September 21, 2000
(C) 2000, The University of Michigan
2
Course Information
• Instructor: Dragomir R. Radev ([email protected])
• Office: 305A, West Hall
• Phone: (734) 615-5225
• Office hours: TTh 3-4
• Course page: http://www.si.umich.edu/~radev/760
• Class meets on Thursdays, 5-8 PM in 311 West Hall
(C) 2000, The University of Michigan
3
Readings
• Textbook:– Oakes, Chapter 2, pages 53 – 76
• Additional readings– M&S, Chapter 7, pages (minus Section 7.4)– M&S, Chapter 8, pages (minus Sections 8.3-4)
(C) 2000, The University of Michigan
4
Information Theory
(C) 2000, The University of Michigan
5
Entropy
• Let p(x) be the probability mass function of a random variable X, over a discrete set of symbols (or alphabet) X:
p(x) = P(X=x), x X
• Example: throwing two coins and counting heads and tails
• Entropy (self-information): is the average uncertainty of a single random variable:
(C) 2000, The University of Michigan
6
Information theoretic measures
• Claude Shannon (information theory): “information = unexpectedness”
• Series of events (messages) with associated probabilities: pi (i = 1 .. n)
• Goal: to measure the information content, H(p1, …, pn) of a particular message
• Simplest case: the messages are words
• When pi is low, the word is less informative
(C) 2000, The University of Michigan
7
Properties of information content
• H is a continuous function of the pi
• If all p are equal (pi = 1/n), then H is a monotone increasing function of n
• if a message is broken into two successive messages, the original H is a weighted sum of the resulting values of H
(C) 2000, The University of Michigan
8
Example
• Only function satisfying all three properties is the entropy function:
p1 = 1/2, p2 = 1/3, p3 = 1/6
H = - pi log2 pi
(C) 2000, The University of Michigan
9
Example (cont’d)
H = - (1/2 log2 1/2 + 1/3 log2 1/3 + 1/6 log2 1/6)
= 1/2 log2 2 + 1/3 log2 3 + 1/6 log2 6
= 1/2 + 1.585/3 + 2.585/6
= 1.46
H = pi log2 (1/pi)
Alternative formula for H:
(C) 2000, The University of Michigan
10
Another example
• Example:– No tickets left: P = 1/2– Matinee shows only: P = 1/4– Eve. show, undesirable seats: P = 1/8– Eve. Show, orchestra seats: P = 1/8
(C) 2000, The University of Michigan
11
Example (cont’d)
H = - (1/2 log 1/2 + 1/4 log 1/4 + 1/8 log 1/8 + 1/8 log 1/8)
H = - (1/2 x -1) + (1/4 x -2) + (1/8 x -3) + (1/8 x -3)
H = 1.75 (bits per symbol)
(C) 2000, The University of Michigan
12
Characteristics of Entropy
• When one of the messages has a probability approaching 1, then entropy decreases.
• When all messages have the same probability, entropy increases.
• Maximum entropy: when P = 1/n (H = ??)
• Relative entropy: ratio of actual entropy to maximum entropy
• Redundancy: 1 - relative entropy
(C) 2000, The University of Michigan
13
Entropy examples
• Letter frequencies in Simplified Polynesian: P(1/8), T(1/4), K(1/8), A(1/4), I (1/8), U (1/8)
• What is H(P)?• What is the shortest code that can be designed to
describe simplified Polynesian?
• What is the entropy of a weighted coin? Draw a diagram.
(C) 2000, The University of Michigan
14
Joint entropy and conditional entropy• The joint entropy of a pair of discrete random variables
X, Y p(x,y) is the amount of information needed on average to specify both their values
H (X,Y) = -xy p(x,y) log2 p(X,Y)• The conditional entropy of a discrete random variable Y given another
X, for X, Y p(x,y) expresses how much extra information is need to communicate Y given that the other party knows X
H (Y|X) = -xy p(x,y) log2 p(y|x)
(C) 2000, The University of Michigan
15
Connection between joint and conditional entropies
• There is a chain rule for entropy (note that the products in the chain rules for probabilities have become sums because of the log):
H (X,Y) = H(X) + H(Y|X) H (X1,…,Xn) = H(X1) + H(X2|X1) + … + H(Xn|X1,…,Xn-1)
(C) 2000, The University of Michigan
16
Simplified Polynesian revisited
p t k
a 1/16 3/8 1/16 1/2
i 1/16 3/16 0 1/4
u 0 3/16 1/16 1/4
1/8 3/4 1/8
(C) 2000, The University of Michigan
17
Mutual information
• Mutual information: reduction in uncertainty of one random variable due to knowing about another, or the amount of information one random variable contains about another.
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
H(X) – H(X|Y) = H(Y) – H(Y|X) = I(X;Y)
(C) 2000, The University of Michigan
18
Mutual information and entropy
H(X|Y)
I(X;Y)
H(Y|X)
H(X|Y)H(X|Y)
H(X,Y)
• I(X;Y) is 0 iff two variables are independent
• For two dependent variables, mutual information grows not only with the degree of dependence, but also according to the entropy of the variables
(C) 2000, The University of Michigan
19
Formulas for I(X;Y)
I(X;Y) = H(X) – H(X|Y) = H(X) + H(Y) – H(X,Y)
I(X;Y) = xyp(x,y) log2p(x)p(y)
p(x,y)
Since H(X|X) = 0, note that H(X) = H(X)-H(X|X) = I(X;X)
I(x;y) = log2 p(x)p(y)p(x,y)
: pointwise mutual information
(C) 2000, The University of Michigan
20
The noisy channel model
EncoderChannel p(y|x) Decoder
Message from a finite alphabet
X Y ŴW
Input to channel
Output from channel
Attempt to reconstruct message based on output
0 0
1 1
1-p
1-p
pBinary symmetric channel
(C) 2000, The University of Michigan
21
Statistical NLP as decoding problems
Application Input Output p(i) p(o|I)
Machine translation
L1 word sequences
L2 word sequences
p(L1) in a language model
Translation model
Optical character recognition
Actual text Text with mistakes
Prob of language text
Model of OCR errors
Part of Speech tagging
POS tag sequences
English words Prob of POS sequences
p(w|t)
Speech recognition
Word sequences
Speech signal Prob of word sequences
Acoustic model
(C) 2000, The University of Michigan
22
Coding
(C) 2000, The University of Michigan
23
Compression
• Huffman coding (prefix property)
• Ziv-Lempel codes (better)
• arithmetic codes (better for images - why?)
(C) 2000, The University of Michigan
24
Huffman coding
• Developed by David Huffman (1952)
• Average of 5 bits per character
• Based on frequency distributions of symbols
• Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols
(C) 2000, The University of Michigan
25
Symbol Frequency
A 7
B 4
C 10
D 5
E 2
F 11
G 15
H 3
I 7
J 8
(C) 2000, The University of Michigan
26
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
(C) 2000, The University of Michigan
27
Symbol Code
A 0110
B 0010
C 000
D 0011
E 01110
F 010
G 10
H 01111
I 110
J 111
(C) 2000, The University of Michigan
28
Exercise
• Consider the bit string: 01101101111000100110001110100111000110101101011101
• Use the Huffman code from the example to decode it.
• Try inserting, deleting, and switching some bits at random locations and try decoding.
(C) 2000, The University of Michigan
29
Ziv-Lempel coding
• Two types - one is known as LZ77 (used in GZIP)
• Code: set of triples <a,b,c>• a: how far back in the decoded text to look
for the upcoming text segment• b: how many characters to copy• c: new character to add to complete segment
(C) 2000, The University of Michigan
30
• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers
(C) 2000, The University of Michigan
31
No. of triples Average textlength
No. of codetriples
Average textlength
1 1.00 11 1.82
2 1.00 12 1.92
3 1.00 13 2.00
4 1.25 14 1.93
5 1.20 15 1.87
6 1.33 16 2.13
7 1.57 17 2.12
8 1.88 18 2.22
9 1.78 19 2.26
10 1.80 20 2.20
(C) 2000, The University of Michigan
32
Arithmetic coding
• Uses probabilities
• Achieves about 2.5 bits per character
(C) 2000, The University of Michigan
33
Symbol Initial Aftera
Afterab
Afteraba
Afterabac
Afterabacu
Afterabacus
a 1/5 2/6 2/7 3/8 3/9 3/10 3/11
b 1/5 1/6 2/7 2/8 2/9 2/10 2/11
c 1/5 1/6 1/7 1/8 2/9 2/10 2/11
s 1/5 1/6 1/7 1/8 1/9 1/10 2/11
u 1/5 1/6 1/7 1/8 1/9 2/10 2/11
UpperBound
1.000 0.200 0.1000 0.076190 0.073809 0.073809 0.073795
LowerBound
0.000 0.000 0.0666 0.066666 0.072619 0.073767 0.073781
(C) 2000, The University of Michigan
34
Exercise
• Assuming the alphabet consists of a, b, and c, develop arithmetic encoding for the following strings:
aaa aababa baaabc cabcba bac