language and information

34
(C) 2000, The Univ ersity of Michigan 1 Language and Information Handout #2 September 21, 2000

Upload: galvin-cohen

Post on 31-Dec-2015

17 views

Category:

Documents


0 download

DESCRIPTION

Language and Information. September 21, 2000. Handout #2. Course Information. Instructor: Dragomir R. Radev ([email protected]) Office: 305A, West Hall Phone: (734) 615-5225 Office hours: TTh 3-4 Course page: http://www.si.umich.edu/~radev/760 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language and Information

(C) 2000, The University of Michigan

1

Language and Information

Handout #2

September 21, 2000

Page 2: Language and Information

(C) 2000, The University of Michigan

2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 305A, West Hall

• Phone: (734) 615-5225

• Office hours: TTh 3-4

• Course page: http://www.si.umich.edu/~radev/760

• Class meets on Thursdays, 5-8 PM in 311 West Hall

Page 3: Language and Information

(C) 2000, The University of Michigan

3

Readings

• Textbook:– Oakes, Chapter 2, pages 53 – 76

• Additional readings– M&S, Chapter 7, pages (minus Section 7.4)– M&S, Chapter 8, pages (minus Sections 8.3-4)

Page 4: Language and Information

(C) 2000, The University of Michigan

4

Information Theory

Page 5: Language and Information

(C) 2000, The University of Michigan

5

Entropy

• Let p(x) be the probability mass function of a random variable X, over a discrete set of symbols (or alphabet) X:

p(x) = P(X=x), x X

• Example: throwing two coins and counting heads and tails

• Entropy (self-information): is the average uncertainty of a single random variable:

Page 6: Language and Information

(C) 2000, The University of Michigan

6

Information theoretic measures

• Claude Shannon (information theory): “information = unexpectedness”

• Series of events (messages) with associated probabilities: pi (i = 1 .. n)

• Goal: to measure the information content, H(p1, …, pn) of a particular message

• Simplest case: the messages are words

• When pi is low, the word is less informative

Page 7: Language and Information

(C) 2000, The University of Michigan

7

Properties of information content

• H is a continuous function of the pi

• If all p are equal (pi = 1/n), then H is a monotone increasing function of n

• if a message is broken into two successive messages, the original H is a weighted sum of the resulting values of H

Page 8: Language and Information

(C) 2000, The University of Michigan

8

Example

• Only function satisfying all three properties is the entropy function:

p1 = 1/2, p2 = 1/3, p3 = 1/6

H = - pi log2 pi

Page 9: Language and Information

(C) 2000, The University of Michigan

9

Example (cont’d)

H = - (1/2 log2 1/2 + 1/3 log2 1/3 + 1/6 log2 1/6)

= 1/2 log2 2 + 1/3 log2 3 + 1/6 log2 6

= 1/2 + 1.585/3 + 2.585/6

= 1.46

H = pi log2 (1/pi)

Alternative formula for H:

Page 10: Language and Information

(C) 2000, The University of Michigan

10

Another example

• Example:– No tickets left: P = 1/2– Matinee shows only: P = 1/4– Eve. show, undesirable seats: P = 1/8– Eve. Show, orchestra seats: P = 1/8

Page 11: Language and Information

(C) 2000, The University of Michigan

11

Example (cont’d)

H = - (1/2 log 1/2 + 1/4 log 1/4 + 1/8 log 1/8 + 1/8 log 1/8)

H = - (1/2 x -1) + (1/4 x -2) + (1/8 x -3) + (1/8 x -3)

H = 1.75 (bits per symbol)

Page 12: Language and Information

(C) 2000, The University of Michigan

12

Characteristics of Entropy

• When one of the messages has a probability approaching 1, then entropy decreases.

• When all messages have the same probability, entropy increases.

• Maximum entropy: when P = 1/n (H = ??)

• Relative entropy: ratio of actual entropy to maximum entropy

• Redundancy: 1 - relative entropy

Page 13: Language and Information

(C) 2000, The University of Michigan

13

Entropy examples

• Letter frequencies in Simplified Polynesian: P(1/8), T(1/4), K(1/8), A(1/4), I (1/8), U (1/8)

• What is H(P)?• What is the shortest code that can be designed to

describe simplified Polynesian?

• What is the entropy of a weighted coin? Draw a diagram.

Page 14: Language and Information

(C) 2000, The University of Michigan

14

Joint entropy and conditional entropy• The joint entropy of a pair of discrete random variables

X, Y p(x,y) is the amount of information needed on average to specify both their values

H (X,Y) = -xy p(x,y) log2 p(X,Y)• The conditional entropy of a discrete random variable Y given another

X, for X, Y p(x,y) expresses how much extra information is need to communicate Y given that the other party knows X

H (Y|X) = -xy p(x,y) log2 p(y|x)

Page 15: Language and Information

(C) 2000, The University of Michigan

15

Connection between joint and conditional entropies

• There is a chain rule for entropy (note that the products in the chain rules for probabilities have become sums because of the log):

H (X,Y) = H(X) + H(Y|X) H (X1,…,Xn) = H(X1) + H(X2|X1) + … + H(Xn|X1,…,Xn-1)

Page 16: Language and Information

(C) 2000, The University of Michigan

16

Simplified Polynesian revisited

p t k

a 1/16 3/8 1/16 1/2

i 1/16 3/16 0 1/4

u 0 3/16 1/16 1/4

1/8 3/4 1/8

Page 17: Language and Information

(C) 2000, The University of Michigan

17

Mutual information

• Mutual information: reduction in uncertainty of one random variable due to knowing about another, or the amount of information one random variable contains about another.

H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

H(X) – H(X|Y) = H(Y) – H(Y|X) = I(X;Y)

Page 18: Language and Information

(C) 2000, The University of Michigan

18

Mutual information and entropy

H(X|Y)

I(X;Y)

H(Y|X)

H(X|Y)H(X|Y)

H(X,Y)

• I(X;Y) is 0 iff two variables are independent

• For two dependent variables, mutual information grows not only with the degree of dependence, but also according to the entropy of the variables

Page 19: Language and Information

(C) 2000, The University of Michigan

19

Formulas for I(X;Y)

I(X;Y) = H(X) – H(X|Y) = H(X) + H(Y) – H(X,Y)

I(X;Y) = xyp(x,y) log2p(x)p(y)

p(x,y)

Since H(X|X) = 0, note that H(X) = H(X)-H(X|X) = I(X;X)

I(x;y) = log2 p(x)p(y)p(x,y)

: pointwise mutual information

Page 20: Language and Information

(C) 2000, The University of Michigan

20

The noisy channel model

EncoderChannel p(y|x) Decoder

Message from a finite alphabet

X Y ŴW

Input to channel

Output from channel

Attempt to reconstruct message based on output

0 0

1 1

1-p

1-p

pBinary symmetric channel

Page 21: Language and Information

(C) 2000, The University of Michigan

21

Statistical NLP as decoding problems

Application Input Output p(i) p(o|I)

Machine translation

L1 word sequences

L2 word sequences

p(L1) in a language model

Translation model

Optical character recognition

Actual text Text with mistakes

Prob of language text

Model of OCR errors

Part of Speech tagging

POS tag sequences

English words Prob of POS sequences

p(w|t)

Speech recognition

Word sequences

Speech signal Prob of word sequences

Acoustic model

Page 22: Language and Information

(C) 2000, The University of Michigan

22

Coding

Page 23: Language and Information

(C) 2000, The University of Michigan

23

Compression

• Huffman coding (prefix property)

• Ziv-Lempel codes (better)

• arithmetic codes (better for images - why?)

Page 24: Language and Information

(C) 2000, The University of Michigan

24

Huffman coding

• Developed by David Huffman (1952)

• Average of 5 bits per character

• Based on frequency distributions of symbols

• Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

Page 25: Language and Information

(C) 2000, The University of Michigan

25

Symbol Frequency

A 7

B 4

C 10

D 5

E 2

F 11

G 15

H 3

I 7

J 8

Page 26: Language and Information

(C) 2000, The University of Michigan

26

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

c

b d

f

g

i j

he

a

Page 27: Language and Information

(C) 2000, The University of Michigan

27

Symbol Code

A 0110

B 0010

C 000

D 0011

E 01110

F 010

G 10

H 01111

I 110

J 111

Page 28: Language and Information

(C) 2000, The University of Michigan

28

Exercise

• Consider the bit string: 01101101111000100110001110100111000110101101011101

• Use the Huffman code from the example to decode it.

• Try inserting, deleting, and switching some bits at random locations and try decoding.

Page 29: Language and Information

(C) 2000, The University of Michigan

29

Ziv-Lempel coding

• Two types - one is known as LZ77 (used in GZIP)

• Code: set of triples <a,b,c>• a: how far back in the decoded text to look

for the upcoming text segment• b: how many characters to copy• c: new character to add to complete segment

Page 30: Language and Information

(C) 2000, The University of Michigan

30

• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

Page 31: Language and Information

(C) 2000, The University of Michigan

31

No. of triples Average textlength

No. of codetriples

Average textlength

1 1.00 11 1.82

2 1.00 12 1.92

3 1.00 13 2.00

4 1.25 14 1.93

5 1.20 15 1.87

6 1.33 16 2.13

7 1.57 17 2.12

8 1.88 18 2.22

9 1.78 19 2.26

10 1.80 20 2.20

Page 32: Language and Information

(C) 2000, The University of Michigan

32

Arithmetic coding

• Uses probabilities

• Achieves about 2.5 bits per character

Page 33: Language and Information

(C) 2000, The University of Michigan

33

Symbol Initial Aftera

Afterab

Afteraba

Afterabac

Afterabacu

Afterabacus

a 1/5 2/6 2/7 3/8 3/9 3/10 3/11

b 1/5 1/6 2/7 2/8 2/9 2/10 2/11

c 1/5 1/6 1/7 1/8 2/9 2/10 2/11

s 1/5 1/6 1/7 1/8 1/9 1/10 2/11

u 1/5 1/6 1/7 1/8 1/9 2/10 2/11

UpperBound

1.000 0.200 0.1000 0.076190 0.073809 0.073809 0.073795

LowerBound

0.000 0.000 0.0666 0.066666 0.072619 0.073767 0.073781

Page 34: Language and Information

(C) 2000, The University of Michigan

34

Exercise

• Assuming the alphabet consists of a, b, and c, develop arithmetic encoding for the following strings:

aaa aababa baaabc cabcba bac