weighted parsing, probabilistic parsing

85
600.465 - Intro to NLP - J. Eisner 1 Weighted Parsing, Probabilistic Parsing

Upload: drea

Post on 12-Jan-2016

82 views

Category:

Documents


1 download

DESCRIPTION

Weighted Parsing, Probabilistic Parsing. Our bane: Ambiguity. John saw Mary Typhoid Mary Phillips screwdriver Mary note how rare rules interact I see a bird is this 4 nouns – parsed like “city park scavenger bird”? rare parts of speech, plus systematic ambiguity in noun sequences - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 1

Weighted Parsing, Probabilistic Parsing

Page 2: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 2

Our bane: Ambiguity

John saw Mary Typhoid Mary Phillips screwdriver Marynote how rare rules interact

I see a bird is this 4 nouns – parsed like “city park scavenger bird”?rare parts of speech, plus systematic ambiguity in noun

sequences

Time flies like an arrow Fruit flies like a banana Time reactions like this one Time reactions like a chemist or is it just an NP?

Page 3: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 3

Our bane: Ambiguity

John saw Mary Typhoid Mary Phillips screwdriver Marynote how rare rules interact

I see a bird is this 4 nouns – parsed like “city park scavenger bird”?rare parts of speech, plus systematic ambiguity in noun

sequences

Time | flies like an arrow NP VP Fruit flies | like a banana NP VP Time | reactions like this one V[stem] NP Time reactions | like a chemist S PP or is it just an NP?

Page 4: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 4

How to solve this combinatorial explosion of ambiguity?

1. First try parsing without any weird rules, throwing them in only if needed.

2. Better: every rule has a weight. A tree’s weight is total weight of all its rules. Pick the overall lightest parse of sentence.

3. Can we pick the weights automatically?We’ll get to this later …

Page 5: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

1NP 4VP 4

2P 2V 5

3 Det 14 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 6: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

1NP 4VP 4

2P 2V 5

3 Det 14 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 7: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

1NP 4VP 4

2P 2V 5

3 Det 14 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 8: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8

1NP 4VP 4

2P 2V 5

3 Det 14 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 9: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

2P 2V 5

3 Det 14 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 10: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

2P 2V 5

3 Det 14 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 11: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

2P 2V 5

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 12: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

2P 2V 5

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 13: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

2P 2V 5

PP 12

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 14: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 15: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 16: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

NP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 17: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

NP 18S 21

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 18: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 19: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 20: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 21: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 22: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 23: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 24: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 25: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 26: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 27: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 28: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

SFollow backpointers …

Page 29: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

S

NP VP

Page 30: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

S

NP VP

VP PP

Page 31: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

S

NP VP

VP PP

P NP

Page 32: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

S

NP VP

VP PP

P NP

Det N

Page 33: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Which entries do we need?

Page 34: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Which entries do we need?

Page 35: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Not worth keeping …

Page 36: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

… since it just breeds worse options

Page 37: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Keep only best-in-class!

inferior stock

Page 38: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0 NP 3Vst 3

NP 10

S 8

NP 24S 22

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Keep only best-in-class!(and its backpointers so you can recover best parse)

Page 39: Weighted Parsing,  Probabilistic Parsing

39

Chart Parsingphrase(X,I,J) :- rewrite(X,W), word(W,I,J).

phrase(X,I,J) :- rewrite(X,Y,Z), phrase(Y,I,Mid), phrase(Z,Mid,J).

goal :- phrase(start_symbol, 0, sentence_length).

Page 40: Weighted Parsing,  Probabilistic Parsing

40

Weighted Chart Parsing (“min cost”)phrase(X,I,J) min= rewrite(X,W) + word(W,I,J).

phrase(X,I,J) min= rewrite(X,Y,Z) + phrase(Y,I,Mid) + phrase(Z,Mid,J).

goal min= phrase(start_symbol, 0, sentence_length).

Page 41: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 41

Probabilistic Trees

Instead of lightest weight tree, take highest probability tree

Given any tree, your assignment 1 generator would have some probability of producing it!

Just like using n-grams to choose among strings …

What is the probability of this tree?

S

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow

Page 42: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 42

Probabilistic Trees

Instead of lightest weight tree, take highest probability tree

Given any tree, your assignment 1 generator would have some probability of producing it!

Just like using n-grams to choose among strings …

What is the probability of this tree?

You rolled a lot of independent dice …

S

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow

p( | S)

Page 43: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 43

Chain rule: One word at a time

p(time flies like an arrow) = p(time)

* p(flies | time)* p(like | time flies)* p(an | time flies like)* p(arrow | time flies like an)

Page 44: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 44

Chain rule + backoff (to get trigram model)

p(time flies like an arrow) = p(time)

* p(flies | time)* p(like | time flies)* p(an | time flies like)* p(arrow | time flies like an)

Page 45: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 45

Chain rule – written differently

p(time flies like an arrow) = p(time)

* p(time flies | time)* p(time flies like | time flies)* p(time flies like an | time flies like)* p(time flies like an arrow | time flies like an)

Proof: p(x,y | x) = p(x | x) * p(y | x, x) = 1 * p(y | x)

Page 46: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 46

Chain rule + backoff

p(time flies like an arrow) = p(time)

* p(time flies | time)* p(time flies like | time flies)* p(time flies like an | time flies like)* p(time flies like an arrow | time flies like an)

Proof: p(x,y | x) = p(x | x) * p(y | x, x) = 1 * p(y | x)

Page 47: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 47

Chain rule: One node at a time

S

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow

p( | S) = p(S

NP VP| S) * p(

S

NPtime

VP|

S

NP VP)

* p(S

NPtime

VP

VP PP

|S

NPtime

VP)

* p(S

NPtime

VP

VPflies

PP

|S

NPtime

VP) * …

VP PP

Page 48: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 48

Chain rule + backoffS

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow

p( | S) = p(S

NP VP| S) * p(

S

NPtime

VP|

S

NP VP)

* p(S

NPtime

VP

VP PP

|S

NPtime

VP)

* p(S

NPtime

VP

VPflies

PP

|S

NPtime

VP) * …

VP PP

model you used in homework 1!

(called “PCFG”)

Page 49: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 49

Simplified notation S

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow

p( | S) = p(S NP VP | S) * p(NP time | NP)

* p(VP VP NP | VP)

* p(VP flies | VP) * …

model you used in homework 1!

(called “PCFG”)

Page 50: Weighted Parsing,  Probabilistic Parsing

50

Already have a CKY alg for weights …

S

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow

w( | S) = w(S NP VP) + w(NP time)

+ w(VP VP NP)

+ w(VP flies) + …

Just let w(X Y Z) = -log p(X Y Z | X)Then lightest tree has highest prob

Page 51: Weighted Parsing,  Probabilistic Parsing

51

Weighted Chart Parsing (“min cost”)phrase(X,I,J) min= rewrite(X,W) + word(W,I,J).

phrase(X,I,J) min= rewrite(X,Y,Z) + phrase(Y,I,Mid) + phrase(Z,Mid,J).

goal min= phrase(start_symbol, 0, sentence_length).

Page 52: Weighted Parsing,  Probabilistic Parsing

52

Probabilistic Chart Parsing (“max prob”)phrase(X,I,J) max= rewrite(X,W) * word(W,I,J).

phrase(X,I,J) max= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).

goal max= phrase(start_symbol, 0, sentence_length).

Page 53: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8 S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

multiply to get 2-22

2-8

2-12

2-2

Page 54: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8 S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

multiply to get 2-22

2-8

2-12

2-2

2-13

Need only best-in-class to get best parse

Page 55: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 55

Why probabilities not weights?

We just saw probabilities are really just a special case of weights …

… but we can estimate them from training data by counting and smoothing! Yay! Warning: What kind of training corpus do we need (if we want to

estimate rule probabilities simply by counting and smoothing)?

Probabilities tell us how likely our best parse actually is: Might improve user interface (e.g. ask for clarification if not sure) Might help when learning the rule probabilities (later in course) Should help combination with other systems

Text understanding: Even if the 3rd-best parse is 40x less probable syntactically, might still use it if it’s> 40x more probable semantically

Ambiguity-preserving translation: If the top 3 parses are all probable, try to find a translation that would be ok regardless of which is correct

Page 56: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 56

A slightly different task

Been asking: What is probability of generating a given tree with your homework 1 generator? To pick tree with highest prob: useful in parsing.

But could also ask: What is probability of generating a given string with the generator?

(i.e., with the –t option turned off) To pick string with highest prob: useful in speech

recognition, as substitute for an n-gram model. (“Put the file in the folder” vs. “Put the file and the folder”)

To get prob of generating string, must add up probabilities of all trees for the string …

Page 57: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S 8 S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Could just add up the parse probabilities

2-22

2-27

2-27

2-22

2-27

oops, back to finding

exponentially many parses

Page 58: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

SS 2-

13

NP 24S 22S 27NP 24S 27SS

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 2-12

VP 163 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2-2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Any more efficient way?

2-8

2-22

2-27

Page 59: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S

NP 24S 22S 27NP 24S 27S

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 2-12

VP 163 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2-2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Add as we go … (the “inside algorithm”)

2-8+2-13

2-22

+2-27

Page 60: Weighted Parsing,  Probabilistic Parsing

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10

S

NP

S

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 2-12

VP 163 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2-2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Add as we go … (the “inside algorithm”)

2-8+2-13

+2-22

+2-27

2-22

+2-27 2-22

+2-27 +2-27

Page 61: Weighted Parsing,  Probabilistic Parsing

61

Probabilistic Chart Parsing (“max prob”)phrase(X,I,J) max= rewrite(X,W) * word(W,I,J).

phrase(X,I,J) max= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).

goal max= phrase(start_symbol, 0, sentence_length).

Page 62: Weighted Parsing,  Probabilistic Parsing

62

The “Inside Algorithm”phrase(X,I,J) += rewrite(X,W) * word(W,I,J).

phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).

goal += phrase(start_symbol, 0, sentence_length).

Page 63: Weighted Parsing,  Probabilistic Parsing

63

agenda of pending updates

prep(2,3)= 1.0

prep(I,3)= ?

s(3,9)+= 0.15 s(3,7)

+= 0.21vp(5,K)

= ?vp(5,9)= 0.5

vp(5,7)= 0.7

Bottom-up inference

np(3,5)+= 0.3

chart of derived items with current values

s(I,K) += np(I,J) * vp(J,K)

rules of program

np(3,5)= 0.1+0.3

0.4

we updated np(3,5);what else must therefore change?

If np(3,5) hadn’t been in the chart already, we would have added it.

vp(5,K)?no more matches to this

query

prep(I,3)?

pp(I,K) += prep(I,J) * np(J,K) pp(2,5)+= 0.3

Page 64: Weighted Parsing,  Probabilistic Parsing

64

Parameterization …

rewrite(X,Y,Z)’s value represents the rule probability p(Y Z | X). This could be defined by a formula instead of a number. Simple conditional log-linear model (each rule has 4 features):

urewrite(X,Y,Z) *= exp(weight_xy(X,Y)). % exp xy,X,Y

urewrite(X,Y,Z) *= exp(weight_xz(X,Z)). urewrite(X,Y,Z) *= exp(weight_yz(Y,Z)). urewrite(X,Same,Same) *= exp(weight_same). % exp same

urewrite(X) += urewrite(X,Y,Z). % normalizing constant rewrite(X,Y,Z) = urewrite(X,Y,Z) / urewrite(X). % normalize

phrase(X,I,J) += rewrite(X,W) * word(W,I,J).

phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).

goal += phrase(start_symbol, 0, sentence_length).

Page 65: Weighted Parsing,  Probabilistic Parsing

65

Parameterization …

rewrite(X,Y,Z)’s value represents the rule probability p(Y Z | X). Simple conditional log-linear model …

What if the program uses the unnormalized probability urewrite instead of rewrite?

Now each parse has an overall unnormalized probability: uprob(Parse) = exp(total weight of all features in the parse)

Can still normalize at the end: p(Parse | sentence) = uprob(parse) / Zwhere Z is the sum of all uprob(Parse): that’s just goal!

phrase(X,I,J) += rewrite(X,W) * word(W,I,J).

phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).

goal += phrase(start_symbol, 0, sentence_length).

Page 66: Weighted Parsing,  Probabilistic Parsing

66

Chart Parsing: Recognition algorithmphrase(X,I,J) :- rewrite(X,W), word(W,I,J).

phrase(X,I,J) :- rewrite(X,Y,Z), phrase(Y,I,Mid), phrase(Z,Mid,J).

goal :- phrase(start_symbol, 0, sentence_length).

Page 67: Weighted Parsing,  Probabilistic Parsing

67

Chart Parsing: Viterbi algorithm (min-cost)phrase(X,I,J) min= rewrite(X,W) + word(W,I,J).

phrase(X,I,J) min= rewrite(X,Y,Z) + phrase(Y,I,Mid) + phrase(Z,Mid,J).

goal min= phrase(start_symbol, 0, sentence_length).

Page 68: Weighted Parsing,  Probabilistic Parsing

68

Chart Parsing: Viterbi algorithm (max-prob)phrase(X,I,J) max= rewrite(X,W) * word(W,I,J).

phrase(X,I,J) max= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).

goal max= phrase(start_symbol, 0, sentence_length).

Page 69: Weighted Parsing,  Probabilistic Parsing

69

Chart Parsing: Inside algorithm

phrase(X,I,J) += rewrite(X,W) * word(W,I,J).

phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).

goal += phrase(start_symbol, 0, sentence_length).

Page 70: Weighted Parsing,  Probabilistic Parsing

70

Generalization: Semiring-Weighted Chart Parsing

phrase(X,I,J) = rewrite(X,W) word(W,I,J).

phrase(X,I,J) = rewrite(X,Y,Z) phrase(Y,I,Mid) phrase(Z,Mid,J).

goal = phrase(start_symbol, 0, sentence_length).

Page 71: Weighted Parsing,  Probabilistic Parsing

Unweighted CKY: Recognition algorithm

600.465 - Intro to NLP - J. Eisner 71

initialize all entries of chart to false for i := 1 to n

for each rule R of the form X word[i] chart[X,i-1,i] ||= in_grammar(R)

for width := 2 to n for start := 0 to n-width

Define end := start + width for mid := start+1 to end-1

for each rule R of the form X Y Z chart[X,start,end] ||= in_grammar(R) &&

chart[Y,start,mid] && chart[Z,mid,end] return chart[ROOT,0,n]

Pay attention to the orange code …

Page 72: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 72

initialize all entries of chart to for i := 1 to n

for each rule R of the form X word[i] chart[X,i-1,i] min= weight(R)

for width := 2 to n for start := 0 to n-width

Define end := start + width for mid := start+1 to end-1

for each rule R of the form X Y Z chart[X,start,end] min= weight(R) +

chart[Y,start,mid] + chart[Z,mid,end] return chart[ROOT,0,n]

Pay attention to the orange code …

Weighted CKY: Viterbi algorithm (min-cost)

Page 73: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 73

Probabilistic CKY: Inside algorithm

initialize all entries of chart to 0 for i := 1 to n

for each rule R of the form X word[i] chart[X,i-1,i] += prob(R)

for width := 2 to n for start := 0 to n-width

Define end := start + width for mid := start+1 to end-1

for each rule R of the form X Y Z chart[X,start,end] += prob(R) *

chart[Y,start,mid] * chart[Z,mid,end] return chart[ROOT,0,n]

Pay attention to the orange code …

Page 74: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 74

Semiring-weighted CKY: General algorithm!

initialize all entries of chart to for i := 1 to n

for each rule R of the form X word[i] chart[X,i-1,i] = semiring_weight(R)

for width := 2 to n for start := 0 to n-width

Define end := start + width for mid := start+1 to end-1

for each rule R of the form X Y Z chart[X,start,end] = semiring_weight(R)

chart[Y,start,mid] chart[Z,mid,end]

return chart[ROOT,0,n]

is like “and”/: combines all of several pieces into an X

is like “or”/: considers the alternative ways to build the X

Page 75: Weighted Parsing,  Probabilistic Parsing

initialize all entries of chart to for i := 1 to n

for each rule R of the form X word[i] chart[X,i-1,i] = semiring_weight(R)

for width := 2 to n for start := 0 to n-width

Define end := start + width for mid := start+1 to end-1

for each rule R of the form X Y Z chart[X,start,end] = semiring_weight(R)

chart[Y,start,mid] chart[Z,mid,end]

return chart[ROOT,0,n]

?600.465 - Intro to NLP - J.

Eisner 75

Semiring-weighted CKY: General algorithm!

Page 76: Weighted Parsing,  Probabilistic Parsing

initialize all entries of chart to for i := 1 to n

for each rule R of the form X word[i] chart[X,i-1,i] = semiring_weight(R)

for width := 2 to n for start := 0 to n-width

Define end := start + width for mid := start+1 to end-1

for each rule R of the form X Y Z chart[X,start,end] = semiring_weight(R)

chart[Y,start,mid] chart[Z,mid,end]

return chart[ROOT,0,n]600.465 - Intro to NLP - J. Eisner 76

Weighted CKY, general version

weights

total prob (inside)

[0,1] + 0

min weight [0,] min + recognizer {true,fals

e}or and false

Page 77: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 77

Other Uses of Semirings

The semiring weight of a constituent, Chart[X,i,k], is a flexible bookkeeping device.

If you want to build up information about larger constituents from smaller ones, design a semiring: Probability of best parse, or its log Number of parses Total probability of all parses, or its log The gradient of the total log-probability with respect to the

parameters (use this to optimize the grammar weights) The entropy of the probability distribution over parses The 5 most probable parses Possible translations of the constituent (this is how MT is done!)

We’ll see semirings again later with finite-state machines.

Page 78: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 78

Some Weight Semirings

weights

total prob (inside)

[0,1] + 0 1

max prob [0,1] max 0 1

min weight= -log(max prob)

[0,] min + 0

log(total prob) [-,0] log+ + - 0

recognizer {true,false}

or and false truelp lq = log(exp(lp)+exp(lq)), denoted log+(lp,lq)

log

lp lq = log(exp(lp)exp(lq)) = lp+lq

Semiring elements are log-probabilities lp, lq;

helps prevent underflow

Page 79: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 79

The Semiring Interfacepublic interface Semiring<K extends Semiring> {

public K oplus(K k); // public K otimes(K k); // public K zero(); // public K one(); //

}public class Minweight implements Semiring<Minweight> {

protected float f; public Minweight(float f) { this.f = f; } public float toFloat() { return f; } public Minweight oplus(Minweight k) {

return (f <= k.toFloat()) ? this : k; } public Minweight otimes(Minweight k) {

return new Minweight(f + k.toFloat()); } static Minweight ZERO = new Minweight(Float.POSITIVE_INFINITY);

static Minweight ONE = new Minweight(0); public Minweight zero() { return ZERO; } public Minweight one() { return ONE; } }

Page 80: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 80

A Generic Parser for Any Semiring K

public interface Semiring<K extends Semiring> { public K oplus(K k); // public K otimes(K k); // public K zero(); // public K one(); //

}

public class ContextFreeGrammar<K extends Semiring<K>> { … // CFG with rule weights in K

}public class CKYParser<K extends Semiring<K>> { … // parser for a CFG whose rule weights are in K K parse(Vector<String> input) { … }

// returns “total” weight (using ) of all parses }g = new ContextFreeGrammar<Minweight>(…);p = new CKYParser<Minweight>(g);minweight = p.parse(input); // returns min weight of any parse

Page 81: Weighted Parsing,  Probabilistic Parsing

600.465 - Intro to NLP - J. Eisner 81

The Semiring Axioms

public interface Semiring<K extends Semiring> { public K oplus(K k); // public K otimes(K k); // public K zero(); // public K one(); //

}

An implementation of Semiring must satisfy the semiring axioms: Commutativity of : a b = b a Associativity: (a b) c = a (b c),

(a b) c = a (b c) Distributivity: a (b c) = (a b) (a c),

(b c) a = (b a) (c a) Identities: a = a = a, a = a = a Annihilation: a = Otherwise the parser won’t work correctly. Why not? (Look back

at it.)

Page 82: Weighted Parsing,  Probabilistic Parsing

82

Rule binarization can speed up program

Mid J

ZY Z

X

I Mid

Y

I J

X+=

phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).

Page 83: Weighted Parsing,  Probabilistic Parsing

83

Rule binarization can speed up program

phrase(X,I,J) += phrase(Y,I,Mid) * phrase(Z,Mid,J) * rewrite(X,Y,Z).

temp(X\Y,Mid,J) += phrase(Z,Mid,J) * rewrite(X,Y,Z).

phrase(X,I,J) += phrase(Y,I,Mid) * temp(X\Y,Mid,J).

folding transformation: asymp. speedup!

Mid J

ZY Z

X

I Mid

Y

I J

X

Mid J

X\Y

I Mid

Y

Page 84: Weighted Parsing,  Probabilistic Parsing

84

Rule binarization can speed up program

phrase(X,I,J) += phrase(Y,I,Mid) * phrase(Z,Mid,J) * rewrite(X,Y,Z).

temp(X\Y,Mid,J) += phrase(Z,Mid,J) * rewrite(X,Y,Z).

phrase(X,I,J) += phrase(Y,I,Mid) * temp(X\Y,Mid,J).

folding transformation: asymp. speedup!

phrase(Y,I,Mid) * phrase(Z,Mid,J) * rewrite(X,Y,Z)MidZY ,,

phrase(Y,I,Mid) phrase(Z,Mid,J) * rewrite(X,Y,Z)MidY ,

Z

graphical modelsconstraint programmingmulti-way database join

Page 85: Weighted Parsing,  Probabilistic Parsing

85

Earley’s algorithm in Dyna

phrase(X,I,J) += rewrite(X,W) * word(W,I,J).

phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).

goal += phrase(start_symbol, 0, sentence_length).

need(start_symbol,0) = true.

need(Nonterm,J) :- phrase(_/[Nonterm|_],_,J).

phrase(Nonterm/Needed,I,I) += need(Nonterm,I), rewrite(Nonterm,Needed).

phrase(Nonterm/Needed,I,K) += phrase(Nonterm/[W|Needed],I,J) * word(W,J,K).

phrase(Nonterm/Needed,I,K) += phrase(Nonterm/[X|Needed],I,J) * phrase(X/[],J,K).

goal += phrase(start_symbol/[],0,sentence_length).

magic templates transformation(as noted by Minnen 1996)