600.465 - intro to nlp - j. eisner1 structured prediction with perceptrons and crfs time flies like...

55
600.465 - Intro to NLP - J. Eisner 1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP V P D N VP S Time flies like an arrow N VP NP N V D N S NP Time flies like an arrow V PP NP N P D N VP VP S Time flies like an arrow V NP V V D N V V S S ? ?

Upload: martha-potter

Post on 01-Jan-2016

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 1

Structured Prediction with Perceptrons and CRFs

Time flies like an arrowN

PP

NP

V P D N

VP

S

Time flies like an arrowN

VP

NP

N V D N

S

NP

Time flies like an arrowV

PP

NP

N P D N

VP

VP

S

Time flies like an arrowV

NP

V V D N

V

V

S

S…??

Page 2: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 2

Structured Prediction with Perceptrons and CRFs

But now, modelstructures!

Back to conditionallog-linear modeling …

Page 5: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Reply today to claim your … Reply today to claim your …

goodmail spam

Wanna get pizza tonight? Wanna get pizza tonight?

goodmail spam

Thx; consider enlarging the … Thx; consider enlarging the …

goodmail spam

Enlarge your hidden … Enlarge your hidden …

goodmail spam

Page 6: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 6

…S

S

NP VP NP[+wh] V S/V/NP

VP NP PP P

S

S

N VP

Det N

S

S

Page 7: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 7…

…S

S

NP VP NP[+wh] V S/V/NP

VP NP PP P

S

S

N VP

Det N

S

S

…NP

NP

NP VP NP CP/NP

VP NP NP PP

NP

NP

N VP

Det N

NP

NP

Page 8: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 8

Time flies like an arrow

Time flies like an arrow

Time flies like an arrow

Time flies like an arrow

Page 9: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 9

Time flies like an arrow

Time flies like an arrow

Time flies like an arrow

Time flies like an arrow

Page 10: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Structured prediction

The general problem

Given some input x Occasionally empty, e.g., no input needed for a generative n-

gram or model of strings (randsent)

Consider a set of candidate outputs y Classifications for x (small number: often just

2) Taggings of x (exponentially many) Parses of x (exponential, even infinite) Translations of x (exponential, even infinite) …

Want to find the “best” y, given x

600.465 - Intro to NLP - J. Eisner 10

Page 11: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

11

Remember Weighted CKY …(find the minimum-weight parse)

time 1 flies 2 like 3 an 4 arrow 5

0 NP 3Vst 3

NP 10

S 8

NP 24S 22

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Page 12: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

12

We used weighted CKY to implement probabilistic CKY for PCFGs

time 1 flies 2 like 3 an 4 arrow 5

0 NP 3Vst 3

NP 10

S 8

NP 24S 22

1NP 4VP 4

NP 18S 21VP 18

2P 2V 5

PP 12VP 16

3 Det 1 NP 104 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

2-8

2-12

2-2

multiply to get 2-22

But is weighted CKY good for anything else??

Page 13: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

13

Can set weights to log probsS

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow

w( | S) = w(S NP VP) + w(NP time)

+ w(VP VP NP)

+ w(VP flies) + …

Just let w(X Y Z) = -log p(X Y Z | X)Then lightest tree has highest prob

But is weighted CKY good for anything else??Do the weights have to be probabilities?

Page 14: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 14

Probability is Useful We love probability distributions!

We’ve learned how to define & use p(…) functions. Pick best output text T from a set of candidates

speech recognition (HW2); machine translation; OCR; spell correction... maximize p1(T) for some appropriate distribution p1

Pick best annotation T for a fixed input I text categorization; parsing; part-of-speech tagging … maximize p(T | I); equivalently maximize joint probability p(I,T)

often define p(I,T) by noisy channel: p(I,T) = p(T) * p(I | T) speech recognition & other tasks above are cases of this too:

we’re maximizing an appropriate p1(T) defined by p(T | I)

Pick best probability distribution (a meta-problem!) really, pick best parameters : train HMM, PCFG, n-grams, clusters … maximum likelihood; smoothing; EM if unsupervised (incomplete data) Bayesian smoothing: max p(|data) = max p(, data) =p()p(data|)

summary of half of the course (statistics)

Page 15: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 15

Probability is Flexible

We love probability distributions! We’ve learned how to define & use p(…) functions.

We want p(…) to define probability of linguistic objects Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside) Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs;

regex compilation, best-paths, forward-backward, collocations) Vectors (decis.lists, Gaussians, naïve Bayes; Yarowsky, clustering/k-NN)

We’ve also seen some not-so-probabilistic stuff Syntactic features, semantics, morph., Gold. Could be stochasticized? Methods can be quantitative & data-driven but not fully probabilistic:

transf.-based learning, bottom-up clustering, LSA, competitive linking But probabilities have wormed their way into most things p(…) has to capture our intuitions about the ling. data

summary of other half of the course (linguistics)

Page 16: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 16

An Alternative Tradition

Old AI hacking technique: Possible parses (or whatever) have scores. Pick the one with the best score. How do you define the score?

Completely ad hoc! Throw anything you want into the stew Add a bonus for this, a penalty for that, etc.

“Learns” over time – as you adjust bonuses and penalties by hand to improve performance.

Total kludge, but totally flexible too … Can throw in any intuitions you might have

Page 17: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Given some input x Consider a set of candidate outputs y Define a scoring function score(x,y)

Linear function: A sum of feature weights (you pick the features!)

Choose y that maximizes score(x,y)

Scoring by Linear Models

600.465 - Intro to NLP - J. Eisner 17

Ranges over all features, e.g., k=5 (numbered features)

or k=“see Det Noun” (named features)

Whether (x,y) has feature k(0 or 1)Or how many times it fires ( 0)Or how strongly it fires (real #)

Weight of feature k(learned or set by hand)

Page 18: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Given some input x Consider a set of candidate outputs y Define a scoring function score(x,y)

Linear function: A sum of feature weights (you pick the features!)

Choose y that maximizes score(x,y)

Linear model notation

600.465 - Intro to NLP - J. Eisner 18

(learned or set by hand)

Page 19: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Choose y that maximizes score(x,y) But how?

Easy when only a few candidates y (text classification, WSD,

…): just try each one in turn! Harder for structured prediction: but you now know how!

Find the best string, path, or tree … That’s what Viterbi-style or Dijkstra-style algorithms are for.

That is, use dynamic programming to find the score of the best y. Then follow backpointers to recover the y that achieves that score.

Finding the best y given x

600.465 - Intro to NLP - J. Eisner 19

Page 20: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2

V 5

PP 12VP 16

3 Det 1 NP 10

4 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Given sentence xYou know how to find max-score parse y (or min-cost parse)

• Provided that the score of a parse = total score of its rules

Time flies like an arrowN

PP

NP

V P D N

VP

S

Page 21: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Given word sequence xYou know how to find max-score tag sequence y

• Provided that the score of a tagged sentence = total score of its emissions and

transitions • These don’t have to be log-probabilities!

• Emission scores assess tag-word compatibility• Transition scores assess goodness of tag bigrams

Bill directed a cortege of autos through the dunes

…? Prep AdjVerb Verb Noun VerbPN Adj Det Noun Prep Noun Prep Det Noun

Page 22: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Given upper string xYou know how to find max-score path that accepts x (or min-cost path)

• Provided that the score of a path = total score of its arcs

• Then choose lower string y from that best path• (So in effect, score(x,y) is score of best path that

transduces x to y)

• Q: How do you make sure that the path accepts aaaaaba?• A: Compose with a straight-line automaton, then find best

path.

Page 23: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

“Provided that the score of a parse = total score of its rules” “Provided that the score of a tagged sentence

= total score of its transitions and emissions” “Provided that the score of a path = total score of its arcs”

How does this fit with what linear models can do?

When can you efficiently choose best y?

600.465 - Intro to NLP - J. Eisner 23

e.g, θ3 = score of VP VP PPθ8 = score of V flies

f3(x,y) = # times VP VP PPappears in y

f8(x,y) = # times V fliesappears in (x,y)

Page 24: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

So it’s fine to have one feature for each rule, transition, emission, arc, … E.g., VP VP PP or V flies

The feature counts the # of occurrences of that substructure in (x,y) But is that all? Or can we allow other features and remain efficient?

Features that count configurations smaller than a rule (or arc)? Backoff feature V… V… P… ? Backoff feature foo foo PP (asks whether PP is “adjoined” to some foo)? Sure. These just add to the overall score of a rule like VP VP PP

that we then use in the parsing algorithm.

When can you efficiently choose best y?

600.465 - Intro to NLP - J. Eisner 24

Page 25: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

So it’s fine to have one feature for each rule, transition, emission, arc, … E.g., VP VP PP or V flies

The feature counts the # of occurrences of that substructure in (x,y) But is that all? Or can we allow other features and remain efficient?

Features that count configurations bigger than a rule (or arc)? E.g., “NP him” is a good rule when the NP is immediately to right of a V Would have to change algorithm Or, enrich CFG nonterminals with attributes (or split states of FSM)

NP[subject] him vs. NP[object] him Now the information about the configuration can be seen locally within one rule

When can you efficiently choose best y?

600.465 - Intro to NLP - J. Eisner 25

Page 26: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

So it’s fine to have one feature for each rule, transition, emission, arc, … E.g., VP VP PP or V flies

The feature counts the # of occurrences of that substructure in (x,y) But is that all? Or can we allow other features and remain efficient?

Features that count configurations bigger than a rule (or arc)? Reasonably easy case: “Does the tree have even depth along left

spine?” Harder case: “Do left and right children have the same # of words?” Extra-hard case: “Is the # of NPs in the parse a prime number?”

When can you efficiently choose best y?

600.465 - Intro to NLP - J. Eisner 26

Page 27: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

So it’s fine to have one feature for each rule, transition, emission, arc, … E.g., VP VP PP or V flies

The feature counts the # of occurrences of that substructure in (x,y)

But is that all? Or can we allow other features and remain efficient?

Features that count configurations bigger than a rule (or arc)? Surprisingly easy case:

Features that look at a rule in y together with any properties of x!

When can you efficiently choose best y?

600.465 - Intro to NLP - J. Eisner 27

Page 28: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Choose y that maximizes score(x,y) But how?

Easy when only a few candidates y (text classification, WSD,

etc.): just try each one in turn! Harder for structured prediction: but you now know how!

At least for linear scoring functions with certain kinds of features.

Generalizing beyond this is an active area! Approximate inference in graphical models, integer linear programming,

weighted MAX-SAT, etc. … see 600.325/425 Declarative Methods

Linear model notation

600.465 - Intro to NLP - J. Eisner 28

Page 29: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Finding the best y given x

Given some input x Consider a set of candidate outputs y Define a scoring function score(x,y)

We’re talking about linear functions: A sum of feature weights

Choose y that maximizes score(x,y) Easy when only two candidates y (spam classification, binary

WSD, etc.): just try both! Hard for structured prediction: but you now know how!

At least for linear scoring functions with certain kinds of features.

Generalizing beyond this is an active area! Approximate inference in graphical models, integer linear programming,

weighted MAX-SAT, etc. … see 600.325/425 Declarative Methods

600.465 - Intro to NLP - J. Eisner 29

Page 30: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 30

An Alternative Tradition

Old AI hacking technique: Possible parses (or whatever) have scores. Pick the one with the best score. How do you define the score?

Completely ad hoc! Throw anything you want into the stew Add a bonus for this, a penalty for that, etc.

“Learns” over time – as you adjust bonuses and penalties by hand to improve performance.

Total kludge, but totally flexible too … Can throw in any intuitions you might have

Exposé at 9

Probabilistic RevolutionNot Really a Revolution,

Critics Say

Log-probabilities no more than scores in disguise

“We’re just adding stuff up like the old corrupt regime did,” admits spokesperson

really so alternative?

Page 31: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 31

Nuthin’ but adding weights

n-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …

PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) …

HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …

Noisy channel: [log p(source)] + [log p(data | source)] Cascade of composed FSTs:

[log p(A)] + [log p(B | A)] + [log p(C | B)] + …

Naïve Bayes: log p(Class) + log p(feature1 | Class) + log p(feature2 |

Class) …

Note: Today we’ll use +logprob not –logprob:i.e., bigger weights are better.

Page 32: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 32

Nuthin’ but adding weights

n-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …

PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) … Can describe any linguistic object as collection of “features”

(here, a tree’s “features” are all of its component rules)(different meaning of “features” from singular/plural/etc.)

Weight of the object = total weight of features Our weights have always been conditional log-probs ( 0)

but what if we changed that?

HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …

Noisy channel: [log p(source)] + [log p(data | source)] Cascade of FSTs:

[log p(A)] + [log p(B | A)] + [log p(C | B)] + …

Naïve Bayes: log(Class) + log(feature1 | Class) + log(feature2 | Class) + …

Page 33: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Change log p(this | that) to (this; that)

600.465 - Intro to NLP - J. Eisner 33

What if our weights were arbitrary real numbers?

n-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …

PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) …

HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …

Noisy channel: [log p(source)] + [log p(data | source)] Cascade of FSTs:

[log p(A)] + [log p(B | A)] + [log p(C | B)] + …

Naïve Bayes: log p(Class) + log p(feature1 | Class) + log p(feature2 |

Class) …

Page 34: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Change log p(this | that) to (this ; that)

600.465 - Intro to NLP - J. Eisner 34

What if our weights were arbitrary real numbers?

n-grams: … + (w7 ; w5, w6) + (w8 ; w6, w7) + …

PCFG: (NP VP ; S) + (Papa ; NP) + (VP PP ; VP) …

HMM tagging: … + (t7 ; t5, t6) + (w7 ; t7) + …

Noisy channel: [ (source)] + [ (data ; source)] Cascade of FSTs:

[ (A)] + [ (B ; A)] + [ (C ; B)] + …

Naïve Bayes: (Class) + (feature1 ; Class) + (feature2 ;

Class) …In practice, is a hash tableMaps from feature name (a string or object) to feature weight (a float)e.g., (NP VP ; S) = weight of the S NP VP rule, say -0.1 or +1.3

Page 35: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Change log p(this | that) to (this ; that) (that & this) [prettiername]

600.465 - Intro to NLP - J. Eisner 35

What if our weights were arbitrary real numbers?

n-grams: … + (w5 w6 w7) + (w6 w7 w8) + …

PCFG: (S NP VP) + (NP Papa) + (VP VP PP) …

HMM tagging: … + (t5 t6 t7) + (t7 w7) + …

Noisy channel: [ (source)] + [ (source, data)] Cascade of FSTs:

[ (A)] + [ (A, B) ] + [ (B, C)] + …

Naïve Bayes: (Class) + (Class, feature 1) + (Class,

feature2) …In practice, is a hash tableMaps from feature name (a string or object) to feature weight (a float)e.g., (S NP VP) = weight of the S NP VP rule, say -0.1 or +1.3

WCFG

(multi-class) logistic regression

Page 36: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Change log p(this | that) to (that & this)

600.465 - Intro to NLP - J. Eisner 36

What if our weights were arbitrary real numbers?

n-grams: … + (w5 w6 w7) + (w6 w7 w8) + … Best string is the one whose trigrams have the highest total weight

PCFG: (S NP VP) + (NP Papa) + (VP VP PP) … Best parse is one whose rules have highest total weight (use CKY/Earley)

HMM tagging: … + (t5 t6 t7) + (t7 w7) + … Best tagging has highest total weight of all transitions and emissions

Noisy channel: [ (source)] + [ (source, data)] To guess source: max (weight of source + weight of source-data match)

Naïve Bayes: (Class) + (Class, feature 1) + (Class, feature 2) … Best class maximizes prior weight + weight of compatibility with features

WCFG

(multi-class) logistic regression

Page 37: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

time 1 flies 2 like 3 an 4 arrow 5

0

NP 3Vst 3

NP 10S 8S 13

NP 24S 22S 27NP 24S 27S 22S 27

1NP 4VP 4

NP 18S 21VP 18

2P 2

V 5

PP 12VP 16

3 Det 1 NP 10

4 N 8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Given sentence xYou know how to find max-score parse y (or min-cost parse)

• Provided that the score of a parse = a sum over its indiv. rules

• Each rule score can add up several features of that rule• But a feature can’t look at 2 rules at once (how to solve?)

fundTO NPtoTO NP

projects SBAR

S

that ...SBAR

...

Page 38: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Given upper string xYou know how to find lower string y such that score(x,y) is highest

• Provided that score(x,y) is a sum of arc scores along the best path that transduces x to y

• Each arc score can add up several features of that arc• But a feature can’t look at 2 arcs at once (how to solve?)

Page 39: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Given some input x Consider a set of candidate outputs y Define a scoring function score(x,y)

Linear function: A sum of feature weights (you pick the features!)

Choose y that maximizes score(x,y)

Linear model notation

600.465 - Intro to NLP - J. Eisner 39

Ranges over all features, e.g., k=5 (numbered features)

or k=“see Det Noun” (named features)

Whether (x,y) has feature k(0 or 1)Or how many times it fires ( 0)Or how strongly it fires (real #)

Weight of feature kTo be learned …

Page 40: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Given some input x Consider a set of candidate outputs y Define a scoring function score(x,y)

Linear function: A sum of feature weights (you pick the features!)

Choose y that maximizes score(x,y)

Linear model notation

600.465 - Intro to NLP - J. Eisner 40

To be learned …

Page 41: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 41

Probabilists Rally Behind Paradigm

“.2, .4, .6, .8! We’re not gonna take your bait!”

1. Can estimate our parameters automatically e.g., log p(t7 | t5, t6) (trigram tag probability) from supervised or unsupervised data

2. Our results are more meaningful Can use probabilities to place bets, quantify risk e.g., how sure are we that this is the correct parse?

3. Our results can be meaningfully combined modularity! Multiply indep. conditional probs – normalized, unlike scores p(English text) * p(English phonemes | English text) * p(Jap.

phonemes | English phonemes) * p(Jap. text | Jap. phonemes) p(semantics) * p(syntax | semantics) * p(morphology | syntax) *

p(phonology | morphology) * p(sounds | phonology)

83% of

^

Page 42: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 42

Probabilists Regret Being Bound by Principle Problem with our course’s “principled” approach:

All we’ve had is the chain rule + backoff.

But this forced us to make some tough “either-or” decisions. p(t7 | t5, t6): do we want to back off to t6 or t5? p(S NP VP | S) with features: do we want to

back off first from number or gender features first?

p(spam | message text): which words of the message do we back off from??

p(Paul Revere wins | weather’s clear, ground is dry, jockey getting over sprain, Epitaph also in race, Epitaph

was recently bought by Gonzalez, race is on May 17, … )

Page 43: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

News Flash! Hope arrives …

So far: Chain rule + backoff = directed graphical model = Bayesian network or Bayes net = locally normalized model

We do have a good trick to help with this: Conditional log-linear model

[look back at smoothing lecture] Solves problems on previous slide! Computationally a bit harder to train Have to compute Z(x) for each condition x

Page 44: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

Gradient-based training

Gradually try to adjust in a direction that will improve the function we’re trying to maximize So compute that function’s partial derivatives

with respect to the feature weights in : the gradient.

Here’s how the key part works out:

General function maximization algorithms include gradient ascent, L-BFGS, simulated annealing …

E as in EM …These feature expectations are just what forward-backward computes! Inside-outside too!

Page 45: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 45

Why Bother?

Gives us probs, not just scores. Can use ’em to bet, or combine w/ other

probs.

We can now learn weights from data!

Page 46: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

So far: Chain rule + backoff = directed graphical model = Bayesian network or Bayes net = locally normalized model

Also consider: Markov Random Field = undirected graphical model = log-linear model (globally normalized) = exponential model = maximum entropy model = Gibbs distribution

News Flash! More hope …

Page 47: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 47

Maximum Entropy

Suppose there are 10 classes, A through J. I don’t give you any other information. Question: Given message m: what is your guess for p(C |

m)?

Suppose I tell you that 55% of all messages are in class A. Question: Now what is your guess for p(C | m)?

Suppose I also tell you that 10% of all messages contain Buy and 80% of these are in class A or C.

Question: Now what is your guess for p(C | m), if m contains Buy?

OUCH!

Page 48: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 48

Maximum Entropy

A B C D E F G H I JBuy .051 .002

5.029 .002

5.0025

.0025

.0025

.0025

.0025

.0025

Other

.499 .0446

.0446

.0446

.0446

.0446

.0446

.0446

.0446

.0446

Column A sums to 0.55 (“55% of all messages are in class A”)

Page 49: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 49

Maximum Entropy

A B C D E F G H I JBuy .051 .002

5.029 .002

5.0025

.0025

.0025

.0025

.0025

.0025

Other

.499 .0446

.0446

.0446

.0446

.0446

.0446

.0446

.0446

.0446

Column A sums to 0.55 Row Buy sums to 0.1 (“10% of all messages contain Buy”)

Page 50: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 50

Maximum Entropy

A B C D E F G H I JBuy .051 .002

5.029 .002

5.0025

.0025

.0025

.0025

.0025

.0025

Other

.499 .0446

.0446

.0446

.0446

.0446

.0446

.0446

.0446

.0446

Column A sums to 0.55 Row Buy sums to 0.1 (Buy, A) and (Buy, C) cells sum to 0.08 (“80% of the 10%”)

Given these constraints, fill in cells “as equally as possible”: maximize the entropy (related to cross-entropy, perplexity)

Entropy = -.051 log .051 - .0025 log .0025 - .029 log .029 - …Largest if probabilities are evenly distributed

Page 51: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 51

Maximum Entropy

A B C D E F G H I JBuy .051 .002

5.029 .002

5.0025

.0025

.0025

.0025

.0025

.0025

Other

.499 .0446

.0446

.0446

.0446

.0446

.0446

.0446

.0446

.0446

Column A sums to 0.55 Row Buy sums to 0.1 (Buy, A) and (Buy, C) cells sum to 0.08 (“80% of the 10%”)

Given these constraints, fill in cells “as equally as possible”: maximize the entropy

Now p(Buy, C) = .029 and p(C | Buy) = .29 We got a compromise: p(C | Buy) < p(A | Buy) < .55

Page 52: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 52

Generalizing to More Features

A B C D E F G H …Buy .051 .002

5.029 .002

5.0025

.0025

.0025

.0025

Other

.499 .0446

.0446

.0446

.0446

.0446

.0446

.0446

<$100Other

Page 53: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 53

What we just did For each feature (“contains Buy”), see what

fraction of training data has it Many distributions p(c,m) would predict these

fractions (including the unsmoothed one where all mass goes to feature combos we’ve actually seen)

Of these, pick distribution that has max entropy

Amazing Theorem: This distribution has the form p(m,c) = (1/Z()) exp i i fi(m,c) So it is log-linear. In fact it is the same log-linear

distribution that maximizes j p(mj, cj) as before!

Gives another motivation for our log-linear approach.

Page 54: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 54

Overfitting

If we have too many features, we can choose weights to model the training data perfectly.

If we have a feature that only appears in spam training, not ling training, it will get weight to maximize p(spam | feature) at 1.

These behaviors overfit the training data. Will probably do poorly on test data.

Page 55: 600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow

600.465 - Intro to NLP - J. Eisner 55

Solutions to Overfitting

1. Throw out rare features. Require every feature to occur > 4 times, and to

occur at least once with each output class.

2. Only keep 1000 features. Add one at a time, always greedily picking the one

that most improves performance on held-out data.

3. Smooth the observed feature counts.4. Smooth the weights by using a prior.

max p(|data) = max p(, data) =p()p(data|) decree p() to be high when most weights close to

0