u.cs.biu.ac.ilu.cs.biu.ac.il/~89-680/parsing-algorithms.pdf · title: parsing-algorithms

Post on 27-Dec-2018

513 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Parsing AlgorithmsYoav Goldberg

(with slides by Michael Collins, Julia Hockenmaier)

Parsing: recovering the constituents of a sentence.

Why is parsing hard?Ambiguity

Fat people eat candy

S

NP

Adj

Fat

Nn

people

VP

Vb

eat

NP

Nn

candy

Fat people eat accumulates

S

NP

Nn

Fat

AdjP

Nn

people

Vb

eat

VP

Vb

accumulates

11 / 48

Why is parsing hard?Ambiguity

Fat people eat candy

S

NP

Adj

Fat

Nn

people

VP

Vb

eat

NP

Nn

candy

Fat people eat accumulates

S

NP

Nn

Fat

AdjP

Nn

people

Vb

eat

VP

Vb

accumulates

11 / 48

Why is parsing hard?Ambiguity

Fat people eat candyS

NP

Adj

Fat

Nn

people

VP

Vb

eat

NP

Nn

candy

Fat people eat accumulates

S

NP

Nn

Fat

AdjP

Nn

people

Vb

eat

VP

Vb

accumulates

11 / 48

Why is parsing hard?Ambiguity

Fat people eat candyS

NP

Adj

Fat

Nn

people

VP

Vb

eat

NP

Nn

candy

Fat people eat accumulates

S

NP

Nn

Fat

AdjP

Nn

people

Vb

eat

VP

Vb

accumulates

11 / 48

Why is parsing hard?Ambiguity

Fat people eat candyS

NP

Adj

Fat

Nn

people

VP

Vb

eat

NP

Nn

candy

Fat people eat accumulates

S

NP

Nn

Fat

AdjP

Nn

people

Vb

eat

VP

Vb

accumulates

11 / 48

Why is parsing hard?Real Sentences are long. . .

“Former Beatle Paul McCartney today was ordered to paynearly $50M to his estranged wife as their bitter divorce battlecame to an end . ”

“Welcome to our Columbus hotels guide, where you’ll findhonest, concise hotel reviews, all discounts, a lowest rateguarantee, and no booking fees.”

12 / 48

Let’s learn how to parse

Context Free Grammars

A context free grammar G = (N,⌃,R,S) where:I N is a set of non-terminal symbolsI ⌃ is a set of terminal symbolsI R is a set of rules of the form X ! Y1Y2 · · ·Yn

for n � 0, X 2 N, Yi 2 (N [ ⌃)

I S 2 N is a special start symbol

14 / 48

Context Free Grammars

a simple grammarN = {S,NP,VP,Adj ,Det ,Vb,Noun}⌃ = {fruit , flies, like, a, banana, tomato, angry}S =‘S’R =

S ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NPAdj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry

15 / 48

Left-most derivations

Left-most derivation is a sequence of strings s1, · · · , sn whereI s1 = S the start symbolI sn 2 ⌃⇤, meaning sn is only terminal symbolsI Each si for i = 2 · · · n is derived from si�1 by picking the

left-most non-terminal X in si�1 and replacing it by some �where X ! � is a rule in R.

For example: [S],[NP VP],[Adj Noun VP], [fruit Noun VP], [fruitflies VP],[fruit flies Vb NP],[fruit flies like NP], [fruit flies like DetNoun], [fruit flies like a], [fruit flies like a banana]

16 / 48

Left-most derivation example

S

NP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VP

Adj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VP

NP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VP

fruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VP

NP ! Adj Noun

Adj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VPfruit Noun VP

fruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VPNP ! Adj Noun

Adj ! fruit

Noun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VPfruit Noun VPfruit flies VP

fruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VPNP ! Adj NounAdj ! fruit

Noun ! flies

VP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NP

fruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VPNP ! Adj NounAdj ! fruitNoun ! flies

VP ! Vb NP

Vb ! likeNP ! Det NounDet ! aNoun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NP

fruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NP

Vb ! like

NP ! Det NounDet ! aNoun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Noun

fruit flies like a Nounfruit flies like a banana

S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! like

NP ! Det Noun

Det ! aNoun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Noun

fruit flies like a banana

S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det Noun

Det ! a

Noun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! a

Noun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana

I The resulting derivation can be written as a tree.

I Many trees can be generated.

17 / 48

Left-most derivation example

SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana

S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana

I The resulting derivation can be written as a tree.I Many trees can be generated.

17 / 48

Context Free Grammars

a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .

Example

18 / 48

Context Free Grammars

a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .

ExampleS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

banana

18 / 48

Context Free Grammars

a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .

ExampleS

NP

Adj

Angry

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

banana

18 / 48

Context Free Grammars

a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .

ExampleS

NP

Adj

Angry

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

tomato

18 / 48

Context Free Grammars

a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .

ExampleS

NP

Adj

Angry

Noun

banana

VP

Vb

like

NP

Det

a

Noun

tomato

18 / 48

Context Free Grammars

a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .

ExampleS

NP

Det

a

Noun

banana

VP

Vb

like

NP

Det

a

Noun

tomato

18 / 48

Context Free Grammars

a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .

ExampleS

NP

Det

a

Noun

banana

VP

Vb

like

NP

Adj

angry

Noun

banana

18 / 48

Parsing with (P)CFGs

20 / 48

Parsing with CFGs

Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of

derivations starting from S that generates it.

21 / 48

Parsing with CFGs

Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of

derivations starting from S that generates it.

ProblemI Natural Language is NOT generated by a CFG.

SolutionI We assume really hard that it is.

21 / 48

Parsing with CFGs

Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of

derivations starting from S that generates it.

ProblemI Natural Language is NOT generated by a CFG.

SolutionI We assume really hard that it is.

21 / 48

Parsing with CFGs

Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of

derivations starting from S that generates it.

ProblemI We don’t have the grammar.

Solution

I We’ll ask a genius linguist to write it!

21 / 48

Parsing with CFGs

Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of

derivations starting from S that generates it.

ProblemI We don’t have the grammar.

SolutionI We’ll ask a genius linguist to write it!

21 / 48

Parsing with CFGs

Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of

derivations starting from S that generates it.

ProblemI How do we find the chain of derivations?

Solution

I With dynamic programming! (soon)

21 / 48

Parsing with CFGs

Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of

derivations starting from S that generates it.

ProblemI How do we find the chain of derivations?

SolutionI With dynamic programming! (soon)

21 / 48

Parsing with CFGs

Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of

derivations starting from S that generates it.

ProblemI Real grammar: hundreds of possible derivations per

sentence.

Solution

I No problem! We’ll choose the best one. (sooner)

21 / 48

Parsing with CFGs

Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of

derivations starting from S that generates it.

ProblemI Real grammar: hundreds of possible derivations per

sentence.

SolutionI No problem! We’ll choose the best one. (sooner)

21 / 48

Obtaining a Grammar

Let a genius linguist write it

I Hard. Many rules, many complex interactions.I Genius linguists don’t grow on trees !

An easier way - ask a linguist to grow trees

I Ask a linguist to annotate sentences with tree structure.I (This need not be a genius – Smart is enough.)I Then extract the rules from the annotated trees.

TreebanksI English Treebank: 40k sentences, manually annotated

with tree structure.I Hebrew Treebank: about 5k sentences

22 / 48

Obtaining a Grammar

Let a genius linguist write it

I Hard. Many rules, many complex interactions.I Genius linguists don’t grow on trees !

An easier way - ask a linguist to grow trees

I Ask a linguist to annotate sentences with tree structure.I (This need not be a genius – Smart is enough.)I Then extract the rules from the annotated trees.

TreebanksI English Treebank: 40k sentences, manually annotated

with tree structure.I Hebrew Treebank: about 5k sentences

22 / 48

Obtaining a Grammar

Let a genius linguist write it

I Hard. Many rules, many complex interactions.I Genius linguists don’t grow on trees !

An easier way - ask a linguist to grow trees

I Ask a linguist to annotate sentences with tree structure.I (This need not be a genius – Smart is enough.)I Then extract the rules from the annotated trees.

TreebanksI English Treebank: 40k sentences, manually annotated

with tree structure.I Hebrew Treebank: about 5k sentences

22 / 48

Treebank Sentence Example

( (S(NP-SBJ

(NP (NNP Pierre) (NNP Vinken) )(, ,)(ADJP

(NP (CD 61) (NNS years) )(JJ old) )

(, ,) )(VP (MD will)

(VP (VB join)(NP (DT the) (NN board) )(PP-CLR (IN as)

(NP (DT a) (JJ nonexecutive) (NN director) ))(NP-TMP (NNP Nov.) (CD 29) )))

(. .) ))

23 / 48

Supervised Learning from a Treebank

((fruit/ADJ flies/NN) (like/VB(a/DET banana/NN)))(time/NN (flies/VB (like/IN

(an/DET (arrow/NN))))). . . . . . . . .. . . . . . . . .

24 / 48

Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R

Extracting RulesS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

banana

S ! NP VPNP ! Adj NounAdj ! fruit

25 / 48

Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R

Extracting RulesS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

banana

S ! NP VPNP ! Adj NounAdj ! fruit

25 / 48

Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R

Extracting RulesS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

bananaS ! NP VP

NP ! Adj NounAdj ! fruit

25 / 48

Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R

Extracting RulesS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

bananaS ! NP VPNP ! Adj Noun

Adj ! fruit

25 / 48

Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R

Extracting RulesS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

bananaS ! NP VPNP ! Adj NounAdj ! fruit

25 / 48

From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a

PCFG!

I PCFG: probabilistic context free grammar. Just like a CFG,but each rule has an associated probability.

I All probabilities for the same LHS sum to 1.I Multiplying all the rule probs in a derivation gives the

probability of the derivation.I We want the tree with maximum probability.

More Formally

P(tree, sent) =Y

l!r2deriv(tree)

p(l ! r)

tree = arg maxtree2trees(sent)

P(tree|sent) = arg maxtree2trees(sent)

P(tree, sent)

26 / 48

From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a

PCFG!I PCFG: probabilistic context free grammar. Just like a CFG,

but each rule has an associated probability.I All probabilities for the same LHS sum to 1.

I Multiplying all the rule probs in a derivation gives theprobability of the derivation.

I We want the tree with maximum probability.

More Formally

P(tree, sent) =Y

l!r2deriv(tree)

p(l ! r)

tree = arg maxtree2trees(sent)

P(tree|sent) = arg maxtree2trees(sent)

P(tree, sent)

26 / 48

From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a

PCFG!I PCFG: probabilistic context free grammar. Just like a CFG,

but each rule has an associated probability.I All probabilities for the same LHS sum to 1.I Multiplying all the rule probs in a derivation gives the

probability of the derivation.I We want the tree with maximum probability.

More Formally

P(tree, sent) =Y

l!r2deriv(tree)

p(l ! r)

tree = arg maxtree2trees(sent)

P(tree|sent) = arg maxtree2trees(sent)

P(tree, sent)

26 / 48

From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a

PCFG!I PCFG: probabilistic context free grammar. Just like a CFG,

but each rule has an associated probability.I All probabilities for the same LHS sum to 1.I Multiplying all the rule probs in a derivation gives the

probability of the derivation.I We want the tree with maximum probability.

More Formally

P(tree, sent) =Y

l!r2deriv(tree)

p(l ! r)

tree = arg maxtree2trees(sent)

P(tree|sent) = arg maxtree2trees(sent)

P(tree, sent)

26 / 48

From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a

PCFG!I PCFG: probabilistic context free grammar. Just like a CFG,

but each rule has an associated probability.I All probabilities for the same LHS sum to 1.I Multiplying all the rule probs in a derivation gives the

probability of the derivation.I We want the tree with maximum probability.

More Formally

P(tree, sent) =Y

l!r2deriv(tree)

p(l ! r)

tree = arg maxtree2trees(sent)

P(tree|sent) = arg maxtree2trees(sent)

P(tree, sent)

26 / 48

PCFG Example

a simple PCFG1.0 S ! NP VP0.3 NP ! Adj Noun0.7 NP ! Det Noun1.0 VP ! Vb NP-0.2 Adj ! fruit0.2 Noun ! flies1.0 Vb ! like1.0 Det ! a0.4 Noun ! banana0.4 Noun ! tomato0.8 Adj ! angry

ExampleS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

banana1⇤0.3⇤0.2⇤0.7⇤1.0⇤0.2⇤1⇤1⇤0.4 =0.0033

27 / 48

PCFG Example

a simple PCFG1.0 S ! NP VP0.3 NP ! Adj Noun0.7 NP ! Det Noun1.0 VP ! Vb NP-0.2 Adj ! fruit0.2 Noun ! flies1.0 Vb ! like1.0 Det ! a0.4 Noun ! banana0.4 Noun ! tomato0.8 Adj ! angry

ExampleS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

banana1⇤0.3⇤0.2⇤0.7⇤1.0⇤0.2⇤1⇤1⇤0.4 =0.0033

27 / 48

PCFG Example

a simple PCFG1.0 S ! NP VP0.3 NP ! Adj Noun0.7 NP ! Det Noun1.0 VP ! Vb NP-0.2 Adj ! fruit0.2 Noun ! flies1.0 Vb ! like1.0 Det ! a0.4 Noun ! banana0.4 Noun ! tomato0.8 Adj ! angry

ExampleS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

banana1⇤0.3⇤0.2⇤0.7⇤1.0⇤0.2⇤1⇤1⇤0.4 =0.0033

27 / 48

PCFG Example

a simple PCFG1.0 S ! NP VP0.3 NP ! Adj Noun0.7 NP ! Det Noun1.0 VP ! Vb NP-0.2 Adj ! fruit0.2 Noun ! flies1.0 Vb ! like1.0 Det ! a0.4 Noun ! banana0.4 Noun ! tomato0.8 Adj ! angry

ExampleS

NP

Adj

Fruit

Noun

Flies

VP

Vb

like

NP

Det

a

Noun

banana1⇤0.3⇤0.2⇤0.7⇤1.0⇤0.2⇤1⇤1⇤0.4 =0.0033

27 / 48

Parsing with PCFG

I Parsing with a PCFG is finding the most probablederivation for a given sentence.

I This can be done quite efficiently with dynamicprogramming (the CKY algorithm)

Obtaining the probabilities

I We estimate them from the Treebank.I P(LHS ! RHS) = count(LHS!RHS)

count(LHS!⌃)I We can also add smoothing and backoff, as before.I Dealing with unknown words - like in the HMM

28 / 48

Parsing with PCFG

I Parsing with a PCFG is finding the most probablederivation for a given sentence.

I This can be done quite efficiently with dynamicprogramming (the CKY algorithm)

Obtaining the probabilities

I We estimate them from the Treebank.I P(LHS ! RHS) = count(LHS!RHS)

count(LHS!⌃)I We can also add smoothing and backoff, as before.I Dealing with unknown words - like in the HMM

28 / 48

The CKY algorithm

29 / 48

The ProblemInput

I Sentence (a list of words)I n – sentence length

I CFG Grammar (with weights on rules)I g – number of non-terminal symbols

Output

I A parse tree / the best parse tree

But. . .I Exponentially many possible parse trees!

SolutionI Dynamic Programming!

30 / 48

CKY

Cocke Kasami Younger

31 / 48

CKY

Cocke Kasami Younger196?

31 / 48

CKY

Cocke Kasami Younger196? 1965

31 / 48

CKY

Cocke Kasami Younger196? 1965 1967

31 / 48

3 Interesting Problems

I Recognition

I Can this string be generated by the grammar?

I Parsing

I Show me a possible derivation. . .

I Disambiguation

I Show me THE BEST derivation

CKY can do all of these in polynomial time

I For any CNF grammar

32 / 48

3 Interesting Problems

I RecognitionI Can this string be generated by the grammar?

I Parsing

I Show me a possible derivation. . .

I Disambiguation

I Show me THE BEST derivation

CKY can do all of these in polynomial time

I For any CNF grammar

32 / 48

3 Interesting Problems

I RecognitionI Can this string be generated by the grammar?

I ParsingI Show me a possible derivation. . .

I Disambiguation

I Show me THE BEST derivation

CKY can do all of these in polynomial time

I For any CNF grammar

32 / 48

3 Interesting Problems

I RecognitionI Can this string be generated by the grammar?

I ParsingI Show me a possible derivation. . .

I DisambiguationI Show me THE BEST derivation

CKY can do all of these in polynomial time

I For any CNF grammar

32 / 48

3 Interesting Problems

I RecognitionI Can this string be generated by the grammar?

I ParsingI Show me a possible derivation. . .

I DisambiguationI Show me THE BEST derivation

CKY can do all of these in polynomial time

I For any CNF grammar

32 / 48

3 Interesting Problems

I RecognitionI Can this string be generated by the grammar?

I ParsingI Show me a possible derivation. . .

I DisambiguationI Show me THE BEST derivation

CKY can do all of these in polynomial time

I For any CNF grammar

32 / 48

CNFChomsky Normal Form

DefinitionA CFG is in CNF form if it only has rules like:

I A ! B CI A ! ↵

A,B,C are non terminal symbols↵ is a terminal symbol (a word. . . )

I All terminal symbols are RHS of unary rulesI All non terminal symbols are RHS of binary rules

CKY can be easily extended to handle also unary rules: A ! B

33 / 48

Binarization

FactI Any CFG grammar can be converted to CNF form

Speficifally for Natural Language grammarsI We already have A ! ↵

I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP

BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP

34 / 48

Binarization

FactI Any CFG grammar can be converted to CNF form

Speficifally for Natural Language grammarsI We already have A ! ↵

I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP

BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP

34 / 48

Binarization

FactI Any CFG grammar can be converted to CNF form

Speficifally for Natural Language grammarsI We already have A ! ↵

I (A ! ↵ � is also easy to handle)

I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP

BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP

34 / 48

Binarization

FactI Any CFG grammar can be converted to CNF form

Speficifally for Natural Language grammarsI We already have A ! ↵

I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OK

I Only problem:S ! NP PP VP PP

BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP

34 / 48

Binarization

FactI Any CFG grammar can be converted to CNF form

Speficifally for Natural Language grammarsI We already have A ! ↵

I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP

BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP

34 / 48

Binarization

FactI Any CFG grammar can be converted to CNF form

Speficifally for Natural Language grammarsI We already have A ! ↵

I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP

BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP

34 / 48

Finally, CKY

Recognition

I Main idea:I Build parse tree from bottom upI Combine built trees to form bigger trees using grammar

rulesI When left with a single tree, verify root is S

I Exponentially many possible trees. . .I Search over all of them in polynomial time using DPI Shared structure – smaller trees

35 / 48

Main Idea

If we know:

I wi . . .wj is an NPI wj+1 . . .wk is a VP

and grammar has rule:I S ! NP VP

Then we know:I S can derive wi . . .wk

36 / 48

Data Structure(Half a) two dimensional array (n x n)

37 / 48

Data StructureOn its side

38 / 48

Data StructureEach cell: all nonterminals than can derive word i to word j

Sue saw her girl with a telescope

38 / 48

Data StructureEach cell: all nonterminals than can derive word i to word jimagine each cell as a g dimensional array

Sue saw her girl with a telescope

38 / 48

Filling the table

Sue saw her girl with a telescope

39 / 48

Handling Unary rules?

Sue saw her girl with a telescope

40 / 48

Which Order?

Sue saw her boy with a telescope

41 / 48

Complexity?

I n2g cells to fillI g2n ways to fill each one

O(g3n3)

42 / 48

Complexity?

I n2g cells to fill

I g2n ways to fill each one

O(g3n3)

42 / 48

Complexity?

I n2g cells to fillI g2n ways to fill each one

O(g3n3)

42 / 48

Complexity?

I n2g cells to fillI g2n ways to fill each one

O(g3n3)

42 / 48

A Note on Implementation

Smart implementation can reduce the runtime:I Worst case is still O(g3n3), but it helps in practice

I No need to check all grammar rules A ! BC at eachlocation:

I only those compatible with B or C of current splitI prune binarized symbols which are too long for current

positionI once you found 1 way to derive A can break out of loopI order grammar rules from frequent to infrequent

I Need both efficient random access and iteration overpossible symbols

I Keep both hash and list, implemented as arrays

43 / 48

Finding a parseParsing – we want to actually find a parse tree

Easy: also keep a possible split point for each NT

44 / 48

PCFG Parsing and DisambiguationDisambiguation – we want THE BEST parse tree

Easy: for each NT, keep best split point, and score.

45 / 48

Implementation Tricks#1: sum instead of product

As in the HMM - Multiplying probabilities is evilI keeping the product of many floating point numbers is

dangerous, because product get really smallI either grow in runtimeI or loose precision (overflowing to 0)I either way, multiplying floats is expensive

Solution: use sum of logs instead

I remember: log(p1 ⇤ p2) = log(p1) + log(p2)) Use log probabilities instead of probabilities) add instead of multiply

46 / 48

The big question

Does this work?

8 / 1

Evaluation

9 / 1

Parsing Evaluation

I Let’s assume we have a parser, how do we know howgood it is?

) Compare output trees to gold trees.

I But how do we compare trees?I Credit of 1 if tree is correct and 0 otherwise, is too harsh.

I Represent each tree as a set of labeled spans.I NP from word 1 to word 5.I VP from word 3 to word 4.I S from word 1 to word 23.I . . .

I Measure Precision, Recall and F1 over these spans, as inthe segmentation case.

10 / 1

Parsing Evaluation

I Let’s assume we have a parser, how do we know howgood it is?

) Compare output trees to gold trees.I But how do we compare trees?I Credit of 1 if tree is correct and 0 otherwise, is too harsh.

I Represent each tree as a set of labeled spans.I NP from word 1 to word 5.I VP from word 3 to word 4.I S from word 1 to word 23.I . . .

I Measure Precision, Recall and F1 over these spans, as inthe segmentation case.

10 / 1

Parsing Evaluation

I Let’s assume we have a parser, how do we know howgood it is?

) Compare output trees to gold trees.I But how do we compare trees?I Credit of 1 if tree is correct and 0 otherwise, is too harsh.

I Represent each tree as a set of labeled spans.I NP from word 1 to word 5.I VP from word 3 to word 4.I S from word 1 to word 23.I . . .

I Measure Precision, Recall and F1 over these spans, as inthe segmentation case.

10 / 1

Evaluation: Representing Trees as Constituents

S

NP

DT

the

NN

lawyer

VP

Vt

questioned

NP

DT

the

NN

witness

Label Start Point End Point

NP 1 2NP 4 5VP 3 5S 1 5

(by Mike Collins)

Precision and RecallLabel Start Point End Point

NP 1 2NP 4 5NP 4 8PP 6 8NP 7 8VP 3 8S 1 8

Label Start Point End Point

NP 1 2NP 4 5PP 6 8NP 7 8VP 3 8S 1 8

I G = number of constituents in gold standard = 7

I P = number in parse output = 6

I C = number correct = 6

Recall = 100%⇥ C

G= 100%⇥ 6

7Precision = 100%⇥ C

P= 100%⇥ 6

6

(by Mike Collins)

Parsing Evaluation

I Is this a good measure?I Why? Why not?

11 / 1

(by Mike Collins)

Parsing Evaluation

How well does the PCFG parser we learned do?

Not very well: about 73% F1 score.

12 / 1

(by Mike Collins)

Problems with PCFGs

13 / 1

(by Mike Collins)

Weaknesses of Probabilistic Context-Free Grammars

Michael Collins, Columbia University

(by Mike Collins)

Weaknesses of PCFGs

I Lack of sensitivity to lexical information

I Lack of sensitivity to structural frequencies

(by Mike Collins)

S

NP

NNP

IBM

VP

Vt

bought

NP

NNP

Lotus

p(t) = q(S ! NP VP) ⇥q(NNP ! IBM)⇥q(VP ! V NP) ⇥q(Vt ! bought)⇥q(NP ! NNP) ⇥q(NNP ! Lotus)⇥q(NP ! NNP)

(by Mike Collins)

Another Case of PP Attachment Ambiguity(a) S

NP

NNS

workers

VP

VP

VBD

dumped

NP

NNS

sacks

PP

IN

into

NP

DT

a

NN

bin(b) S

NP

NNS

workers

VP

VBD

dumped

NP

NP

NNS

sacks

PP

IN

into

NP

DT

a

NN

bin

(by Mike Collins)

(a)

RulesS ! NP VPNP ! NNSVP ! VP PP

VP ! VBD NPNP ! NNSPP ! IN NPNP ! DT NNNNS ! workersVBD ! dumpedNNS ! sacksIN ! intoDT ! aNN ! bin

(b)

RulesS ! NP VPNP ! NNSNP ! NP PP

VP ! VBD NPNP ! NNSPP ! IN NPNP ! DT NNNNS ! workersVBD ! dumpedNNS ! sacksIN ! intoDT ! aNN ! bin

If q(NP ! NP PP) > q(VP ! VP PP) then (b) is moreprobable, else (a) is more probable.Attachment decision is completely independent of the

words

(by Mike Collins)

A Case of Coordination Ambiguity

(a) NP

NP

NP

NNS

dogs

PP

IN

in

NP

NNS

houses

CC

and

NP

NNS

cats

(b) NP

NP

NNS

dogs

PP

IN

in

NP

NP

NNS

houses

CC

and

NP

NNS

cats

(by Mike Collins)

(a)

RulesNP ! NP CC NPNP ! NP PPNP ! NNSPP ! IN NPNP ! NNSNP ! NNSNNS ! dogsIN ! inNNS ! housesCC ! andNNS ! cats

(b)

RulesNP ! NP CC NPNP ! NP PPNP ! NNSPP ! IN NPNP ! NNSNP ! NNSNNS ! dogsIN ! inNNS ! housesCC ! andNNS ! cats

Here the two parses have identical rules, and

therefore have identical probability under any

assignment of PCFG rule probabilities

(by Mike Collins)

Structural Preferences: Close Attachment

(a) NP

NP

NN

PP

IN NP

NP

NN

PP

IN NP

NN

(b) NP

NP

NP

NN

PP

IN NP

NN

PP

IN NP

NN

I Example: president of a company in Africa

I Both parses have the same rules, therefore receive sameprobability under a PCFG

I “Close attachment” (structure (a)) is twice as likely in WallStreet Journal text.

Lexicalized PCFGs

PCFG Problem 1Lack of sensitivity to lexical information (words)

SolutionI Make PCFG aware of words (lexicalized PCFG)I Main Idea: Head Words

14 / 1

Head Words

Each constituent has one words which captures its “essence”.

I (S John saw the young boy with the large hat)I (VP saw the young boy with the large hat)I (NP the young boy with the large hat)I (NP the large hat)I (PP with the large hat)

I hat is the “semantic head”I with is the “functional head”I (it is common to choose the functional head)

15 / 1

Heads in Context-Free Rules

Add annotations specifying the “head” of each rule:

S ) NP VPVP ) ViVP ) Vt NPVP ) VP PPNP ) DT NNNP ) NP PPPP ) IN NP

Vi ) sleepsVt ) sawNN ) manNN ) womanNN ) telescopeDT ) theIN ) withIN ) in

More about Heads

I Each context-free rule has one “special” child that is thehead of the rule. e.g.,

S ) NP VP (VP is the head)VP ) Vt NP (Vt is the head)NP ) DT NN NN (NN is the head)

I A core idea in syntax(e.g., see X-bar Theory, Head-Driven Phrase StructureGrammar)

I Some intuitions:

I The central sub-constituent of each rule.I The semantic predicate in each rule.

Rules which Recover Heads: An Example for NPs

If the rule contains NN, NNS, or NNP:Choose the rightmost NN, NNS, or NNP

Else If the rule contains an NP: Choose the leftmost NP

Else If the rule contains a JJ: Choose the rightmost JJ

Else If the rule contains a CD: Choose the rightmost CD

Else Choose the rightmost child

e.g.,NP ) DT NNP NNNP ) DT NN NNPNP ) NP PPNP ) DT JJNP ) DT

Rules which Recover Heads: An Example for VPs

If the rule contains Vi or Vt: Choose the leftmost Vi or Vt

Else If the rule contains an VP: Choose the leftmost VP

Else Choose the leftmost child

e.g.,VP ) Vt NPVP ) VP PP

Adding Headwords to Trees

S

NP

DT

the

NN

lawyer

VP

Vt

questioned

NP

DT

the

NN

witness

+

S(questioned)

NP(lawyer)

DT(the)

the

NN(lawyer)

lawyer

VP(questioned)

Vt(questioned)

questioned

NP(witness)

DT(the)

the

NN(witness)

witness

Adding Headwords to Trees (Continued)S(questioned)

NP(lawyer)

DT(the)

the

NN(lawyer)

lawyer

VP(questioned)

Vt(questioned)

questioned

NP(witness)

DT(the)

the

NN(witness)

witness

I A constituent receives its headword from its head child.

S ) NP VP (S receives headword from VP)VP ) Vt NP (VP receives headword from Vt)NP ) DT NN (NP receives headword from NN)

Adding Headwords to Trees (Continued)S(questioned)

NP(lawyer)

DT(the)

the

NN(lawyer)

lawyer

VP(questioned)

Vt(questioned)

questioned

NP(witness)

DT(the)

the

NN(witness)

witness

I A constituent receives its headword from its head child.

S ) NP VP (S receives headword from VP)VP ) Vt NP (VP receives headword from Vt)NP ) DT NN (NP receives headword from NN)

We can parse a lexicalized grammar in O( ) [how?]n5

Dependency Representation

16 / 1

Dependency Representation

S(questioned)

NP(lawyer)

DT(the)

the

NN(lawyer)

lawyer

VP(questioned)

Vt(questioned)

questioned

NP(witness)

DT(the)

the

NN(witness)

witness

Dependency representation is very common.We will return to it in the future.

18 / 1

Dependency Representation

questioned

lawyer

the

the

lawyer

lawyer

questioned

questioned

questioned

witness

the

the

witness

witness

Dependency representation is very common.We will return to it in the future.

18 / 1

Dependency Representation

questioned

lawyer

the

the

lawyer

lawyer

questioned

questioned

questioned

witness

the

the

witness

witness

Dependency representation is very common.We will return to it in the future.

18 / 1

Dependency Representation

questioned

lawyer

the

witness

the

Dependency representation is very common.We will return to it in the future.

18 / 1

Dependency Representation

questioned

lawyer

the

witness

the

Dependency representation is very common.We will return to it in the future.

18 / 1

Dependency Representation

Dependency representation is very common.We will return to it in the future.

18 / 1

Dependency Representation

Dependency representation is very common.We will return to it in the future.

18 / 1

Dependency Parsing

21 / 1

Evaluation Measures

I UAS. Unlabeled Attachment Scores(% of words with correct head)

I LAS. Labeled Attachment Scores(% of words with correct head and label)

I Root(% of sentences with correct root)

I Exact(% of sentences with exact correct structure)

22 / 1

Evaluation Measures

I UAS. Unlabeled Attachment Scores 90-94 (Eng, WSJ)(% of words with correct head)

I LAS. Labeled Attachment Scores 87-92 (Eng, WSJ)(% of words with correct head and label)

I Root ⇠90 (Eng, WSJ)(% of sentences with correct root)

I Exact 40-50 (Eng, WSJ)(% of sentences with exact correct structure)

22 / 1

Three main approaches to Dependency ParsingConversion

I Parse to constituency structure.I Extract dependencies from the trees.

Global Optimization (Graph based)

I Define a scoring function over <sentence,tree> pairs.I Search for best-scoring structure.I Simpler scoring ) easier search.I (Similar to how we do tagging, constituency parsing.)

Greedy decoding (Transition based)

I Start with an unparsed sentence.I Apply locally-optimal actions until sentence is parsed.

23 / 1

Three main approaches to Dependency ParsingConversion

I Parse to constituency structure.I Extract dependencies from the trees.

Global Optimization (Graph based)

I Define a scoring function over <sentence,tree> pairs.I Search for best-scoring structure.I Simpler scoring ) easier search.I (Similar to how we do tagging, constituency parsing.)

Greedy decoding (Transition based)

I Start with an unparsed sentence.I Apply locally-optimal actions until sentence is parsed.

23 / 1

argmax over combinatorial space

while (!done) { do best thing }

Graph-based parsing (Global Search)

24 / 1

Dependency Parsing

Alexander Rushsrush@csail.mit.edu

NYU CS 3033

Arcs

Dependency parsing is concerned with head-modifier relationships.

Definitions:

I head; the main word in a phrase

I modifier; an auxiliary word in a phrase

Meaning depends on underlying linguistic formalism.

Common to use head!modifier arc notation

* Millions on the coast face freak storm

Input Notation

Input:

I x = (w , t)

I w1 . . .wn; the words of the sentence

I t1 . . . tn; the tags of the sentence

I Special symbol w0 = ⇤; the pseudo-root

Note: Unlike in CFG parsing, we assume tags are given.

Output Notation

Output:

I set of possible dependency arcs

A = {(h,m) : h 2 {0 . . . n},m 2 {1 . . . n}}

I Y ⇢ {0, 1}|A|; set of all valid dependency parses

I y 2 Y; a valid dependency parse

Example

* Millions/N on/P the/D coast/N face/V freak/A storm/N

I w0 = ⇤, w1 = Millions, w2 = on, w3 = the, . . .

I t0 = ⇤, t1 = N, t2 = P, t3 = D, . . .

I y(0, 5) = 1, y(5, 1) = 1, y(1, 2) = 1 . . .

Example

* Millions/N on/P the/D coast/N face/V freak/A storm/N

I w0 = ⇤, w1 = Millions, w2 = on, w3 = the, . . .

I t0 = ⇤, t1 = N, t2 = P, t3 = D, . . .

I y(0, 5) = 1, y(5, 1) = 1, y(1, 2) = 1 . . .

Example

* Millions/N on/P the/D coast/N face/V freak/A storm/N

I w0 = ⇤, w1 = Millions, w2 = on, w3 = the, . . .

I t0 = ⇤, t1 = N, t2 = P, t3 = D, . . .

I y(0, 5) = 1, y(5, 1) = 1, y(1, 2) = 1 . . .

Forbidden Structures

I Each (non-root) word must modify exactly one word.

* Millions on the coast face freak storm

I Arcs must form a tree.

* Millions on the coast face freak storm

I (Projective) Arcs may not cross each other.

* Millions on the coast face freak storm

Main Idea

I Define a scoring function g(y; x, ✓)

I This function will tell us, for every x (sentence) and y (tree)pair, how good the pair is.

I ✓ are the parameters, or weights (we called them w before)I For example: g(y; x, ✓) =

Pi�i(x, y)✓i = �(x, y) · ✓

I (a linear model)I Look for the best y for a given sentence arg max

yg(y; x, ✓)

25 / 1

at is a good dependency parse?

y⇤ = argmax

y2Yg(y ; x , ✓)

Method:

I Define features for this problem.

I Learn parameters ✓ from corpus data.

I Maximize objective to find best parse y⇤.

First-order Scoring Function

Scoring function g(y ; x , ✓) is the sum of first-order arc scores

* Millions on the coast face freak storm

g(y ; x , ✓) =

score(coast ! the)

+ score(on ! coast)

+ score(Millions ! on)

+ score(face ! millions)

+ score(face ! storm)

+ score(storm ! freak)

+ score(⇤ ! face)

First-order Scoring Function

Scoring function g(y ; x , ✓) is the sum of first-order arc scores

* Millions on the coast face freak storm

g(y ; x , ✓) = score(coast ! the)

+ score(on ! coast)

+ score(Millions ! on)

+ score(face ! millions)

+ score(face ! storm)

+ score(storm ! freak)

+ score(⇤ ! face)

First-order Scoring Function

Scoring function g(y ; x , ✓) is the sum of first-order arc scores

* Millions on the coast face freak storm

g(y ; x , ✓) = score(coast ! the)

+ score(on ! coast)

+ score(Millions ! on)

+ score(face ! millions)

+ score(face ! storm)

+ score(storm ! freak)

+ score(⇤ ! face)

First-order Scoring Function

Scoring function g(y ; x , ✓) is the sum of first-order arc scores

* Millions on the coast face freak storm

g(y ; x , ✓) = score(coast ! the)

+ score(on ! coast)

+ score(Millions ! on)

+ score(face ! millions)

+ score(face ! storm)

+ score(storm ! freak)

+ score(⇤ ! face)

First-order Scoring Function

Scoring function g(y ; x , ✓) is the sum of first-order arc scores

* Millions on the coast face freak storm

g(y ; x , ✓) = score(coast ! the)

+ score(on ! coast)

+ score(Millions ! on)

+ score(face ! millions)

+ score(face ! storm)

+ score(storm ! freak)

+ score(⇤ ! face)

First-order Scoring Function

Scoring function g(y ; x , ✓) is the sum of first-order arc scores

* Millions on the coast face freak storm

g(y ; x , ✓) = score(coast ! the)

+ score(on ! coast)

+ score(Millions ! on)

+ score(face ! millions)

+ score(face ! storm)

+ score(storm ! freak)

+ score(⇤ ! face)

First-order Scoring Function

Scoring function g(y ; x , ✓) is the sum of first-order arc scores

* Millions on the coast face freak storm

g(y ; x , ✓) = score(coast ! the)

+ score(on ! coast)

+ score(Millions ! on)

+ score(face ! millions)

+ score(face ! storm)

+ score(storm ! freak)

+ score(⇤ ! face)

First-order Scoring Function

Scoring function g(y ; x , ✓) is the sum of first-order arc scores

* Millions on the coast face freak storm

g(y ; x , ✓) = score(coast ! the)

+ score(on ! coast)

+ score(Millions ! on)

+ score(face ! millions)

+ score(face ! storm)

+ score(storm ! freak)

+ score(⇤ ! face)

Conditional Model (e.g. CRF)

Define:

score(wh ! wm) = �(x , hh,mi) · ✓

where:

I �(x , hh,mi) : X ⇥A ! {0, 1}p; a feature function

I ✓ 2 Rp; a parameter vector (assume given)

I p; number of features

Feature-based Discriminative Model

Features

I Features are critical for dependency parsing performance.

I Specified as a vector of indicators.

�NAME(ht,wi, hh,mi) =⇢

1, if tm = u

0, o.w.

I Each feature has a corresponding real-value weight.

✓NAME = 9.23

Features: Tags

8u 2 T �TAG:M:u(ht,wi, hh,mi) =⇢

1, if tm = u

0, o.w.

8u 2 T �TAG:H:u(ht,wi, hh,mi) =⇢

1, if th = u

0, o.w.

8u, v 2 T �TAG:H:M:u:v (ht,wi, hh,mi) =⇢

1, if th = u and tm = v

0, o.w.

* Millions/N on/P the/D coast/N face/V freak/A storm/N

Features: Words

8u 2 W �WORD:M:u(ht,wi, hh,mi) =⇢

1, if wm = u

0, o.w.

8u 2 W �WORD:H:u(ht,wi, hh,mi) =⇢

1, if wh = u

0, o.w.

8u, v 2 W �WORD:H:M:u:v (ht,wi, hh,mi) =⇢

1, if wh = u and wm = v

0, o.w.

* Millions/N on/P the/D coast/N face/V freak/A storm/N

Features: Context Tags

8u 2 T 4 �CON:�1:�1:u(ht,wi, hh,mi) =

8<

:

1, if th�1 = u1 and th = u2

and tm�1 = u3 and tm = u4

0, o.w.

8u 2 T 4 �CON:1:�1:u(ht,wi, hh,mi) =

8<

:

1, if th+1 = u1 and th = u2

and tm�1 = u3 and tm = u4

0, o.w.

* Millions/N on/P the/D coast/N face/V freak/A storm/N

Features: Between Tags

8u 2 T �BET:u(ht,wi, hh,mi) =⇢

1, if ti = u for i between h and m

0, o.w.

* Millions/N on/P the/D coast/N face/V freak/A storm/N

Features: Direction

�RIGHT(ht,wi, hh,mi) =⇢

1, if h > m

0, o.w.

�LEFT(ht,wi, hh,mi) =⇢

1, if h < m

0, o.w.

* Millions/N on/P the/D coast/N face/V freak/A storm/N

Features: Length

�LEN:2(ht,wi, hh,mi) =⇢

1, if |h �m| > 20, o.w.

�LEN:5(ht,wi, hh,mi) =⇢

1, if |h �m| > 50, o.w.

�LEN:10(ht,wi, hh,mi) =⇢

1, if |h �m| > 100, o.w.

* Millions/N on/P the/D coast/N face/V freak/A storm/N

Features: Backo↵s and Combinations

I Additionally include backo↵.

8u 2 T 3 �CON:�1:u(ht,wi, hh,mi) =

8<

:

1, if th�1 = u1 and th = u2

and tm = u3

0, o.w.

I As well as combination features.

8u 2 W �LEN:2:DIR:LEFT:TAG:M:u(ht,wi, hh,mi) =⇢

1, if all on0, o.w.

First-Order Results

Model AccuracyNoPOSContextBetween 86.0NoEdge 87.3NoAttachmentOrDistance 88.1NoBiLex 90.6Full 90.7

From McDonald (2006)

What’s left

I Define features for this problem.

I Learn parameters ✓ from corpus data.

I Maximize objective to find best parse y⇤.

Downside: Higher-order models make inference more di�cult

y⇤ = argmax

y2Yg(y ; x , ✓)

What’s left

I Define features for this problem.

I Learn parameters ✓ from corpus data.

I Maximize objective to find best parse y⇤.

Downside: Higher-order models make inference more di�cult

y⇤ = argmax

y2Yg(y ; x , ✓)

Parsing

Goal: Finding the best parse.

y⇤ = argmax

y2Yg(y ; x , ✓)

Graph Algorithms

Algorithm 2: Use graph algorithms for parsing.

Find the maximum directed spanning tree.

I Chou-Liu-Edmonds Algorithm O(n3)

I Tarjan’s Extension O(n2)

Graph Algorithms

Algorithm 2: Use graph algorithms for parsing.

Find the maximum directed spanning tree.

I Chou-Liu-Edmonds Algorithm O(n3)

I Tarjan’s Extension O(n2)

Maximum Directed Spanning Tree Algorithm

Issues with MST Algorithm

I Allows non-projective parses.

* Millions on the coast face freak storm

I Good for some languages.

I Cannot incorporate higher-order parts.I Problem becomes NP-Hard.

Dynamic Programming for Parsing

Algorithm 3: Use a specialized dynamic programming algorithm.

I The Eisner algorithm (1996) for bilexical parsing.

I Use split-head trick. Handle left and right dependenciesseparately.

Dependency Parsing New Example

* As McGwire neared , fans went wild

Base Case

* As McGwire neared , fans went wild;

Dependency Parsing Algorithm - First-order Model

h m

h r

+

mr + 1

h e

h m

+

m e

Parsing

* As McGwire neared , fans went wild

Parsing

* As McGwire neared , fans went wild

Parsing

* As McGwire neared , fans went wild

Parsing

* As McGwire neared , fans went wild

Parsing

* As McGwire neared , fans went wild

Parsing

* As McGwire neared , fans went wild

Parsing

* As McGwire neared , fans went wild

Parsing

* As McGwire neared , fans went wild

Parsing

* As McGwire neared , fans went wild

Algorithm Key

I L; left-facing item

I R; right-facing item

I C; completed item (triangle)

I I; incomplete item (trapezoid)

AlgorithmInitialize:for i in 0 . . . n do

⇡[C,L, i , i ] = 0⇡[C,R, i , i ] = 0⇡[I,L, i , i ] = 0⇡[I,R, i , i ] = 0

Inner Loop:for k in 1 . . . n do

for s in 0 . . . n do

t k + s

if t � n then break

⇡[I,L, s, t] = maxr2s...t�1 ⇡[C,R, s, r ] + ⇡[C,L, r + 1, t]⇡[I,R, s, t] = maxr2s...t�1 ⇡[C,R, s, r ] + ⇡[C ,L, r + 1, t]⇡[C,L, s, t] = maxr2s...t�1 ⇡[C,L, s, r ] + ⇡[I,L, r , t]⇡[C,R, s, t] = maxr2s+1...t ⇡[I,R, s, r ] + ⇡[C,R, r , t]

return ⇡[C,R, 0, n]

Graph-based parsing algorithm

I Begin with a tagged sentence (can use a POS-tagger)

I Extract a set of “parts”I For a first-order model, each part is a (h,m) pair

(O(n2) parts)I For a second-order model, each part is a (h,m1,m2) tuple

(O(n3) parts)I Calculate a score for each part (using feature-extractor �

and parameters ✓)I Find a valid parse tree that is composed of the best parts.

I using Chu-Liu-Edmunds (for first-order non-projective)(O(n2))

I using a dynamic-programming algorithm (for first- andsecond-order projective)(O(n3))

Does this remind you of anything?

26 / 1

Graph-based parsing algorithm

I Begin with a tagged sentence (can use a POS-tagger)I Extract a set of “parts”

I For a first-order model, each part is a (h,m) pair(O(n2) parts)

I For a second-order model, each part is a (h,m1,m2) tuple(O(n3) parts)

I Calculate a score for each part (using feature-extractor �and parameters ✓)

I Find a valid parse tree that is composed of the best parts.I using Chu-Liu-Edmunds (for first-order non-projective)

(O(n2))I using a dynamic-programming algorithm (for first- and

second-order projective)(O(n3))

Does this remind you of anything?

26 / 1

Graph-based parsing algorithm

I Begin with a tagged sentence (can use a POS-tagger)I Extract a set of “parts”

I For a first-order model, each part is a (h,m) pair(O(n2) parts)

I For a second-order model, each part is a (h,m1,m2) tuple(O(n3) parts)

I Calculate a score for each part (using feature-extractor �and parameters ✓)

I Find a valid parse tree that is composed of the best parts.I using Chu-Liu-Edmunds (for first-order non-projective)

(O(n2))I using a dynamic-programming algorithm (for first- and

second-order projective)(O(n3))

Does this remind you of anything?

26 / 1

Graph-based parsing algorithm

I Begin with a tagged sentence (can use a POS-tagger)I Extract a set of “parts”

I For a first-order model, each part is a (h,m) pair(O(n2) parts)

I For a second-order model, each part is a (h,m1,m2) tuple(O(n3) parts)

I Calculate a score for each part (using feature-extractor �and parameters ✓)

I Find a valid parse tree that is composed of the best parts.I using Chu-Liu-Edmunds (for first-order non-projective)

(O(n2))I using a dynamic-programming algorithm (for first- and

second-order projective)(O(n3))

Does this remind you of anything?

26 / 1

Graph-based parsing algorithm

I Begin with a tagged sentence (can use a POS-tagger)I Extract a set of “parts”

I For a first-order model, each part is a (h,m) pair(O(n2) parts)

I For a second-order model, each part is a (h,m1,m2) tuple(O(n3) parts)

I Calculate a score for each part (using feature-extractor �and parameters ✓)

I Find a valid parse tree that is composed of the best parts.I using Chu-Liu-Edmunds (for first-order non-projective)

(O(n2))I using a dynamic-programming algorithm (for first- and

second-order projective)(O(n3))

Does this remind you of anything?

26 / 1

Training - setting values for ✓

27 / 1

Note: we need values such that g(y; x, ✓) of gold tree y is largerthan g(y0; x, ✓) for all other trees y

0.

28 / 1

Perceptron Sketch: Part 1

I (x1, y1) . . . (xn, yn); training data

I Gold features X

a2A:y(a)=1

�(xi , a)

Idea: Increase value (in ✓) of gold features.

Perceptron Sketch: Part 2

I Best-scoring structure

zi = argmaxz2Y

g(z ; x , ✓)

I Best-scoring structure features

X

a2A:z(a)=1

�(xi , a)

Idea: Decrease value (in ✓) of wrong best-scoring features

Perceptron Algorithm

✓ 0for t = 1 . . . T, i = 1 . . . n do

zi = argmaxy2Y

g(y ; xi , ✓)

gold X

a2A:yi (a)=1

�(xi , a)

best X

a2A:zi (a)=1

�(xi , a)

✓ ✓ + gold � best

return ✓

Theory

I If possible, perceptron will separate the correct structure fromthe incorrect structure.

I That is, it will find a ✓ that assigns yi a higher score thanother y 2 Y for each example.

Practical Training Considerations

I Training requires solving inference many times.

I Often times computing feature values is time consuming.

I In practice, averaged perceptron variant preferred (Collins,2002).

Conclusion

Method:

I Define features for this problem.

I Learn parameters ✓ from corpus data.

I Maximize objective to find best parse y⇤.

Structured prediction framework, applicable to many problems.

Transition-based parsing

29 / 1

Transition-based (greedy) parsing

1. Start with an unparsed sentence.2. Apply locally-optimal actions until sentence is parsed.

3. Use whatever features you want.4. Surprisingly accurate.5. Can be extremely fast.

30 / 1

Transition-based (greedy) parsing

1. Start with an unparsed sentence.2. Apply locally-optimal actions until sentence is parsed.3. Use whatever features you want.4. Surprisingly accurate.5. Can be extremely fast.

30 / 1

Intro to Transition-based Dependency Parsing

An abstract machine composed of a stack and a buffer.

Machine is initialized with the words of a sentence.

A set of actions process the words by moving them from bufferto stack, removing them from the stack, or adding links betweenthem.

A specific set of actions define a transition system.

31 / 1

The Arc-Eager Transition System

I SHIFT move first word from bufferto stack.(pre: Buffer not empty.)

I LEFTARClabel make first word inbuffer head of top of stack, popthe stack.(pre: Stack not empty. Top of stack doesnot have a parent.)

I RIGHTARClabel make top of stackhead of first in buffer, move firstin buffer to stack.(pre: Buffer not empty.)

I REDUCE pop the stack(pre: Stack not empty. Top of stack has aparent.)

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

32 / 1

The Arc-Eager Transition System

I SHIFT move first word from bufferto stack.(pre: Buffer not empty.)

I LEFTARClabel make first word inbuffer head of top of stack, popthe stack.(pre: Stack not empty. Top of stack doesnot have a parent.)

I RIGHTARClabel make top of stackhead of first in buffer, move firstin buffer to stack.(pre: Buffer not empty.)

I REDUCE pop the stack(pre: Stack not empty. Top of stack has aparent.)

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

32 / 1

The Arc-Eager Transition System

I SHIFT move first word from bufferto stack.(pre: Buffer not empty.)

I LEFTARClabel make first word inbuffer head of top of stack, popthe stack.(pre: Stack not empty. Top of stack doesnot have a parent.)

I RIGHTARClabel make top of stackhead of first in buffer, move firstin buffer to stack.(pre: Buffer not empty.)

I REDUCE pop the stack(pre: Stack not empty. Top of stack has aparent.)

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

32 / 1

The Arc-Eager Transition System

I SHIFT move first word from bufferto stack.(pre: Buffer not empty.)

I LEFTARClabel make first word inbuffer head of top of stack, popthe stack.(pre: Stack not empty. Top of stack doesnot have a parent.)

I RIGHTARClabel make top of stackhead of first in buffer, move firstin buffer to stack.(pre: Buffer not empty.)

I REDUCE pop the stack(pre: Stack not empty. Top of stack has aparent.)

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

A A B C D

32 / 1

Parsing Example

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasure

33 / 1

Parsing Example

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

33 / 1

What do we know about the arc-eager transitionsystem?

I Every sequence of actions result in a valid projectivestructure.

I Every projective tree is derivable by (at least one)sequence of actions.

I Given a tree, finding a sequence of actions for deriving it.("oracle")

we know these things also for thearc-standard, arc-hybrid and other transition systems

34 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasureA She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure

A She ate pizza with pleasure

35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure35 / 1

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

“She ate pizza with pleasure”

SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE

A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure

A She ate pizza with pleasure35 / 1

This knowledge is quite powerful

Parsing without an oracle

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

return configuration.tree

36 / 1

This knowledge is quite powerful

Parsing without an oracle

placeholderfor sentence,tree pair in corpus do

start with weight vector w

configuration initialize(sentence)while not configuration.IsFinal() do

action predict(w, �(configuration))configuration configuration.apply(action)

return configuration.tree

36 / 1

This knowledge is quite powerful

Parsing without an oracle

placeholderfor sentence,tree pair in corpus do

start with weight vector w

configuration initialize(sentence)while not configuration.IsFinal() do

action predict(w, �(configuration))configuration configuration.apply(action)

return configuration.tree

36 / 1

summarize the configurationas a feature vector

This knowledge is quite powerful

Parsing without an oracle

placeholderfor sentence,tree pair in corpus do

start with weight vector w

configuration initialize(sentence)while not configuration.IsFinal() do

action predict(w, �(configuration))configuration configuration.apply(action)

return configuration.tree

36 / 1

summarize the configurationas a feature vector

predict the action based on the features

This knowledge is quite powerful

Parsing without an oracle

placeholderfor sentence,tree pair in corpus do

start with weight vector w

configuration initialize(sentence)while not configuration.IsFinal() do

action predict(w, �(configuration))configuration configuration.apply(action)

return configuration.tree

36 / 1

summarize the configurationas a feature vector

predict the action based on the features

need to learn the correct weights

This knowledge is quite powerful

Parsing with an oracle sequence

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()configuration configuration.apply(action)

37 / 1

This knowledge is quite powerful

Learning a parser (batch)

placeholderfor sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()

features �(configuration)training_set.add(features, action)

configuration configuration.apply(action)

37 / 1

This knowledge is quite powerful

Learning a parser (batch)training_set []for sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()features �(configuration)training_set.add(features, action)configuration configuration.apply(action)

train a classifier on training_set

37 / 1

This knowledge is quite powerful

Learning a parser (batch)training_set []for sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()features �(configuration)training_set.add(features, action)configuration configuration.apply(action)

train a classifier on training_set

37 / 1

This knowledge is quite powerful

Learning a parser (online)training_set []for sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()features �(configuration)training_set.add(features, action)configuration configuration.apply(action)

train a classifier on training_set

37 / 1

This knowledge is quite powerful

Learning a parser (online)w 0for sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()features �(configuration)predicted predict(w, �(configuration))if predicted 6= action then

w.update(�(configuration), action, predicted)configuration configuration.apply(action)

return w

37 / 1

This knowledge is quite powerful

Learning a parser (online)w 0for sentence,tree pair in corpus do

sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do

action sequence.next()features �(configuration)predicted predict(w, �(configuration))if predicted 6= action then

w.update(�(configuration), action, predicted)configuration configuration.apply(action)

return w

37 / 1

This knowledge is quite powerful

Parsing timeconfiguration initialize(sentence)while not configuration.isFinal() do

action predict(w, �(configuration))configuration configuration.apply(action)

return configuration.tree

38 / 1

In short

I Summarize configuration by a set of features.I Learn the best action to take at each configuration.I Hope this generalizes well.

39 / 1

Transition Based Parsing

I A different approach.I Very common.I Can be as accurate as first-order graph-based parsing.

I Higher-order graph-based are still better.I Easy to implement.I Very fast. (O(n))I Can be improved further:

I Easy-firstI Dynamic oracleI Beam Search

41 / 1

Neural Networks

42 / 1

Neural-network (deep learning) based approaches

I Both graph based and transition-based models benefitfrom the move to neural networks.

I Same over-all approach and algorithm as before, but:I Replace classifier from linear to MLP.I Use pre-trained word embeddings.I Replace feature-extractor with Bi-LSTM.

I Now exploring;

I Semi-supervised learning.I Multi-task learning objectives.I Out of domain parsing.

43 / 1

Neural-network (deep learning) based approaches

I Both graph based and transition-based models benefitfrom the move to neural networks.

I Same over-all approach and algorithm as before, but:I Replace classifier from linear to MLP.I Use pre-trained word embeddings.I Replace feature-extractor with Bi-LSTM.

I Now exploring;I Semi-supervised learning.I Multi-task learning objectives.

I Out of domain parsing.

43 / 1

Neural-network (deep learning) based approaches

I Both graph based and transition-based models benefitfrom the move to neural networks.

I Same over-all approach and algorithm as before, but:I Replace classifier from linear to MLP.I Use pre-trained word embeddings.I Replace feature-extractor with Bi-LSTM.

I Now exploring;I Semi-supervised learning.I Multi-task learning objectives.I Out of domain parsing.

43 / 1

Neural Networks (deep learning): seq2seq to linearized trees.

Use the "sequence to sequence with attention" model used for Machine Translation (details in DL4Seq course).

Treat parsing as a translation from sentence to linearized tree.

(S (NP Adj Noun NP) (VP Vb (NP Det Noun NP) VP) S)

Linearize Tree

Neural Networks (deep learning): seq2seq to linearized trees.

Use the "sequence to sequence with attention" model used for Machine Translation (details in DL4Seq course).

Treat parsing as a translation from sentence to linearized tree.

(S (NP Adj Noun NP) (VP Vb (NP Det Noun NP) VP) S)

FruitFlieslikeabanana

NMT (seq2seq+att)

Hybrid Approaches

44 / 1

Hybrid-approaches

I Different parsers have different strengths.) Combine several parsers.

Stacking

I Run parser A.I Use tree from parser A to add features to parser B.

Voting

I Parse the sentence with k different parsers.I Each parser “votes” on its dependency arcs.I Run first-order graph-parser to find tree with best arcs

according to votes.

45 / 1

Hybrid-approaches

I Different parsers have different strengths.) Combine several parsers.

Stacking

I Run parser A.I Use tree from parser A to add features to parser B.

Voting

I Parse the sentence with k different parsers.I Each parser “votes” on its dependency arcs.I Run first-order graph-parser to find tree with best arcs

according to votes.

45 / 1

Hybrid-approaches

I Different parsers have different strengths.) Combine several parsers.

Stacking

I Run parser A.I Use tree from parser A to add features to parser B.

Voting

I Parse the sentence with k different parsers.I Each parser “votes” on its dependency arcs.I Run first-order graph-parser to find tree with best arcs

according to votes.

45 / 1

Semi-supervised-approachesI We only see very few words (and word-pairs) in training

data.I If we know (eat, carrot) is a good pair, what do we know

about (eat, tomato)?I Nothing, if the pair is not in our training data!) Use unlabeled data.

Cluster FeaturesI Represent words as context vectors.I Define a similarity measure between vectors.I Use a clustering algorithm to cluster the words.I We hope that:

I (eat, drink, devour,. . . ) are in the same cluster.I (tomato, carrot, pizza, . . . ) are in the same cluster.

I Use clusters as additional features to the parser.I This works well (better?) also for POS-tagging, NER.

46 / 1

Semi-supervised-approachesI We only see very few words (and word-pairs) in training

data.I If we know (eat, carrot) is a good pair, what do we know

about (eat, tomato)?I Nothing, if the pair is not in our training data!) Use unlabeled data.

Cluster FeaturesI Represent words as context vectors.I Define a similarity measure between vectors.I Use a clustering algorithm to cluster the words.I We hope that:

I (eat, drink, devour,. . . ) are in the same cluster.I (tomato, carrot, pizza, . . . ) are in the same cluster.

I Use clusters as additional features to the parser.I This works well (better?) also for POS-tagging, NER.

46 / 1

Semi-supervised-approachesI We only see very few words (and word-pairs) in training

data.I If we know (eat, carrot) is a good pair, what do we know

about (eat, tomato)?I Nothing, if the pair is not in our training data!) Use unlabeled data.

Cluster FeaturesI Represent words as context vectors.I Define a similarity measure between vectors.I Use a clustering algorithm to cluster the words.I We hope that:

I (eat, drink, devour,. . . ) are in the same cluster.I (tomato, carrot, pizza, . . . ) are in the same cluster.

I Use clusters as additional features to the parser.

I This works well (better?) also for POS-tagging, NER.

46 / 1

Semi-supervised-approachesI We only see very few words (and word-pairs) in training

data.I If we know (eat, carrot) is a good pair, what do we know

about (eat, tomato)?I Nothing, if the pair is not in our training data!) Use unlabeled data.

Cluster FeaturesI Represent words as context vectors.I Define a similarity measure between vectors.I Use a clustering algorithm to cluster the words.I We hope that:

I (eat, drink, devour,. . . ) are in the same cluster.I (tomato, carrot, pizza, . . . ) are in the same cluster.

I Use clusters as additional features to the parser.I This works well (better?) also for POS-tagging, NER.

46 / 1

Available SoftwareThere are many parsers available for download, including:Constituency (PCFG)

I Stanford Parser (can produce also dependencies)I Berkeley ParserI Charniak ParserI Collins Parser

Dependency

I RBGParser, TurboParser (graph based)I ZPar (transition+beam)I ClearNLP (many variants)I EasyFirst (my own)I Bist Parser (from BGU lab, biLSTM, graph + transition)I SpaCy (nice API, super fast!!)

47 / 1

Summary

Dependency Parsers

I Conversion from ConstituencyI Graph-basedI Transition-basedI Hybrid / EnsembleI Semi-supervised (cluster features)

48 / 1

top related