data-driven dependency parsing

39
1 Data-Driven Dependency Parsing Kenji Sagae CSCI-544

Upload: prue

Post on 24-Feb-2016

62 views

Category:

Documents


0 download

DESCRIPTION

Data-Driven Dependency Parsing. Kenji Sagae CSCI-544. Background: Natural Language Parsing. Syntactic analysis String to (tree) structure. S. VP. NP. PARSER. NP. He likes fish. N. Prn. V. He. likes. fish. Input. Output. S. VP. NP. PARSER. NP. He likes fish. N. Prn. V. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data-Driven Dependency Parsing

1

Data-Driven Dependency ParsingKenji Sagae CSCI-544

Page 2: Data-Driven Dependency Parsing

2

Background: Natural Language Parsing

• Syntactic analysis• String to (tree) structure

He likes fish PARSER

S

NPVP

NP

V NPrn

He likes fishInput

Output

Page 3: Data-Driven Dependency Parsing

3

He likes fish PARSER

S

NPVP

NP

V NPrn

He likes fish

Page 4: Data-Driven Dependency Parsing

4

He likes fish PARSER

S

NPVP

NP

V NPrn

He likes fish

• Useful in Natural Language Understanding• NL interfaces, conversational agents

• Language technology applications• Machine translation, question answering, information extraction

• Scientific study of language• Syntax• Language processing models

Page 5: Data-Driven Dependency Parsing

5

He likes fish PARSER

S

NPVP

NP

V NPrn

He likes fish

S → NP VPNP → NNP → NP PPVP → V NPVP → V NP PPVP → VP PP

GRAMMAR

Not enough coverage,Too much ambiguity

Page 6: Data-Driven Dependency Parsing

6 6

He likes fish PARSER

S

NPVP

NP

V NPrn

He likes fish

S → NP VPNP → NNP → NP PPVP → V NPVP → V NP PPVP → VP PP

GRAMMAR

S

NPVP

AdvP

V Adv Det

The runs fast

N

boy

S

NPVP

AdvP

V Adv N

Dogs run fast

S

NP VP

V N

Dogs run

S

NPVP

AdvP

V Adv Det

The runs fast

N

boy

S

NPVP

AdvP

V Adv N

Dogs run fast

S

NP VP

V N

Dogs run

TREEBANK

Charniak (1996);Collins (1996);

Charniak (1997)

Page 7: Data-Driven Dependency Parsing

7

He likes fish PARSER

S

NPVP

NP

V NPrn

He likes fish

S → NP VPNP → NNP → NP PPVP → V NPVP → V NP PPVP → VP PP

GRAMMAR

S

NPVP

AdvP

V Adv Det

The runs fast

N

boy

S

NPVP

AdvP

V Adv N

Dogs run fast

S

NP VP

V N

Dogs run

S

NPVP

AdvP

V Adv Det

The runs fast

N

boy

S

NPVP

AdvP

V Adv N

Dogs run fast

S

NP VP

V N

Dogs run

TREEBANK

Page 8: Data-Driven Dependency Parsing

8

S

NP

VP

NP

V NDet

The ate the

DetN

boy cheese sandwich

N

The ate theboy cheese sandwich

Phrase Structure Tree(Constituent Structure)

Dependency Structure

Page 9: Data-Driven Dependency Parsing

9

S

NP

VP

NP

V NDet

The ate the

DetN

boy cheese sandwich

N

The ate theboy cheese sandwich

sandwich

ate

ate

boy

Page 10: Data-Driven Dependency Parsing

10The ate theboy cheese sandwich

The

boy

cheesethe

ate

sandwich

SUBJ

DETDET

OBJ

MOD

DET SUBJ

OBJ

DETMOD

HEAD

DEPENDENT

LABEL

Page 11: Data-Driven Dependency Parsing

11

Background: Linear Classification with the Perceptron• Classification: given an input x predict output y

• Example: x is a document, y ∈ {Sports, Politics, Science}• x is represented as a feature vector f(x)

• Example: x f(x) y

• Just add feature weights given in a vector w

Wednesday night, when the Lakers play the Mavericks at American Airlines Center, they get to see first hand …

# games: 5# Lakers: 4# said: 3# rebounds: 3# democrat: 0# republican: 0# science: 0

Sports

Page 12: Data-Driven Dependency Parsing

12

Multiclass Perceptron

• Learn vectors of feature weights wclass

for each class c wc = 0

For N iterationsFor each training example (xi, yi) zi = argmaxz wz• f(xi) if zi ≠ yi

wzi = wzi – f(xi) wyi = wyi + f(xi)

• Try to classify each example. If a mistake is made, update the weights.

Page 13: Data-Driven Dependency Parsing

13

Shift-Reduce Dependency Parsing• Two main data structures

• Stack S (initially empty)• Queue Q (initialized to contain each word in the input

sentence)

• Two types of actions• Shift: removes a word from Q, pushes onto S• Reduce: pops two items from S, pushes a new item onto S

• New item is a tree that contains the two popped items

• This can be applied to either dependencies (Nivre, 2004) or constituents (Sagae & Lavie, 2005)

Page 14: Data-Driven Dependency Parsing

14

Shift

Under a proposal… to

Stack Input string Stack Input string

Before SHIFT After SHIFT

SHIFT

expand IRAs a

a shift action removes the next tokenfrom the input list…

… and pushesthis new item onto the stack

PMOD

Under a proposal…

to

PMOD

expand IRAs a

Page 15: Data-Driven Dependency Parsing

15

Reduce

Stack Input Stack Input

Before REDUCE After REDUCE

REDUCE-RIGHT-VMOD

a reduce action pops these two items…

… and pushesthis new item

Under a proposal…

to

expand

PMOD

Under a proposal…

to expand

PMOD

VMOD

IRAs a $2000 IRAs a $2000

Page 16: Data-Driven Dependency Parsing

16

STACK QUEUE

He likes fish

SHIFTREDUCE-RIGHT-SUBJSHIFTSHIFTREDUCE-LEFT-OBJ

SUBJ OBJ

He likes fish SUBJ

He likes

Parser Action:

Page 17: Data-Driven Dependency Parsing

17

Choosing Parser Actions

• No grammar, no action table

• Learn to associate stack/queue configurations with appropriate parser actions

• Classifier• Treated as a black-box• Perceptron, SVM, maximum entropy, memory-based

learning, etc• Features: top two items on the stack, next input token,

context, lookahead, …• Classes: parser actions

Page 18: Data-Driven Dependency Parsing

18

STACK QUEUE

He

likes

fish

Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0

Page 19: Data-Driven Dependency Parsing

19

STACK QUEUE

He

likes

fish

Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0

Class: Reduce-Right-SUBJ

Page 20: Data-Driven Dependency Parsing

20

STACK QUEUE

fish

Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0

Class: Reduce-Right-SUBJ

He likes

Page 21: Data-Driven Dependency Parsing

21

STACK QUEUE

fish

Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0

Class: Reduce-Right-SUBJ

He likes

Page 22: Data-Driven Dependency Parsing

22

STACK QUEUE

fish

Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0

Class: Reduce-Right-SUBJ

SUBJ

He likes

Page 23: Data-Driven Dependency Parsing

23

Accurate Parsing with Greedy Search

• Experiments: • WSJ Penn Treebank

• 1M words of WSJ text• Accuracy: ~90% (unlabeled dependency links)

• Other languages (CoNLL 2006, 2007 shared tasks)• Arabic, Basque, Chinese, Czech, Japanese, Greek,

Hungarian, Turkish, …• about 75% to 92%

• Good accuracy, fast (linear time), easy to implement!

Page 24: Data-Driven Dependency Parsing

24

Maximum Spanning Tree Parsing(McDonald et al., 2005)

• Dependency tree is a graph (obviously)• Words are vertices, dependency links are edges

• Imagine instead a fully connected weighted graph• Each weight is the score for the dependency link• Each scores is independent of other dependencies

• Edge-factored model• Find the Maximum Spanning Tree

• Score for the tree is the sum of the scores of its individual dependencies

• How are edge weights determined?

Page 25: Data-Driven Dependency Parsing

25

1 (I)

2 (ate)

3 (a)

4 (sandwich)

0 (root)

I ate a sandwich1 2 3 4

Page 26: Data-Driven Dependency Parsing

26

1 (I)

2 (ate)

3 (a)

4 (sandwich)

0 (root)

I ate a sandwich1 2 3 4

-11

2

12-8

81

-20

57

-29

3 3

-33

95

13

Page 27: Data-Driven Dependency Parsing

27

1 (I)

2 (ate)

3 (a)

4 (sandwich)

0 (root)

I ate a sandwich1 2 3 4

-11

2

12-8

81

-20

57

-29

3 3

-33

-15

13

Page 28: Data-Driven Dependency Parsing

28

Structured Classification

• x is a sentence, G is a dependency tree, f(G) is a vector of features for the entire tree

• Features: h(ate):d(sandwich) hPOS(VBD):dPOS(NN)h(ate):d(I) hPOS(VBD):dPOS(PRP)h(sandwich):d(a) hPOS(NN):dPOS(DT)hPOS(VBD) hPOS(NN) dPOS(NN)dPOS(DT) dPOS(NN) dPOS(PRP)h(ate) h(sandwich) d(sandwich)… (many more)

• To assign edge weights, we learn a feature weight vector w

Page 29: Data-Driven Dependency Parsing

29

Structured Perceptron

• Learn a vector of feature weights ww = 0For N iterations

For each training example (xi, Gi) G’i = argmaxG’ ∈ GEN(xi) w• f(G’) if G’i ≠ Gi

w = w + f(Gi) – f(G’i)

• The same as before, but to find the argmax we use MST, since each G is a tree (which also contains the corresponding input x). If G’i is not the right tree, update the feature vector

Page 30: Data-Driven Dependency Parsing

30

Question: Are there trees that an MST parser can find, but a Shift-Reduce parser* can’t?(*shift-reduce parser as described in slides 13-19)

Page 31: Data-Driven Dependency Parsing

31

Accurate Parsing with Edge-Factored Models

• The Maximum Spanning Tree algorithm for directed trees (Chu & Liu, 1965; Edmonds, 1967) runs in quadratic time

• Finds the best out of exponentially many trees• Exact inference!

• Edge-factored: each dependency link is considered independently from the others• Compare to Shift-Reduce parsing

• Greedy inference• Rich set of features includes partially built trees

• McDonald and Nivre (2007) show that shift-reduce and MST parsing get similar accuracy, but have different strengths

Page 32: Data-Driven Dependency Parsing

32

Parser Ensembles

• By using different types of classifiers and algorithms, we get several different parsers

• Ensemble idea: combine the output of several parsers to obtain a single more accurate result

I like cheese

Parser A

Parser B

Parser C

I like cheese

I like cheeseI like cheese

I like cheese

Page 33: Data-Driven Dependency Parsing

33

Parser Ensembles with Maximum Spanning Trees(Sagae and Lavie, 2006)

• First, build a graph• Create a node for each word in the input sentence (plus one extra

“root” node)• Each dependency proposed by any of the parsers is an weighted edge• If multiple parsers propose the same dependency, add weight to the

corresponding edge

• Then, simply find the MST• Maximizes the votes• Structure guaranteed to be a dependency tree

Page 34: Data-Driven Dependency Parsing

34

1 (I)

2 (ate)

3 (a)

4 (sandwich)

0 (root)

I ate a sandwich1 2 3 4

Page 35: Data-Driven Dependency Parsing

35

1 (I)

2 (ate)

3 (a)

4 (sandwich)

0 (root)

I ate a sandwich1 2 3 4

Page 36: Data-Driven Dependency Parsing

36

1 (I)

2 (ate)

3 (a)

4 (sandwich)

0 (root)

I ate a sandwich1 2 3 4

Parser AParser BParser C

Page 37: Data-Driven Dependency Parsing

37

1 (I)

2 (ate)

3 (a)

4 (sandwich)

0 (root)

I ate a sandwich1 2 3 4

Page 38: Data-Driven Dependency Parsing

38

1 (I)

2 (ate)

3 (a)

4 (sandwich)

0 (root)

I ate a sandwich1 2 3 4

Page 39: Data-Driven Dependency Parsing

39

MST Parser Ensembles Are Very Accurate

• Highest accuracy in CoNLL 2007 shared task on multilingual dependency parsing (a parser bake-off with 22 teams)• Nilson et al. (2007); Sagae and Tsujii (2007)

• Improvement depends on selection of parsers for the ensemble• With four parsers with accuracy between 89 and 91,

ensemble accuracy = 92.7