data-driven dependency parsing
DESCRIPTION
Data-Driven Dependency Parsing. Kenji Sagae CSCI-544. Background: Natural Language Parsing. Syntactic analysis String to (tree) structure. S. VP. NP. PARSER. NP. He likes fish. N. Prn. V. He. likes. fish. Input. Output. S. VP. NP. PARSER. NP. He likes fish. N. Prn. V. - PowerPoint PPT PresentationTRANSCRIPT
1
Data-Driven Dependency ParsingKenji Sagae CSCI-544
2
Background: Natural Language Parsing
• Syntactic analysis• String to (tree) structure
He likes fish PARSER
S
NPVP
NP
V NPrn
He likes fishInput
Output
3
He likes fish PARSER
S
NPVP
NP
V NPrn
He likes fish
4
He likes fish PARSER
S
NPVP
NP
V NPrn
He likes fish
• Useful in Natural Language Understanding• NL interfaces, conversational agents
• Language technology applications• Machine translation, question answering, information extraction
• Scientific study of language• Syntax• Language processing models
5
He likes fish PARSER
S
NPVP
NP
V NPrn
He likes fish
S → NP VPNP → NNP → NP PPVP → V NPVP → V NP PPVP → VP PP
…
GRAMMAR
Not enough coverage,Too much ambiguity
6 6
He likes fish PARSER
S
NPVP
NP
V NPrn
He likes fish
S → NP VPNP → NNP → NP PPVP → V NPVP → V NP PPVP → VP PP
…
GRAMMAR
S
NPVP
AdvP
V Adv Det
The runs fast
N
boy
S
NPVP
AdvP
V Adv N
Dogs run fast
S
NP VP
V N
Dogs run
S
NPVP
AdvP
V Adv Det
The runs fast
N
boy
S
NPVP
AdvP
V Adv N
Dogs run fast
S
NP VP
V N
Dogs run
TREEBANK
Charniak (1996);Collins (1996);
Charniak (1997)
7
He likes fish PARSER
S
NPVP
NP
V NPrn
He likes fish
S → NP VPNP → NNP → NP PPVP → V NPVP → V NP PPVP → VP PP
…
GRAMMAR
S
NPVP
AdvP
V Adv Det
The runs fast
N
boy
S
NPVP
AdvP
V Adv N
Dogs run fast
S
NP VP
V N
Dogs run
S
NPVP
AdvP
V Adv Det
The runs fast
N
boy
S
NPVP
AdvP
V Adv N
Dogs run fast
S
NP VP
V N
Dogs run
TREEBANK
8
S
NP
VP
NP
V NDet
The ate the
DetN
boy cheese sandwich
N
The ate theboy cheese sandwich
Phrase Structure Tree(Constituent Structure)
Dependency Structure
9
S
NP
VP
NP
V NDet
The ate the
DetN
boy cheese sandwich
N
The ate theboy cheese sandwich
sandwich
ate
ate
boy
10The ate theboy cheese sandwich
The
boy
cheesethe
ate
sandwich
SUBJ
DETDET
OBJ
MOD
DET SUBJ
OBJ
DETMOD
HEAD
DEPENDENT
LABEL
11
Background: Linear Classification with the Perceptron• Classification: given an input x predict output y
• Example: x is a document, y ∈ {Sports, Politics, Science}• x is represented as a feature vector f(x)
• Example: x f(x) y
• Just add feature weights given in a vector w
Wednesday night, when the Lakers play the Mavericks at American Airlines Center, they get to see first hand …
# games: 5# Lakers: 4# said: 3# rebounds: 3# democrat: 0# republican: 0# science: 0
Sports
12
Multiclass Perceptron
• Learn vectors of feature weights wclass
for each class c wc = 0
For N iterationsFor each training example (xi, yi) zi = argmaxz wz• f(xi) if zi ≠ yi
wzi = wzi – f(xi) wyi = wyi + f(xi)
• Try to classify each example. If a mistake is made, update the weights.
13
Shift-Reduce Dependency Parsing• Two main data structures
• Stack S (initially empty)• Queue Q (initialized to contain each word in the input
sentence)
• Two types of actions• Shift: removes a word from Q, pushes onto S• Reduce: pops two items from S, pushes a new item onto S
• New item is a tree that contains the two popped items
• This can be applied to either dependencies (Nivre, 2004) or constituents (Sagae & Lavie, 2005)
14
Shift
Under a proposal… to
Stack Input string Stack Input string
Before SHIFT After SHIFT
SHIFT
expand IRAs a
a shift action removes the next tokenfrom the input list…
… and pushesthis new item onto the stack
PMOD
Under a proposal…
to
PMOD
expand IRAs a
15
Reduce
Stack Input Stack Input
Before REDUCE After REDUCE
REDUCE-RIGHT-VMOD
a reduce action pops these two items…
… and pushesthis new item
Under a proposal…
to
expand
PMOD
Under a proposal…
to expand
PMOD
VMOD
IRAs a $2000 IRAs a $2000
16
STACK QUEUE
He likes fish
SHIFTREDUCE-RIGHT-SUBJSHIFTSHIFTREDUCE-LEFT-OBJ
SUBJ OBJ
He likes fish SUBJ
He likes
Parser Action:
17
Choosing Parser Actions
• No grammar, no action table
• Learn to associate stack/queue configurations with appropriate parser actions
• Classifier• Treated as a black-box• Perceptron, SVM, maximum entropy, memory-based
learning, etc• Features: top two items on the stack, next input token,
context, lookahead, …• Classes: parser actions
18
STACK QUEUE
He
likes
fish
Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0
19
STACK QUEUE
He
likes
fish
Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0
Class: Reduce-Right-SUBJ
20
STACK QUEUE
fish
Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0
Class: Reduce-Right-SUBJ
He likes
21
STACK QUEUE
fish
Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0
Class: Reduce-Right-SUBJ
He likes
22
STACK QUEUE
fish
Features:stack(0) = likes stack(0).POS = VBZstack(1) = He stack(1).POS = PRPstack(2) = 0 stack(2).POS = 0queue(0) = fish queue(0).POS = NNqueue(1) = 0 queue(1).POS = 0queue(2) = 0 queue(2).POS = 0
Class: Reduce-Right-SUBJ
SUBJ
He likes
23
Accurate Parsing with Greedy Search
• Experiments: • WSJ Penn Treebank
• 1M words of WSJ text• Accuracy: ~90% (unlabeled dependency links)
• Other languages (CoNLL 2006, 2007 shared tasks)• Arabic, Basque, Chinese, Czech, Japanese, Greek,
Hungarian, Turkish, …• about 75% to 92%
• Good accuracy, fast (linear time), easy to implement!
24
Maximum Spanning Tree Parsing(McDonald et al., 2005)
• Dependency tree is a graph (obviously)• Words are vertices, dependency links are edges
• Imagine instead a fully connected weighted graph• Each weight is the score for the dependency link• Each scores is independent of other dependencies
• Edge-factored model• Find the Maximum Spanning Tree
• Score for the tree is the sum of the scores of its individual dependencies
• How are edge weights determined?
25
1 (I)
2 (ate)
3 (a)
4 (sandwich)
0 (root)
I ate a sandwich1 2 3 4
26
1 (I)
2 (ate)
3 (a)
4 (sandwich)
0 (root)
I ate a sandwich1 2 3 4
-11
2
12-8
81
-20
57
-29
3 3
-33
95
13
27
1 (I)
2 (ate)
3 (a)
4 (sandwich)
0 (root)
I ate a sandwich1 2 3 4
-11
2
12-8
81
-20
57
-29
3 3
-33
-15
13
28
Structured Classification
• x is a sentence, G is a dependency tree, f(G) is a vector of features for the entire tree
• Features: h(ate):d(sandwich) hPOS(VBD):dPOS(NN)h(ate):d(I) hPOS(VBD):dPOS(PRP)h(sandwich):d(a) hPOS(NN):dPOS(DT)hPOS(VBD) hPOS(NN) dPOS(NN)dPOS(DT) dPOS(NN) dPOS(PRP)h(ate) h(sandwich) d(sandwich)… (many more)
• To assign edge weights, we learn a feature weight vector w
29
Structured Perceptron
• Learn a vector of feature weights ww = 0For N iterations
For each training example (xi, Gi) G’i = argmaxG’ ∈ GEN(xi) w• f(G’) if G’i ≠ Gi
w = w + f(Gi) – f(G’i)
• The same as before, but to find the argmax we use MST, since each G is a tree (which also contains the corresponding input x). If G’i is not the right tree, update the feature vector
30
Question: Are there trees that an MST parser can find, but a Shift-Reduce parser* can’t?(*shift-reduce parser as described in slides 13-19)
31
Accurate Parsing with Edge-Factored Models
• The Maximum Spanning Tree algorithm for directed trees (Chu & Liu, 1965; Edmonds, 1967) runs in quadratic time
• Finds the best out of exponentially many trees• Exact inference!
• Edge-factored: each dependency link is considered independently from the others• Compare to Shift-Reduce parsing
• Greedy inference• Rich set of features includes partially built trees
• McDonald and Nivre (2007) show that shift-reduce and MST parsing get similar accuracy, but have different strengths
32
Parser Ensembles
• By using different types of classifiers and algorithms, we get several different parsers
• Ensemble idea: combine the output of several parsers to obtain a single more accurate result
I like cheese
Parser A
Parser B
Parser C
I like cheese
I like cheeseI like cheese
I like cheese
33
Parser Ensembles with Maximum Spanning Trees(Sagae and Lavie, 2006)
• First, build a graph• Create a node for each word in the input sentence (plus one extra
“root” node)• Each dependency proposed by any of the parsers is an weighted edge• If multiple parsers propose the same dependency, add weight to the
corresponding edge
• Then, simply find the MST• Maximizes the votes• Structure guaranteed to be a dependency tree
34
1 (I)
2 (ate)
3 (a)
4 (sandwich)
0 (root)
I ate a sandwich1 2 3 4
35
1 (I)
2 (ate)
3 (a)
4 (sandwich)
0 (root)
I ate a sandwich1 2 3 4
36
1 (I)
2 (ate)
3 (a)
4 (sandwich)
0 (root)
I ate a sandwich1 2 3 4
Parser AParser BParser C
37
1 (I)
2 (ate)
3 (a)
4 (sandwich)
0 (root)
I ate a sandwich1 2 3 4
38
1 (I)
2 (ate)
3 (a)
4 (sandwich)
0 (root)
I ate a sandwich1 2 3 4
39
MST Parser Ensembles Are Very Accurate
• Highest accuracy in CoNLL 2007 shared task on multilingual dependency parsing (a parser bake-off with 22 teams)• Nilson et al. (2007); Sagae and Tsujii (2007)
• Improvement depends on selection of parsers for the ensemble• With four parsers with accuracy between 89 and 91,
ensemble accuracy = 92.7