transition-based dependency parsing with selectional branching

55
Transition-based Dependency Parsing with Selectional Branching Presented at the 4th workshop on Statistical Parsing in Morphologically Rich Languages October 18th, 2013 Jinho D. Choi University of Massachusetts Amherst Wednesday, October 23, 13

Upload: jinho-d-choi

Post on 11-Jun-2015

454 views

Category:

Technology


3 download

DESCRIPTION

We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy transition-based dependency parsing approach. Selectional branching is guaranteed to perform a fewer number of transitions than beam search yet performs as accurately. We also present a new transition-based dependency parsing algorithm that gives a complexity of O(n) for projective parsing and an expected linear time speed for non-projective parsing. With the standard setup, our parser shows an unlabeled attachment score of 92.96% and a parsing speed of 9 milliseconds per sentence, which is faster and more accurate than the current state-of-the-art transition- based parser that uses beam search.

TRANSCRIPT

Page 1: Transition-based Dependency Parsing with Selectional Branching

Transition-based Dependency Parsingwith Selectional Branching

Presented at the 4th workshop onStatistical Parsing in Morphologically Rich Languages

October 18th, 2013

Jinho D. ChoiUniversity of Massachusetts Amherst

Wednesday, October 23, 13

Page 2: Transition-based Dependency Parsing with Selectional Branching

Greedy vs. Non-greedy Parsing• Greedy parsing

- Considers only one head for each token.

- Generates one parse tree per sentence.

- e.g., transition-based parsing (2 ms / sentence).

• Non-greedy parsing

- Considers multiple heads for each token.

- Generates multiple parse trees per sentence.

- e.g., transition-based parsing with beam search, graph-based parsing, linear programming, dual decomposition (≥ 93%).

2

Wednesday, October 23, 13

Page 3: Transition-based Dependency Parsing with Selectional Branching

Motivation• How often do we need non-greedy parsing?

- Our greedy parser performs as accurately as our non-greedy parser about 64% of the time.

- This gap is even closer when they are evaluated on non-benchmark data (e.g., twits, chats, blogs).

• Many applications are time sensitive.

- Some applications need at least one complete parse tree ready given a limited time period (e.g., search, dialog, Q/A).

• Hard sentences are hard for any parser!

- Considering more heads does not always guarantee more accurate parse results.

3

Wednesday, October 23, 13

Page 4: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

Wednesday, October 23, 13

Page 5: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

S

Wednesday, October 23, 13

Page 6: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

S t1

tL

Wednesday, October 23, 13

Page 7: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

S t1

tL

t′

Wednesday, October 23, 13

Page 8: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

S t1

tL

t′ S

Wednesday, October 23, 13

Page 9: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

S t1

tL

t′ S t1

tL

Wednesday, October 23, 13

Page 10: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

S t1

tL

t′ S t1

tL

t′

Wednesday, October 23, 13

Page 11: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

S t1

tL

t′ S t1

tL

t′ S…

Wednesday, October 23, 13

Page 12: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

S t1

tL

t′ S t1

tL

t′ S… T

Wednesday, October 23, 13

Page 13: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4

S t1

tL

t′ S t1

tL

t′ S… T

What if t′ is notthe correct transition?

Wednesday, October 23, 13

Page 14: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

Wednesday, October 23, 13

Page 15: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

S

Wednesday, October 23, 13

Page 16: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

S t1

tL

Wednesday, October 23, 13

Page 17: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

S t1

tL

t′1

t′b

Wednesday, October 23, 13

Page 18: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

S t1

tL

t′1

t′b

…S1

Sb

Wednesday, October 23, 13

Page 19: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

S t1

tL

t′1

t′b

…t11

t1L

tb1

tbL

……

S1

Sb

Wednesday, October 23, 13

Page 20: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

S t1

tL

t′1

t′b

…t11

t1L

tb1

tbL

……

S1

Sb

Wednesday, October 23, 13

Page 21: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

S t1

tL

t′1

t′b

…t11

t1L

tb1

tbL

……

t′1

t′b

S1

Sb

Wednesday, October 23, 13

Page 22: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

S t1

tL

t′1

t′b

…t11

t1L

tb1

tbL

……

t′1

t′b

S1

Sb…

…S1

Sb

Wednesday, October 23, 13

Page 23: Transition-based Dependency Parsing with Selectional Branching

Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5

S t1

tL

t′1

t′b

…t11

t1L

tb1

tbL

……

t′1

t′b

S1

Sb…

…T1

Tb

S1

Sb

Wednesday, October 23, 13

Page 24: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching• Issues with beam search

- Generates the fixed number of parse trees no matter how easy/hard the input sentence is.

- Is it possible to dynamically adjust the beam size for each individual sentence?

• Selectional branching

- One-best transition sequence is found by a greedy parser.

- Collect k-best state-transition pairs for each low confidence transition used to generate the one-best sequence.

- Generate transition sequences from the b-1 highest scoring state-transition pairs in the collection.

6

Wednesday, October 23, 13

Page 25: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1

Wednesday, October 23, 13

Page 26: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

Wednesday, October 23, 13

Page 27: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11

Wednesday, October 23, 13

Page 28: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11

lowconfident?

Wednesday, October 23, 13

Page 29: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11

λ =

lowconfident?

Wednesday, October 23, 13

Page 30: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11

λ = t′12S1 t′1kS1…

lowconfident?

Wednesday, October 23, 13

Page 31: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2

λ = t′12S1 t′1kS1…

lowconfident?

Wednesday, October 23, 13

Page 32: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2 t21

t2L

…λ = t′12S1 t′1kS1…

lowconfident?

Wednesday, October 23, 13

Page 33: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2 t21

t2L

t′21

λ = t′12S1 t′1kS1…

lowconfident?

Wednesday, October 23, 13

Page 34: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2 t21

t2L

t′21

λ = t′12S1 t′1kS1…

lowconfident?

lowconfident?

Wednesday, October 23, 13

Page 35: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2 t21

t2L

t′21

λ = t′12S1 t′1kS1… t′22S2 t′2kS2…

lowconfident?

lowconfident?

Wednesday, October 23, 13

Page 36: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2 t21

t2L

t′21 Sn…

λ = t′12S1 t′1kS1… t′22S2 t′2kS2…

lowconfident?

lowconfident?

Wednesday, October 23, 13

Page 37: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2 t21

t2L

t′21 Sn…

λ = t′12S1 t′1kS1… t′22S2 t′2kS2… …

lowconfident?

lowconfident?

Wednesday, October 23, 13

Page 38: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2 t21

t2L

t′21 Sn… T

λ = t′12S1 t′1kS1… t′22S2 t′2kS2… …

lowconfident?

lowconfident?

Wednesday, October 23, 13

Page 39: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2 t21

t2L

t′21 Sn… T

λ = t′12S1 t′1kS1… t′22S2 t′2kS2… …

Pick b-1 number of pairs with the highest scores.

lowconfident?

lowconfident?

Wednesday, October 23, 13

Page 40: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

7

S1 t11…

t1L

t′11 S2 t21

t2L

t′21 Sn… T

λ = t′12S1 t′1kS1… t′22S2 t′2kS2… …

Pick b-1 number of pairs with the highest scores.

lowconfident?

lowconfident?

For our experiments, k = 2 is used.

Wednesday, October 23, 13

Page 41: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

8

Wednesday, October 23, 13

Page 42: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

8

λ = t′12S1 t′22S2 t′32S3

Wednesday, October 23, 13

Page 43: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…

Wednesday, October 23, 13

Page 44: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…

TS2 t′22 S3 Sb…

Wednesday, October 23, 13

Page 45: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…

TS2 t′22 S3 Sb…

TS3 t′32 S4 Sc…

Wednesday, October 23, 13

Page 46: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…

TS2 t′22 S3 Sb…

TS3 t′32 S4 Sc…

Carries on parsing states from the one-best sequence.

Wednesday, October 23, 13

Page 47: Transition-based Dependency Parsing with Selectional Branching

Selectional Branching

8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…

TS2 t′22 S3 Sb…

TS3 t′32 S4 Sc…

Carries on parsing states from the one-best sequence.

Guarantees to generate fewer trees thanbeam search when |λ| ≤ b.

Wednesday, October 23, 13

Page 48: Transition-based Dependency Parsing with Selectional Branching

Low Confidence Transition• Let C1 be a classifier that finds the highest scoring

transition given the parsing state x.

• Let Ck be a classifier that finds the k-highest scoring transitions given the parsing state x and the margin m.

• The highest scoring transition C1(x) is low confident if |Ck(x, m)| > 1.

9

higher parsing accuracy than the current state-of-the-art transition-based parser using beam search,and performs about 3 times faster.

3.2 Branching strategy

Figure 2 shows an overview of our branching strat-egy. sij represents a parsing state, where i is theindex of the current transition sequence and j isthe index of the current parsing state (e.g., s12 rep-resents the 2nd parsing state in the 1st transitionsequence). pkj represents the k’th best prediction(in our case, it is a predicted transition) given s1j(e.g., p21 is the 2nd-best prediction given s11).

s11 s12p11

s22

… … s1tp12

… … s2t

p21

s33

p22

… s3t

sdt…

… …

p2j

T1 =

T2 =

T3 =

Td =

p1j

Figure 2: An overview of our branching strategy.Each sequence Ti>1 branches from T1.

Initially, the one-best sequence T1 = [s11, ... , s1t]is generated by a greedy parser. While generatingT1, the parser adds tuples (s1j , p2j), ... , (s1j , pkj)to a list � for each low confidence prediction p1jgiven s1j .4 Then, new transition sequences are gen-erated by using the b highest scoring predictions in�, where b is the beam size. If |�| < b, all predic-tions in � are used. The same greedy parser is usedto generate these new sequences although it nowstarts with s1j instead of an initial parsing state,applies pkj to s1j , and performs further transitions.Once all transition sequences are generated, a parsetree is built from a sequence with the highest score.

For our experiments, we set k = 2, which gavenoticeably more accurate results than k = 1. Wealso experimented with k > 2, which did not showsignificant improvement over k = 2. Note that as-signing a greater k may increase |�| but not the totalnumber of transition sequences generated, whichis restricted by the beam size, b. Since each se-quence Ti>1 branches from T1, selectional branch-ing performs fewer transitions than beam search:at least d(d�1)

2 transitions are inherited from T1,

4� is initially empty, which is hidden in Figure 2.

where d = min(b, |�| + 1); thus, it performs thatmany transitions less than beam search (see theleft lower triangle in Figure 2). Furthermore, se-lectional branching generates a d number of se-quences, where d is proportional to the number oflow confidence predictions made by T1. To sum up,selectional branching generates the same or fewertransition sequences than beam search and eachsequence Ti>1 performs fewer transitions than T1;thus, it performs faster than beam search in generalgiven the same beam size.

3.3 Finding low confidence predictions

For each parsing state sij , a prediction is made bygenerating a feature vector xij 2 X , feeding it intoa classifier C1 that uses a feature map �(x, y) anda weight vector w to measure a score for each labely 2 Y , and choosing a label with the highest score.When there is a tie between labels with the highestscore, the first one is chosen. This can be expressedas a logistic regression:

C1(x) = argmax

y2Y{f(x, y)}

f(x, y) =

exp(w · �(x, y))Py02Y exp(w · �(x, y0))

To find low confidence predictions, we use the mar-gins (score differences) between the best predictionand the other predictions. If all margins are greaterthan a threshold, the best prediction is consideredhighly confident; otherwise, it is not. Given thisanalogy, the k-best predictions can be found asfollows (m � 0 is a margin threshold):

Ck(x,m) = K argmax

y2Y{f(x, y)}

s.t. f(x,C1(x))� f(x, y) m

‘K argmax’ returns a set of k0 labels whose mar-gins to C1

(x) are smaller than any other label’smargin to C1

(x) and also m, where k0 k.When m = 0, it returns a set of the highest scoringlabels only, including C1

(x). When m = 1, it re-turns a set of all labels. Given this, a prediction isconsidered not confident if |Ck

(x,m)| > 1.

3.4 Finding the best transition sequence

Let Pi be a list of all predictions that lead to gen-erate a transition sequence Ti. The predictions inPi are either inherited from T1 or made specifi-cally for Ti. In Figure 2, P3 consists of p11 as itsfirst prediction, p22 as its second prediction, and

higher parsing accuracy than the current state-of-the-art transition-based parser using beam search,and performs about 3 times faster.

3.2 Branching strategy

Figure 2 shows an overview of our branching strat-egy. sij represents a parsing state, where i is theindex of the current transition sequence and j isthe index of the current parsing state (e.g., s12 rep-resents the 2nd parsing state in the 1st transitionsequence). pkj represents the k’th best prediction(in our case, it is a predicted transition) given s1j(e.g., p21 is the 2nd-best prediction given s11).

s11 s12p11

s22

… … s1tp12

… … s2t

p21

s33

p22

… s3t

sdt…

… …

p2j

T1 =

T2 =

T3 =

Td =

p1j

Figure 2: An overview of our branching strategy.Each sequence Ti>1 branches from T1.

Initially, the one-best sequence T1 = [s11, ... , s1t]is generated by a greedy parser. While generatingT1, the parser adds tuples (s1j , p2j), ... , (s1j , pkj)to a list � for each low confidence prediction p1jgiven s1j .4 Then, new transition sequences are gen-erated by using the b highest scoring predictions in�, where b is the beam size. If |�| < b, all predic-tions in � are used. The same greedy parser is usedto generate these new sequences although it nowstarts with s1j instead of an initial parsing state,applies pkj to s1j , and performs further transitions.Once all transition sequences are generated, a parsetree is built from a sequence with the highest score.

For our experiments, we set k = 2, which gavenoticeably more accurate results than k = 1. Wealso experimented with k > 2, which did not showsignificant improvement over k = 2. Note that as-signing a greater k may increase |�| but not the totalnumber of transition sequences generated, whichis restricted by the beam size, b. Since each se-quence Ti>1 branches from T1, selectional branch-ing performs fewer transitions than beam search:at least d(d�1)

2 transitions are inherited from T1,

4� is initially empty, which is hidden in Figure 2.

where d = min(b, |�| + 1); thus, it performs thatmany transitions less than beam search (see theleft lower triangle in Figure 2). Furthermore, se-lectional branching generates a d number of se-quences, where d is proportional to the number oflow confidence predictions made by T1. To sum up,selectional branching generates the same or fewertransition sequences than beam search and eachsequence Ti>1 performs fewer transitions than T1;thus, it performs faster than beam search in generalgiven the same beam size.

3.3 Finding low confidence predictions

For each parsing state sij , a prediction is made bygenerating a feature vector xij 2 X , feeding it intoa classifier C1 that uses a feature map �(x, y) anda weight vector w to measure a score for each labely 2 Y , and choosing a label with the highest score.When there is a tie between labels with the highestscore, the first one is chosen. This can be expressedas a logistic regression:

C1(x) = argmax

y2Y{f(x, y)}

f(x, y) =

exp(w · �(x, y))Py02Y exp(w · �(x, y0))

To find low confidence predictions, we use the mar-gins (score differences) between the best predictionand the other predictions. If all margins are greaterthan a threshold, the best prediction is consideredhighly confident; otherwise, it is not. Given thisanalogy, the k-best predictions can be found asfollows (m � 0 is a margin threshold):

Ck(x,m) = K argmax

y2Y{f(x, y)}

s.t. f(x,C1(x))� f(x, y) m

‘K argmax’ returns a set of k0 labels whose mar-gins to C1

(x) are smaller than any other label’smargin to C1

(x) and also m, where k0 k.When m = 0, it returns a set of the highest scoringlabels only, including C1

(x). When m = 1, it re-turns a set of all labels. Given this, a prediction isconsidered not confident if |Ck

(x,m)| > 1.

3.4 Finding the best transition sequence

Let Pi be a list of all predictions that lead to gen-erate a transition sequence Ti. The predictions inPi are either inherited from T1 or made specifi-cally for Ti. In Figure 2, P3 consists of p11 as itsfirst prediction, p22 as its second prediction, and

Wednesday, October 23, 13

Page 49: Transition-based Dependency Parsing with Selectional Branching

Experiments• Parsing algorithm (Choi & McCallum, 2013)

- Hybrid between Nivre’s arc-eager and list-based algorithms.

- Projective parsing: O(n).

- Non-projective parsing: expected linear time.

• Features

- Rich non-local features from Zhang & Nivre, 2011.

- For languages with coarse-grained POS tags, feature templates using fine-grained POS tags are replicated.

- For languages with morphological features, morphologies of σ[0] and β[0] are used as unigram features.

10

Wednesday, October 23, 13

Page 50: Transition-based Dependency Parsing with Selectional Branching

Number of Transitions• # of transitions performed with respect to beam sizes.

11

first bootstrapping, and the range 10-14 shows re-sults of 5 iterations during the second bootstrap-ping. Thus, the number of bootstrap iteration is2 where each bootstrapping takes a different num-ber of ADAGRAD iterations. Using an Intel Xeon2.57GHz machine, it takes less than 40 minutesto train the entire Penn Treebank, which includestimes for IO, feature extraction and bootstrapping.

800 10 20 30 40 50 60 70

1,200,000

0

200,000

400,000

600,000

800,000

1,000,000

Beam size = 1, 2, 4, 8, 16, 32, 64, 80

Tran

sitio

ns

Figure 5: The total number of transitions performedduring decoding with respect to beam sizes on theEnglish development set.

Figure 5 shows the total number of transitions per-formed during decoding with respect to beam sizeson the English development set (1,700 sentences,40,117 tokens). With selectional branching, thenumber of transitions grows logarithmically as thebeam size increases whereas it would have grownlinearly if beam search were used. We also checkedhow often the one best sequence is chosen as thefinal sequence during decoding. Out of 1,700 sen-tences, the one best sequences are chosen for 1,095sentences. This implies that about 64% of time,our greedy parser performs as accurately as ournon-greedy parser using selectional branching.

For the other languages, we use the same valuesas English for ↵, ⇢, m, and b; only the ADAGRADand bootstrap iterations are tuned on the develop-ment sets of the other languages.

4.4 Projective parsing experiments

Before parsing, POS tags were assigned to the train-ing set by using 20-way jackknifing. For the auto-matic generation of POS tags, we used the domain-specific model of Choi and Palmer (2012a)’s tagger,which gave 97.5% accuracy on the English evalua-tion set (0.2% higher than Collins (2002)’s tagger).

Table 4 shows comparison between past and cur-rent state-of-the-art parsers and our approach. Thefirst block shows results from transition-based de-

pendency parsers using beam search. The secondblock shows results from other kinds of parsingapproaches (e.g., graph-based parsing, ensembleparsing, linear programming, dual decomposition).The third block shows results from parsers usingexternal data. The last block shows results fromour approach. The Time column show how manyseconds per sentence each parser takes.7

Approach UAS LAS TimeZhang and Clark (2008) 92.1Huang and Sagae (2010) 92.1 0.04Zhang and Nivre (2011) 92.9 91.8 0.03Bohnet and Nivre (2012) 93.38 92.44 0.4McDonald et al. (2005) 90.9Mcdonald and Pereira (2006) 91.5Sagae and Lavie (2006) 92.7Koo and Collins (2010) 93.04Zhang and McDonald (2012) 93.06 91.86Martins et al. (2010) 93.26Rush et al. (2010) 93.8Koo et al. (2008) 93.16Carreras et al. (2008) 93.54Bohnet and Nivre (2012) 93.67 92.68Suzuki et al. (2009) 93.79bt = 80, bd = 80, m = 0.88 92.96 91.93 0.009bt = 80, bd = 64, m = 0.88 92.96 91.93 0.009bt = 80, bd = 32, m = 0.88 92.96 91.94 0.009bt = 80, bd = 16, m = 0.88 92.96 91.94 0.008bt = 80, bd = 8, m = 0.88 92.89 91.87 0.006bt = 80, bd = 4, m = 0.88 92.76 91.76 0.004bt = 80, bd = 2, m = 0.88 92.56 91.54 0.003bt = 80, bd = 1, m = 0.88 92.26 91.25 0.002bt = 1, bd = 1, m = 0.88 92.06 91.05 0.002

Table 4: Parsing accuracies and speeds on the En-glish evaluation set, excluding tokens containingonly punctuation. bt and bd indicate the beam sizesused during training and decoding, respectively.UAS: unlabeled attachment score, LAS: labeledattachment score, Time: seconds per sentence.

For evaluation, we use the model trained with b =80 and m = 0.88, which is the best setting foundduring development. Our parser shows higher ac-curacy than Zhang and Nivre (2011), which isthe current state-of-the-art transition-based parserthat uses beam search. Bohnet and Nivre (2012)’stransition-based system jointly performs POS tag-ging and dependency parsing, which shows higheraccuracy than ours. Our parser gives a comparativeaccuracy to Koo and Collins (2010) that is a 3rd-order graph-based parsing approach. In terms ofspeed, our parser outperforms all other transition-based parsers; it takes about 9 milliseconds per

7Dhillon et al. (2012) and Rush and Petrov (2012) alsohave shown good results on this data but they are excludedfrom our comparison because they use different kinds ofconstituent-to-dependency conversion methods.

Wednesday, October 23, 13

Page 51: Transition-based Dependency Parsing with Selectional Branching

Projective Parsing• The benchmark setup using WSJ.

12

Approach USA LAS Time

bt = 80, bd = 80 92.96 91.93 0.009

bt = 80, bd = 64 92.96 91.93 0.009

bt = 80, bd = 32 92.96 91.94 0.009

bt = 80, bd = 16 92.96 91.94 0.008

bt = 80, bd = 8 92.89 91.87 0.006

bt = 80, bd = 4 92.76 91.76 0.004

bt = 80, bd = 2 92.56 91.54 0.003

bt = 80, bd = 1 92.26 91.25 0.002

bt = 1, bd = 1 92.06 91.05 0.002

Wednesday, October 23, 13

Page 52: Transition-based Dependency Parsing with Selectional Branching

Projective Parsing• The benchmark setup using WSJ.

13

Approach USA LAS Time

bt = 80, bd = 80 92.96 91.93 0.009

Zhang & Clark, 2008 92.1

Huang & Sagae, 2010 92.1 0.04

Zhang & Nivre, 2011 92.9 91.8 0.03

Bohnet & Nivre, 2012 93.38 92.44 0.4

McDonald et al., 2005 90.9

McDonald & Pereira, 2006 91.5

Sagae & Lavie, 2006 92.7

Koo & Collins, 2010 93.04

Zhang & McDonald, 2012 93.06 91.86

Martins et al., 2010 93.26

Rush et al., 2010 93.8

Wednesday, October 23, 13

Page 53: Transition-based Dependency Parsing with Selectional Branching

Non-projective Parsing• CoNLL-X shared task data

14

ApproachDanishDanish DutchDutch SloveneSlovene SwedishSwedish

ApproachLAS UAS LAS UAS LAS UAS LAS UAS

bt = 80, bd = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36

bt = 80, bd = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12

Nivre et al., 2006 84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50

McDonald et al., 2006 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93

Nivre, 2009 84.2 - - - 75.2 - - -

F.-Gonz. & G.-Rodr., 2012 85.17 90.10 - - - - 83.55 89.30

Nivre & McDonald, 2008 86.67 - 81.63 - 75.94 - 84.66 -

Martins et al., 2010 - 91.50 - 84.91 - 85.53 - 89.80

Wednesday, October 23, 13

Page 54: Transition-based Dependency Parsing with Selectional Branching

SPMRL 2013 Shared Task• Baseline results provided by ClearNLP.

15

Language5K5K5K FullFullFull

LanguageLAS UAS LS LAS UAS LS

Arabic 81.72 84.46 93.41 84.19 86.48 94.43Basque 78.01 84.62 82.71 79.16 85.32 83.63French 73.39 85.30 81.42 74.51 86.41 82.00German 82.58 85.36 90.49 86.73 88.80 92.95Hebrew 75.09 81.74 82.84 - - -

Hungarian 81.98 86.09 88.26 82.68 86.56 88.80Korean 76.28 80.39 87.32 83.55 86.82 92.39Polish 80.64 88.49 86.47 81.12 89.24 86.59

Swedish 80.96 86.48 85.10 - - -

Wednesday, October 23, 13

Page 55: Transition-based Dependency Parsing with Selectional Branching

Conclusion• Selectional branching

- Uses confidence estimates to decide when to employ a beam.

- Shows comparable accuracy against traditional beam search.

- Gives faster speed against any other non-greedy parsing.

• ClearNLP

- Provides several NLP tools including morphological analyzer, dependency parser, semantic role labeler, etc.

- Webpage: clearnlp.com.

16

Wednesday, October 23, 13