transition-based dependency parsing with selectional branching

Transition-based Dependency Parsingwith Selectional Branching

Presented at the 4th workshop onStatistical Parsing in Morphologically Rich Languages

October 18th, 2013

Jinho D. ChoiUniversity of Massachusetts Amherst

Wednesday, October 23, 13

Greedy vs. Non-greedy Parsing• Greedy parsing

- Considers only one head for each token.

- Generates one parse tree per sentence.

- e.g., transition-based parsing (2 ms / sentence).

• Non-greedy parsing

- Considers multiple heads for each token.

- Generates multiple parse trees per sentence.

- e.g., transition-based parsing with beam search, graph-based parsing, linear programming, dual decomposition (≥ 93%).

2


Motivation• How often do we need non-greedy parsing?

- Our greedy parser performs as accurately as our non-greedy parser about 64% of the time.

- This gap is even closer when they are evaluated on non-benchmark data (e.g., twits, chats, blogs).

• Many applications are time sensitive.

- Some applications need at least one complete parse tree ready given a limited time period (e.g., search, dialog, Q/A).

• Hard sentences are hard for any parser!

- Considering more heads does not always guarantee more accurate parse results.

3


Transition-based Parsing• Transition-based dependency parsing (greedy)

- Considers one transition for each parsing state.

4




4

S




4

S t1

…

tL




4

S t1

…

tL

t′




4

S t1

…

tL

t′ S




4

S t1

…

tL

t′ S t1

tL

…




4

S t1

…

tL

t′ S t1

tL

…

t′




4

S t1

…

tL

t′ S t1

tL

…

t′ S…




4

S t1

…

tL

t′ S t1

tL

…

t′ S… T




4

S t1

…

tL

t′ S t1

tL

…

t′ S… T

What if t′ is notthe correct transition?


Transition-based Parsing• Transition-based dependency parsing with beam search

- Considers b-num. of transitions for each block of parsing states.

5




5

S




5

S t1

…

tL




5

S t1

…

tL

t′1

t′b

…




5

S t1

…

tL

t′1

t′b

…S1

Sb

…




5

S t1

…

tL

t′1

t′b

…t11

t1L

tb1

tbL

……

S1

Sb

…




5

S t1

…

tL

t′1

t′b

…t11

t1L

tb1

tbL

……

t′1

t′b

S1

Sb

…




5

S t1

…

tL

t′1

t′b

…t11

t1L

tb1

tbL

……

t′1

t′b

S1

Sb…

…S1

Sb

…




5

S t1

…

tL

t′1

t′b

…t11

t1L

tb1

tbL

……

t′1

t′b

S1

Sb…

…T1

Tb

S1

Sb

…


Selectional Branching• Issues with beam search

- Generates the fixed number of parse trees no matter how easy/hard the input sentence is.

- Is it possible to dynamically adjust the beam size for each individual sentence?

• Selectional branching

- One-best transition sequence is found by a greedy parser.

- Collect k-best state-transition pairs for each low confidence transition used to generate the one-best sequence.

- Generate transition sequences from the b-1 highest scoring state-transition pairs in the collection.

6


Selectional Branching

7

S1



7

S1 t11…

t1L



7

S1 t11…

t1L

t′11



7

S1 t11…

t1L

t′11

lowconfident?



7

S1 t11…

t1L

t′11

λ =

lowconfident?



7

S1 t11…

t1L

t′11

λ = t′12S1 t′1kS1…

lowconfident?



7

S1 t11…

t1L

t′11 S2

λ = t′12S1 t′1kS1…

lowconfident?



7

S1 t11…

t1L

t′11 S2 t21

t2L

…λ = t′12S1 t′1kS1…

lowconfident?



7

S1 t11…

t1L

t′11 S2 t21

t2L

…

t′21

λ = t′12S1 t′1kS1…

lowconfident?



7

S1 t11…

t1L

t′11 S2 t21

t2L

…

t′21

λ = t′12S1 t′1kS1…

lowconfident?

lowconfident?



7

S1 t11…

t1L

t′11 S2 t21

t2L

…

t′21

λ = t′12S1 t′1kS1… t′22S2 t′2kS2…

lowconfident?

lowconfident?



7

S1 t11…

t1L

t′11 S2 t21

t2L

…

t′21 Sn…

λ = t′12S1 t′1kS1… t′22S2 t′2kS2…

lowconfident?

lowconfident?



7

S1 t11…

t1L

t′11 S2 t21

t2L

…

t′21 Sn…

λ = t′12S1 t′1kS1… t′22S2 t′2kS2… …

lowconfident?

lowconfident?



7

S1 t11…

t1L

t′11 S2 t21

t2L

…

t′21 Sn… T

λ = t′12S1 t′1kS1… t′22S2 t′2kS2… …

lowconfident?

lowconfident?



7

S1 t11…

t1L

t′11 S2 t21

t2L

…

t′21 Sn… T

λ = t′12S1 t′1kS1… t′22S2 t′2kS2… …

Pick b-1 number of pairs with the highest scores.

lowconfident?

lowconfident?



7

S1 t11…

t1L

t′11 S2 t21

t2L

…

t′21 Sn… T

λ = t′12S1 t′1kS1… t′22S2 t′2kS2… …

Pick b-1 number of pairs with the highest scores.

lowconfident?

lowconfident?

For our experiments, k = 2 is used.



8



8

λ = t′12S1 t′22S2 t′32S3



8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…



8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…

TS2 t′22 S3 Sb…



8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…

TS2 t′22 S3 Sb…

TS3 t′32 S4 Sc…



8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…

TS2 t′22 S3 Sb…

TS3 t′32 S4 Sc…

Carries on parsing states from the one-best sequence.



8

λ = t′12S1 t′22S2 t′32S3

TS1 t′12 S2 Sa…

TS2 t′22 S3 Sb…

TS3 t′32 S4 Sc…

Carries on parsing states from the one-best sequence.

Guarantees to generate fewer trees thanbeam search when |λ| ≤ b.


Low Confidence Transition• Let C1 be a classifier that finds the highest scoring

transition given the parsing state x.

• Let Ck be a classifier that finds the k-highest scoring transitions given the parsing state x and the margin m.

• The highest scoring transition C1(x) is low confident if |Ck(x, m)| > 1.

9

higher parsing accuracy than the current state-of-the-art transition-based parser using beam search,and performs about 3 times faster.

3.2 Branching strategy

Figure 2 shows an overview of our branching strat-egy. sij represents a parsing state, where i is theindex of the current transition sequence and j isthe index of the current parsing state (e.g., s12 rep-resents the 2nd parsing state in the 1st transitionsequence). pkj represents the k’th best prediction(in our case, it is a predicted transition) given s1j(e.g., p21 is the 2nd-best prediction given s11).

s11 s12p11

s22

… … s1tp12

… … s2t

p21

s33

p22

… s3t

sdt…

… …

p2j

T1 =

T2 =

T3 =

Td =

p1j

Figure 2: An overview of our branching strategy.Each sequence Ti>1 branches from T1.

Initially, the one-best sequence T1 = [s11, ... , s1t]is generated by a greedy parser. While generatingT1, the parser adds tuples (s1j , p2j), ... , (s1j , pkj)to a list � for each low confidence prediction p1jgiven s1j .4 Then, new transition sequences are gen-erated by using the b highest scoring predictions in�, where b is the beam size. If |�| < b, all predic-tions in � are used. The same greedy parser is usedto generate these new sequences although it nowstarts with s1j instead of an initial parsing state,applies pkj to s1j , and performs further transitions.Once all transition sequences are generated, a parsetree is built from a sequence with the highest score.

For our experiments, we set k = 2, which gavenoticeably more accurate results than k = 1. Wealso experimented with k > 2, which did not showsignificant improvement over k = 2. Note that as-signing a greater k may increase |�| but not the totalnumber of transition sequences generated, whichis restricted by the beam size, b. Since each se-quence Ti>1 branches from T1, selectional branch-ing performs fewer transitions than beam search:at least d(d�1)

2 transitions are inherited from T1,

4� is initially empty, which is hidden in Figure 2.

where d = min(b, |�| + 1); thus, it performs thatmany transitions less than beam search (see theleft lower triangle in Figure 2). Furthermore, se-lectional branching generates a d number of se-quences, where d is proportional to the number oflow confidence predictions made by T1. To sum up,selectional branching generates the same or fewertransition sequences than beam search and eachsequence Ti>1 performs fewer transitions than T1;thus, it performs faster than beam search in generalgiven the same beam size.

3.3 Finding low confidence predictions

For each parsing state sij , a prediction is made bygenerating a feature vector xij 2 X , feeding it intoa classifier C1 that uses a feature map �(x, y) anda weight vector w to measure a score for each labely 2 Y , and choosing a label with the highest score.When there is a tie between labels with the highestscore, the first one is chosen. This can be expressedas a logistic regression:

C1(x) = argmax

y2Y{f(x, y)}

f(x, y) =

exp(w · �(x, y))Py02Y exp(w · �(x, y0))

To find low confidence predictions, we use the mar-gins (score differences) between the best predictionand the other predictions. If all margins are greaterthan a threshold, the best prediction is consideredhighly confident; otherwise, it is not. Given thisanalogy, the k-best predictions can be found asfollows (m � 0 is a margin threshold):

Ck(x,m) = K argmax

y2Y{f(x, y)}

s.t. f(x,C1(x))� f(x, y) m

‘K argmax’ returns a set of k0 labels whose mar-gins to C1

(x) are smaller than any other label’smargin to C1

(x) and also m, where k0 k.When m = 0, it returns a set of the highest scoringlabels only, including C1

(x). When m = 1, it re-turns a set of all labels. Given this, a prediction isconsidered not confident if |Ck

(x,m)| > 1.

3.4 Finding the best transition sequence

Let Pi be a list of all predictions that lead to gen-erate a transition sequence Ti. The predictions inPi are either inherited from T1 or made specifi-cally for Ti. In Figure 2, P3 consists of p11 as itsfirst prediction, p22 as its second prediction, and

higher parsing accuracy than the current state-of-the-art transition-based parser using beam search,and performs about 3 times faster.

3.2 Branching strategy

Figure 2 shows an overview of our branching strat-egy. sij represents a parsing state, where i is theindex of the current transition sequence and j isthe index of the current parsing state (e.g., s12 rep-resents the 2nd parsing state in the 1st transitionsequence). pkj represents the k’th best prediction(in our case, it is a predicted transition) given s1j(e.g., p21 is the 2nd-best prediction given s11).

s11 s12p11

s22

… … s1tp12

… … s2t

p21

s33

p22

… s3t

sdt…

… …

p2j

T1 =

T2 =

T3 =

Td =

p1j

Figure 2: An overview of our branching strategy.Each sequence Ti>1 branches from T1.

Initially, the one-best sequence T1 = [s11, ... , s1t]is generated by a greedy parser. While generatingT1, the parser adds tuples (s1j , p2j), ... , (s1j , pkj)to a list � for each low confidence prediction p1jgiven s1j .4 Then, new transition sequences are gen-erated by using the b highest scoring predictions in�, where b is the beam size. If |�| < b, all predic-tions in � are used. The same greedy parser is usedto generate these new sequences although it nowstarts with s1j instead of an initial parsing state,applies pkj to s1j , and performs further transitions.Once all transition sequences are generated, a parsetree is built from a sequence with the highest score.

For our experiments, we set k = 2, which gavenoticeably more accurate results than k = 1. Wealso experimented with k > 2, which did not showsignificant improvement over k = 2. Note that as-signing a greater k may increase |�| but not the totalnumber of transition sequences generated, whichis restricted by the beam size, b. Since each se-quence Ti>1 branches from T1, selectional branch-ing performs fewer transitions than beam search:at least d(d�1)

2 transitions are inherited from T1,

4� is initially empty, which is hidden in Figure 2.

where d = min(b, |�| + 1); thus, it performs thatmany transitions less than beam search (see theleft lower triangle in Figure 2). Furthermore, se-lectional branching generates a d number of se-quences, where d is proportional to the number oflow confidence predictions made by T1. To sum up,selectional branching generates the same or fewertransition sequences than beam search and eachsequence Ti>1 performs fewer transitions than T1;thus, it performs faster than beam search in generalgiven the same beam size.

3.3 Finding low confidence predictions

For each parsing state sij , a prediction is made bygenerating a feature vector xij 2 X , feeding it intoa classifier C1 that uses a feature map �(x, y) anda weight vector w to measure a score for each labely 2 Y , and choosing a label with the highest score.When there is a tie between labels with the highestscore, the first one is chosen. This can be expressedas a logistic regression:

C1(x) = argmax

y2Y{f(x, y)}

f(x, y) =

exp(w · �(x, y))Py02Y exp(w · �(x, y0))

To find low confidence predictions, we use the mar-gins (score differences) between the best predictionand the other predictions. If all margins are greaterthan a threshold, the best prediction is consideredhighly confident; otherwise, it is not. Given thisanalogy, the k-best predictions can be found asfollows (m � 0 is a margin threshold):

Ck(x,m) = K argmax

y2Y{f(x, y)}

s.t. f(x,C1(x))� f(x, y) m

‘K argmax’ returns a set of k0 labels whose mar-gins to C1

(x) are smaller than any other label’smargin to C1

(x) and also m, where k0 k.When m = 0, it returns a set of the highest scoringlabels only, including C1

(x). When m = 1, it re-turns a set of all labels. Given this, a prediction isconsidered not confident if |Ck

(x,m)| > 1.

3.4 Finding the best transition sequence

Let Pi be a list of all predictions that lead to gen-erate a transition sequence Ti. The predictions inPi are either inherited from T1 or made specifi-cally for Ti. In Figure 2, P3 consists of p11 as itsfirst prediction, p22 as its second prediction, and


Experiments• Parsing algorithm (Choi & McCallum, 2013)

- Hybrid between Nivre’s arc-eager and list-based algorithms.

- Projective parsing: O(n).

- Non-projective parsing: expected linear time.

• Features

- Rich non-local features from Zhang & Nivre, 2011.

- For languages with coarse-grained POS tags, feature templates using fine-grained POS tags are replicated.

- For languages with morphological features, morphologies of σ[0] and β[0] are used as unigram features.

10


Number of Transitions• # of transitions performed with respect to beam sizes.

11

first bootstrapping, and the range 10-14 shows re-sults of 5 iterations during the second bootstrap-ping. Thus, the number of bootstrap iteration is2 where each bootstrapping takes a different num-ber of ADAGRAD iterations. Using an Intel Xeon2.57GHz machine, it takes less than 40 minutesto train the entire Penn Treebank, which includestimes for IO, feature extraction and bootstrapping.

800 10 20 30 40 50 60 70

1,200,000

0

200,000

400,000

600,000

800,000

1,000,000

Beam size = 1, 2, 4, 8, 16, 32, 64, 80

Tran

sitio

ns

Figure 5: The total number of transitions performedduring decoding with respect to beam sizes on theEnglish development set.

Figure 5 shows the total number of transitions per-formed during decoding with respect to beam sizeson the English development set (1,700 sentences,40,117 tokens). With selectional branching, thenumber of transitions grows logarithmically as thebeam size increases whereas it would have grownlinearly if beam search were used. We also checkedhow often the one best sequence is chosen as thefinal sequence during decoding. Out of 1,700 sen-tences, the one best sequences are chosen for 1,095sentences. This implies that about 64% of time,our greedy parser performs as accurately as ournon-greedy parser using selectional branching.

For the other languages, we use the same valuesas English for ↵, ⇢, m, and b; only the ADAGRADand bootstrap iterations are tuned on the develop-ment sets of the other languages.

4.4 Projective parsing experiments

Before parsing, POS tags were assigned to the train-ing set by using 20-way jackknifing. For the auto-matic generation of POS tags, we used the domain-specific model of Choi and Palmer (2012a)’s tagger,which gave 97.5% accuracy on the English evalua-tion set (0.2% higher than Collins (2002)’s tagger).

Table 4 shows comparison between past and cur-rent state-of-the-art parsers and our approach. Thefirst block shows results from transition-based de-

pendency parsers using beam search. The secondblock shows results from other kinds of parsingapproaches (e.g., graph-based parsing, ensembleparsing, linear programming, dual decomposition).The third block shows results from parsers usingexternal data. The last block shows results fromour approach. The Time column show how manyseconds per sentence each parser takes.7

Approach UAS LAS TimeZhang and Clark (2008) 92.1Huang and Sagae (2010) 92.1 0.04Zhang and Nivre (2011) 92.9 91.8 0.03Bohnet and Nivre (2012) 93.38 92.44 0.4McDonald et al. (2005) 90.9Mcdonald and Pereira (2006) 91.5Sagae and Lavie (2006) 92.7Koo and Collins (2010) 93.04Zhang and McDonald (2012) 93.06 91.86Martins et al. (2010) 93.26Rush et al. (2010) 93.8Koo et al. (2008) 93.16Carreras et al. (2008) 93.54Bohnet and Nivre (2012) 93.67 92.68Suzuki et al. (2009) 93.79bt = 80, bd = 80, m = 0.88 92.96 91.93 0.009bt = 80, bd = 64, m = 0.88 92.96 91.93 0.009bt = 80, bd = 32, m = 0.88 92.96 91.94 0.009bt = 80, bd = 16, m = 0.88 92.96 91.94 0.008bt = 80, bd = 8, m = 0.88 92.89 91.87 0.006bt = 80, bd = 4, m = 0.88 92.76 91.76 0.004bt = 80, bd = 2, m = 0.88 92.56 91.54 0.003bt = 80, bd = 1, m = 0.88 92.26 91.25 0.002bt = 1, bd = 1, m = 0.88 92.06 91.05 0.002

Table 4: Parsing accuracies and speeds on the En-glish evaluation set, excluding tokens containingonly punctuation. bt and bd indicate the beam sizesused during training and decoding, respectively.UAS: unlabeled attachment score, LAS: labeledattachment score, Time: seconds per sentence.

For evaluation, we use the model trained with b =80 and m = 0.88, which is the best setting foundduring development. Our parser shows higher ac-curacy than Zhang and Nivre (2011), which isthe current state-of-the-art transition-based parserthat uses beam search. Bohnet and Nivre (2012)’stransition-based system jointly performs POS tag-ging and dependency parsing, which shows higheraccuracy than ours. Our parser gives a comparativeaccuracy to Koo and Collins (2010) that is a 3rd-order graph-based parsing approach. In terms ofspeed, our parser outperforms all other transition-based parsers; it takes about 9 milliseconds per

7Dhillon et al. (2012) and Rush and Petrov (2012) alsohave shown good results on this data but they are excludedfrom our comparison because they use different kinds ofconstituent-to-dependency conversion methods.


Projective Parsing• The benchmark setup using WSJ.

12

Approach USA LAS Time

bt = 80, bd = 80 92.96 91.93 0.009

bt = 80, bd = 64 92.96 91.93 0.009

bt = 80, bd = 32 92.96 91.94 0.009

bt = 80, bd = 16 92.96 91.94 0.008

bt = 80, bd = 8 92.89 91.87 0.006

bt = 80, bd = 4 92.76 91.76 0.004

bt = 80, bd = 2 92.56 91.54 0.003

bt = 80, bd = 1 92.26 91.25 0.002

bt = 1, bd = 1 92.06 91.05 0.002


Projective Parsing• The benchmark setup using WSJ.

13

Approach USA LAS Time

bt = 80, bd = 80 92.96 91.93 0.009

Zhang & Clark, 2008 92.1

Huang & Sagae, 2010 92.1 0.04

Zhang & Nivre, 2011 92.9 91.8 0.03

Bohnet & Nivre, 2012 93.38 92.44 0.4

McDonald et al., 2005 90.9

McDonald & Pereira, 2006 91.5

Sagae & Lavie, 2006 92.7

Koo & Collins, 2010 93.04

Zhang & McDonald, 2012 93.06 91.86

Martins et al., 2010 93.26

Rush et al., 2010 93.8


Non-projective Parsing• CoNLL-X shared task data

14

ApproachDanishDanish DutchDutch SloveneSlovene SwedishSwedish

ApproachLAS UAS LAS UAS LAS UAS LAS UAS

bt = 80, bd = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36

bt = 80, bd = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12

Nivre et al., 2006 84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50

McDonald et al., 2006 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93

Nivre, 2009 84.2 - - - 75.2 - - -

F.-Gonz. & G.-Rodr., 2012 85.17 90.10 - - - - 83.55 89.30

Nivre & McDonald, 2008 86.67 - 81.63 - 75.94 - 84.66 -

Martins et al., 2010 - 91.50 - 84.91 - 85.53 - 89.80


SPMRL 2013 Shared Task• Baseline results provided by ClearNLP.

15

Language5K5K5K FullFullFull

LanguageLAS UAS LS LAS UAS LS

Arabic 81.72 84.46 93.41 84.19 86.48 94.43Basque 78.01 84.62 82.71 79.16 85.32 83.63French 73.39 85.30 81.42 74.51 86.41 82.00German 82.58 85.36 90.49 86.73 88.80 92.95Hebrew 75.09 81.74 82.84 - - -

Hungarian 81.98 86.09 88.26 82.68 86.56 88.80Korean 76.28 80.39 87.32 83.55 86.82 92.39Polish 80.64 88.49 86.47 81.12 89.24 86.59

Swedish 80.96 86.48 85.10 - - -


Conclusion• Selectional branching

- Uses confidence estimates to decide when to employ a beam.

- Shows comparable accuracy against traditional beam search.

- Gives faster speed against any other non-greedy parsing.

• ClearNLP

- Provides several NLP tools including morphological analyzer, dependency parser, semantic role labeler, etc.

- Webpage: clearnlp.com.

16


transition-based dependency parsing with selectional branching

Technology

graphbased parsing

statistical parsing

block of parsing states

transition sequences

correct transition

best transition sequence

nongreedy parser

beam searcht1sconsiders