a2 arxiv:2110.06865v1 [cs.cl] 13 oct 2021

13
Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments Yu Zhang 1 , Qingrong Xia 1 , Shilin Zhou 1 , Yong Jiang 2 , Zhenghua Li 1 , Guohong Fu 1 , Min Zhang 1 1 Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, Suzhou, China; 2 DAMO Academy, Alibaba Group, China {yzhang.cs,slzhou.cs}@outlook.com; [email protected] [email protected]; {zhli13,ghfu,minzhang}@suda.edu.cn Abstract Semantic role labeling is a fundamental yet challenging task in the NLP community. Re- cent works of SRL mainly fall into two lines:1) BIO-based and 2) span-based. Despite effec- tiveness, they share some intrinsic drawbacks of not explicitly considering internal argument structures, which may potentially hinder the model’s expressiveness. To remedy this, we propose to reduce SRL to a dependency pars- ing task and regard the flat argument spans as latent subtrees. In particular, we equip our formulation with a novel span-constrained TreeCRF model to make tree structures span- aware, and further extend it to the second- order case. Experiments on CoNLL05 and CoNLL12 benchmarks reveal that the results of our methods outperform all previous works and achieve the state-of-the-art. 1 Introduction Semantic role labeling (SRL) is a fundamental yet challenging task in the NLP community, involving predicate and argument identification, as well as semantic role classification. As SRL can provide informative linguistic representations, it has been widely adopted in downstream tasks like question answering (Berant et al., 2013; Yih et al., 2016), information extraction (Christensen et al., 2010; Lin et al., 2017), and machine translation (Liu and Gildea, 2010; Bazrafshan and Gildea, 2013), etc. Recent works of SRL mainly fall into two lines: 1) BIO-based and 2) span-based. The former views SRL as a sequence labeling task (Zhou and Xu, 2015; Strubell et al., 2018; Shi and Lin, 2019). For each predicate, each token is tagged with a label starting with BIO prefixes indicating if it is at the Beginning, Inside, or Outside of an argument. The latter (He et al., 2018a; Ouchi et al., 2018), in contrast, directly models all predicate-argument pairs in a unified graph. Despite effectiveness, there are some drawbacks that limit the expressiveness of the two methods. ... take ... out of the market ... VMOD AMOD NMOD PMOD A2 Figure 1: An argument example (below) and its related subtree structure (above) for the predicate “take”. First, framing predicate-argument structures as a BIO-tagging scheme is less effective as it lacks explicit modeling of span-level representations, so that long adjacencies of argument phrases can be ignored (Cohn and Blunsom, 2005; Jie and Lu, 2019; Zhou et al., 2020c; Xu et al., 2021). Second, for a sentence with n words, the span-based method needs to consider n 3 candidate predicate-argument pairs, thus severely suffering from data sparsity. To resolve this issue, the span-based method relies on heavy pruning to reduce the searching space, which potentially impairs the performance. Meanwhile, both of the two formulations share some common flaws in terms of explicit modeling of internal argument structures. Internal structures appear to be beneficial to SRL, the reasons are two folds. First, the headword of a phrasal argu- ment is very instructive to localize the semantic roles with respect to the selected predicate (Johans- son and Nugues, 2008c; Punyakanok et al., 2008). In fact, many linguistic phenomena in SRL like wh-extraction (e.g., who did what to whom) can be transparently reflected by the dependency link from the predicate to the headword of its associated ar- gument (Johansson and Nugues, 2008a; Surdeanu et al., 2008; Hajiˇ c et al., 2009). Taking Fig. 1 as an example, the semantic role “A2” of argument “out of the market” for the predicate “take” can be mirrored in the labeled dependency “take A2 ÝÑ out”. Second, constituent-based arguments are highly correlated with the subtrees governed by their head- words (Hacioglu, 2004; Choi and Palmer, 2010). For instance, the span boundary of the argument in arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

Upload: others

Post on 28-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

Semantic Role Labeling as Dependency Parsing: ExploringLatent Tree Structures Inside Arguments

Yu Zhang1, Qingrong Xia1, Shilin Zhou1, Yong Jiang2,Zhenghua Li1, Guohong Fu1, Min Zhang1

1Institute of Artificial Intelligence, School of Computer Science and Technology,Soochow University, Suzhou, China; 2DAMO Academy, Alibaba Group, China

{yzhang.cs,slzhou.cs}@outlook.com; [email protected]@alibaba-inc.com; {zhli13,ghfu,minzhang}@suda.edu.cn

AbstractSemantic role labeling is a fundamental yetchallenging task in the NLP community. Re-cent works of SRL mainly fall into two lines:1)BIO-based and 2) span-based. Despite effec-tiveness, they share some intrinsic drawbacksof not explicitly considering internal argumentstructures, which may potentially hinder themodel’s expressiveness. To remedy this, wepropose to reduce SRL to a dependency pars-ing task and regard the flat argument spansas latent subtrees. In particular, we equipour formulation with a novel span-constrainedTreeCRF model to make tree structures span-aware, and further extend it to the second-order case. Experiments on CoNLL05 andCoNLL12 benchmarks reveal that the resultsof our methods outperform all previous worksand achieve the state-of-the-art.

1 Introduction

Semantic role labeling (SRL) is a fundamental yetchallenging task in the NLP community, involvingpredicate and argument identification, as well assemantic role classification. As SRL can provideinformative linguistic representations, it has beenwidely adopted in downstream tasks like questionanswering (Berant et al., 2013; Yih et al., 2016),information extraction (Christensen et al., 2010;Lin et al., 2017), and machine translation (Liu andGildea, 2010; Bazrafshan and Gildea, 2013), etc.

Recent works of SRL mainly fall into two lines:1) BIO-based and 2) span-based. The former viewsSRL as a sequence labeling task (Zhou and Xu,2015; Strubell et al., 2018; Shi and Lin, 2019).For each predicate, each token is tagged with alabel starting with BIO prefixes indicating if it is atthe Beginning, Inside, or Outside of an argument.The latter (He et al., 2018a; Ouchi et al., 2018),in contrast, directly models all predicate-argumentpairs in a unified graph.

Despite effectiveness, there are some drawbacksthat limit the expressiveness of the two methods.

... take ... out of the market ...

VMOD AMOD NMODPMOD

A2

Figure 1: An argument example (below) and its relatedsubtree structure (above) for the predicate “take”.

First, framing predicate-argument structures as aBIO-tagging scheme is less effective as it lacksexplicit modeling of span-level representations, sothat long adjacencies of argument phrases can beignored (Cohn and Blunsom, 2005; Jie and Lu,2019; Zhou et al., 2020c; Xu et al., 2021). Second,for a sentence with n words, the span-based methodneeds to consider n3 candidate predicate-argumentpairs, thus severely suffering from data sparsity. Toresolve this issue, the span-based method relies onheavy pruning to reduce the searching space, whichpotentially impairs the performance.

Meanwhile, both of the two formulations sharesome common flaws in terms of explicit modelingof internal argument structures. Internal structuresappear to be beneficial to SRL, the reasons aretwo folds. First, the headword of a phrasal argu-ment is very instructive to localize the semanticroles with respect to the selected predicate (Johans-son and Nugues, 2008c; Punyakanok et al., 2008).In fact, many linguistic phenomena in SRL likewh-extraction (e.g., who did what to whom) can betransparently reflected by the dependency link fromthe predicate to the headword of its associated ar-gument (Johansson and Nugues, 2008a; Surdeanuet al., 2008; Hajic et al., 2009). Taking Fig. 1 asan example, the semantic role “A2” of argument“out of the market” for the predicate “take” can bemirrored in the labeled dependency “take A2

ÝÑ out”.Second, constituent-based arguments are highlycorrelated with the subtrees governed by their head-words (Hacioglu, 2004; Choi and Palmer, 2010).For instance, the span boundary of the argument in

arX

iv:2

110.

0686

5v1

[cs

.CL

] 1

3 O

ct 2

021

Page 2: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

Fig. 1 can be properly retrieved by the related de-pendency subtree. Thus considering subtree struc-tures is intuitively very helpful for argument identi-fication. However, to our best knowledge, very fewworks have made such explorations, stuck on thefact that span-style SRL has no determined internalargument structures.

Motivated by the above observations, in thiswork, we propose to model flat argument as latentdependency subtrees. In this way, we can natu-rally reduce SRL to a dependency parsing task: weview each SRL graph as dependency trees wherethe exact structure for each argument is not real-ized. Specifically, we make use of TreeCRF (Eis-ner, 2000; Zhang et al., 2020) to learn the partially-observed trees and marginalize the latent structuresout during training, as it provides a viable wayto conduct probabilistic modeling of tree struc-tures. While canonical TreeCRF aims to enumer-ate all possible trees, in our setting, we have toimpose many span constraints on subtrees, aim-ing to correctly reflect the argument boundaries.To accommodate this, we further design a novelspan-constrained for TreeCRF to adapt it to ourlearning procedure, which explicitly forbids invalidedges across different arguments and the existenceof multi-head subtrees (Nivre et al., 2014; Zhanget al., 2021a).

There are further advantages to our reduction.Once we have successfully done the SRL-to-Treeconversion, we can easily conduct global inferencefor tree structures (Eisner, 1996; McDonald et al.,2005) in polynomial time, which has already beenshown to often lead to improved results and moremeaningful predictions (Toutanova et al., 2008;Täckström et al., 2015; FitzGerald et al., 2015)compared to local unconstrained methods. Onthe other hand, by drawing on the experience inthe parsing literature, we can further extend ourmethod to some well-studied high-order methods(McDonald and Pereira, 2006; Zhang et al., 2020)without any obstacles. We note that our formula-tion does not need predicates or syntax trees to bepre-specified, and is thus fully end-to-end. Ourcontributions can be summarized as follows:1

• In aware of the importance of internal argu-ment structures, we propose to reduce SRL toa dependency parsing task, and model internalarguments as latent subtrees.

1Our code is available at htts://github.com/yzhangcs/crfsrl.

• We propose a novel span-constrained TreeCRFto learn the converted dependency trees, andfurther extend it to second-order case.

• Experiments on CoNLL05 and CoNLL12benchmarks reveal that the results of our pro-posed methods outperform others by a largemargin, and achieve the state-of-the-art.

2 SRL-to-Tree Conversion

We begin by first reviewing the SRL task, and thenintroduce rules for translating SRL structures intodependency trees. The inverse conversion (TreeÑSRL) is detailed in § 3.2.

Task definition Formally, given an input sen-tence x with n words x1, . . . , xn, we wish to pre-dict a semantic graph g. g can have more thanone predicate, and each predicate-argument rela-tion pp, ri,jq P g is composed of a predicate wordp P x, as well as a labeled argument ri,j that gov-erns a consecutive word span xi, . . . , xj with asemantic role r P R, where R is the space of allroles. As depicted in Fig. 2a, arguments belongingto the same predicate must be non-overlapping.

Span-constrained trees Our goal is to obtainsyntactic trees that are strictly “compatible” witheach predicate-argument pair pp, ri,jq P g. Givenx, we define a directed acyclic dependency tree tby assigning a head h P tx0, x1, . . . , xnu togetherwith a relation label r P R to each modifier m P x,where a dummy word x0 is attached before x asthe root node. We denote the subtree governingxi, . . . , xj and headed by h as hi,j .2 Intuitively, wesay the transformed tree t is “compatible” with gif we can perfectly associate each pp, ri,jq with alabeled arc “p r

ÝÑ h” and its corresponding sub-tree hi,j . This implies two span constraints: 1)the scope of the subtree must strictly correspond toargument span ri,j ; 2) ri,j contains no more thanone headword, otherwise we can further split hi,jinto two or more separate subtrees due to the natureof tree projectivity.

Conversion procedure It would be ideal if wecan establish a one-to-one correspondence betweeng and a single t. However, this is not often the caseif there exists more than one predicate in a sentenceas it may introduce cycles if a predicate is popu-lated in an argument of another. And we can not

2In this work, we assume all dependency trees are projec-tive, i.e., without any crossing arcs. This property allows us toassociate the subtree hi,j with its continuous argument span.

Page 3: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

They1 want2 to3 do4 more5 .6A0 A1

(a) Original structure: the predicate and its arguments arelocated in the upper half-plane of the sentence.

They1 want2 to3 do4 more5 .6

PRD

A0 A1O

(b) Training: convert the predicate-argument structure to adependency tree with (dotted) latent annotations.

They1 want2 to3 do4 more5 .6

PRD

A0 A1O

(c) Decoding: realize a tree rooted at the predicate with thearc labeled as “PRD”; (dashed) “O” arcs are discarded.

They1 want2 to3 do4 more5 .6

PRD

A0 A1

(d) Recovery: collapse all (dashed) subtrees governed by thepredicate into flat argument spans.

Figure 2: Illustration of our SRLÑTree conversion (Fig. 2a and Fig. 2b), and its inverse TreeÑSRL process(Fig. 2c and Fig. 2d).

simply read off dependencies from semantic struc-tures since the internal structure of each argumentri,j is absent. Taking these considerations into ac-count, we first decompose g by predicates, and thenapply the following rules to construct dependencytrees:

1. For predicate p, we first link the root to p, form-ing a labeled dependency x0

PRDÝÝÑ p. We view

the non-predicate word as a special case withits dependency being labeled as “O”.

2. For each argument ri,j of p , all words coveredby ri,j can be candidate headwords. For headh, we first link p to h and then assign the role rto the arc, thus forming “p r

ÝÑ h”; we take otherwords txi, . . . , xjuzh as descendants of h.

3. For a non-argument span pi, jq, we view it as aspecial argument of p with label “O” for dis-tinction with semantic roles, yielding a pairpp,Oi,jq. Then the construction process resem-bles argument spans, except that we do not limitthe number of heads inside pi, jq.

By the above procedure, we can successfully con-vert g into a dependency forest T pgq containing npartially-observed trees, each rooted at the corre-sponding predicate. We note that we do not assignany label to dependencies inside hi,j since theyhave no realized correspondences to SRL struc-tures. Fig. 2b gives a conversion example.

3 SRL as Dependency Parsing

Now we elaborate our approach of reducing SRL todependency parsing. Following the basic settings

of Dozat and Manning (2017), our model consistsof a contextualized encoder, a scoring module, anda span-aware TreeCRF to learn the converted par-tial dependency trees.

3.1 Neural ParameterizationGiven the sentence x “ x0, x1, . . . , xn, we firstobtain the hidden representation of each token xivia a deep contextualized encoder.

h0,h1, . . . ,hn “ Encoderpx0, x1, . . . , xnq

In this work, we experiment with two alternative en-coders, i.e., BiLSTMs (Gal and Ghahramani, 2016)and pretrained language models (PLMs) (Devlinet al., 2019). We leave more setting details in Ap-pendix A.

(Second-order) Tree parameterization Dozatand Manning (2017) decompose t into two sep-arate y and r, where y is a skeletal tree, and ris the related strictly-ordered label sequence. Wewrite the head of the modifier m as ym. For eachhead-modifier pair h Ñ m P y, we score themusing two MLPs followed by a Biaffine layer:

rh; rm “ MLPh{mphh;hmq

sphÑ mq “ Biaffine prh, rmq(1)

The score of the dependency hÑ m with label r PR is calculated in an analogous manner. We usetwo extra MLPs and |R| biaffine layers to obtainall label scores.

We further enhance the first-order biaffine parserwith adjacent-sibling information (McDonald andPereira, 2006). Following Wang et al. (2019);

Page 4: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

Zhang et al. (2020), we employ three extra MLPsas well as a Triaffine structure for second-ordersubtree scoring,

rh; rm; rs “ MLPh1{m1{sphh;hm;hsq

sphÑ s,mq “ Triaffine prh, rm, rsq(2)

where s and m are two adjacent modifiers of h, ands populates between h and m.

Under the first-order factorization (McDonaldet al., 2005), the score of y becomes.

spx,yq “ÿ

hÑmPy

sphÑ mq (3)

For second-order case, we further incorporateadjacent-sibling subtree scores into tree scoring.

spx,yq “ÿ

hÑm

sphÑ mq `ÿ

hÑs,m

sphÑ s,mq

(4)Then we parameterize the probability of tree y

and its label sequence r as

P py | xq “exp pspx,yqq

Zpxq ”ř

y1PY pxq exp pspx,y1qq

P pr | x,yq “ź

hrÝÑmPt

P pr | x, hÑ mq

(5)where Y pxq is the set of all possible legal unla-beled trees, and Zpxq is known as the partitionfunction. Each label r is independent of tree yand other labels, thus P pr | x, h Ñ mq is locallynormalized over all r1 P R. Finally, we define theprobability of the labeled tree t as the product ofthe probabilities of its two sub-components.

P pt | xq “ P py | xq ¨ P pr | x,yq (6)

3.2 Training and Inference

Training objective During training, we seek tomaximize the probability of SRL graph P pg | xq,which is approximated as the probability of con-verted dependency forest P pT pgq | xq. We furtherdecomposed P pT pgq | xq by predicates,

P pT pgq|xq “ź

p

P pT pg, pq | xq (7)

R-COMB : COMB :

@y, h ă m ď i,

hÑ m P y,

if yh “ x0, ph, r˚,iq P g

@y, s ď i ă m

h m

Ih,m

m i

Cm,i

h i

Ch,i

s i

Cs,i

i` 1 m

Cm,i`1

s m

Ss,m

R-LINK : R-LINK2 :

@y, h ă m,

hÑ m P y

@y, h ă s ă m,

hÑ m P y,

if yh “ x0, hÑ s,m Ę y

h h

Ch,h

h` 1 m

Cm,h`1

h m

Ih,m

h s

Ih,s

s m

Ss,m

h m

Ih,m

Figure 3: Deduction rules (Pereira and Warren, 1983)for our span-constrained Inside/Eisner algorithm (R-COMB and R-LINK) and its second-order extension(McDonald and Pereira, 2006) (COMB and R-LINK2).Our modified rule constraints are highlighted in greencolor. We only show R-rules, and the symmetric L-rules are omitted for brevity.

where the probability for each predicate p can befurther expanded as.

P pT pg, pq | xq “ÿ

y,rPT pg,pq

P py | xq ¨ P pr | x,yq

“ÿ

y,r

exppspx,yqq ¨ P pr | x,yq

Zpxq

“ÿ

t

exppspx, tqq

Zpxq

(8)The denominator Zpxq can be efficiently computedvia (second-order) Inside algorithm in Opn3q timecomplexity (Eisner, 2016; Zhang et al., 2020). Asfor the numerator, we make a slight change of theformula and define the labeled tree score as:

spx, tq “ spx,yq ` logpP pr | x,yqq. (9)

In this way, the numerator is exactly the summationof all legal labeled tree scores. It is noteworthy thatwe do not assign any label to h Ñ m P y while

Page 5: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

h R t0, pu, thus its label probability is set to 1 anddoes not contribute to tree scoring.

Span-constrained TreeCRF The above formuladiffers from the traditional case of partial tree learn-ing (Li et al., 2016) in that we impose span con-straints mentioned in § 2 to the tree space T pg, pq,where the common Inside algorithm is not adequateto. In this work, we propose a span-constrainedTreeCRF to accommodate these constraints. We il-lustrate the deduction rules of our tailored TreeCRFand its second-order extension in Fig. 3. Basi-cally, we avoid the arc h Ñ m crossing differ-ent spans by prohibiting the merge operation ofits related incomplete span Ih,m (R-LINK). Toprevent multi headwords in the same argument, in-spired by Zhang et al. (2021a), for predicate p, weonly allow merging the complete span Cp,m if mis the endpoint of an argument (R-COMB). Forsecond-order case, we further prohibit the subtreep Ñ s,m once s and m are located in the sameargument (R-LINK2).

Inference During inference, as shown in Fig. 2c,for each candidate predicate p, we use (second-order) Eisner algorithm to parse an optimal projec-tive tree rooted at p.

t˚ “ arg maxt1”py1,r1q,s.t.,y1p“x0

spx, t1q (10)

The score spx, tq is calculated by Eq. 3. We firstpredict the label sequence r locally as they areindependent of skeletal trees, and then only real-ize the tree with the label of x0 Ñ p being PRDfor efficiency. Dependencies with label “O” arediscarded.

3.3 Recovery from Dependency TreesWe have gone through the most sophisticated pro-cedure, then all we need is to synthesize the finalSRL graph g1 from the output parse trees. Our post-recovery method is egregiously simple: 1) we firstdivide each parsed tree t1 into multiple subtreesheaded by p; 2) then for each tp r

ÝÑ hu Y hi,j , werecursively make top-down traversal starting fromthe head h, and collapse the subtree into a flat span,forming a predicate-argument pair pp, ri,jq; 3) fi-nally, we collate all arguments into a single graphg1. We demonstrate a recovery example in Fig. 2d.

An important observation is that as by-productswe get, the semantic-compatible trees are task-specific, and may not be in line with any expert-designed grammar (Marcus et al., 1993). However,

once we directly examine our model on grammarinduction task (Klein and Manning, 2004; Gormleyet al., 2014), we find surprisingly promising results.More discussions can be found in Appendix B.

4 Experiments

Following previous works, we measure our pro-posed first-order CRF and second-order CRF2O ontwo SRL benchmarks: CoNLL05 and CoNLL12.For CoNLL05, we adopt the standard data split ofCarreras and Màrquez (2005). For CoNLL12, weextract data from OntoNotes (Pradhan et al., 2013),and follow the split of Pradhan et al. (2012). Weadopt the official scripts provided by CoNLL05shared task3 for evaluation. More experimentalsetups are given in Appendix A.

4.1 Main results

Table 1 gives our main results on CoNLL05 in-domain WSJ data, out-of-domain Brown data, andCoNLL12 Test data. By default, our model worksin an end-to-end fashion, i.e., predicting all predi-cates and associated arguments simultaneously. Forcomparisons with previous works, we report theresults of using gold predicates as well. We achievethis by only parsing trees rooted at the pre-specifiedpredicates.

We can clearly see that our CRF outperformsprevious works by a large margin on all threedatasets, especially showing larger gains on out-of-doamin Brown data. The second-order CRF2O

further brings 0.4, 0.7 and 0.1 F1 improvementsover CRF on the three datasets, respectively. Asshown in § 4.2, we attribute the improvements tobetter performing at global consistency and long-range dependencies.

The results under the setting of w/ gold predi-cates show similar trends. The bottom lines showthe results of utilizing PLMs. Compared with theprevious state-of-the-art BIO-based parser (Shi andLin, 2019), our CRF model enhanced with BERTshows competitive performance without using anypretrained word embeddings as well as LSTMlayers. The results of CRF2O are 88.84, 82.89,and 87.57 respectively. By utilizing RoBERTa,our CRF2O obtains further gains and achievesnew state-of-the-art on all three datasets underthe end-to-end setting. As a reference, the recentbest-performing syntax-related work (Zhou et al.,2020a) achieve 88.64 and 89.81 respectively on

3https://www.cs.upc.edu/~srlconll

Page 6: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

CoNLL05 CoNLL12WSJ Brown Test

P R F1 P R F1 P R F1

He et al. (2017) 80.20 82.30 81.20 67.60 69.60 68.50 78.60 75.10 76.80He et al. (2018a) 81.20 83.90 82.50 69.70 71.90 70.80 79.40 80.10 79.80Strubell et al. (2018)˚ 81.77 83.28 82.51 68.58 70.10 69.33 - - -

CRF 83.18 85.38 84.27 70.40 72.97 71.66 79.47 82.80 81.10CRF2O 83.26 86.20 84.71 70.70 74.16 72.39 79.27 83.24 81.21

Li et al. (2019) + ELMo 85.20 87.50 86.30 74.70 78.10 76.40 84.90 81.40 83.10CRF + BERT 86.53 88.02 87.27 78.77 80.71 79.73 84.08 85.79 84.92CRF2O + BERT 86.78 88.59 87.67 79.29 82.10 80.67 84.19 86.21 85.19CRF + RoBERTa 87.23 88.81 88.01 80.02 82.22 81.10 84.95 87.00 85.97CRF2O + RoBERTa 87.43 89.32 88.36 80.14 82.49 81.30 85.09 86.99 86.03

w/ gold predicatesHe et al. (2017) 82.00 83.40 82.70 69.70 70.50 70.10 80.20 76.60 78.40Ouchi et al. (2018) 84.70 82.30 83.50 76.00 70.40 73.10 84.40 81.70 83.00Tan et al. (2018) 84.50 85.20 84.80 73.50 74.60 74.10 81.90 83.60 82.70Strubell et al. (2018)˚ 84.70 84.24 84.47 73.89 72.39 73.13 - - -Zhang et al. (2021b) 85.30 85.17 85.23 74.98 73.85 74.41 83.09 83.71 83.40

CRF 85.38 85.56 85.47 74.39 73.76 74.07 83.21 83.85 83.53CRF2O 85.47 86.40 85.93 74.92 75.00 74.96 83.02 84.31 83.66

Shi and Lin (2019) + BERT 88.60 89.00 88.80 81.90 82.10 82.00 85.90 87.00 86.50Zhang et al. (2021b) + BERT 87.70 88.15 87.93 81.52 81.36 81.44 86.00 86.84 86.42CRF + BERT 88.70 88.32 88.51 82.58 81.47 82.02 87.20 87.38 87.29CRF2O + BERT 88.80 88.88 88.84 82.95 82.82 82.89 87.35 87.78 87.57CRF + RoBERTa 89.38 89.15 89.27 83.80 82.96 83.38 88.00 88.39 88.19CRF2O + RoBERTa 89.57 89.69 89.63 83.83 83.60 83.72 88.05 88.59 88.32

Table 1: Results on CoNLL05 WSJ, Brown, and CoNLL12 Test data. All results are averaged over 4 runs with dif-ferent random seeds. Strubell et al. (2018) incidentally report results with error (higher) precisions (see discussionsin their code issues), and the above results marked by ˚ are obtained by rerunning their released models.

Dev Test

P R F1 CM F1 CM

BIO 86.80 86.38 86.59 69.24 88.22 71.95SPAN 87.68 86.75 87.21 68.43 88.44 70.22CRF 87.71 87.49 87.60 70.99 88.51 72.40CRF2O 87.71 87.91 87.81 71.99 88.84 73.40

Table 2: Finetuning results on CoNLL05 data under thesetting of w/ gold predicates.

CoNLL05 Test data after using BERT and XLNet(Yang et al., 2019). Our models show very compet-itive results.

4.2 Analysis

To better understand in what aspects our CRF andCRF2O are helpful, we conduct detailed analy-sis on CoNLL05 Dev data. To this end, we re-implement the BIO-based model (BIO) of Strubellet al. (2018) and the span-based model (SPAN) of

He et al. (2018a). For BIO, we follow Zhang et al.(2021b) by further employing linear-chain CRF(Lafferty et al., 2001) during training and performviterbi decoding to satisfy BIO constraints. Forfair comparisons, we adopt the same experimentalsetups in that we report the results of finetuning onBERT and assume all predicates are given.

We first make comprehensive comparisons forthe four methods in Table 2. We can observe thatunder the same settings, our CRF widens the ad-vantages over BIO and SPAN, and CRF2O furtherimproves the performance on Dev data.

Structural consistency To quantify the benefitsof our methods in making global decisions for SRLstructures, we report the percentage of completelycorrect predicates (CM) in Table 2. We show thatthe BIO model with CRF significantly outperformsSPAN, but still falls short of our CRF by 1.75. Byexplicitly modeling sibling information, CRF2O

goes further beyond CRF by 1 point.

Page 7: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

0-1 2-3 4-7 ě882

84

86

88

90F 1

(%)

BIO

SPAN

CRF

CRF2O

(a) Argument length0-1 2-3 4-7 ě8

72

77

82

87

92

F 1(%

)

(b) Distance to predicate

Figure 4: F1 scores breakdown by argument length(Fig. 4a) and predicate-argument distance (Fig. 4b).

In terms of the performance broken down byargument length, as shown in Fig. 4a, SPAN lagslargely behind BIO over lengthě8. We guess this ismainly because of their aggressive argument prun-ing strategy. Correspondingly, CRF and CRF2O

demonstrate steady improvements over BIO andSPAN. We owe this to the superiority of our for-mulations in modeling subtree structures, thus pro-viding rich inter- and intra-argument dependencyinteractions.

Long-range dependencies Fig. 4b shows the re-sults broken down by predicate-argument distance.It is clear that the gaps between BIO and othermethods become larger as the distance increases.This is reasonable since BIO lacks explicit con-nections for non-adjacent predicate-argument pairs,whereas ours provides direct word-to-word map-ping. SPAN shows competitive results but is stillinferior to us. we speculate this is due to their in-feriority in ultra-long arguments, as illustrated inFig. 4a.

4.3 Dependency-based evaluation

Given above analyses, we have concluded that span-style SRL can substantially benefit from our formu-lation of recovering SRL from induced trees. Thisnaturally opens up the question: can induced treeslearn plausible structures?

Observing that our CRF model can convenientlydetermine dependencies from predicates to spanheadwords as by-products of constructing argu-ments, we therefore conduct dependency-basedevaluation on CoNLL09 Test data (Hajic et al.,2009) to measure the quality of induced depen-dencies. We obtain the results by making predic-tions on CoNLL09 Test with the parser trained onCoNLL05 as they share the same text content.

We set up two baselines: 1) FIRST, means al-ways take the first word as argument head; 2) LAST,

P R F1

FIRST 61.16 63.51 62.31LAST 60.61 62.95 61.76CRF 72.20 74.98 73.56CRF + BERT 82.30 84.04 83.16CRF + BERT (gold) 90.92 92.85 91.88

w/ gold predicatesCRF 75.28 75.24 75.26CRF + BERT 84.70 84.39 84.54CRF + BERT (gold) 93.56 93.22 93.39

Table 3: Results for dependency-based evaluation onCoNLL09 Test data under w/o and w/ gold predicatessettings.

means the last word. Following Johansson andNugues (2008b); Li et al. (2019), we also compareour CRF outputs with the upper bound of utiliz-ing gold syntax tree to determine headwords ofpredicted arguments. Since the CoNLL05 data con-tains only verbal predicates, we discard all nominalpredicate-argument structures with the guidanceof POS tags starting with N*. Word senses andself-loops are removed as well. Table 3 lists theresults.

We can clearly see that FIRST performs signifi-cantly better than LAST. This is reasonable sinceEnglish is often regarded as being right-branching,and headwords often precede their dependents. Theresults of our CRF model substantially outperformthe two baselines, i.e., FIRST and LAST, especiallyafter utilizing BERT. It also exhibits promisingperformance even when compared with the upperbound of utilizing gold syntax. This indicates thatthe induced dependencies of CRF are highly inline with dependency-based annotations (Johans-son and Nugues, 2008a; Surdeanu et al., 2008).This sheds lights on extensions of our work on su-pervised dependency-based SRL, which we leavefor future work. Furthermore, we demonstrate thatour learned parser can show very competitive re-sults on grammar induction in comparison withtypical methods, which we refer to Appendix B formore discussions.

5 Related Works

Span-style SRL Pioneered by Gildea and Juraf-sky (2002), syntax has long been considered indis-pensable for the span-style SRL task (Punyakanoket al., 2008). With the advent of neural network era,syntax-agnostic models make remarkable progress(Zhou and Xu, 2015; Tan et al., 2018; He et al.,

Page 8: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

2018a), mainly owing to powerful model architec-tures like BiLSTM (Gal and Ghahramani, 2016)or Transformer (Vaswani et al., 2017). Meanwhile,other researchers also pay attention to syntax treeutilization, including serving as guidance for argu-ment pruning (He et al., 2018b), as input features(Marcheggiani and Titov, 2017; Xia et al., 2019),or as supervisions for joint learning (Swayamdiptaet al., 2018). However, to our knowledge, very fewworks have been devoted to mining internal struc-tures of shallow SRL representation. Zhang et al.(2021b) explore headword distributions by doingattention over words in argument spans. Beyondthis, this work proposes to model full argumentsubtree structures by directly reducing SRL as a de-pendency parsing task, and find more competitiveresults.

Parsing with latent variables Henderson et al.(2008, 2013) design a latent variable model to de-liver syntactic and semantic interactions under thesetting of joint learning. In more common situa-tions where gold treebanks may lack, Naradowskyet al. (2012); Gormley et al. (2014) use LBP forthe inference of semantic graphs and treat latenttrees as global factors (Smith and Eisner, 2008)to provide soft beliefs for reasonable predicate-arguments structures. This work differs in that wemake hard constraints on syntax tree structures toconform to the SRL graph, and take only subtreesattached to predicates as latent variables. The in-tuition behind latent tree models (Chu et al., 2017;Meila and Jordan, 2000; Kim et al., 2017) is toutilize tree structures to provide rich structural in-teractions for problems with prohibitive high com-plexity. This idea is also common in many otherNLP tasks like text summarization (Liu and Lapata,2018), sequence labeling (Zhou et al., 2020c), andAMR parsing (Zhou et al., 2020b).

Reduction to syntactic parsing Researchershave investigated several ways to recover SRLstructures from syntactic trees, due to their highcoupling nature (Palmer et al., 2005). Early ef-forts of Cohn and Blunsom (2005) derive predicate-arguments to pruned phrase structures equippedwith a CKY-style TreeCRF to learn parameters. Jo-hansson and Nugues (2008a) and Choi and Palmer(2010) investigate retrieving semantic boundariesfrom dependency outputs. Their devised heuristicsrely heavily on the quality of output trees, leadingto inferior results. Our reduction is also inspired

by works on other NLP tasks, including namedentity recognition (NER) (Yu et al., 2020), nestedNER (Fu et al., 2021), semantic dependency pars-ing (Sun et al., 2017), and EUD parsing (Ander-son and Gómez-Rodríguez, 2021). As the mostrelevant work, Shi et al. (2020) also propose toreduce SRL to syntactic dependency parsing by in-tegrating syntactic-semantic relations into a singledependency tree by means of joint labels. How-ever, their approach shows non-improved resultspossibly due to the label sparsity problem and highback-and-forth conversion loss. Also, they use goldtreebank supervisions while ours does not rely onany hand-annotated syntax data.

We prefer to reduce SRL to dependency pars-ing rather than another paradigm, i.e., constituencyparsing. This is partly due to the fact that depen-dency structures can offer a more transparent bilex-ical governor-dependent encoding of predicate-argument relations. We also do not take the way ofjointly modeling dependencies and phrasal struc-tures with lexicalized trees (Eisner and Satta, 1999;Yang and Tu, 2021) as our approach enjoys a lowertime complexity of Opn3q. Certainly, we admitpotential advantages of this kind of modeling andleave this as our future work.

6 Conclusions

In this paper, we propose to reduce SRL to a de-pendency parsing task. Specifically, we view flatphrasal arguments as latent subtrees and design anovel span-constrained TreeCRF model to accom-modate the span structures. We also borrow tra-ditional second-order parsing techniques and findfurther gains. Experimental results show that ourproposed methods achieve the state-of-the-art per-formance on both CoNLL05 and CoNLL12 bench-marks. Extensive analyses reveal the advantagesof our modeling of latent subtrees in terms of long-range dependencies and global consistency. Theevaluation over induced dependencies validatestheir agreement with linguistic motivated annota-tions.

ReferencesMark Anderson and Carlos Gómez-Rodríguez. 2021.

Splitting EUD graphs into trees: A quick and clattyapproach. In Proceedings of IWPT, pages 167–174,Online. Association for Computational Linguistics.

Marzieh Bazrafshan and Daniel Gildea. 2013. Seman-tic roles for string to tree machine translation. In

Page 9: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

Proceedings of ACL, pages 419–423, Sofia, Bulgaria.Association for Computational Linguistics.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In Proceedings of EMNLP,pages 1533–1544, Seattle, Washington, USA. Asso-ciation for Computational Linguistics.

Jiong Cai, Yong Jiang, and Kewei Tu. 2017. CRF au-toencoder for unsupervised dependency parsing. InProceedings of EMNLP, pages 1638–1643, Copen-hagen, Denmark. Association for ComputationalLinguistics.

Xavier Carreras and Lluís Màrquez. 2005. Introduc-tion to the CoNLL-2005 shared task: Semantic rolelabeling. In Proceedings of CoNLL, pages 152–164, Ann Arbor, Michigan. Association for Compu-tational Linguistics.

Jinho Choi and Martha Palmer. 2010. Retrieving cor-rect semantic boundaries in dependency structure.In Proceedings of LAW, pages 91–99, Uppsala, Swe-den. Association for Computational Linguistics.

Janara Christensen, Mausam, Stephen Soderland, andOren Etzioni. 2010. Semantic role labeling foropen information extraction. In Proceedings of WS,pages 52–60, Los Angeles, California. Associationfor Computational Linguistics.

Shanbo Chu, Yong Jiang, and Kewei Tu. 2017. Latentdependency forest models. In Proceedings of AAAI,pages 3733–3739, San Francisco, California, USA.

Trevor Cohn and Philip Blunsom. 2005. Semantic rolelabelling with tree conditional random fields. InProceedings of CoNLL, pages 169–172, Ann Arbor,Michigan. Association for Computational Linguis-tics.

Michael Collins. 2003. Head-driven statistical modelsfor natural language parsing. Computational Lin-guistics, pages 589–637.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language un-derstanding. In Proceedings of NAACL, pages4171–4186, Minneapolis, Minnesota. Associationfor Computational Linguistics.

Timothy Dozat and Christopher D. Manning. 2017.Deep biaffine attention for neural dependency pars-ing. In Proceedings of ICLR, Toulon, France. Open-Review.net.

Jason Eisner. 2000. Bilexical Grammars andtheir Cubic-Time Parsing Algorithms, pages 29–61.Springer Netherlands, Dordrecht.

Jason Eisner. 2016. Inside-outside and forward-backward algorithms are just backprop (tutorial pa-per). In Proceedings of WS, pages 1–17, Austin, TX.Association for Computational Linguistics.

Jason Eisner and Giorgio Satta. 1999. Efficient pars-ing for bilexical context-free grammars and head au-tomaton grammars. In Proceedings of ACL, pages457–464, College Park, Maryland, USA. Associa-tion for Computational Linguistics.

Jason M. Eisner. 1996. Three new probabilistic mod-els for dependency parsing: An exploration. In Pro-ceedings of COLING, pages 340–345.

Nicholas FitzGerald, Oscar Täckström, KuzmanGanchev, and Dipanjan Das. 2015. Semantic role la-beling with neural network factors. In Proceedingsof EMNLP, pages 960–970, Lisbon, Portugal. Asso-ciation for Computational Linguistics.

Yao Fu, Chuanqi Tan, Mosha Chen, Songfang Huang,and Fei Huang. 2021. Nested named entity recogni-tion with partially-observed treecrfs. In Proceedingsof AAAI, pages 12839–12847, Online. AAAI Press.

Yarin Gal and Zoubin Ghahramani. 2016. Dropoutas a bayesian approximation: Representing modeluncertainty in deep learning. In Proceedings ofICML, page 1050–1059, New York, New York,USA. PMLR.

Daniel Gildea and Daniel Jurafsky. 2002. AutomaticLabeling of Semantic Roles. Computational Lin-guistics, pages 245–288.

Matthew R. Gormley, Margaret Mitchell, BenjaminVan Durme, and Mark Dredze. 2014. Low-resourcesemantic role labeling. In Proceedings of ACL,pages 1177–1187, Baltimore, Maryland. Associa-tion for Computational Linguistics.

Kadri Hacioglu. 2004. Semantic role labeling using de-pendency trees. In Proceedings of COLING, pages1273–1276, Geneva, Switzerland. COLING.

Jan Hajic, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, Maria Antònia Martí, LluísMàrquez, Adam Meyers, Joakim Nivre, SebastianPadó, Jan Štepánek, Pavel Stranák, Mihai Surdeanu,Nianwen Xue, and Yi Zhang. 2009. The CoNLL-2009 shared task: Syntactic and semantic depen-dencies in multiple languages. In Proceedings ofCoNLL, pages 1–18, Boulder, Colorado. Associationfor Computational Linguistics.

Wenjuan Han, Yong Jiang, and Kewei Tu. 2017. De-pendency grammar induction with neural lexical-ization and big training data. In Proceedings ofEMNLP, pages 1683–1688, Copenhagen, Denmark.Association for Computational Linguistics.

Luheng He, Kenton Lee, Omer Levy, and Luke Zettle-moyer. 2018a. Jointly predicting predicates and ar-guments in neural semantic role labeling. In Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 364–369, Melbourne, Australia. As-sociation for Computational Linguistics.

Page 10: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle-moyer. 2017. Deep semantic role labeling: Whatworks and what’s next. In Proceedings of ACL,pages 473–483, Vancouver, Canada. Association forComputational Linguistics.

Shexia He, Zuchao Li, Hai Zhao, and Hongxiao Bai.2018b. Syntax for semantic role labeling, to be,or not to be. In Proceedings of ACL, pages 2061–2071, Melbourne, Australia. Association for Com-putational Linguistics.

James Henderson, Paola Merlo, Gabriele Musillo, andIvan Titov. 2008. A latent variable model of syn-chronous parsing for syntactic and semantic depen-dencies. In Proceedings of CoNLL, pages 178–182,Manchester, England. Coling 2008 Organizing Com-mittee.

James Henderson, Paola Merlo, Ivan Titov, andGabriele Musillo. 2013. Multilingual joint parsingof syntactic and semantic dependencies with a latentvariable model. CL, pages 949–998.

Yong Jiang, Wenjuan Han, and Kewei Tu. 2016. Unsu-pervised neural dependency parsing. In Proceedingsof EMNLP, pages 763–771, Austin, Texas. Associa-tion for Computational Linguistics.

Zhanming Jie and Wei Lu. 2019. Dependency-guidedLSTM-CRF for named entity recognition. In Pro-ceedings of EMNLP-IJCNLP, pages 3862–3872,Hong Kong, China. Association for ComputationalLinguistics.

Richard Johansson and Pierre Nugues. 2008a.Dependency-based semantic role labeling of Prop-Bank. In Proceedings of EMNLP, pages 69–78,Honolulu, Hawaii. Association for ComputationalLinguistics.

Richard Johansson and Pierre Nugues. 2008b.Dependency-based syntactic–semantic analysiswith PropBank and NomBank. In Proceedingsof CoNLL, pages 183–187, Manchester, England.Coling 2008 Organizing Committee.

Richard Johansson and Pierre Nugues. 2008c. The ef-fect of syntactic representation on semantic role la-beling. In Proceedings of COLING, pages 393–400,Manchester, UK. Coling 2008 Organizing Commit-tee.

Yoon Kim, Carl Denton, Luong Hoang, and Alexan-der M. Rush. 2017. Structured attention networks.In Proceedings of ICLR, Toulon, France. OpenRe-view.net.

Dan Klein and Christopher Manning. 2004. Corpus-based induction of syntactic structure: Models of de-pendency and constituency. In Proceedings of ACL,pages 478–485, Barcelona, Spain.

John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and labeling

sequence data. In Proceedings of ICML, page282–289, Williams College, Williamstown, MA,USA. Morgan Kaufmann.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of NAACL, pages 2475–2485, SanDiego, California. Association for ComputationalLinguistics.

Phong Le and Willem Zuidema. 2015. Unsuperviseddependency parsing: Let’s use supervised parsers.In Proceedings of NAACL, pages 651–661, Denver,Colorado. Association for Computational Linguis-tics.

Zhenghua Li, Min Zhang, Yue Zhang, Zhanyi Liu,Wenliang Chen, Hua Wu, and Haifeng Wang. 2016.Active learning for dependency parsing with par-tial annotation. In Proceedings of ACL, pages 344–354, Berlin, Germany. Association for Computa-tional Linguistics.

Zuchao Li, Shexia He, Hai Zhao, Yiqing Zhang, Zhu-osheng Zhang, Xi Zhou, and Xiang Zhou. 2019. De-pendency or span, end-to-end uniform semantic rolelabeling. In Proceedings of the AAAI, pages 6730–6737.

Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2017.Neural relation extraction with multi-lingual atten-tion. In Proceedings of ACL, pages 34–43, Vancou-ver, Canada. Association for Computational Linguis-tics.

Ding Liu and Daniel Gildea. 2010. Semantic role fea-tures for machine translation. In Proceedings ofCOLING, pages 716–724, Beijing, China. Coling2010 Organizing Committee.

Yang Liu and Mirella Lapata. 2018. Learning struc-tured text representations. TACL, pages 63–75.

Ilya Loshchilov and Frank Hutter. 2019. Decoupledweight decay regularization. In Proceedings ofICLR, New Orleans, LA, USA.

Diego Marcheggiani and Ivan Titov. 2017. Encodingsentences with graph convolutional networks for se-mantic role labeling. In Proceedings of EMNLP,pages 1506–1515, Copenhagen, Denmark. Associa-tion for Computational Linguistics.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank. CL, pages313–330.

Ryan McDonald, Koby Crammer, and FernandoPereira. 2005. Online large-margin training of de-pendency parsers. In Proceedings of ACL, pages 91–98, Ann Arbor, Michigan. Association for Computa-tional Linguistics.

Page 11: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

Ryan McDonald and Fernando Pereira. 2006. On-line learning of approximate dependency parsing al-gorithms. In Proceedings of EACL, pages 81–88,Trento, Italy. Association for Computational Lin-guistics.

Marina Meila and Michael I. Jordan. 2000. Learningwith mixtures of trees. Journal of Machine LearningResearch, pages 1–48.

Jason Naradowsky, Sebastian Riedel, and David Smith.2012. Improving NLP through marginalizationof hidden syntactic structure. In Proceedings ofEMNLP, pages 810–820, Jeju Island, Korea. Asso-ciation for Computational Linguistics.

Joakim Nivre, Yoav Goldberg, and Ryan McDonald.2014. Squibs: Constrained arc-eager dependencyparsing. CL, pages 249–257.

Hiroki Ouchi, Hiroyuki Shindo, and Yuji Matsumoto.2018. A span selection model for semantic rolelabeling. In Proceedings of EMNLP, pages 1630–1642, Brussels, Belgium. Association for Computa-tional Linguistics.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The Proposition Bank: An annotated corpusof semantic roles. CL, pages 71–106.

Fernando C. N. Pereira and David H. D. Warren. 1983.Parsing as deduction. In Proceedings of ACL, pages137–144, Cambridge, Massachusetts, USA. Associ-ation for Computational Linguistics.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Björkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using OntoNotes. In Pro-ceedings of CoNLL-WS, pages 143–152, Sofia, Bul-garia. Association for Computational Linguistics.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 shared task: Modeling multilingual unre-stricted coreference in OntoNotes. In Proceedingsof CoNLL-WS, pages 1–40, Jeju Island, Korea. As-sociation for Computational Linguistics.

Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008.The importance of syntactic parsing and inference insemantic role labeling. Computational Linguistics,pages 257–287.

Yikang Shen, Yi Tay, Che Zheng, Dara Bahri, DonaldMetzler, and Aaron Courville. 2021. StructFormer:Joint unsupervised induction of dependency and con-stituency structure from masked language modeling.In Proceedings of ACL-IJCNLP, pages 7196–7209,Online. Association for Computational Linguistics.

Peng Shi and Jimmy Lin. 2019. Simple bert models forrelation extraction and semantic role labeling.

Tianze Shi, Igor Malioutov, and Ozan Irsoy. 2020. Se-mantic role labeling as syntactic dependency pars-ing. In Proceedings of EMNLP, pages 7551–7571,Online. Association for Computational Linguistics.

David Smith and Jason Eisner. 2008. Dependencyparsing by belief propagation. In Proceedings ofEMNLP, pages 145–156, Honolulu, Hawaii. Asso-ciation for Computational Linguistics.

Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-informed self-attention for semanticrole labeling. In Proceedings of EMNLP, pages5027–5038, Brussels, Belgium. Association forComputational Linguistics.

Weiwei Sun, Junjie Cao, and Xiaojun Wan. 2017. Se-mantic dependency parsing via book embedding. InProceedings of ACL, pages 828–838, Vancouver,Canada. Association for Computational Linguistics.

Mihai Surdeanu, Richard Johansson, Adam Meyers,Lluís Màrquez, and Joakim Nivre. 2008. TheCoNLL 2008 shared task on joint parsing of syntac-tic and semantic dependencies. In Proceedings ofCoNLL, pages 159–177, Manchester, England. Col-ing 2008 Organizing Committee.

Swabha Swayamdipta, Sam Thomson, Kenton Lee,Luke Zettlemoyer, Chris Dyer, and Noah A. Smith.2018. Syntactic scaffolds for semantic structures.In Proceedings of EMNLP, pages 3772–3782, Brus-sels, Belgium. Association for Computational Lin-guistics.

Oscar Täckström, Kuzman Ganchev, and Dipanjan Das.2015. Efficient inference and structured learning forsemantic role labeling. TACL, pages 29–41.

Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen,and Xiaodong Shi. 2018. Deep semantic role label-ing with self-attention.

Kristina Toutanova, Aria Haghighi, and Christopher D.Manning. 2008. A global joint model for se-mantic role labeling. Computational Linguistics,34(2):161–191.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of NIPS, pages 5998–6008, Long Beach, California, USA. Neural Infor-mation Processing Systems Foundation, Inc.

Xinyu Wang, Jingxian Huang, and Kewei Tu. 2019.Second-order semantic dependency parsing withend-to-end neural networks. In Proceedings of ACL,pages 4609–4618, Florence, Italy. Association forComputational Linguistics.

Qingrong Xia, Zhenghua Li, Min Zhang, MeishanZhang, Guohong Fu, Rui Wang, and Luo Si. 2019.Syntax-aware neural semantic role labeling. In Pro-ceedings of AAAI Conference, pages 7305–7313.AAAI Press.

Page 12: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

Lu Xu, Zhanming Jie, Wei Lu, and Lidong Bing. 2021.Better feature integration for named entity recogni-tion. In Proceedings of NAACL, pages 3457–3469,Online. Association for Computational Linguistics.

Songlin Yang, Yong Jiang, Wenjuan Han, and KeweiTu. 2020. Second-order unsupervised neural de-pendency parsing. In Proceedings of COLING,pages 3911–3924, Barcelona, Spain (Online). Inter-national Committee on Computational Linguistics.

Songlin Yang and Kewei Tu. 2021. Headed span-basedprojective dependency parsing.

Songlin Yang, Yanpeng Zhao, and Kewei Tu. 2021.Neural bi-lexicalized PCFG induction. In Proceed-ings of ACL-IJCNLP, pages 2688–2699, Online. As-sociation for Computational Linguistics.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrainingfor language understanding. In Advances in NIPS,pages 5753–5763. Curran Associates, Inc.

Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of se-mantic parse labeling for knowledge base questionanswering. In Proceedings of ACL, pages 201–206, Berlin, Germany. Association for Computa-tional Linguistics.

Juntao Yu, Bernd Bohnet, and Massimo Poesio. 2020.Named entity recognition as dependency parsing. InProceedings of ACL, pages 6470–6476, Online. As-sociation for Computational Linguistics.

Liwen Zhang, Ge Wang, Wenjuan Han, and KeweiTu. 2021a. Adapting unsupervised syntactic parsingmethodology for discourse dependency parsing. InProceedings of ACL-IJCNLP, pages 5782–5794, On-line. Association for Computational Linguistics.

Yu Zhang, Zhenghua Li, and Min Zhang. 2020. Effi-cient second-order TreeCRF for neural dependencyparsing. In Proceedings of ACL, pages 3295–3305,Online. Association for Computational Linguistics.

Zhisong Zhang, Emma Strubell, and Eduard Hovy.2021b. Comparing span extraction methods for se-mantic role labeling. In Proceedings of SPNLP,pages 67–77, Online. Association for ComputationalLinguistics.

Jie Zhou and Wei Xu. 2015. End-to-end learning of se-mantic role labeling using recurrent neural networks.In Proceedings of ACL-IJCNLP, pages 1127–1137,Beijing, China. Association for Computational Lin-guistics.

Junru Zhou, Zuchao Li, and Hai Zhao. 2020a. Parsingall: Syntax and semantics, dependencies and spans.In Findings of the EMNLP, pages 4438–4449, On-line. Association for Computational Linguistics.

Qiji Zhou, Yue Zhang, Donghong Ji, and Hao Tang.2020b. AMR parsing with latent structural infor-mation. In Proceedings of ACL, pages 4306–4319,Online. Association for Computational Linguistics.

Yang Zhou, Yong Jiang, Zechuan Hu, and Kewei Tu.2020c. Neural latent dependency model for se-quence labeling.

Hao Zhu, Yonatan Bisk, and Graham Neubig. 2020.The return of lexical dependencies: Neural lexical-ized PCFGs. TACL, pages 647–661.

A Implementation Details

In this work, we set up two alternative model ar-chitectures, i.e, LSTM-based and PLM-based. Forthe LSTM-based model, we directly adopt mostsettings of Dozat and Manning (2017) with someadaptions. For each token xi P x, its input vectoris the concatenation of three parts,

ei “”

ewordi ; elemma

i ; echari

ı

where ewordi and elemma

i are word and lemma em-beddings, and echari is the outputs of a CharLSTMlayer (Lample et al., 2016). We set the dimensionof lemma and CharLSTM representations to 100 inour setting. We next feed the input embeddings into3-layer BiLSTMs (Gal and Ghahramani, 2016) toget contextualized representations with dimension800.

h0,h1, . . . ,hn “ BiLSTMspe0, e1, . . . , enq

Other dimension settings are kept the same as bi-affine parser (Dozat and Manning, 2017). Follow-ing Zhang et al. (2020), we set the hidden size ofTriaffine layer to 100 for CRF2O additionally. Thetraining process continues at most 1,000 epochsand is early stopped if the performance on Devdata does not increase in 100 consecutive epochs.

For PLM-based models, we directly conductfinetuning on PLM layers without any cascad-ing LSTMs for the sake of simplicity. Weuse “bert-large-cased” for BERT, and“roberta-large” for RoBERTa respectively.We train the model for 10 epochs with batch size ofroughly 1000 tokens and use AdamW (Loshchilovand Hutter, 2019) for parameter optimization. Thelearning rate is 5ˆ 10´5 for PLMs, and 10´3 forthe rest components. We adopt the warmup strat-egy in the first epoch followed by a linear decay inother epochs for the learning rate.

Page 13: A2 arXiv:2110.06865v1 [cs.CL] 13 Oct 2021

46 48 50 52

74

78

82

86

90

F 1(%

)

64 66 68

74

78

82

86

90

LSTMBERT

UAS (%)

Figure 5: Relationships between grammar inductionand SRL results on WSJ data, where the x-axis repre-sents the UAS scores, and y-axis shows the correspond-ing SRL F1 values.

Rules Models WSJ

Stanford

NL-PCFGs (Zhu et al., 2020) 40.5NBL-PCFGs (Yang et al., 2021) 39.1StructFormer (Shen et al., 2021) 46.2

CRF 48.0CRF + BERT 65.4

w/ gold POS tags (for reference)

Collins

DMV (Klein and Manning, 2004) 39.4MaxEnc (Le and Zuidema, 2015) 65.8NDMV (Jiang et al., 2016) 57.6CRFAE (Cai et al., 2017) 55.7L-NDMV (Han et al., 2017) 59.5NDMV2o (Yang et al., 2020) 67.5

Table 4: Grammar induction results of our CRF modelunder different head-finding rules.

B Grammar Induction

Can we associate the SRL results with the quality ofinduced trees? Fig. 5 present a scatter plot to showthe correlations between the results of grammarinduction and SRL. Specifically, we obtain UASof induced trees by first picking up all SRL modelcheckpoints and then parsing 1-best trees usingEisner algorithm given scores defined in Eq. 3. Wecan see that the two results are highly positivelycorrelated, showing a roughly linear relationship.

We show precise grammar induction results inTable 4. The results are not comparable to typicalmethods like DMV (Klein and Manning, 2004) orCRFAE (Cai et al., 2017), as they use gold POStags, and we use Stanford Dependencies instead ofCollins rules (Collins, 2003). However, our learnedtask-specific trees perform significantly better thanrecent works under similar settings.

Another interesting observation is that the gapbetween the BERT-based model and the LSTM-

based model is much larger than that on SRL re-sults. This implies LSTMs more prefer to fitting toSRL structures, while BERT can provide a stronginductive bias for syntax induction.