reasoning with latent structure refinement for document ...or heuristics from the unstructured text...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1546–1557July 5 - 10, 2020. c©2020 Association for Computational Linguistics

1546

Reasoning with Latent Structure Refinementfor Document-Level Relation Extraction

Guoshun Nan1∗ , Zhijiang Guo1∗, Ivan Sekulic2† and Wei Lu1

1StatNLP Research Group, Singapore University of Technology and Design2Università della Svizzera italiana

[email protected], [email protected]@usi.ch, [email protected]

Abstract

Document-level relation extraction requires in-tegrating information within and across mul-tiple sentences of a document and capturingcomplex interactions between inter-sentenceentities. However, effective aggregation ofrelevant information in the document remainsa challenging research question. Existingapproaches construct static document-levelgraphs based on syntactic trees, co-referencesor heuristics from the unstructured text tomodel the dependencies. Unlike previousmethods that may not be able to capture richnon-local interactions for inference, we pro-pose a novel model that empowers the re-lational reasoning across sentences by auto-matically inducing the latent document-levelgraph. We further develop a refinement strat-egy, which enables the model to incrementallyaggregate relevant information for multi-hopreasoning. Specifically, our model achieves anF1 score of 59.05 on a large-scale document-level dataset (DocRED), significantly improv-ing over the previous results, and also yieldsnew state-of-the-art results on the CDR andGDA dataset. Furthermore, extensive analysesshow that the model is able to discover moreaccurate inter-sentence relations.

1 Introduction

Relation extraction aims to detect relations amongentities in the text and plays a significant role ina variety of natural language processing applica-tions. Early research efforts focus on predicting re-lations between entities within the sentence (Zenget al., 2014; Xu et al., 2015a,b). However, valu-able relational information between entities, suchas biomedical findings, is expressed by multiplementions across sentence boundaries in real-worldscenarios (Peng et al., 2017). Therefore, the scope

∗∗ Equally Contributed.†† Work done during internship at SUTD.

Lutsenko is a former minister of internal affairs. He occupied

this post in the cabinets of Yulia Tymoshenko. The ministry of

internal affairs is the Ukrainian police authority.

Subject: Yulia Tymoshenko Object:Ukrainian

Relation: country of citizenship

Figure 1: An example adapted from the DocREDdataset. The example has four entities: Lutsenko, inter-nal affairs, Yulia Tymoshenko and Ukrainian. Here en-tity Lutsenko has two mentions: Lutsenko and He. Men-tions corresponding to the same entity are highlightedwith the same color. The solid and dotted lines repre-sent intra- and inter-sentence relations, respectively.

of extraction in biomedical domain has recentlybeen expanded to cross-sentence level (Quirk andPoon, 2017; Gupta et al., 2018; Song et al., 2019).

A more challenging, yet practical extension, isthe document-level relation extraction, where a sys-tem needs to comprehend multiple sentences toinfer the relations among entities by synthesizingrelevant information from the entire document (Jiaet al., 2019; Yao et al., 2019). Figure 1 showsan example adapted from the recently proposeddocument-level dataset DocRED (Yao et al., 2019).In order to infer the inter-sentence relation (i.e.,country of citizenship) between Yulia Tymoshenkoand Ukrainian, one first has to identify the factthat Lutsenko works with Yulia Tymoshenko. Nextwe identify that Lutsenko manages internal affairs,which is a Ukrainian authority. After incrementallyconnecting the evidence in the document and per-forming the step-by-step reasoning, we are able toinfer that Yulia Tymoshenko is also a Ukrainian.

Prior efforts show that interactions between men-tions of entities facilitate the reasoning process inthe document-level relation extraction. Thus, Vergaet al. (2018) and Jia et al. (2019) leverage Multi-Instance Learning (Riedel et al., 2010; Surdeanu

1547

et al., 2012). On the other hand, structural infor-mation has been used to perform better reasoningsince it models the non-local dependencies that areobscure from the surface form alone. Peng et al.(2017) construct dependency graph to capture in-teractions among n-ary entities for cross-sentenceextraction. Sahu et al. (2019) extend this approachby using co-reference links to connect dependencytrees of sentences to construct the document-levelgraph. Instead, Christopoulou et al. (2019) con-struct a heterogeneous graph based on a set ofheuristics, and then apply an edge-oriented model(Christopoulou et al., 2018) to perform inference.

Unlike previous methods, where a document-level structure is constructed by co-references andrules, our proposed model treats the graph structureas a latent variable and induces it in an end-to-endfashion. Our model is built based on the struc-tured attention (Kim et al., 2017; Liu and Lapata,2018). Using a variant of Matrix-Tree Theorem(Tutte, 1984; Koo et al., 2007), our model is ableto generate task-specific dependency structures forcapturing non-local interactions between entities.We further develop an iterative refinement strategy,which enables our model to dynamically build thelatent structure based on the last iteration, allowingthe model to incrementally capture the complexinteractions for better multi-hop reasoning (Welblet al., 2018).

Experiments show that our model significantlyoutperforms the existing approaches on DocRED,a large-scale document-level relation extractiondataset with a large number of entities and re-lations, and also yields new state-of-the-art re-sults on two popular document-level relation ex-traction datasets in the biomedical domain. Thecode and pretrained model are available at https://github.com/nanguoshun/LSR 1.

Our contributions are summarized as follows:

• We construct a document-level graph for in-ference in an end-to-end fashion without re-lying on co-references or rules, which maynot always yield optimal structures. With theiterative refinement strategy, our model is ableto dynamically construct a latent structure forimproved information aggregation in the en-tire document.

• We perform quantitative and qualitative anal-yses to compare with the state-of-the-art mod-

1Our model is implemented in PyTorch (Paszke et al.,2017)

els in various settings. We demonstrate thatour model is capable of discovering more ac-curate inter-sentence relations by utilizing amulti-hop reasoning module.

2 Model

In this section, we present our proposed La-tent Structure Refinement (LSR) model for thedocument-level relation extraction task. Our LSRmodel consists of three components: node con-structor, dynamic reasoner, and classifier. The nodeconstructor first encodes each sentence of an inputdocument and outputs contextual representations.Representations that correspond to mentions andtokens on the shortest dependency path in a sen-tence are extracted as nodes. The dynamic reasoneris then applied to induce a document-level struc-ture based on the extracted nodes. Representationsof nodes are updated based on information propa-gation on the latent structure, which is iterativelyrefined. Final representations of nodes are used tocalculate classification scores by the classifier.

2.1 Node ConstructorNode constructor encodes sentences in a documentinto contextual representations and constructs rep-resentations of mention nodes, entity nodes andmeta dependency paths (MDP) nodes, as shown inFigure 2. Here MDP indicates a set of shortest de-pendency paths for all mentions in a sentence, andtokens in the MDP are extracted as MDP nodes.

2.1.1 Context EncodingGiven a document d, each sentence di in it is fedto the context encoder, which outputs the contex-tualized representations of each word in di. Thecontext encoder can be a bidirectional LSTM (BiL-STM) (Schuster and Paliwal, 1997) or BERT (De-vlin et al., 2019). Here we use the BiLSTM as anexample:

←−hij = LSTMl(

←−hi

j+1, γij) (1)

−→hij = LSTMr(

−→hi

j−1, γij) (2)

where←−hij ,←−hi

j+1,−→hij and

−→hi

j−1 represent the hid-den representations of the j-th, (j+1)-th and (j-1)-th token in the sentence di of two directions, and γijdenotes the word embedding of the j-th token. Con-textual representation of each token in the sentence

is represented as hij = [

←−hij ;−→hij ] by concatenating

hidden states of two directions, where hij ∈ Rd and

d is the dimension.

https://github.com/nanguoshun/LSR

https://github.com/nanguoshun/LSR

1548

·Context

Encoder

AVGLutsenko is a former minister of internal affairs.

He occupied this post in the cabinets of Yulia Tymoshenko.

The ministry of internal affairs is the Ukrainian police authority.

Mention Node Entity Node MDP Node

Figure 2: Overview of the Node Constructor: A context encoder is applied to get the contextualized representationsof sentences. The representations of mentions and words in the meta dependency paths are extracted as mentionnodes and MDP nodes. An average pooling is used to construct the entity node from the mention nodes. Forexample, the entity node Lutsenko is constructed by averaging representations of its mentions Lutsenko and He.All figures best viewed in color.

2.1.2 Node ExtractionWe construct three types of nodes for a document-level graph: mention nodes, entity nodes and metadependency paths (MDP) nodes as shown in Fig-ure 2. Mention nodes correspond to different men-tions of entities in each sentence. The represen-tation of an entity node is computed as the aver-age of its mentions. To build a document-levelgraph, existing approaches use all nodes in the de-pendency tree of a sentence (Sahu et al., 2019)or one sentence-level node by averaging all to-ken representations of the sentence (Christopoulouet al., 2019). Alternatively, we use tokens on theshortest dependency path between mentions in thesentence. The shortest dependency path has beenwidely used in the sentence-level relation extrac-tion as it is able to effectively make use of relevantinformation while ignoring irrelevant information(Bunescu and Mooney, 2005; Xu et al., 2015a,b).Unlike sentence-level extraction, where each sen-tence only has two entities, each sentence here mayinvolve multiple mentions.

2.2 Dynamic ReasonerThe dynamic reasoner has two modules, structureinduction and multi-hop reasoning as shown in Fig-ure 3. The structure induction module is used tolearn a latent structure of a document-level graph.The multi-hop reasoning module is used to performinference on the induced latent structure, whererepresentations of each node will be updated basedon the information aggregation scheme. We stackN blocks in order to iteratively refine the latentdocument-level graph for better reasoning.

2.2.1 Structure InductionUnlike existing models that use co-reference links(Sahu et al., 2019) or heuristics (Christopoulouet al., 2019) to construct a document-level graph

Weighted Dependency Graph

FFN FFN Linear

Graph Neural Networks

Densely

Connected Multi-Hop Reasoning

N Refinements

Input Graph 1st refinement

BiLinear

0.2

0.1

0.1

0.25

0.15

0.15

0.25 0.1

0.15

0.20.15

0.10.1

0.2

0.15

N-th refinement

...

0.1

0.20.25

0.1

0.150.150.3

0.3

Structure Induction

Figure 3: Overview of the Dynamic Reasoner. Eachblock consists of two sub-modules: structure inductionand multi-hop reasoning. The first module takes thenodes constructed by the Node Constructor as inputs.Representations of nodes are fed into two feed-forwardnetworks before the bilinear transformation. The latentdocument-level structure is computed by the Matrix-Tree Theorem. The second module takes the structureas input and updates representations of nodes by usingthe densely connected graph convolutional networks.We stack N blocks which correspond to N times ofrefinement. Each iteration outputs the latent structurefor inference.

for reasoning, our model treats the graph as a latentvariable and induces it in an end-to-end fashion.The structure induction module is built based onthe structured attention (Kim et al., 2017; Liu andLapata, 2018). Inspired by Liu and Lapata (2018),we use a variant of Kirchhoff’s Matrix-Tree Theo-rem (Tutte, 1984; Koo et al., 2007) to induce thelatent dependency structure.

Let ui denote the contextual representation ofthe i-th node, where ui ∈ Rd, we first calculate thepair-wise unnormalized attention score sij betweenthe i-th and the j-th node with the node represen-

1549

tations ui and uj . The score sij is calculated bytwo feed-forward neural networks and a bilineartransformation:

sij = (tanh(Wpui))TWb(tanh(Wcuj)) (3)

where Wp ∈ Rd×d and Wc ∈ Rd×d are weightsfor two feed-forward neural networks, d is the di-mension of the node representations, and tanh isapplied as the activation function. Wb ∈ Rd×d arethe weights for the bilinear transformation. Nextwe compute the root score sri which represents theunnormalized probability of the i-th node to beselected as the root node of the structure:

sri = Wrui (4)

where Wr ∈R1×d is the weight for the linear trans-formation. Following Koo et al. (2007), we calcu-late the marginal probability of each dependencyedge of the document-level graph. For a graph Gwith n nodes, we first assign non-negative weightsP ∈ Rn×n to the edges of the graph:

Pij =

{0 if i = j

exp (sij) otherwise(5)

where Pij is the weight of the edge between thei-th and the j-th node. We then define the Lapla-cian matrix L ∈ Rn×n of G in Equation (6), andits variant L ∈ Rn×n in Equation (7) for furthercomputations (Koo et al., 2007).

Lij =

{∑ni′=1Pi′j if i = j

−Pij otherwise(6)

Lij =

{exp(sri ) if i = 1

Lij if i > 1(7)

We use Aij to denote the marginal probability ofthe dependency edge between the i-th and the j-thnode. Then, Aij can be derived based on Equation(8), where δ is the Kronecker delta (Koo et al.,2007).

Aij = (1− δ1,j)Pij [L−1]ij

−(1− δi,1)Pij [L−1]ji

(8)

Here, A ∈ Rn×n can be interpreted as a weightedadjacency matrix of the document-level entitygraph. Finally, we can feed A ∈ Rn×n into themulti-hop reasoning module to update the represen-tations of nodes in the latent structure.

2.2.2 Multi-hop ReasoningGraph neural networks have been widely usedin different tasks to perform multi-hop reasoning(Song et al., 2018a; Yang et al., 2019; Tu et al.,2019; Lin et al., 2019), as they are able to effec-tively collect relevant evidence based on an in-formation aggregation scheme. Specifically, ourmodel is based on graph convolutional networks(GCNs) (Kipf and Welling, 2017) to perform rea-soning.

Formally, given a graph G with n nodes, whichcan be represented with an n × n adjacency ma-trix A induced by the previous structure inductionmodule, the convolution computation for the nodei at the l-th layer, which takes the representationul−1i from previous layer as input and outputs the

updated representations uli, can be defined as:

uli = σ(

n∑j=1

AijWlul−1

i + bl) (9)

where Wl and bl are the weight matrix and biasvector for the l-th layer, respectively. σ is the ReLU(Nair and Hinton, 2010) activation function. u0

i ∈Rd is the initial contextual representation of thei-th node constructed by the node constructor.

Following Guo et al. (2019b), we use dense con-nections to the GCNs in order to capture morestructural information on a large document-levelgraph. With the help of dense connections, we areable to train a deeper model, allowing richer lo-cal and non-local information to be captured forlearning a better graph representation. The compu-tations on each graph convolution layer is similarto Equation (9).

2.2.3 Iterative RefinementThough structured attention (Kim et al., 2017; Liuand Lapata, 2018) is able to automatically inducea latent structure, recent research efforts show thatthe induced structure is relatively shallow and maynot be able to model the complex dependenciesfor document-level input (Liu et al., 2019b; Ferra-cane et al., 2019). Unlike previous work (Liu andLapata, 2018) that only induces the latent struc-ture once, we repeatedly refine the document-levelgraph based on the updated representations, allow-ing the model to infer a more informative structurethat goes beyond simple parent-child relations.

As shown in Figure 3, we stack N blocks of thedynamic reasoner in order to induce the document-level structure N times. Intuitively, the reasoner

1550

induces a shallow structure at early iterations sincethe information propagates mostly between neigh-boring nodes. As the structure gets more refinedby interactions with richer non-local information,the induction module is able to generate a moreinformative structure.

2.3 Classifier

After N times of refinement, we obtain representa-tions of all the nodes. Following Yao et al. (2019),for each entity pair (ei, ej), we use a bilinear func-tion to compute the probability for each relationtype r as:

P (r|ei, ej) = σ(eTi Weej + be)r (10)

where We ∈ Rd×k×d and be ∈ Rk are trainableweights and bias, with k being the number of rela-tion categories, σ is the sigmoid function, and thesubscript r in the right side of the equation refersto the relation type.

3 Experiments

3.1 Data

We evaluate our model on DocRED (Yao et al.,2019), the largest human-annotated dataset fordocument-level relation extraction, and another twopopular document-level relation extraction datasetsin the biomedical domain, including Chemical-Disease Reactions (CDR) (Li et al., 2016a) andGene-Disease Associations (GDA) (Wu et al.,2019). DocRED contains 3, 053 documents fortraining, 1, 000 for development and 1, 000 for test,totally with 132, 375 entities and 56, 354 relationalfacts. CDR consists of 500 training instances, 500development instances, and 500 testing instances.GDA contains 29, 192 documents for training and1, 000 for test. We follow (Christopoulou et al.,2019) to split training set of GDA into an 80/20split for training and development.

With more than 40% of the relational facts re-quiring reading and reasoning over multiple sen-tences, DocRED significantly differs from previoussentence-level datasets (Doddington et al., 2004;Hendrickx et al., 2009; Zhang et al., 2018). Unlikeexisting document-level datasets (Li et al., 2016a;Quirk and Poon, 2017; Peng et al., 2017; Vergaet al., 2018; Jia et al., 2019) that are in the specificbiomedical domain considering only the drug-gene-disease relation, DocRED covers a broad range ofcategories with 96 relation types.

Batch size 20Learning rate 0.001Optimizer AdamHidden size 120Induction block number 2GCN dropout 0.3

Table 1: Hyper-parameters of LSR.

3.2 SetupWe use spaCy2 to get the meta dependency pathsof sentences in a document. Following Yao et al.(2019) and Wang et al. (2019), we use the GloVe(Pennington et al., 2014) embedding with BiLSTM,and Uncased BERT-Base (Devlin et al., 2019) asthe context encoder. All hyper-parameters aretuned based on the development set. We list someof the important hyper-parameters in Table 1.

Following Yao et al. (2019), we use F1 and IgnF1 as the evaluation metrics. Ign F1 denotes F1

scores excluding relational facts shared by the train-ing and dev/test sets. F1 scores for intra- and inter-sentence entity pairs are also reported. Evaluationon the test set is done through CodaLab3.

3.3 Main ResultsWe compare our proposed LSR with the followingthree types of competitive models on the DocREDdataset, and show the main results in Table 2.• Sequence-based Models. These models lever-

age different neural architectures to encode sen-tences in the document, including convolutionalneural networks (CNN) (Zeng et al., 2014),LSTM, bidirectional LSTM (BiLSTM) (Caiet al., 2016) and attention-based LSTM (Con-textAware) (Sorokin and Gurevych, 2017).• Graph-based Models. These models construct

task-specific graphs for inference. GCNN (Sahuet al., 2019) constructs a document-level graphby co-reference links, and then applies relationalGCNs for reasoning. EoG (Christopoulou et al.,2019) is the state-of-the-art document-level re-lation extraction model in biomedical domain.EoG first uses heuristics to construct the graph,then leverages an edge-oriented model to per-form inference. GCNN and EoG are based onstatic structures. GAT (Velickovic et al., 2018) isable to learn the weighted graph structure basedon a local attention mechanism. AGGCN (Guo2https://spacy.io/3https://competitions.codalab.org/

competitions/20717

https://spacy.io/

https://competitions.codalab.org/competitions/20717

https://competitions.codalab.org/competitions/20717

1551

Dev Test

Model Ign F1 F1 Intra-F1 Inter-F1 Ign F1 F1

CNN (Yao et al., 2019) 41.58 43.45 51.87∗ 37.58∗ 40.33 42.26LSTM (Yao et al., 2019) 48.44 50.68 56.57∗ 41.47∗ 47.71 50.07BiLSTM (Yao et al., 2019) 48.87 50.94 57.05∗ 43.49∗ 48.78 51.06ContexAware (Yao et al., 2019) 48.94 51.09 56.74∗ 42.26∗ 48.40 50.70

GCNN ¨ (Sahu et al., 2019) 46.22 51.52 57.78∗ 44.11∗ 49.59 51.62EoG ¨ (Christopoulou et al., 2019) 45.94 52.15 58.90∗ 44.60∗ 49.48 51.82GAT ¨ (Velickovic et al., 2018) 45.17 51.44 58.14∗ 43.94∗ 47.36 49.51AGGCN ¨ (Guo et al., 2019a) 46.29 52.47 58.76∗ 45.45∗ 48.89 51.45

GloVe+LSR 48.82 55.17 60.83∗ 48.35∗ 52.15 54.18

BERT (Wang et al., 2019) - 54.16 61.61∗ 47.15∗ - 53.20Two-Phase BERT (Wang et al., 2019) - 54.42 61.80∗ 47.28∗ - 53.92BERT+LSR 52.43 59.00 65.26∗ 52.05∗ 56.97 59.05

Table 2: Main results on the development and the test set of DocRED: Models with ¨ are adapted to DocREDbased on their open implementations. Results with ∗ are computed based on re-trained models as we need toevaluate F1 for both intra- and inter-sentence setting, which are not given in original papers.

et al., 2019a) is the state-of-the-art sentence-level relation extraction model, which constructsthe latent structure by self-attention. These twomodels are able to dynamically construct task-specific structures.• BERT-based Models. These models fine-tune

BERT (Devlin et al., 2019) for DocRED. Specif-ically, Two-Phase BERT (Wang et al., 2019) isthe best reported model. It is a pipeline model,which predicts if the relation exists between en-tity pairs in the first phase and predicts the typeof the relation in the second phase.

As shown in Table 2, LSR with GloVe achieves54.18 F1 on the test set, which is the new state-of-the-art result for models with GloVe. In particu-lar, our model consistently outperforms sequence-based models by a significant margin. For example,LSR improves upon the best sequence-based modelBiLSTM by 3.1 points in terms of F1. This sug-gests that models which directly encode the entiredocument are unable to capture the inter-sentencerelations present in documents.

Under the same setting, our model consistentlyoutperforms graph-based models based on staticgraphs or attention mechanisms. Compared withEoG, our LSR model achieves 3.0 and 2.4 higherF1 on development and test set, respectively. Wealso have similar observations for the GCNNmodel, which shows that a static document-levelgraph may not be able to capture the complexinteractions in a document. The dynamic latentstructure induced by LSR captures richer non-localdependencies. Moreover, LSR also outperformsGAT and AGGCN. This empirically shows that

compared to the models that use local attentionand self-attention (Velickovic et al., 2018; Guoet al., 2019a), LSR can induce more informativedocument-level structures for better reasoning. OurLSR model also shows its superiority under thesetting of Ign F1.

In addition, LSR with GloVe obtains better re-sults than two BERT-based models. This empiri-cally shows that our model is able to capture long-range dependencies even without using powerfulcontext encoders. Following Wang et al. (2019), weleverage BERT as the context encoder. As shownin Table 2, our LSR model with BERT achieves a59.05 F1 score on DocRED, which is a new state-of-the-art result. As of the ACL deadline on the 9thof December 2019, we held the first position on theCodaLab scoreboard under the alias diskorak.

3.4 Intra- and inter-sentence performance

In this subsection, we analyze intra- and inter-sentence performance on the development set. Anentity pair requires inter-sentence reasoning if thetwo entities from the same document have no men-tions in the same sentence. In DocRED’s develop-ment set, about 45% of entity pairs require infor-mation aggregation over multiple sentences.

Under the same setting, our LSR model out-performs all other models in both intra- and inter-sentence setting. The differences in F1 scores be-tween LSR and other models in the inter-sentencesetting tend to be larger than the differences in theintra-sentence setting. These results demonstratethat the majority of LSR’s superiority comes fromthe inter-sentence relational facts, suggesting that

1552

Model F1 Intra-F1 Inter-F1

Gu et al. (2017) 61.3 57.2 11.7Nguyen and Verspoor (2018) 62.3 - -Verga et al. (2018) 62.1 - -

Sahu et al. (2019) 58.6 - -Christopoulou et al. (2019) 63.6 68.2 50.9

LSR 61.2 66.2 50.3LSR w/o MDP Nodes 64.8 68.9 53.1Peng et al. (2016) 63.1 - -Li et al. (2016b) 67.3 58.9 -Panyam et al. (2018) 60.3 65.1 45.7Zheng et al. (2018) 61.5 - -

Table 3: Results on the test set of the CDR dataset. Themethods below the double line take advantage of addi-tional training data and/or incorporate external tools.

the latent structure induced by our model is indeedcapable of synthesizing the information across mul-tiple sentences of a document.

Furthermore, LSR with GloVe also proves bet-ter in the inter-sentence setting compared with twoBERT-based (Wang et al., 2019) models, indicat-ing latent structure’s superiority in resolving long-range dependencies across the whole documentcompared with the BERT encoder.

3.5 Results on the Biomedical Datasets

Table 3 depicts the comparisons with state-of-the-art models on the CDR dataset. Gu et al.(2017); Nguyen and Verspoor (2018); Verga et al.(2018) leverage sequence-based models. Convolu-tional neural networks and self-attention networksare used as the encoders. Sahu et al. (2019);Christopoulou et al. (2019) use graph-based mod-els. As shown in Table 3, our LSR performs worsethan the state-of-the-art models. It is challeng-ing for an off-the-shelf parser to get high qual-ity dependency trees in the biomedical domain, aswe observe that the MDP nodes extracted by thespaCy parser from the CDR dataset contains muchless informative context compared with the nodesfrom DocRED. Here we introduce a simplifiedLSR model indicated as “LSR w/o MDP Nodes” ,which removes the MDP nodes and builds a fully-connected graph using all tokens of a document.It shows that “LSR w/o MDP Nodes” consistentlyoutperforms sequence-based and graph-based mod-els, indicating the effectiveness the latent structure.Moreover, the simplified LSR outperforms mostof the models with external resources, except forLi et al. (2016b), which leverages co-training withadditional unlabeled training data. We believe sucha setting also benefits our LSR model.


NoInf (Christopoulou et al., 2019) 74.6 79.1 49.3Full (Christopoulou et al., 2019) 80.8 84.1 54.7EoG (Christopoulou et al., 2019) 81.5 85.2 50.0

LSR 79.6 83.1 49.6LSR w/o MDP Nodes 82.2 85.4 51.1

Table 4: Results on the test set of the GDA dataset.

Figure 4: Intra- and inter-sentence F1 for differentgraph structures in QAGCN, EoG, AGGCN and LSR.The number of refinements is ranging from 1 to 4.

Table 4 shows the results on the distantly super-vised GDA dataset. Here “Full” indicates EoGmodel with a fully connected graph as the in-puts, while “NoInf” is a variant of EoG modelwithout inference component (Christopoulou et al.,2018). The simplified LSR model achieves the newstate-of-the-art result on GDA. The “Full” model(Christopoulou et al., 2019) yields a higher F1score on the inter-sentence setting while havinga relatively low score on the intra-sentence. It islikely because that this model neglects the differ-ences between relations expressed within the sen-tence and across sentences.

3.6 Model Analysis

In this subsection, we use the development set ofDocRED to demonstrate the effectiveness of thelatent structure and refinements.

3.6.1 Does Latent Structure Matter?We investigate the extent to which the latent struc-tures, that are induced and iteratively refined by theproposed dynamic reasoner, help to improve theoverall performance. We experiment with the threedifferent structures defined below. For fair com-parisons, we use the same GCN model to performmulti-hop reasoning for all these structures.

Rule-based Structure: We use the rule-basedstructure in EoG (Christopoulou et al., 2019). Also,

1553

[1]Lark Force was an Australian Army formation established in March 1941 during World War II for service in New Britain and New Ireland. ….[4]Most of Lark Force was captured by the Imperial Japanese Army after Rabaul and Kavieng were captured in January 1942. [5]The officers of Lark Force were transported to Japan, however the NCOs and men were unfortunately torpedoed by the USS Sturgeon while being transported aboard the Montevideo Maru. ...Head: Japan Tail: World War IIRelation: participant of

Australian ArmyWorld War IINew BritainNew IrelandLark ForceImperial Japanese ArmyRabaulKaviengLark ForceJapanNCOsUSS Sturgeon

Lark Force0.040.010.170.030.440.130.370.090.110.200.410.050.02

0.200.160.080.430.300.170.080.090.050.090.050.02

0.21

Australian ArmyWorld War IINew BritainNew IrelandLark ForceImperial Japanese ArmyRabaulKaviengLark ForceJapanNCOsUSS Sturgeon

Lark Force0.110.130.120.130.120.110.140.110.100.130.120.100.10

0.340.070.160.100.200.020.120.120.100.110.150.15

0.10Australian ArmyWorld War IINew BritainNew IrelandLark ForceImperial Japanese ArmyRabaulKaviengLark ForceJapanNCOsUSS Sturgeon

Lark Force0.130.270.110.110.100.170.100.140.130.110.100.110.14

0.130.120.120.110.130.070.090.110.070.110.120.11

0.12Australian ArmyWorld War IINew BritainNew IrelandLark ForceImperial Japanese ArmyRabaulKaviengLark ForceJapanNCOsUSS Sturgeon

Lark Force0.130.150.120.100.110.140.120.110.130.120.120.110.12

0.060.060.060.080.280.610.120.120.060.140.070.09

0.10

ContextA

ware

AG

GC

NLSR

Ground Truth

P17

P607 P607

P607

P17

P1344P137

P607

P17

P607

P607P607

NIL

3 Heads A

GG

CN

head 1

LSR

P607

P607

P607

head 2 head 3

1st 2nd

1st 1st 1st2nd 2nd 2nd

P137

P607

P607

Figure 5: Case study of an example from the development set of DocRED. We visualize the reasoning process forpredicting the relation of an entity pair 〈Japan, World War II〉 by LSR and AGGCN in two refinement steps, usingthe attention scores of the mention World War II in each step. We scale all attention scores by 1000 to illustratethem more clearly. Some sentences are omitted due to space limitation.

We adapt rules from De Cao et al. (2019) for multi-hop question answering, i.e., each mention node isconnected to its entity node and to the same men-tion nodes across sentences, while mention nodesand MDP nodes which reside in the same sentenceare fully connected. The model is termed QAGCN.

Attention-based Structure: This structure is in-duced by AGGCN (Guo et al., 2019a) with multi-head attention (Vaswani et al., 2017). We extendthe model from sentence-level to document-level.

We explore multiple settings of these modelswith different block numbers ranging from 1 to 4,where a block is composed of a graph constructioncomponent and a densely connected GCN com-ponent. As shown in Figure 4, LSR outperformsQAGCN, EoG and AGGCN in terms of overall F1.This empirically confirms our hypothesis that thelatent structure induced by LSR is able to capture amore informative context for the entire document.

3.6.2 Does Refinement Matter?As shown in Figure 4, our LSR yields the best per-formance in the second refinement, outperformingthe first induction by 0.72% in terms of overallF1. This indicates that the proposed LSR is able toinduce more accurate structures by iterative refine-ment. However, too many iterations may lead to anF1 drop due to over-fitting.

3.7 Ablation StudyTable 5 shows F1 scores of the full LSR modeland with different components turned off one at


Full model 55.17 60.83 48.35- 1 Refinement 54.42 60.46 47.67- 2 Structure Induction 51.91 58.08 45.04- 1 Multi-hop Reasoning 54.49 59.75 47.49- 2 Multi-hop Reasoning 54.24 60.58 47.15- MDP nodes 54.20 60.54 47.12

Table 5: Ablation study of LSR on DocRED.

a time. We observe that most of the componentscontribute to the main model, as the performancedeteriorates with any of the components missing.The most significant difference is visible in thestructure induction module. Removal of structureinduction part leads to a 3.26 drop in terms of F1

score. This result indicates that the latent structureplays a key role in the overall performance.

3.8 Case Study

In Figure 5, we present a case study to analyze whythe latent structure induced by our proposed LSRperforms better than the structures learned by AG-GCN. We use the entity World War II to illustratethe reasoning process and our goal here is to predictthe relation of the entity pair 〈Japan, World WarII〉. As shown in Figure 5, in the first refinementof LSR, Word War II interacts with several localmentions with higher attention scores, e.g., 0.43 forthe mention Lake Force, which will be used as abridge between the mention Japan and World WarII. In the second refinement, the attention scores ofseveral non-local mentions, such as Japan and Im-perial Japanese Army, significantly increase from

1554

0.09 to 0.41, and 0.17 to 0.37, respectively, indicat-ing that information is propagated globally at thisstep. With such intra- and inter-sentence structures,the relation of the entity pair 〈Japan, World WarII〉 can be predicted as “participant of”, which isdenoted by P1344. Compared with LSR, the at-tention scores learned by AGGCN are much morebalanced, indicating that the model may not be ableto construct an informative structure for inference,e.g., the highest score is 0.27 in the second head,and most of the scores are near 0.11.

We also depict the predicted relations of Con-textAware, AGGCN and LSR on the graph on theright side of the Figure 5. Interested reader couldrefer to (Yao et al., 2019) for the definition of arelation, such as P607, P17, etc. The LSR modelproves capable of filling out the missing relationfor 〈Japan, World War II〉 that requires reasoningacross sentences. However, LSR also attends to themention New Ireland with a high score, thus failingto predict that the entity pair 〈New Ireland, WorldWar II〉 actually has no relation (NIL type).

4 Related Work

Document-level relation extraction. Early ef-forts focus on predicting relations between entitieswithin a single sentence by modeling interactionsin the input sequence (Zeng et al., 2014; Wanget al., 2016; Zhou et al., 2016; Zhang et al., 2017;Guo et al., 2020) or the corresponding dependencytree (Xu et al., 2015a,b; Liu et al., 2015; Miwaand Bansal, 2016; Zhang et al., 2018). These ap-proaches do not consider interactions across men-tions and ignore relations expressed across sen-tence boundaries. Recent work begins to explorecross-sentence extraction (Quirk and Poon, 2017;Peng et al., 2017; Gupta et al., 2018; Song et al.,2018c, 2019). Instead of using discourse struc-ture understanding techniques (Liu et al., 2019a;Lei et al., 2017, 2018), these approaches leveragethe dependency graph to capture inter-sentence in-teractions, and their scope is still limited to sev-eral sentences. More recently, the extraction scopehas been expanded to the entire document (Vergaet al., 2018; Jia et al., 2019; Sahu et al., 2019;Christopoulou et al., 2019) in the biomedical do-main by only considering a few relations amongchemicals. Unlike previous work, we focus ondocument-level relation extraction datasets (Yaoet al., 2019; Li et al., 2016a; Wu et al., 2019) fromdifferent domains with a large number of relations

and entities, which require understanding a docu-ment and performing multi-hop reasoning.

Structure-based relational reasoning. Struc-tural information has been widely used for rela-tional reasoning in various NLP applications in-cluding question answering (Dhingra et al., 2018;De Cao et al., 2019; Song et al., 2018a) and rela-tion extraction (Sahu et al., 2019; Christopoulouet al., 2019). Song et al. (2018a) and (De Cao et al.,2019) leverage co-reference information and setof rules to construct document-level entity graph.GCNs (Kipf and Welling, 2017) or GRNs (Songet al., 2018b) are applied to perform reasoning formulti-hop question answering (Welbl et al., 2018).Sahu et al. (2019) also utilize co-reference links toconstruct the dependency graph and use labellededge GCNs (Marcheggiani and Titov, 2017) fordocument-level relation extraction. Instead of us-ing GNNs, Christopoulou et al. (2019) use the edge-oriented model (Christopoulou et al., 2018) for log-ical inference based on a heterogeneous graph con-structed by heuristics. Unlike previous approachesthat use syntactic trees, co-references or heuristics,LSR model treats the document-level structure asa latent variable and induces it in an iteratively re-fined fashion, allowing the model to dynamicallyconstruct the graph for better relational reasoning.

5 ConclusionWe introduce a novel latent structure refinement(LSR) model for better reasoning in the document-level relation extraction task. Unlike previous ap-proaches that rely on syntactic trees, co-referencesor heuristics, LSR dynamically learns a document-level structure and makes predictions in an end-to-end fashion. There are multiple avenues for futurework. One possible direction is to extend the scopeof structure induction for constructions of nodeswithout relying on an external parser.

AcknowledgmentsWe would like to thank the anonymous review-ers for their thoughtful and constructive comments.This research is supported by Ministry of Educa-tion, Singapore, under its Academic Research Fund(AcRF) Tier 2 Programme (MOE AcRF Tier 2Award No: MOE2017-T2-1-156). Any opinions,findings and conclusions or recommendations ex-pressed in this material are those of the authors anddo not reflect the views of the Ministry of Educa-tion, Singapore.

1555

ReferencesRazvan C. Bunescu and Raymond J. Mooney. 2005. A

shortest path dependency kernel for relation extrac-tion. In Proc. of EMNLP.

Rui Cai, Xiaodong Zhang, and Houfeng Wang. 2016.Bidirectional recurrent convolutional neural networkfor relation classification. In Proc. of ACL.

Fenia Christopoulou, Makoto Miwa, and Sophia Ana-niadou. 2018. A walk-based model on entity graphsfor relation extraction. In Proc. of ACL.

Fenia Christopoulou, Makoto Miwa, and Sophia Ana-niadou. 2019. Connecting the dots: Document-levelneural relation extraction with edge-oriented graphs.In Proc. of EMNLP.

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.Question answering by reasoning across documentswith graph convolutional networks. In Proc. ofNAACL-HLT.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In Proc. of NAACL-HLT.

Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Co-hen, and Ruslan Salakhutdinov. 2018. Neural mod-els for reasoning over multiple mentions using coref-erence. In Proc. of NAACL-HLT.

George R. Doddington, Alexis Mitchell, Mark A. Przy-bocki, Lance A. Ramshaw, Stephanie Strassel, andRalph M. Weischedel. 2004. The automatic contentextraction (ace) program - tasks, data, and evalua-tion. In Proc. of LREC.

Elisa Ferracane, Greg Durrett, Junyi Jessy Li, and Ka-trin Erk. 2019. Evaluating discourse in structuredtext representations. In Proc. of ACL.

Jinghang Gu, Fuqing Sun, Longhua Qian, andGuodong Zhou. 2017. Chemical-induced disease re-lation extraction via convolutional neural network.Database: The Journal of Biological Databases andCuration, 2017.

Zhijiang Guo, Guoshun Nan, Wei Lu, and Shay B. Co-hen. 2020. Learning latent forests for medical rela-tion extraction. In Proc. of IJCAI.

Zhijiang Guo, Yan Zhang, and Wei Lu. 2019a. Atten-tion guided graph convolutional networks for rela-tion extraction. In Proc. of ACL.

Zhijiang Guo, Yan Zhang, Zhiyang Teng, and WeiLu. 2019b. Densely connected graph convolutionalnetworks for graph-to-sequence learning. Transac-tions of the Association for Computational Linguis-tics, 7:297–312.

Pankaj Gupta, Subburam Rajaram, Hinrich Schütze,Bernt Andrassy, and Thomas A. Runkler. 2018.Neural relation extraction within and across sen-tence boundaries. In Proc. of AAAI.

Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid Ó Séaghdha, Marco Pen-nacchiotti, Lorenza Romano, and Stan Szpakowicz.2009. Semeval-2010 task 8: Multi-way classifica-tion of semantic relations between pairs of nominals.In Proc. of NAACL-HLT.

Robin Jia, Cliff Wong, and Hoifung Poon. 2019.Document-level n-ary relation extraction with mul-tiscale representation learning. In Proc. of NAACL-HLT.

Yoon Kim, Carl Denton, Luong Hoang, and Alexan-der M. Rush. 2017. Structured attention networks.In Proc. of ICLR.

Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. In Proc. of ICLR.

Terry K Koo, Amir Globerson, Xavier Carreras, andMichael Collins. 2007. Structured prediction mod-els via the matrix-tree theorem. In Proc. of EMNLP-CoNLL.

Wenqiang Lei, Xuancong Wang, Meichun Liu, IlijaIlievski, Xiangnan He, and Min-Yen Kan. 2017.Swim: A simple word interaction model for implicitdiscourse relation recognition. In Proc of. IJCAI.

Wenqiang Lei, Yuanxin Xiang, Yuwei Wang, QianZhong, Meichun Liu, and Min-Yen Kan. 2018. Lin-guistic properties matter for implicit discourse rela-tion recognition: Combining semantic interaction,topic continuity and attribution. In Proc of. AAAI.

Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sci-aky, Chih-Hsuan Wei, Robert Leaman, Allan PeterDavis, Carolyn J. Mattingly, Thomas C. Wiegers,and Zhiyong Lu. 2016a. Biocreative v cdr task cor-pus: a resource for chemical disease relation extrac-tion. Database : the journal of biological databasesand curation, 2016.

Zhiheng Li, Zhihao Yang, Hongfei Lin, Jian Wang,Yingyi Gui, Yin Zhang, and Lei Wang. 2016b.Cidextractor: A chemical-induced disease relationextraction system for biomedical literature. In Proc.of BIBM.

Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xi-ang Ren. 2019. Kagnet: Knowledge-aware graphnetworks for commonsense reasoning. In Proc. ofEMNLP.

Jiangming Liu, Shay B. Cohen, and Mirella Lapata.2019a. Discourse representation parsing for sen-tences and documents. In Proc of. ACL.

Yang Liu and Mirella Lapata. 2018. Learning struc-tured text representations. Transactions of the Asso-ciation for Computational Linguistics, 6:63–75.

Yang Liu, Ivan Titov, and Mirella Lapata. 2019b. Sin-gle document summarization as tree induction. InProc. of NAACL-HLT.

1556

Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou,and Houfeng Wang. 2015. A dependency-based neu-ral network for relation classification. In Proc. ofACL.

Diego Marcheggiani and Ivan Titov. 2017. Encodingsentences with graph convolutional networks for se-mantic role labeling. In Proc. of EMNLP.

Makoto Miwa and Mohit Bansal. 2016. End-to-end re-lation extraction using lstms on sequences and treestructures. In Proc. of ACL.

Vinod Nair and Geoffrey E. Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In Proc. of ICML.

Dat Quoc Nguyen and Karin M. Verspoor. 2018. Con-volutional neural networks for chemical-disease re-lation extraction are improved with character-basedword embeddings. In Proc. of BioNLP.

Nagesh C Panyam, Karin Verspoor, Trevor Cohn, andKotagiri Ramamohanarao. 2018. Exploiting graphkernels for high performance biomedical relation ex-traction. Journal of Biomedical Semantics, 9(1):7.

Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary Devito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer.2017. Automatic differentiation in pytorch.

Nanyun Peng, Hoifung Poon, Chris Quirk, KristinaToutanova, and Wen tau Yih. 2017. Cross-sentencen-ary relation extraction with graph lstms. Transac-tions of the Association for Computational Linguis-tics, 5:101–115.

Yifan Peng, Chih-Hsuan Wei, and Zhiyong Lu. 2016.Improving chemical disease relation extraction withrich features and weakly labeled data. Journal ofCheminformatics, 8.

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In Proc. of EMNLP.

Chris Quirk and Hoifung Poon. 2017. Distant super-vision for relation extraction beyond the sentenceboundary. In Proc. of EACL.

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In Proc. of ECML/PKDD.

Sunil Kumar Sahu, Fenia Christopoulou, MakotoMiwa, and Sophia Ananiadou. 2019. Inter-sentencerelation extraction with document-level graph convo-lutional neural network. In Proc. of ACL.

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-tional recurrent neural networks. IEEE Trans. Sig-nal Processing, 45:2673–2681.

Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang,Radu Florian, and Daniel Gildea. 2018a. Exploringgraph-structured passage representation for multi-hop reading comprehension with graph neural net-works. arXiv preprint arXiv:1809.02040.

Linfeng Song, Yue Zhang, Daniel Gildea, Mo Yu,Zhiguo Wang, and Jinsong Su. 2019. Leveragingdependency forest for neural medical relation extrac-tion. In Proc. of EMNLP-IJCNLP.

Linfeng Song, Yue Zhang, Zhiguo Wang, and DanielGildea. 2018b. A graph-to-sequence model forAMR-to-text generation. In Proc. of ACL.

Linfeng Song, Yue Zhang, Zhiguo Wang, and DanielGildea. 2018c. N-ary relation extraction using graphstate lstm. In Proc. of EMNLP.

Daniil Sorokin and Iryna Gurevych. 2017. Context-aware representations for knowledge base relationextraction. In Proc. of EMNLP.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D. Manning. 2012. Multi-instancemulti-label learning for relation extraction. In Proc.of EMNLP-CoNLL.

Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xi-aodong He, and Bowen Zhou. 2019. Multi-hop read-ing comprehension across multiple documents byreasoning over heterogeneous graphs. In Proc. ofACL.

William Thomas Tutte. 1984. Graph theory. Claren-don Press.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proc. of NeurIPS.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Liò, and Yoshua Bengio.2018. Graph Attention Networks. In Proc. of ICLR.

Patrick Verga, Emma Strubell, and Andrew McCallum.2018. Simultaneously self-attending to all mentionsfor full-abstract biological relation extraction. InProc. of NAACL-HLT.

Hong Wang, Christfried Focke, Rob Sylvester, NileshMishra, and William Wang. 2019. Fine-tune bertfor docred with two-step process. arXiv preprintarXiv:1909.11898.

Linlin Wang, Zhu Cao, Gerard de Melo, and ZhiyuanLiu. 2016. Relation classification via multi-level at-tention cnns. In Proc. of ACL.

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents. Transac-tions of the Association for Computational Linguis-tics, 6:287–302.

1557

Ye Wu, Ruibang Luo, Henry CM Leung, Hing-FungTing, and Tak-Wah Lam. 2019. Renet: A deeplearning approach for extracting gene-disease asso-ciations from literature. In Proc. of RECOMB.

Kun Xu, Yansong Feng, Songfang Huang, andDongyan Zhao. 2015a. Semantic relation classifica-tion via convolutional neural networks with simplenegative sampling. In Proc. of EMNLP.

Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng,and Zhi Jin. 2015b. Classifying relations via longshort term memory networks along shortest depen-dency paths. In Proc. of EMNLP.

Hsiu-Wei Yang, Yanyan Zou, Peng Shi, Wei Lu, JimmyLin, and Xu Sun. 2019. Aligning cross-lingual en-tities with multi-aspect information. In Proc. ofEMNLP.

Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin,Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou,and Maosong Sun. 2019. DocRED: A large-scaledocument-level relation extraction dataset. In Proc.of ACL.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jian Zhao. 2014. Relation classification via con-volutional deep neural network. In Proc. of COL-ING.

Yuhao Zhang, Peng Qi, and Christopher D. Manning.2018. Graph convolution over pruned dependencytrees improves relation extraction. In Proc. ofEMNLP.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slot fill-ing. In Proc. of EMNLP.

Wei Zheng, Hongfei Lin, Zhiheng Li, Xiaoxia Liu,Zhengguang Li, Bo Xu, Yijia Zhang, Zhihao Yang,and Jian Wang. 2018. An effective neural modelextracting document level chemical-induced diseaserelations from biomedical literature. Journal ofbiomedical informatics, 83:1–9.

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li,Hongwei Hao, and Bo Xu. 2016. Attention-basedbidirectional long short-term memory networks forrelation classification. In Proc. of ACL.

reasoning with latent structure refinement for document ...or heuristics from the unstructured text...

Documents