distant supervision relation extraction with intra-bag and inter … · 2019. 6. 1. · ji et al....

Proceedings of NAACL-HLT 2019, pages 2810–2819Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

2810

Distant Supervision Relation Extraction with Intra-Bagand Inter-Bag Attentions

Zhi-Xiu YeUniversity of Science and

Technology of [email protected]

Zhen-Hua LingUniversity of Science and

Technology of [email protected]

AbstractThis paper presents a neural relation extrac-tion method to deal with the noisy trainingdata generated by distant supervision. Pre-vious studies mainly focus on sentence-levelde-noising by designing neural networks withintra-bag attentions. In this paper, both intra-bag and inter-bag attentions are considered inorder to deal with the noise at sentence-leveland bag-level respectively. First, relation-aware bag representations are calculated byweighting sentence embeddings using intra-bag attentions. Here, each possible relationis utilized as the query for attention calcula-tion instead of only using the target relation inconventional methods. Furthermore, the rep-resentation of a group of bags in the trainingset which share the same relation label is cal-culated by weighting bag representations us-ing a similarity-based inter-bag attention mod-ule. Finally, a bag group is utilized as a train-ing sample when building our relation extrac-tor. Experimental results on the New YorkTimes dataset demonstrate the effectiveness ofour proposed intra-bag and inter-bag attentionmodules. Our method also achieves better re-lation extraction accuracy than state-of-the-artmethods on this dataset1.

1 Introduction

Relation Extraction is a fundamental task in nat-ural language processing (NLP), which aims toextract semantic relations between entities. Forexample, sentence “[Barack Obama]e1 was bornin [Hawaii]e2” expresses the relation BornIn be-tween entity pair Barack Obama and Hawaii.

Conventional relation extraction methods, suchas (Zelenko et al., 2002; Culotta and Sorensen,2004; Mooney and Bunescu, 2006), adopted su-pervised training and suffered from the lack of

1The code is available athttps://github.com/ZhixiuYe/Intra-Bag-and-Inter-Bag-Attentions.

bag sentence correct?

B1

S1. Barack Obamna was born in theUnited States. Yes

S2. Barack Obamna was the 44thpresident of the United States No

B2

S3. Kyle Busch , a Las Vegas res-ident who ran second to Johnson lastyear, finished third, followed by KaseyKahne, Jeff Gordon and mark martin .

No

S4. Hendrick drivers finished in threeof the top four spots at Las Vegas , in-cluding Kyle Busch in second and ...

No

Table 1: Examples of sentences with relationplace of birth annotated by distant supervision, where“Yes” and “No” indicate whether or not each sentenceactually expresses this relation.

large-scale manually labeled data. To address thisissue, the distant supervision method (Mintz et al.,2009) was proposed, which generated the data fortraining relation extraction models automatically.The distant supervision assumption says that iftwo entities participate in a relation, all sentencesthat mention these two entities express that rela-tion. It is inevitable that there exists noise in thedata labeled by distant supervision. For example,the precision of aligning the relations in Freebaseto the New York Times corpus was only about70% (Riedel et al., 2010).

Thus, the relation extraction method proposedin (Riedel et al., 2010) argued that the distant su-pervision assumption was too strong and relaxedit to expressed-at-least-once assumption. This as-sumption says that if two entities participate in arelation, at least one sentence that mentions thesetwo entities might express that relation. An ex-ample is shown by sentences S1 and S2 in Table1. This relation extraction method first dividedthe training data given by distant supervision intobags where each bag was a set of sentences con-taining the same entity pair. Then, bag representa-tions were derived by weighting sentences within

2811

each bag. It was expected that the weights of thesentences with incorrect labels were reduced andthe bag representations were calculated mainly us-ing the sentences with correct labels. Finally, bagswere utilized as the samples for training relationextraction models instead of sentences.

In recent years, many relation extraction meth-ods using neural networks with attention mecha-nism (Lin et al., 2016; Ji et al., 2017; Jat et al.,2018) have been proposed to alleviate the influ-ence of noisy training data under the expressed-at-least-once assumption. However, these methodsstill have two deficiencies. First, only the target re-lation of each bag is used to calculate the attentionweights for deriving bag representations from sen-tence embeddings at training stage. Here we arguethat the bag representations should be calculated ina relation-aware way. For example, the bag B1 inTable 1 contains two sentences S1 and S2. Whenthis bag is classified to relation BornIn, the sen-tence S1 should have higher weight than S2, butwhen classified to relation PresidentOf, the weightof S2 should be higher. Second, the expressed-at-least-once assumption ignores the noisy bagproblem which means that all sentences in onebag are incorrectly labeled. An example is shownby bag B2 in Table 1.

In order to deal with these two deficienciesof previous methods, this paper proposes a neu-ral network with multi-level attentions for dis-tant supervision relation extraction. At theinstance/sentence-level, i.e., intra-bag level, allpossible relations are employed as queries to cal-culate the relation-aware bag representations in-stead of only using the target relation of each bag.To address the noisy bag problem, a bag group isadopted as a training sample instead of a singlebag. Here, a bag group is composed of bags in thetraining set which share the same relation label.The representation of a bag group is calculated byweighting bag representations using a similarity-based inter-bag attention module.

The contributions of this paper are threefold.First, an improved intra-bag attention mechanismis proposed to derive relation-aware bag represen-tations for relation extraction. Second, an inter-bag attention module is introduced to deal withthe noisy bag problem which is ignored by theexpressed-at-least-once assumption. Third, ourmethods achieve better extraction accuracy thanstate-of-the-art models on the widely used New

York Times (NYT) dataset (Riedel et al., 2010).

2 Related Work

Some previous work (Zelenko et al., 2002;Mooney and Bunescu, 2006) treated relation ex-traction as a supervised learning task and designedhand-crafted features to train kernel-based mod-els. Due to the lack of large-scale manually la-beled data for supervised training, the distant su-pervision approach (Mintz et al., 2009) was pro-posed, which aligned raw texts toward knowledgebases automatically to generate relation labels forentity pairs. However, this approach suffered fromthe issue of noisy labels. Therefore, some subse-quent studies (Riedel et al., 2010; Hoffmann et al.,2011; Surdeanu et al., 2012) considered distantsupervision relation extraction as a multi-instancelearning problem, which extracted relation from abag of sentences instead of a single sentence.

With the development of deep learning tech-niques (LeCun et al., 2015), many neural-network-based models have been developed for distant su-pervision relation extraction. Zeng et al. (2015)proposed piecewise convolutional neural networks(PCNNs) to model sentence representations andchose the most reliable sentence as the bag rep-resentation. Lin et al. (2016) employed PCNNs assentence encoders and proposed an intra-bag at-tention mechanism to compute the bag representa-tion via a weighted sum of all sentence represen-tations in the bag. Ji et al. (2017) adopted a simi-lar attention strategy and combined entity descrip-tions to calculate the weights. Liu et al. (2017)proposed a soft-label method to reduce the influ-ence of noisy instances. All these methods repre-sented a bag with a weighted sum of sentence em-beddings, and calculated the probability of the bagbeing classified into each relation using the samebag representation at training stage. In our pro-posed method, intra-bag attentions are computedin a relation-aware way, which means that differ-ent bag representations are utilized to calculate theprobabilities for different relation types. Besides,these existing methods focused on intra-bag atten-tions and ignored the noisy bag problem.

Some data filtering strategies for robust dis-tant supervision relation extraction have also beenproposed. Feng et al. (2018) and Qin et al.(2018b) both employed reinforcement learning totrain instance selector and to filter out the sam-ples with wrong labels. Their rewards were calcu-

2812

intra-bag attention

……

𝑥11 𝑥2

1 𝑥𝑚11

𝑠11 𝑠𝑚1

1

SE SE SE

R

𝑩1

inter-bag attention

𝑮

input sentence

sentence encoder

sentence representation

bag representation

intra-bag attention

inter-bag attention

representation of a group of bags

𝑠21

intra-bag attention

……

𝑥12 𝑥2

2

𝑠12 𝑠𝑚2

2

SE SE SE

R

𝑩2

𝑠22

intra-bag attention

……

𝑥1𝑛 𝑥2

𝑛 𝑥𝑚𝑛𝑛

𝑠1𝑛 𝑠𝑚𝑛

𝑛

SE SE SE

R

𝑩𝑛

𝑠2𝑛

……

……

matrix

vector

module

𝑥𝑚22

Figure 1: The framework of our proposed neural network with intra-bag and inter-bag attentions for relationextraction.

lated from the prediction probabilities and the per-formance change of the relation classifier respec-tively. Qin et al. (2018a) designed an adversariallearning process to build a sentence-level genera-tor via policy-gradient-based reinforcement learn-ing. These methods were proposed to filter out thenoisy data at sentence-level and also failed to dealwith the noisy bag problem explicitly.

3 Methodology

In this section, we introduce a neural network withintra-bag and inter-bag attentions for distant super-vision relation extraction. Let g = {b1, b2, ..., bn}denote a group of bags which have the same re-lation label given by distant supervision, and nis the number of bags within this group. Letbi = {xi1, xi2, ..., ximi

} denote all sentences in bagbi, and mi is the number of sentences in bag bi.Let xij = {wi

j1, wij2, ..., w

ijlij} denote the j-th sen-

tence in the i-th bag and lij is its length (i.e., num-ber of words). The framework of our model isshown in Fig. 1, which has three main modules.

• Sentence Encoder Given a sentence xij andthe positions of two entities within this sen-tence, CNNs or PCNNs (Zeng et al., 2015)are adopted to derive the sentence represen-tation sij .

• Intra-Bag Attention Given the sentence rep-resentations of all sentences within a bag bi

and a relation embedding matrix R, atten-

tion weight vectors αik and bag representa-

tions bik are calculated for all relations, where

k is the relation index.

• Inter-Bag Attention Given the representa-tions of all bags with the group g, a weightmatrix β is further calculated via similarity-based attention mechanism to obtain the rep-resentation of the bag group.

More details of these three modules will be intro-duced in the following subsections.

3.1 Sentence Encoder3.1.1 Word RepresentationEach word wi

jk within the sentence xij is firstmapped into a dw-dimensional word embedding.To describe the position information of two enti-ties, the position features (PFs) proposed in (Zenget al., 2014) are also adopted in our work. Foreach word, the PFs describe the relative distancesbetween current word and the two entities and arefurther mapped into two vectors pi

jk and qijk of dp

dimensions. Finally, these three vectors are con-catenated to get the word representation wi

jk =

[eijk;pijk;q

ijk] of dw + 2dp dimensions.

3.1.2 Piecewise CNNFor sentence xij , the matrix of word representa-tions Wi

j ∈ Rlij×(dw+2dp) is first input into aCNN with dc filters. Then, piecewise max pool-ing (Zeng et al., 2015) is employed to extract fea-tures from the three segments of CNN outputs, and

2813

the segment boundaries are determined by the po-sitions of the two entities. Finally, the sentencerepresentation sij ∈ R3dc can be obtained.

3.2 Intra-Bag Attention

Let Si ∈ Rmi×3dc represent the representationsof all sentences within bag bi, and R ∈ Rh×3dc

denote a relation embedding matrix where h is thenumber of relations.

Different from conventional methods (Lin et al.,2016; Ji et al., 2017) where a unified bag represen-tation was derived for relation classification, ourmethod calculates bag representations bi

k for bagbi on the condition of all possible relations as

bik =

mi∑j=1

αikjs

ij , (1)

where k ∈ {1, 2, ..., h} is the relation index andαikj is the attention weight between the k-th rela-

tion and the j-th sentence in bag bi. αikj can be

further defined as

αikj =

exp(eikj)∑mi

j′=1exp(ei

kj′), (2)

where eikj is the matching degree between the k-threlation query and the j-th sentence in bag bi. Inour implementation, a simple dot product betweenvectors is adopted to calculate the matching degreeas

eikj = rksi>j , (3)

where rk is the k-th row of the relation embeddingmatrix R2.

Finally, the representations of bag bi composethe matrix Bi ∈ Rh×3dc in Fig. 1, where each rowcorresponds to a possible relation type of this bag.

3.3 Inter-Bag Attention

In order to deal with the noisy bag problem, asimilarity-based inter-bag attention module is de-signed to reduce the weights of noisy bags dynam-ically. Intuitively, if two bags bi1 and bi2 are bothlabeled as relation k, their representations bi1

k andbi2k should be close to each other. Given a group of

bags with the same relation label, we assign higherweights to those bags which are close to other bags

2We also tried rkAsi>j , where A was a diagonal matrix,in experiments and achieved similar performance.

in this group. As a result, the representation of baggroup g can be formulated as

gk =n∑

i=1

βikbik, (4)

where gk is the k-th row of the matrix G ∈Rh×3dc in Fig. 1, k is the relation index and βikcomposes the attention weight matrix β ∈ Rn×h.Each βik is defined as

βik =exp(γik)∑ni=1 exp(γik)

, (5)

where γik describes the confidence of labeling bagbi with the k-th relation.

Inspired by the self-attention algorithm(Vaswani et al., 2017) which calculates theattention weights for a group of vectors usingthe vectors themselves, we calculate the weightsof bags according to their own representations.Mathematically, γik is defined as

γik =∑

i′=1,...,n,i′ 6=i

similarity(bik,b

i′

k ), (6)

where the function similarity is a simple dot prod-uct in our implementation as

similarity(bik,b

i′

k ) = bikb

i′>k . (7)

And also, in order to prevent the influence of vec-tor length, all bag representations bi

k are normal-ized to unit length as bi

k = bik/||bi

k||2 before cal-culating Eq.(4)-(7).

Then, the score ok of classifying bag group ginto relation k is calculated via gk and relation em-bedding rk as

ok = rkg>k + dk, (8)

where dk is a bias term. Finally, a softmax func-tion is employed to obtain the probability that thebag group g is classified into the k-th relation as

p(k|g) = exp(ok)∑hk′=1

exp(ok′ ). (9)

It should be noticed that the same relation embed-ding matrix R is used for calculating Eq.(3) andEq.(8). Similar to Lin et al. (2016), the dropoutstrategy (Srivastava et al., 2014) is applied to bagrepresentation Bi to prevent overfitting.

2814

3.4 Implementation Details

3.4.1 Data PackingFirst of all, all sentences in the training set thatcontain the same two entities are accumulated intoone bag. Then, we tie up every n bags that sharethe same relation label into a group. It should benoticed that a bag group is one training samplein our method. Therefore, the model can also betrained in mini-batch mode by packing multiplebag groups into one batch.

3.4.2 Objective Function and OptimizationIn our implementation, the objective function isdefined as

J(θ) = −∑

(g,k)∈T

logp(k|g; θ) (10)

where T is the set of all training samples and θ isthe set of model parameters, including word em-bedding matrix, position feature embedding ma-trix, CNN weight matrix and relation embeddingmatrix. The model parameters are estimated byminimizing the objective function J(θ) throughmini-batch stochastic gradient descent (SGD).

3.4.3 Training and TestAs introduced above, at the training phase of ourproposed method, n bags which have the same re-lation label are accumulated into one bag groupand the weighted sum of bag representations iscalculated to obtain the representation G of thebag group. Due to the fact that the label of eachbag is unknown at test stage, each single bag istreated as a bag group (i.e., n=1) when processingthe test set.

And also, similar to (Qin et al., 2018b), weonly apply inter-bag attentions to positive sam-ples, i.e., the bags whose relation label is not NA(NoRelation). The reason is that the representa-tions of the bags that express no relations are al-ways diverse and it’s difficult to calculate suitableweights for them.

3.4.4 Pre-training StrategyIn our implementation, a pre-training strategy isadopted. We first train the model with only intra-bag attentions until convergence. Then, the inter-bag attention module is added and the model pa-rameters are further updated until convergenceagain. Preliminary experimental results showed

Component Parameter Valueword embedding dimension 50

position feature max relative distance ±30dimension 5

CNN window size 3filter number 230

dropout dropout rate 0.5

optimization

strategy SGDlearning rate 0.1

batch size Np 50batch size Nt 10group size n 5gradient clip 5.0

Table 2: Hyper-parameters of the models built in ourexperiments.

that this strategy can lead to better model perfor-mance than considering inter-bag attentions fromthe very beginning.

4 Experiment

4.1 Dataset and Evaluation MetricsThe New York Times (NYT) dataset was adoptedin our experiments. This dataset was first releasedby (Riedel et al., 2010) and has been widely usedby previous research on distant supervision rela-tion extraction (Liu et al., 2017; Jat et al., 2018;Qin et al., 2018a,b). This dataset was generatedby aligning Freebase with the New York Times(NYT) corpus automatically. There were 52 ac-tual relations and a special relation NA which in-dicated there was no relation between two entities.

Following previous studies (Mintz et al., 2009;Liu et al., 2017), we evaluated our models on theheld-out test set of the NYT dataset. Precision-recall (PR) curves, area under curve (AUC) valuesand Precision@N (P@N) values (Lin et al., 2016)were adopted as evaluation metrics in our experi-ments. All of the numerical results given by ourexperiments were the mean values of 10 repetitivetrainings, and the PR curves were randomly se-lected from the repetitions because there was nosignificant visual difference among them.

4.2 Training Details and HyperparametersAll of the hyperparameters used in our experi-ments are listed in Table 2. Most of them followedthe hyperparameter settings in (Lin et al., 2016).The 50-dimensional word embeddings released by(Lin et al., 2016)3 were also adopted for initializa-tion. The vocabulary contained the words whichappeared more than 100 times in the NYT corpus.

3https://github.com/thunlp/NRE.

2815

0.0 0.1 0.2 0.3 0.4 0.5Recall

0.4

0.5

0.6

0.7

0.8

0.9

1.0Pr

ecisi

onCNN+ATT_BLCNN+ATT_BL+BAG_ATTCNN+ATT_RACNN+ATT_RA+BAG_ATT

Figure 2: PR curves of different models using CNNsentence encoders.

Two different batch sizes Np and Nt were usedfor pre-training and training respectively. In ourexperiments, a grid search is employed using train-ing set to determine the optimal values of n,Np and Nt among n ∈ {3, 4, ..., 10}, Np ∈{10, 20, 50, 100, 200} and Nt ∈ {5, 10, 20, 50}.Note that increasing the bag group size n mayboost the effect of inter-bag attentions but lead toless training samples. The effects of inter-bag at-tentions would be lost when n=1. For optimiza-tion, we employed mini-batch SGD with the initiallearning rate of 0.1. The learning rate was decayedto one tenth every 100,000 steps. The pre-trainedmodel with only intra-bag attentions convergedwithin 300,000 steps in our experiments. Thus,the initial learning rate for training the model withinter-bag attentions was set as 0.001.

4.3 Overall performance

Eight models were implemented for comparison.The names of these models are listed in Table3, where CNN and PCNN denote using CNNsor piecewise CNNs in sentence encoders respec-tively, ATT BL means the baseline intra-bag at-tention method proposed by (Lin et al., 2016),ATT RA means our proposed relation-aware intra-bag attention method, and BAG ATT means ourproposed inter-bag attention method. At the train-ing stage of the ATT BL method, the relation queryvector for attention weight calculation was fixedas the embedding vector associated with the dis-tant supervision label for each bag. At the teststage, all relation query vectors were applied tocalculate the posterior probabilities of relations re-spectively and the relation with the highest prob-

0.0 0.1 0.2 0.3 0.4 0.5Recall

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Prec

ision

PCNN+ATT_BLPCNN+ATT_BL+BAG_ATTPCNN+ATT_RAPCNN+ATT_RA+BAG_ATT

Figure 3: PR curves of different models using PCNNsentence encoders.

No. Model AUC1 CNN+ATT BL 0.376± 0.0032 CNN+ATT BL+BAG ATT 0.388± 0.0023 CNN+ATT RA 0.398± 0.0044 CNN+ATT RA+BAG ATT 0.407± 0.004

5 PCNN+ATT BL 0.388± 0.0046 PCNN+ATT BL+BAG ATT 0.403± 0.0027 PCNN+ATT RA 0.403± 0.0038 PCNN+ATT RA+BAG ATT 0.422 ± 0.004

Table 3: AUC values of different models.

ability was chosen as the classification result (Linet al., 2016). The means and standard deviationsof the AUC values given by the whole PR curvesof these models are shown in Table 3 for a quan-titative comparison. Following (Lin et al., 2016),we also plotted the PR curves of these models inFig. 2 and 3 with recall smaller than 0.5 for a vi-sualized comparison.

From Table 3, Fig. 2 and Fig. 3, we havethe following observations. (1) Similar to the re-sults of previous work (Zeng et al., 2015), PCNNsworked better than CNNs as sentence encoders.(2) When using either CNN or PCNN sentenceencoders, ATT RA outperformed ATT BL. It canbe attributed to that the ATT BL method only con-sidered the target relation when deriving bag rep-resentations at training time, while the ATT RAmethod calculated intra-bag attention weights us-ing all relation embeddings as queries, which im-proved the flexibility of bag representations. (3)For both sentence encoders and both intra-bag at-tention methods, the models with BAG ATT al-ways achieved better performances than the oneswithout BAG ATT. This result verified the ef-fectiveness of our proposed inter-bag attention

2816

# of Test Sentences one two allP@N(%) 100 200 300 mean 100 200 300 mean 100 200 300 mean

(Lin et al., 2016) 73.3 69.2 60.8 67.8 77.2 71.6 66.1 71.6 76.2 73.1 47.4 72.2(Liu et al., 2017) 84.0 75.5 68.3 75.9 86.0 77.0 73.3 78.8 87.0 84.5 77.0 82.8CNN+ATT BL 74.2 68.9 65.3 69.5 77.8 71.5 68.1 72.5 79.2 74.9 70.3 74.8CNN+ATT RA 76.8 72.7 67.9 72.5 79.6 73.9 70.7 74.7 81.4 76.3 72.5 76.8

CNN+ATT BL+BAG ATT 78.6 74.2 69.7 74.2 82.4 76.2 72.1 76.9 83.0 78.0 74.0 78.3CNN+ATT RA+BAG ATT 79.8 75.3 71.0 75.4 83.2 76.5 72.1 77.3 87.2 78.7 74.9 80.3

PCNN+ATT BL 78.6 73.5 68.1 73.4 77.8 75.1 70.3 74.4 80.8 77.5 72.3 76.9PCNN+ATT RA 79.4 73.9 69.6 74.3 82.2 77.6 72.4 77.4 84.2 79.9 73.0 79.0

PCNN+ATT BL+BAG ATT 85.2 78.2 71.3 78.2 84.8 80.0 74.3 79.7 88.8 83.7 77.4 83.9PCNN+ATT RA+BAG ATT 86.8 77.6 73.9 79.4 91.2 79.2 75.4 81.9 91.8 84.0 78.7 84.8

Table 4: P@N values of the entity pairs with different number of test sentences.

0.0 0.1 0.2 0.3 0.4 0.5Recall

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Prec

ision

PCNN+ATT_RA+BAG_ATTLiu et al. 2017Lin et al. 2016MIMLREMultiRMintz

Figure 4: PR curves of several models in previous workand our best model.

method for distant supervision relation extrac-tion. (4) The best AUC performance was achievedby combining PCNN sentence encoders with theintra-bag and inter-bag attentions proposed in thispaper.

4.4 Comparison with previous work

4.4.1 PR curvesThe PR curves of several models in previous workand our best model PCNN+ATT RA+BAG ATTare compared in Fig. 4, where Mintz (Mintzet al., 2009), MultiR (Hoffmann et al., 2011) andMIMLRE (Surdeanu et al., 2012) are conventionalfeature-based methods, and (Lin et al., 2016) and(Liu et al., 2017) are PCNN-based ones4. For afair comparison with (Lin et al., 2016) and (Liuet al., 2017), we also plotted the curves with onlythe top 2000 points. We can see that our modelachieved better PR performance than all the othermodels.

4All of these curve data are from https://github.com/tyliupku/soft-label-RE.

model CNN PCNNATT BL† 0.219 0.253

ATT BL+RL 0.229 0.261ATT BL+DSGAN 0.226 0.264

ATT BL‡ 0.242 0.271ATT RA 0.254 0.297

ATT BL+BAG ATT 0.253 0.285ATT RA+BAG ATT 0.262 0.311

Table 5: AUC values of previous work and our mod-els, where ATT BL+DSGAN and ATT BL+RL are twomodels proposed in (Qin et al., 2018a) and (Qin et al.,2018b) respectively, † indicates the baseline result re-ported in (Qin et al., 2018a,b) and ‡ indicates the base-line result given by our implementation.

4.4.2 AUC valuesATT BL+DSGAN (Qin et al., 2018a) andATT BL+RL (Qin et al., 2018b) are two recentstudies on distant supervision relation extractionwith reinforcement learning for data filtering,which reported the AUC values of PR curvescomposed by the top 2000 points. Table 5 com-pares the AUC values reported in these two papersand the results of our proposed models. We cansee that introducing the proposed ATT RA andBAG ATT methods to baseline models achievedlarger improvement than using the methodsproposed in (Qin et al., 2018a,b).

4.5 Effects of Intra-Bag Attentions

Following (Lin et al., 2016), we evaluated ourmodels on the entity pairs with more than onetraining sentence. One, two and all sentences foreach test entity pair were randomly selected toconstruct three new test sets. The P@100, P@200,P@300 values and their means given by our pro-posed models on these three test sets are reportedin Table 4 together with the best results of (Linet al., 2016) and (Liu et al., 2017). Here, P@N

2817

bag sentence correct? intra-bag weights inter-bag weights

B1

[Panama City Beach]e2 , too , has a glut of condos , butthe area was one of only two in [Florida]e1 where salesrose in march , compared with a year earlier.

Yes 0.710.48

Like much of [Florida]e1 , [Panama City Beach]e2 hasbeen hurt by the downturn in the real estate market. Yes 0.29

B2

Among the major rivers that overflowed were theHousatonic , Still , Saugatuck , Norwalk , Quinnip-iac , Farmington , [Naugatuck]e1 , Mill , Rooster and[Connecticut]e2 .

No 1.00 0.13

B3

..., the army chose a prominent location in [Virginia]e1, atthe foot of the Arlington memorial bridge , directly acrossthe [Potomac River]e2 from the Lincoln memorial .

No 0.13

0.39... , none of those stars carried the giants the way barberdid at Fedex field , across the [Potomac River]e1 from[Virginia]e2 , where he grew up as a redskins fan .

Yes 0.87

Table 6: A test set example of relation /location/location/contains from the NYT corpus.

sentence number mean±std1 0.163± 0.0292 0.187± 0.0333 0.210± 0.0344 0.212± 0.037

≥ 5 0.256± 0.043

Table 7: The distributions of inter-bag attentionweights for the bags with different number of sen-tences.

means the precision of the relation classificationresults with the top N highest probabilities in thetest set.

We can see our proposed methods achievedhigher P@N values than previous work. Fur-thermore, no matter whether PCNN or BAG ATTwere adopted, the ATT RA method outperformedthe ATT BL method on the test set with only onesentence for each entity pair. Note that the de-coding procedures of ATT BL and ATT RA wereequivalent when there was only one sentence in abag. Therefore, the improvements from ATT BLto ATT RA can be attributed to that ATT RA cal-culated intra-bag attention weights in a relation-aware way at the training stage.

4.6 Distributions of Inter-Bag AttentionWeights

We divided the training set into 5 parts accordingto the number of sentences in each bag. For eachbag, the inter-bag attention weights given by thePCNN+ATT RA+BAG ATT model were recorded.Then, the mean and standard deviation of inter-bag attention weights for each part of the train-ing set were calculated and are shown in Table7. From this table, we can see that the bag with

smaller number of training sentences were usuallyassigned with lower inter-bag attention weights.This result was consistent with the finding in (Qinet al., 2018b) that the entity pairs with fewer train-ing sentences were more likely to have incorrectrelation labels.

4.7 Case Study

A test set example of relation/location/location/contains is shown in Table 6.The bag group contained 3 bags, which consistedof 2, 1, and 2 sentences respectively. We calcu-lated the intra-bag and inter-bag attentions for thisbag group using our PCNN+ATT RA+BAG ATTmodel and the weights of the target relation arealso shown in Table 6.

In this example, the second bag was a noisy bagbecause the only sentence in this bag didn’t ex-press the relation /location/location/contains be-tween the two entities Naugatuck and Connecti-cut. In conventional methods, these three bagswere treated equally for model training. Af-ter introducing inter-bag attention mechanism, theweight of this noisy bag was reduced significantlyas shown in the last column of Table 6.

5 Conclusion

In this paper, we have proposed a neural networkwith intra-bag and inter-bag attentions to copewith the noisy sentence and noisy bag problemsin distant supervision relation extraction. First,relation-aware bag representations are calculatedby a weighted sum of sentence embeddings wherethe noisy sentences are expected to have smallerweights. Further, an inter-bag attention module isdesigned to deal with the noisy bag problem by

2818

calculating the bag-level attention weights dynam-ically during model training. Experimental resultson New York Times dataset show that our mod-els achieved significant and consistent improve-ments compared with the models using only con-ventional intra-bag attentions. To deal with themulti-label problem of relation extraction and tointegrate external knowledge into our model willbe the tasks of our future work.

Acknowledgments

We thank the anonymous reviewers for their valu-able comments.

ReferencesAron Culotta and Jeffrey Sorensen. 2004. Dependency

tree kernels for relation extraction. In Proceedingsof the 42nd Annual Meeting of the Association forComputational Linguistics (ACL-04).

Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xi-aoyan Zhu. 2018. Reinforcement learning for rela-tion classification from noisy data. In Proceedingsof AAAI.

Raphael Hoffmann, Congle Zhang, Xiao Ling,Luke Zettlemoyer, and Daniel S. Weld. 2011.Knowledge-based weak supervision for informationextraction of overlapping relations. In Proceedingsof the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Tech-nologies, pages 541–550. Association for Computa-tional Linguistics.

Sharmistha Jat, Siddhesh Khandelwal, and ParthaTalukdar. 2018. Improving distantly supervised re-lation extraction using word and entity based atten-tion. arXiv preprint arXiv:1804.06987.

Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, et al.2017. Distant supervision for relation extractionwith sentence-level attention and entity descriptions.In AAAI, pages 3060–3066.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.2015. Deep learning. nature, 521(7553):436.

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 2124–2133. Association for Computa-tional Linguistics.

Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhi-fang Sui. 2017. A soft-label method for noise-tolerant distantly supervised relation extraction. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, pages

1790–1795. Association for Computational Linguis-tics.

Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP: Vol-ume 2-Volume 2, pages 1003–1011. Association forComputational Linguistics.

Raymond J Mooney and Razvan C Bunescu. 2006.Subsequence kernels for relation extraction. In Ad-vances in neural information processing systems,pages 171–178.

Pengda Qin, Weiran XU, and William Yang Wang.2018a. Dsgan: Generative adversarial training fordistant supervision relation extraction. In Proceed-ings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 496–505. Association for Computa-tional Linguistics.

Pengda Qin, Weiran XU, and William Yang Wang.2018b. Robust distant supervision relation extrac-tion via deep reinforcement learning. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 2137–2147. Association for Computa-tional Linguistics.

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In Joint European Conferenceon Machine Learning and Knowledge Discovery inDatabases, pages 148–163. Springer.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch, 15(1):1929–1958.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D. Manning. 2012. Multi-instancemulti-label learning for relation extraction. In Pro-ceedings of the 2012 Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning, pages 455–465. Association for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Dmitry Zelenko, Chinatsu Aone, and AnthonyRichardella. 2002. Kernel methods for relation ex-traction. In Proceedings of the 2002 Conference onEmpirical Methods in Natural Language Processing(EMNLP 2002).

2819

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages 1753–1762. Association for Computational Linguistics.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation classification via con-volutional deep neural network. In Proceedings ofCOLING 2014, the 25th International Conferenceon Computational Linguistics: Technical Papers,pages 2335–2344. Dublin City University and As-sociation for Computational Linguistics.

distant supervision relation extraction with intra-bag and inter … · 2019. 6. 1. · ji et al....

Documents