acl member portal | the association for computational ...title: improving entity linking by modeling...

Improving Entity Linking by Modeling Latent Relations between MentionsPhong Le1 and Ivan Titov1,2

1University of Edinburgh 2University of AmsterdamEd nburghNLPUniversity of Edinburgh

Natural Language Processing

Introduction

• Intuition: Relations between entities in a document help to link theseentities to a knowledge base

• Previous work: Preprocess text to produce relations (e.g., using aco-reference system) [CR13]

Our work

We treat relations between entities as latent variables and induce themin such way as to help entity linking

World Cup 1966 was held in England …. England won… The final saw England beat West Germany .

FIFA_World_CupFIBA_Basketball_

World_Cup...

West_GermanyGermany_national_football_team

Germany_national_basketball_team

...

EnglandEngland_national_football_team

England_national_basketball_team

...

EnglandEngland_national_football_team

England_national_basketball_team

...

If ''World Cup'' is FIFA_Wold_Cup then England is a football team

Our General Model: A CRF

We assign each mention mi an entity ei. Let D = {m1, ...,mn} be a docu-ment and E = {e1, ..., en}• Our CRF:

q(E|D) ∝ exp

n∑i=1

Ψ(ei,mi)︸︷︷︸local score

+∑i 6=j

∑k

αijk︸︷︷︸how likely the relation

holds for the pair

Φk(ei, ej, D)︸︷︷︸pair-wise score︸︷︷︸

compatibility score

ei,mi ej,mj

αij1Φ1(ei,ej,D)

αij2Φ2(ei,ej,D)

αij3Φ3(ei,ej,D)

• Pair-wise score: Φk(ei, ej, D) = eTi Rk︸︷︷︸Relation embeddingA diagonal matrix

ej︸︷︷︸Entity embedding

Relation weights: Two versions

• Rel-norm (Relation-wise normalization):

αijk =exp

{fT (mi)Dkf (mj)

}∑k′ exp {fT (mi)Dk′f (mj)}

ei,mi ej,mj

normalize over relations: αij1 + αij2 + αij3 = 1

Intuitively, αijk is the probability of assigning a k-th relation to amention pair (mi,mj).

• Ment-norm (Mention-wise normalization):

αijk =exp

{fT (mi)Dkf (mj)

}∑j′ exp {fT (mi)Dkf (mj′)}

ei,mi ej,mj

normalize over mentions: αi12 + αi22 + … + αij2 + … + αin2 = 1

e1,m1

en,mn

...

...

Similar to multi-head attention [VSP+17].

Estimation and Training

• Using Loopy Belief Propagation (LBP) [GH17]:

qi(ei|D) ≈ maxe1,...,ei−1ei+1,...,en

q(E|D)

The final score for each mention

ρi(e) = g(qi(e|D), p(e|mi))where g is a 2-layer neural network. p is mention-entity hyperlinkcount statistics from Wikipedia, a large Web corpus and YAGO.

• We minimize the following ranking loss:

L(θ) =∑D∈D

∑mi∈D

∑e∈Ci

h(mi, e)

h(mi, e) = max(0, γ − ρi(e∗i ) + ρi(e)

)

Reference[CR13] Xiao Cheng and Dan Roth, Relational inference for wikification, EMNLP, 2013.

[GH17] Octavian-Eugen Ganea and Thomas Hofmann, Deep joint entitydisambiguation with local neural attention, EMNLP, 2017.

[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin, Attention is all you need,NIPS, 2017.

Experiments

F1 micro score. Mean and 95% confidence interval of 5 runs.

• In-domain:

91% F1 on the dev set, 5 we reduced the learningrate from 10−4 to 10−5. We then stopped the train-ing when F1 was not improved after 20 epochs.We did the same for ment-norm except that thelearning rate was changed at 91.5% F1.

Note that all the hyper-parameters except K andthe turning point for early stopping were set to thevalues used by Ganea and Hofmann (2017). Sys-tematic tuning is expensive though may have fur-ther increased the result of our models.

4.2 Results

Methods Aida-BChisholm and Hachey (2015) 88.7

Guo and Barbosa (2016) 89.0Globerson et al. (2016) 91.0Yamada et al. (2016) 91.5

Ganea and Hofmann (2017) 92.22 ± 0.14rel-norm 92.41 ± 0.19

ment-norm 93.07 ± 0.27ment-norm (K = 1) 92.89 ± 0.21ment-norm (no pad) 92.37 ± 0.26

Table 1: F1 scores on AIDA-B (test set).

Table 1 shows micro F1 scores on AIDA-Bof the SOTA methods and ours, which all useWikipedia and YAGO mention-entity index. Toour knowledge, ours are the only (unsupervis-edly) inducing and employing more than one re-lations on this dataset. The others use only onerelation, coreference, which is given by simpleheuristics or supervised third-party resolvers. Allfour our models outperform any previous method,with ment-norm achieving the best results, 0.85%higher than that of Ganea and Hofmann (2017).

Table 2 shows micro F1 scores on 5 out-domaintest sets. Besides ours, only Cheng and Roth(2013) employs several mention relations. Ment-norm achieves the highest F1 scores on MSNBCand ACE2004. On average, ment-norm’s F1 scoreis 0.3% higher than that of Ganea and Hofmann(2017), but 0.2% lower than Guo and Barbosa(2016)’s. It is worth noting that Guo and Barbosa(2016) performs exceptionally well on WIKI, butsubstantially worse than ment-norm on all otherdatasets. Our other three models, however, havelower average F1 scores compared to the best pre-vious model.

The experimental results show that ment-normoutperforms rel-norm, and that mention paddingplays an important role.

5We chose the highest F1 that rel-norm always achievedwithout the learning rate reduction.

4.3 AnalysisMono-relational v.s. multi-relationalFor rel-norm, the mono-relational version (i.e.,Ganea and Hofmann (2017)) is outperformedby the multi-relational one on AIDA-CoNLL,but performs significantly better on all five out-domain datasets. This implies that multi-relationalrel-norm does not generalize well across domains.

For ment-norm, the mono-relational versionperforms worse than the multi-relational one on alltest sets except AQUAINT. We speculate that thisis due to multi-relational ment-norm being lesssensitive to prediction errors. Since it can rely onmultiple factors more easily, a single mistake inassignment is unlikely to have large influence onits predictions.

Oracle

G&H rel-norm ment-norm(K=1)

ment-norm

92

92.5

93

93.5

94

94.5 LBPoracle

Figure 4: F1 on AIDA-B when using LBP and theoracle. G&H is Ganea and Hofmann (2017).

In order to examine learned relations in a moretransparant setting, we consider an idealistic sce-nario where imperfection of LBP, as well as mis-takes in predicting other entities, are taken out ofthe equation using an oracle. This oracle, whenwe make a prediction for mention mi, will tellus the correct entity e∗

j for every other mentionsmj , j = i. We also used AIDA-A (developmentset) for selecting the numbers of relations for rel-norm and ment-norm. They are set to 6 and 3,respectively. Figure 4 shows the micro F1 scores.

Surprisingly, the performance of oracle rel-norm is close to that of oracle ment-norm, al-though without using the oracle the differencewas substantial. This suggests that rel-norm ismore sensitive to prediction errors than ment-norm. Ganea and Hofmann (2017), even with thehelp of the oracle, can only perform slightly bet-ter than LBP (i.e. non-oracle) ment-norm. This

91% F1 on the dev set, 5 we reduced the learningrate from 10−4 to 10−5. We then stopped the train-ing when F1 was not improved after 20 epochs.We did the same for ment-norm except that thelearning rate was changed at 91.5% F1.

Note that all the hyper-parameters except K andthe turning point for early stopping were set to thevalues used by Ganea and Hofmann (2017). Sys-tematic tuning is expensive though may have fur-ther increased the result of our models.

4.2 Results

Methods Aida-BChisholm and Hachey (2015) 88.7

Guo and Barbosa (2016) 89.0Globerson et al. (2016) 91.0Yamada et al. (2016) 91.5

Ganea and Hofmann (2017) 92.22 ± 0.14rel-norm 92.41 ± 0.19

ment-norm 93.07 ± 0.27ment-norm (K = 1) 92.89 ± 0.21ment-norm (no pad) 92.37 ± 0.26

Table 1: F1 scores on AIDA-B (test set).

Table 1 shows micro F1 scores on AIDA-Bof the SOTA methods and ours, which all useWikipedia and YAGO mention-entity index. Toour knowledge, ours are the only (unsupervis-edly) inducing and employing more than one re-lations on this dataset. The others use only onerelation, coreference, which is given by simpleheuristics or supervised third-party resolvers. Allfour our models outperform any previous method,with ment-norm achieving the best results, 0.85%higher than that of Ganea and Hofmann (2017).

Table 2 shows micro F1 scores on 5 out-domaintest sets. Besides ours, only Cheng and Roth(2013) employs several mention relations. Ment-norm achieves the highest F1 scores on MSNBCand ACE2004. On average, ment-norm’s F1 scoreis 0.3% higher than that of Ganea and Hofmann(2017), but 0.2% lower than Guo and Barbosa(2016)’s. It is worth noting that Guo and Barbosa(2016) performs exceptionally well on WIKI, butsubstantially worse than ment-norm on all otherdatasets. Our other three models, however, havelower average F1 scores compared to the best pre-vious model.

The experimental results show that ment-normoutperforms rel-norm, and that mention paddingplays an important role.

5We chose the highest F1 that rel-norm always achievedwithout the learning rate reduction.

4.3 AnalysisMono-relational v.s. multi-relationalFor rel-norm, the mono-relational version (i.e.,Ganea and Hofmann (2017)) is outperformedby the multi-relational one on AIDA-CoNLL,but performs significantly better on all five out-domain datasets. This implies that multi-relationalrel-norm does not generalize well across domains.

For ment-norm, the mono-relational versionperforms worse than the multi-relational one on alltest sets except AQUAINT. We speculate that thisis due to multi-relational ment-norm being lesssensitive to prediction errors. Since it can rely onmultiple factors more easily, a single mistake inassignment is unlikely to have large influence onits predictions.

Oracle

G&H rel-norm ment-norm(K=1)

ment-norm

92

92.5

93

93.5

94

94.5 LBPoracle

Figure 4: F1 on AIDA-B when using LBP and theoracle. G&H is Ganea and Hofmann (2017).

In order to examine learned relations in a moretransparant setting, we consider an idealistic sce-nario where imperfection of LBP, as well as mis-takes in predicting other entities, are taken out ofthe equation using an oracle. This oracle, whenwe make a prediction for mention mi, will tellus the correct entity e∗

j for every other mentionsmj , j = i. We also used AIDA-A (developmentset) for selecting the numbers of relations for rel-norm and ment-norm. They are set to 6 and 3,respectively. Figure 4 shows the micro F1 scores.

Surprisingly, the performance of oracle rel-norm is close to that of oracle ment-norm, al-though without using the oracle the differencewas substantial. This suggests that rel-norm ismore sensitive to prediction errors than ment-norm. Ganea and Hofmann (2017), even with thehelp of the oracle, can only perform slightly bet-ter than LBP (i.e. non-oracle) ment-norm. This

Rel-norm is more sensitive toprediction errors.

• Out-of-domain:Methods MSNBC AQUAINT ACE2004 CWEB WIKI Avg

Milne and Witten (2008) 78 85 81 64.1 81.7 77.96Hoffart et al. (2011) 79 56 80 58.6 63 67.32Ratinov et al. (2011) 75 83 82 56.2 67.2 72.68

Cheng and Roth (2013) 90 90 86 67.5 73.4 81.38Guo and Barbosa (2016) 92 87 88 77 84.5 85.7

Ganea and Hofmann (2017) 93.7 ± 0.1 88.5 ± 0.4 88.5 ± 0.3 77.9 ± 0.1 77.5 ± 0.1 85.22rel-norm 92.2 ± 0.3 86.7 ± 0.7 87.9 ± 0.3 75.2 ± 0.5 76.4 ± 0.3 83.67

ment-norm 93.9 ± 0.2 88.3 ± 0.6 89.9 ± 0.8 77.5 ± 0.1 78.0 ± 0.1 85.51ment-norm (K = 1) 93.2 ± 0.3 88.4 ± 0.4 88.9 ± 1.0 77.0 ± 0.2 77.2 ± 0.1 84.94ment-norm (no pad) 93.6 ± 0.3 87.8 ± 0.5 90.0 ± 0.3 77.0 ± 0.2 77.3 ± 0.3 85.13

Table 2: F1 scores on five out-domain test sets. Underlined scores show cases where the correspondingmodel outperforms the baseline.

suggests that its global coherence scoring com-ponent is indeed too simplistic. Also note thatboth multi-relational oracle models substantiallyoutperform the two mono-relational oracle mod-els. This shows the benefit of using more than onerelations, and the potential of achieving higher ac-curacy with more accurate inference methods.

Relations

In this section we qualitatively examine relationsthat the models learned by looking at the prob-abilities αijk. See Figure 5 for an example. Inthat example we focus on mention “Liege” in thesentence at the top and study which mentions arerelated to it under two versions of our model:rel-norm (leftmost column) and ment-norm (right-most column).

For rel-norm it is difficult to interpret the mean-ing of the relations. It seems that the first relationdominates the other two, with very high weightsfor most of the mentions. Nevertheless, the factthat rel-norm outperforms the baseline suggeststhat those learned relations encode some useful in-formation.

For ment-norm, the first relation is similar tocoreference: the relation prefers those mentionsthat potentially refer to the same entity (and/orhave semantically similar mentions): see Figure5 (left, third column). The second and third rela-tions behave differently from the first relation asthey prefer mentions having more distant mean-ings and are complementary to the first relation.They assign large weights to (1) “Belgium” and(2) “Brussels” but small weights to (4) and (6)“Liege”. The two relations look similar in thisexample, however they are not identical in gen-eral. See a histogram of bucketed values of theirweights in Figure 5 (right): their α have quite dif-ferent distributions.

ComplexityThe complexity of rel-norm and ment-norm is lin-ear in K, so in principle our models should beconsiderably more expensive than Ganea and Hof-mann (2017). However, our models convergemuch faster than their relation-agnostic model:on average ours needs 120 epochs, compared totheirs 1250 epochs. We believe that the structuralbias helps the model to capture necessary regu-larities more easily. In terms of wall-clock time,our model requires just under 1.5 hours to train,that is ten times faster than the relation agnosticmodel (Ganea and Hofmann, 2017). In addition,the difference in testing time is negligible whenusing a GPU.

5 Conclusion and Future work

We have shown the benefits of using relations inNEL. Our models consider relations as latent vari-ables, thus do not require any extra supervision.Representation learning was used to learn rela-tion embeddings, eliminating the need for exten-sive feature engineering. The experimental resultsshow that our best model achieves the best re-ported F1 on AIDA-CoNLL with an improvementof 0.85% F1 over the best previous results.

Conceptually, modeling multiple relations issubstantially different from simply modeling co-herence (as in Ganea and Hofmann (2017)). Inthis way we also hope it will lead to interest-ing follow-up work, as individual relations can beinformed by injecting prior knowledge (e.g., bytraining jointly with relation extraction models).

In future work, we would like to use syntac-tic and discourse structures (e.g., syntactic depen-dency paths between mentions) to encourage themodels to discover a richer set of relations. Wealso would like to combine ment-norm and rel-norm. Besides, we would like to examine whether

• Analysis:rel-norm on Friday , Liege police said in ment-norm

(1) missing teenagers in Belgium .(2) UNK BRUSSELS UNK(3) UNK Belgian police said on(4) , ” a Liege police official told(5) police official told Reuters .(6) eastern town of Liege on Thursday ,(7) home village of UNK .(8) link with the Marc Dutroux case , the(9) which has rocked Belgium in the past

Figure 5: (Left) Examples of α. The first and third columns show αijk for oracle rel-norm and oraclement-norm, respectively. (Right) Histograms of α•k for k = 2, 3, corresponding to the second and thirdrelations from oracle ment-norm. Only α > 0.25 (i.e. high attentions) are shown.

the induced latent relations could be helpful for re-lation extract.

Acknowledgments

We would like to thank anonymous reviewers fortheir suggestions and comments. The projectwas supported by the European Research Council(ERC StG BroadSem 678254), the Dutch NationalScience Foundation (NWO VIDI 639.022.518),and an Amazon Web Services (AWS) grant.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Razvan Bunescu and Marius Pasca. 2006. Using en-cyclopedic knowledge for named entity disambigua-tion. In 11th Conference of the European Chapter ofthe Association for Computational Linguistics.

Xiao Cheng and Dan Roth. 2013. Relational inferencefor wikification. In Proceedings of the 2013 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1787–1796, Seattle, Washington,USA. Association for Computational Linguistics.

Andrew Chisholm and Ben Hachey. 2015. Entity dis-ambiguation with web links. Transactions of the As-sociation of Computational Linguistics, 3:145–156.

Greg Durrett and Dan Klein. 2014. A joint model forentity analysis: Coreference, typing, and linking.Transactions of the Association for ComputationalLinguistics, 2:477–490.

Evgeniy Gabrilovich, Michael Ringgaard, and Amar-nag Subramanya. 2013. Facc1: Freebase annotationof clueweb corpora.

Octavian-Eugen Ganea, Marina Ganea, Aurelien Luc-chi, Carsten Eickhoff, and Thomas Hofmann. 2016.Probabilistic bag-of-hyperlinks model for entitylinking. In Proceedings of the 25th InternationalConference on World Wide Web, pages 927–938. In-ternational World Wide Web Conferences SteeringCommittee.

Octavian-Eugen Ganea and Thomas Hofmann. 2017.Deep joint entity disambiguation with local neuralattention. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 2609–2619. Association for Compu-tational Linguistics.

Amir Globerson, Nevena Lazic, Soumen Chakrabarti,Amarnag Subramanya, Michael Ringaard, and Fer-nando Pereira. 2016. Collective entity resolutionwith multi-focal attention. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages621–631. Association for Computational Linguis-tics.

Zhaochen Guo and Denilson Barbosa. 2016. Robustnamed entity disambiguation with random walks.Semantic Web, (Preprint).

Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, LongkaiZhang, and Houfeng Wang. 2013. Learning entityrepresentation for entity disambiguation. In Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 30–34, Sofia, Bulgaria. Associationfor Computational Linguistics.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen Furstenau, Manfred Pinkal, Marc Span-iol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of namedentities in text. In Proceedings of the 2011 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 782–792. Association for Com-putational Linguistics.

Raphael Hoffmann, Congle Zhang, Xiao Ling,Luke Zettlemoyer, and Daniel S. Weld. 2011.

• Hard to interpret relations induced by rel-norm• Ment-norm: relation 1 is similar to coreference. Relation 2 and relation 3 complementrelation 1, but they are quite different.

Conclusions

• Inducing multiple relations between entities is beneficial for entity linking.• Our system does not use any supervision for relations and uses minimalamount of feature engineering.

• Future work: Injecting linguistic knowledge (discourse and syntax).

Source code

https://github.com/lephong/mulrel-nel

https://github.com/lephong/mulrel-nel

acl member portal | the association for computational ...title: improving entity linking by modeling...

Documents