semlinker system for kbp2013: a disambiguation algorithm

SemLinker system for KBP2013: A disambiguation algorithm based onmutual relations of semantic annotations inside a document.

Eric ChartonPolytechnique [email protected]

Marie-Jean MeursConcordia University

[email protected]

Ludovic Jean-LouisPolytechnique Montreal

[email protected]

Michel GagnonPolytechnique Montreal

[email protected]

Abstract

In this paper, we present the SemLinker sys-tem used by the polymtl team in the Englishentity linking track of TAC-KBP 2013. To im-prove the disambiguation process, SemLinkerre-uses and enriches the entity links providedby a generic annotation engine. The linkingis done through a re-ranking process on thecandidate links associated with a given namedentity. This process relies on the mutual re-lations between all the named entities in thedocument.

1 Introduction

The current definition of the TAC KBP Entity Link-ing (EL) task is to link named entities (NEs) foundat specific positions in a document collection to en-tities in a reference KB, or to new NEs discoveredin the collection. According to this definition, asimple way of achieving the task could be hypothe-sized: an existing annotation engine compliant withWikipedia URIs could annotate a document, andthen, if it exists, one could retrieve a link at a givenposition. In fact, such a simplification of the EL taskis not yet possible because of the poor quality of an-notations provided by existing engines. As shownfor instance in (Gangemi, 2013), even if some ofthe publicly released annotation engines obtain goodresults on mention detection and semantic disam-biguation, none or very few of them seem indeedto be precise enough to challenge the performanceof the best systems specifically trained for the KBPEL task. However, there is still an interest in im-proving annotation quality of such generic annota-

tion engines. The main idea of our participation tothe 2013 edition of KBP is to use an existing an-notation engine, and apply a complementary algo-rithm to improve its accuracy. The improvement isachieved through a method for computing seman-tic relatedness between candidate entities to link, byusing knowledge from the hypertext provided by theinternal links of the Wikipedia encyclopedia.

The paper is organized as follows. The main com-ponents of our approach are introduced in Section 2,then we present the architecture of our system forthe TAC-KBP 2013 Entity Linking task, togetherwith additional software and resource componentsin Section 3. A detailed description of resource us-age is presented in Section 4. The modules of oursystem are described in Section 5. Experiments andresults in the context of the TAC-KBP 2013 EL taskare reported in Section 6. We then discuss our re-sults, and conclude in Section 7.

2 Approach

The two major novelties in the SemLinker systemreside in the query pre-processing, and in the im-provement of candidate annotations provided by ageneric annotation engine compliant with WikipediaURIs. By generic annotation engine, we mean anannotator able to provide Wikipedia or DBPediaURIs as links, and which is not specifically trainedor built for the KBP task. A previous attempt to in-volve a generic, non-Knowledge-Based supervisedannotator has been conducted by (Mendes et al.,2011) with the SpotLight annotator. In this attempt,Spotlight was used with limited modifications to fitthe specifics of KBP tasks. Our architecture propo-

sition is similar but involves complementary meth-ods to improve the disambiguation process of anno-tations provided by the generic annotation engine.

The basic principle of SemLinker is to re-use andto improve candidate annotations provided by theWikimeta (Charton and Gagnon, 2012) annotationengine. Using the whole document annotations pro-vided by the Wikimeta engine, SemLinker re-ranksthe candidate links assigned to a given NE mention,according to their mutual semantic relations with theother NE mentions annotated in the document.

In past years, numerous methods have beenproposed to include the semantic relatedness be-tween concepts to improve a disambiguation pro-cess. Milne in (Milne and Witten, 2008) used a ma-chine learning method to identify significant termswithin unstructured text, and to enrich them withlinks to the appropriate Wikipedia articles. The pro-posed algorithm balances the prior probability of asense with its relatedness to the surrounding context.The commonness of a sense is defined by the num-ber of times it is used as a destination in Wikipedia.A different approach proposed by (Hoffart et al.,2011) uses graphs to compute relatedness. Themethod builds a weighted graph of mentions andcandidate entities, and computes a dense subgraphthat approximates the best joint mention-entity map-ping. (Yazdani and Popescu-Belis, 2013) presents amethod for computing semantic relatedness betweenwords or texts by using knowledge from Wikipedia.A network of concepts is built by filtering the en-cyclopedia articles. Two types of weighted linksbetween concepts are considered: one is based onhyperlinks between the texts of the articles, and theother one is based on the lexical similarity betweenthem.

If many approaches use Wikipedia to find andcompute semantic relatedness of words, or to im-prove a named entity disambiguation (NED) tasklike (Strube and Ponzetto, 2006; Bunescu and Pasca,2006), yet, these propositions rarely involve the se-mantic relations between annotations of a completedocument, which can be discovered according toWikipedia knowledge. In KBP 2012, (Zhang et al.,2012) proposed to calculate a Semantic Relatednessof Candidate Entity value, then they used it as a fea-ture in a Support Vector Machine classifier utilizedfor disambiguation. Our model uses a similar ap-

proach to calculate a semantic relatedness value, butthis value is used in the context of a graph explo-ration rather than in a machine learning based algo-rithm.

3 System architecture

In our work context, a query consists of a surfaceform, an anchor document, and the position of thesurface form in the document. The pipeline pro-cesses a query as follows:

1. Reformulate the query if necessary.

2. Annotate each NE in the document with aranked set of Wikipedia URIs (the candidates)using an external annotator compliant withWikipedia URIs. NE category labels (PERS,ORG...) and Part Of Speech (POS) tags are alsoprovided.

3. Re-rank candidates for each NE in the docu-ment using all the annotation layers.

4. Extract the best NE link in the document usingthe query definition and query expansion.

5. Once all the queries have been processed, clus-ter their associated NEs with no correspondingKB entries (NILs), and convert the WikipediaURIs format to TAC-KBP node identifiers.

Figure 1 shows the SemLinker workflow and ar-chitecture. Four main modules, described in Section5, are dedicated to these tasks:

1. Query Reformulation module

2. Mutual Disambiguation module

3. Link Extraction module

4. Clustering module

These modules relies on various resources intro-duced in Section 4.

4 System resources

The SemLinker system makes use of external soft-ware: Wikimeta1 as annotator, Lucene-Search forWiki2 as spell checker, and corpus resources as NL-GbAse3, and a Wikipedia dump. These resources

1http://www.wikimeta.org2https://www.mediawiki.org/wiki/

Extension:Lucene-search, v2.13http://www.nlgbase.org

Wikimeta

Annotateddocument object

Wikipedia linksNamed EntitiesPart Of Speech

Mutual Disambiguationbased reranking

Coreferences

NIL Clustering

Query

Wikipedia URI

Document

Annotation

Document withreranked Wikipedia

links

Correction processesNamed Entities

Link extraction

KB Nodes & NIL answers

SemLinker

Reranking

Query expansion

1. NLGbAse2. Wiki Lucene Spell Checker

Query reformulation

Figure 1: SemLinker workflow.

are involved in various aspects of the system like itsmention correction and reformulation process, or theentity linking annotation process.

4.1 Wikipedia dump and related software

A version of Wikipedia dump was indexed with theLucene-Search engine, and used as an internal linkand category resource for the Mutual Disambigua-tion Algorithm. The Wikipedia dump was down-loaded in July 2013. It was also utilized to trainthe Lucene-Search for Media-Wiki software version2.1.3, used as a spelling correction resource for theSemLinker Query Reformulation module.

4.2 NLGbAse lexical resource

NLGbAse is a multilingual linguistic resource com-posed of metadata, and built from the Wikipedia en-cyclopedic content. The structure of Wikipedia, andthe sequential process to build metadata like thesein NLGbAse, have been described in (Bunescu andPasca, 2006). The process is applied in (Chartonand Torres-Moreno, 2010). For each document inWikipedia, NLGbAse provides a set of metadata,composed of three elements: (i) a set of surfaceforms, (ii) all the words contained in the document,where a TF.IDF weight (Salton and Buckley, 1988)

is assigned to each word, (iii) a NE tag obtainedthrough a classification process.

The set of surface forms is obtained throughthe collection of every Wikipedia internal link thatpoints to an encyclopedic document. Internal linkscan be redirection links (specific Wikipedia pagesthat only point to another page and express anotherway of spelling), interwiki links (links directing tothe same document in another language edition)and, finally, disambiguation pages (pages that sum-marize for a unique surface form all the possiblyrelated Wikipedia documents). For instance, the sur-face form set for the NE Paris (France)4 contains 39elements (eg. Ville Lumiere, Ville de Paris, Paname,Capitale de la France, Departement de Paris...).In our resource, the surface forms are collectedfrom six linguistic editions of Wikipedia (Polish,English, German, Italian, Spanish and French).We use such cross-lingual resources because, insome cases, a surface form may appear in a singlelanguage edition of Wikipedia but should be usedin other languages. For instance, the surface formsRenault-Dacia or RNUR related to a European carmaker are not available in the English Wikipedia butcan be collected from the Polish Wikipedia edition.

NLGbAse is involved in SemLinker as a resourcefor spelling correction in the Query Reformulationmodule. It is also used as annotation and disam-biguation resource by the Wikimeta annotation en-gine. For this reason, NLGbAse is also utilizedto build a correspondence table between WikipediaURIs and the KB nodes (see Section 4.4).

4.3 Wikimeta annotation engine

Wikimeta is an annotation engine able to provide,for a given document, various levels of annotations.Wikimeta relies on a part-of-speech tagger, a NErecognition module, a semantic annotation module,and NLGbAse used as disambiguation interface toestablish a link between annotations and the seman-tic web. The engine is accessible through a RESTAPI. It is free for academic use, and its componentshave been described in (Charton and Gagnon, 2012).

The Wikimeta annotation engine workflow is de-

4http://www.nlgbase.org/perl/display.pl?query=Paris&search=FR

Figure 2: In this example, multiple surface forms for a company name are collected from the Polish edition ofWikipedia.

Word POS NE Semantic LinkLaura NNP PERS.HUM NORDFColby NNP PERS.HUMin IN UNKMilan NNP LOC.ADMI http://dbpedia.org/data/Milan.rdf

Table 1: Sample annotation from the Wikimeta annotation engine. The special semantic annotation NORDF is usedwhen no RDF link is available

scribed below.First, NE mentions are detected using the surfaceforms, and a machine-learning algorithm. To la-bel NEs as Pers, Org, Loc, Prod, Time and Date,Wikimeta uses a Conditional Random Field tag-ger (Bechet and Charton, 2010) to produce an initiallayer of NEs. Then, Wikimeta reinforces the candi-date detection process with surface forms from NL-GbAse used as regular expressions to detect NEs.Finally, NEs identified in both layers are merged.

For each NE annotated in the document, the sur-face form is utilized to select a group of candidateWikipedia URIs from the NLGbAse resource. A co-sine similarity is calculated between the context ofthe NEs in the document and a bag of words withtheir TF.IDF for each potential candidate availablein NLGbAse. The candidate links are then rankedaccording to their cosine scores.

One advantage of this separation between the NElabeling process and the linking annotation processis the capacity of the system to detect all NEs inthe document, even if no link is available for them.For instance, Milton, from query EL ENG 00309in TAC-KBP 2012 test set, is annotated as PERSby Wikimeta with its correct complete boundaries- Christian Milton -, even if no link is provided bythe engine. This property is very useful to manageNIL mentions as it allows, for example, to collect ex-tended query information - the first name Christianin the previous sample -, and this information canbe used to improve the NIL clustering process (this

aspect is detailed in Section 5.4).The Wikimeta disambiguation engine is based on

the most recent version of NLGbAse, itself gener-ated with a recent version of Wikipedia dump (Jan-uary 2013). It is thus able to annotate over 3 millionsof entities, far beyond the 800k present in the cur-rent KB resource. This enables our system to cor-rectly identify many queries considered as NILs inthe KBP evaluation framework. We use this comple-mentary disambiguation capability (also used in for-mer systems (Cucerzan, 2012; Radford et al., 2012;Graus et al., 2012)) to improve the SemLinker Clus-tering module effectiveness on NIL entities (see Sec-tion 5.4).

In the context of the SemLinker system, Wikimetais utilized to annotate documents with POS labels,with NEs (including finding boundaries of NE men-tions, and attributing a NE label to them), and toprovide for each NE a list of ranked Wikipedia URIlinks. A sample of annotation is reported in Table5. The Wikimeta annotator used for TAC-KBP 2013was a version of the system installed in an experi-mental environment. It was modified to provide ncandidates links (instead of 3 in the public version)for each mention detected.

4.4 Correspondence between Wikipedia URIsand KB nodes

Since Wikimeta annotations are Wikipedia URIs, weneed a correspondence table between KB nodes andWikimeta outputs. This table is generated using

NLGbAse since this resource is the Wikimeta ref-erence knowledge base. This table is built with analgorithm that matches the exact correspondence be-tween KB node names and NLGbAse surface forms.When a surface form of an NLGbAse entry matchesexactly a KB node name, the Wikipedia URI storedin the NLGbAse entry is defined as the matching re-lation for the KB node. About 0.3% (less than 2.5k)of the KB nodes do not have correspondence. Thisis mostly due to pages deleted since 2008, that stillexist in the 2008 dump used for building the KB,and that are not present in recent Wikipedia XMLdumps utilized to build NLGbAse. Less than 0.1%of those missing annotations have an influence onthe final process. This is controlled according to theKB nodes required in queries of the 2012 TAC-KBPtest corpus that do not have matching entries in thecorrespondence table. We consider that this distancebetween our correspondence table and the KB nodetable does not have a notable influence on the finalresults.

5 SemLinker Modules

We now describe the modules involved in ourpipeline to tackle the TAC-KBP EL task. Thepipeline starts with a Query Reformulation module,and ends with the Clustering process of all NEs.

5.1 Query reformulation module

In EL systems, availability of an exhaustive resourceof candidate surface forms is of critical interest fordetecting mentions. The higher the coverage of thisresource, the more candidate entities are detected.Recent TAC-KBP evaluation campaigns have beenengineered to emphasize the surface form matchingproblem: the evaluation framework of the EL taskmakes increasing use of noisy and misspelled men-tions that must be linked. In the 2012 and 2013 TAC-KBP evaluation corpora, we identified three maincases of mentions to annotate for which no surfaceform exists in Wikipedia-based resources:

• Case 1: An abbreviation that refers to a NEdoes not exist in any Wikipedia redirection, dis-ambiguation or interwiki page.

For example, a NE is denoted by an abbrevia-tion like JGL, which stands for Joseph Gordon-

Levitt5.

• Case 2: An abbreviation that refers to a NE ex-ists in Wikipedia, but it is redirected to anotherentity.

An example is given by the IPI6 surface form,that refers to Intellectual Property Institute. AnIPI disambiguation page exists in Wikipedia,and allows to collect several full names for thisabbreviation (see Figure 3) but does not refer tothe Intellectual Property Institute page.

• Case 3: A mention is misspelled or providedunder an uncommon or unconventional surfaceform, and exists in Wikipedia under a slightlydifferent lexical description.

This is the case of query Bagdahd, that shouldrefer to the Bagdad page in Wikipedia7.

These three cases cannot be handled by state-of-the-art approaches based on Wikipedia derived con-tent only, since the surface forms collected from theencyclopedia do not match the ones expressing NEsin the document. The algorithm we propose em-powers EL systems to handle such cases, and thusimproves their performance.

5.1.1 Mention Correction AlgorithmWe propose a Mention Correction algorithm in-

volving two strategies to improve surface form cov-erage of EL systems. The first strategy consists inautomatically adding supplementary surface formsgenerated by heuristics to an existing resource ofsurface forms. The second strategy involves the in-troduction of a lexical correction step in the surfaceform detection process. Let us consider a candidatemention that we want to link to a KB entry. Tofind a set of candidate entries in the KB accordingto this surface form, the proposed algorithm runs asfollows:

• Step 1: The candidate mention is submittedto the Improved Surface Form Detection algo-rithm. If matching surface form candidates arereturned, the algorithm proceeds to step 3; elseto step 2.

5Example from query EL13 ENG 0319 of KBP 2013.6Example from query EL13 ENG 1604 of KBP 2013.7Example from query EL13 ENG 1872 of KBP 2013.

The IPI has also just published a new tract.

International Prognostic IndexIndustrial Production IndexInternational Protein IndexIntegrated Pulmonary IndexInterested Parties InformationInternational Peace InstituteImage Permanence InstituteInstitute for Policy InnovationInternational Press Instituteinter-parliamentary institution Intelligent Peripheral InterfaceInter-processor interruptInwald Personality InventoryInternational Parking Institute...

Matching entities collected in Wikipedia IPI disambiguation page

Matching entities generated from abbreviation IPI

Intellectual Property InstituteIrish Peace InstituteIndependent Photo Imager...

Surface form = IPI

Disambiguation engine

Figure 3: All the abbreviated surface forms of a given entity are not necessarily present in Wikipedia. Generatingsome complementary matching (here for the IPI abbreviation) improves the detection capabilities of an Entity Linkingsystem.

• Step 2: If no candidate is provided by the Im-proved Surface Form Detection algorithm, thecandidate mention is submitted to the SurfaceForm Correction algorithm.

– If this module returns suggestions of al-ternative surface forms, step 1 is repeatedusing those suggestions to collect candi-dates.

– Else the algorithm returns no suggestionand exits.

• Step 3: Rewrite the query in document ifneeded, and proceed to annotation.

We describe below the two components of this al-gorithm, the Improved Surface Form Detection al-gorithm and the Surface Form Correction algorithm.

Improved Surface Form Detection algorithmThe original surface form resource used in this

study is NLGbAse. Currently, this resource containsabout 3 million English metadata, each of them de-scribing a unique concept. The set of surface formsis obtained by collecting every Wikipedia internallink that points to an encyclopedic document. Theyallow matching numerous alternate spelling sugges-tions, including the misspelled ones (usually foundin redirection pages of Wikipedia). However, thisresource does not cover all the cases encountered

in the TAC-KBP corpus. Hence, additional sur-face forms are automatically generated with variousheuristics like the following ones:

• Automatic generation of abbreviations: for ex-ample, for a surface form like Joseph Gordon-Levitt, an heuristic generate the surface formJGL according to capital letters of the se-quence.

• Automatic generation of alternative surfaceforms (adding “s” at the end of forms, re-ordering n-grams).

We finally obtained 4 million additional generatedsurface forms for a total of 14 millions, each of themrelated to one or more Wikipedia documents. Ac-cording to step 1 of the algorithm described in Sec-tion 5.1.1, for a given word sequence in a text, thisresource is used to first check if a matching surfaceform exists. If it exists, all related candidates arecollected, and the disambiguation process is appliedto rank them. If no match is provided the SurfaceForm Correction algorithm is invoked.

Surface Form Correction algorithmThe Surface Form Correction algorithm makes

use of a database of potential spelling errors builtfrom Wikipedia dump, and a set of rules used to val-idate the suggested corrections. The database en-gine generates a set of variations for each existing

surface form. The Lucene-Wiki software is used togenerate this database8, which includes about 1 bil-lion entries.

According to step 2 of the Query Reformulationmodule, the Surface Form Correction algorithm iscalled when the Improved Surface Form Detectionalgorithm did not provide any candidate. If thedatabase engine suggests a rewriting, the resultingrefined surface form is submitted to a set of selectionrules intended to check if the suggestion is relevant.The rules described below are sequentially applied:

• Rule A : m common word(s).

Let us suppose, m = 1, and the original sur-face form is “hitlery clinton”9: if the systemsuggests “Hillary Rodham Clinton”, the ruleselects the suggested refined surface form be-cause it has at least one common word with theoriginal one.

• Rule B : Lexical distance of n letter(s).

If n = 1, the original surface form is “Michic-gan”10, and the system suggests “Michigan“,the rule selects the suggested form.

• Rule C: Edit-distance between mention andsuggestion.

The rule verifies if the original and correctedmentions start with the same character and cal-culate their l Levenshtein distance. The sug-gested form is accepted if l > t where t is athreshold value.

If the suggested refined form is accepted, it is sub-mitted to the Improved Surface Form algorithm toverify if the reformulated form exits in the surfaceform resource.

5.1.2 Experiments on query reformulationThe algorithm was tested with the TAC-KBP

2013 English queries11 according to the evaluationprotocol of the TAC-KBP task evaluation frame-work. The queries consist of a surface form, an an-chor document, and the position of the surface form

8https://www.mediawiki.org/wiki/Extension:Lucene-search9Example from query EL13 ENG 0895 of KBP 2013.

10Exemple from query EL13 ENG 1624 of KBP 2013.11http://www.nist.gov/tac/2013/KBP/data.

html

refSF QRCategory B3 + F1 B3 + F1

Overall 0.574 0.596KB (in KB) 0.494 0.535NIL (not in KB) 0.665 0.662NW (news doc) 0.645 0.649WEB (web doc) 0.579 0.592DF (forum doc) 0.454 0.508PER (person) 0.695 0.708ORG (organization) 0.604 0.607GPE (geopolitical entity) 0.440 0.486

Table 2: System performance on test corpus of TAC-KBP2013 task with reference surface form resource (refSF)and Query Reformulation (QR).

in the document. We submitted the queries to a ver-sion of our system with Surface Form Correctiondisabled (refSF system in Table 2), and then enabled(QR system in Table 2). With QR system, whena refined surface form is proposed, each of its oc-currences in the document is rewritten according tothe new form, prior to be submitted to the seman-tic annotation engine. The Mention Correction al-gorithm improves the performance for KB link de-tection (KB line of Table 2), and does not reduce theperformance for NILs (surface form with no match-ing KB link). We can conclude that the selectionrules applied in the Surface Form Correction algo-rithm accurately reject most of the wrong correc-tions of surface forms. Improvement of performanceobtained on the noisiest documents - DF docs of theTAC-KBP task that are web forum transcripts - alsoshows that the Query Reformulation module is effi-cient with noisy text content.

5.2 Mutual Disambiguation module

The Mutual Disambiguation module of SemLinkerconsists in 3 main steps:First, it collects a set of ranked candidate WikipediaURIs for each NE candidate in a given documentusing an annotator. Wikimeta is utilized for TAC-KBP 2013, but it could be any generic annotator ca-pable of giving a list of ranked Wikipedia URIs foreach NE in a document. Then, it applies simple pro-cesses of correction to improve the precision of thefirst rank URIs. Finally, it runs a Mutual Disam-

IBM has 12 research laboratories worldwide.In 1952, Thomas J. Watson, Jr., became president of the company.

IBM International Brotherhood of Magicians International Business MachinesThomas J. Watson Thomas Watson, Jr.

NE Mentions Candidates annotations

Annotation object:

Direct Semantic relations:

International Business MachinesInternational Brotherhood of Magicians

Thomas Watson, Jr10

0

Common Semantic relations:

International Business MachinesInternational Brotherhood of Magicians

Thomas Watson, Jr

{IBM 7070, Software, History of IBM, ...}

{0}

Document:

Figure 4: Example of semantic relatedness captured bythe Mutual Disambiguation module in SemLinker.

biguation Process (MDP) to globally improve all theannotations in the document.

The extraction of an accurate link at a specific po-sition, according to a given TAC-KBP query, is alink extraction process occurring after the URI an-notation of NEs in the whole document. Our goalhere is to use all the semantic content of an annotateddocument to locally improve the precision of eachannotation in this document. The mutual disam-biguation process relies on the graph of all the rela-tions (internal links, categories) between Wikipediacontent related to the document annotations sincethis graph contains helpful semantic information.

A basic example of semantic relatedness thatshould be captured is presented in Figure 4, andexplained hereafter. Let us consider the mentionIBM in a given document. Candidate NE anno-tations for this mention can be International Busi-ness Machine or International Brotherhood of Ma-gicians. But if the IBM mention co-occurs with aThomas Watson, Jr mention in the document, therewill probably be more links between the Interna-tional Business Machine and Thomas Watson, Jrrelated Wikipedia pages than between the Interna-tional Brotherhood of Magicians and Thomas Wat-son, Jr related Wikipedia pages. The purpose ofthe MDP is to capture this semantic relatedness in-formation contained in the graph of links extractedfrom Wikipedia pages related to each candidate an-notation.

We now explain the complete annotation and dis-ambiguation process.

5.2.1 Annotation processThe document D is submitted to the annotation

engine that provides:- for all the words: POS tags- for each NE found in the document: NE label,span of the NE, and links - a maximum of 15 rankedWikipedia URIs12.

The annotation process results in an annotationobject calledAD. TheAD array has one line per NEspotted in the document D to annotate. AD has |I|lines, with I the index set associated with the docu-ment NEs. AD has 19 columns cj,j∈{1,...,19} definedas follows:

• c1 to c15 store Wikipedia URIs associated withNEs, ordered by decreasing values of likeli-hood,

• c16 stores the offset of the NEs in the document,

• c17 stores the surface form of the NEs,

• c18 stores the NE labels (PERS, GPE, ...)

• c19 stores the POS tags.

These Wikipedia URIs are the candidate URI an-notations for each NE in the document. They willbe converted in KB nodes at the end of the pipelineprocess. When no Wikipedia URI is found bythe annotation engine for the NE i at rank k andgreater (k ∈ {1, ..., 15}), null URIs are declared inAD[i][k], AD[i][k + 1],..., AD[i][15].

The URIs stored in colums c1 to c15 of the anno-tation object will be re-ranked to improve the anno-tation precision. The re-ranking phase is performedby the Mutual Disambiguation Process (MDP) pre-pared by two Corrections Processes (CPs).

5.2.2 Correction ProcessesCPs are applied before the MDP, using all the in-

formation stored in AD. The idea is to use simplemethods to locally correct some wrong annotationsbefore applying the disambiguation process. CPsare intended to normalize the first rank URI anno-tations through : 1) applying the most probable NE

12The Wikimeta annotation engine is a specific version of thepublic one, installed locally, and modified to provide 15 rankedcandidate Wikipedia URIs for each annotation (3 are providedin the public version)

label to NEs with similar surface forms; 2) usinga co-reference algorithm based on NE labels to as-sign to all the NEs of a co-reference chain the sameWikipedia URI at first rank using the most probableone.

Named entity correction processThe NE correction process is applied according

to the list of NEs in the annotation object. For twogiven NEs i and i′ reported in AD, if:

[ surface form AD[i][17] is longer than AD[i′][17] ]

AND[ AD[i

′][17] ⊂ AD[i][17] ]

then the NE labelAD[i][18] is affected toAD[i′][18].

For example, if in the document the NE Kennedyreceived the GPE label, and the NE John F Kennedyreceived the PERS label, the Kennedy GPE label isreplaced by PERS.

Co-reference correction processThe co-reference detector is derived from (Char-

ton et al., 2010). The co-reference detection is con-ducted using the information provided by the anno-tation object. Among the targeted NEs labeled asORG, PERS, or GPE that have been detected, theones that co-refer are identified and clustered by log-ical rules based on POS tags, words and NE labels.When a co-reference chain is detected, the most fre-quent URI associated with the longest NE surfaceform is attributed at first rank to all members of thechain.

5.2.3 Mutual Disambiguation ProcessIn MDP, for each Wikipedia URI candidate an-

notation, all the internal links and categories con-tained in the source Wikipedia document related tothis URI are collected. This information will be usedto calculate a weight for each of the n candidate URIannotations of each mention (with n = 15 in thecurrent version of SemLinker). For a given NE, thisweight is expected to measure the mutual relationsof a candidate annotation with all the other candi-date annotations of NEs in the document.

AlgorithmLet us consider an annotation object AD, ob-

tained as explained in Section 5.2.1, for a documentD of the KBP corpus.

For all i ∈ I , k ∈ {1, ..., 15}, we build the setSki , composed of the Wikipedia URIs and categories

contained in the source Wikipedia document relatedto AD[i][k].Scoring:For all i, i′ ∈ I , k ∈ {1, ..., 15}, we want to calcu-late the weight of mutual relations between the can-didateAD[i][k] and all the candidatesAD[i

′][1] withi 6= i′.

The calculation combines two scores:- the direct semantic relation (dsr) score forAD[i][k]sums up the number of occurrences of AD[i][k] inS1i′ for all i′ ∈ I − {i}.

- the common semantic relation (csr) score forAD[i][k] sums up the number of common URIsbetween Sk

i and S1i′ for all i′ ∈ I − {i}.

Figure 4 shows an example of direct semantic andcommon semantic relations.

We assumed the dsr score was much more seman-tically significant than the csr score, and translatedthis assumption in the weight calculation by intro-ducing two correction parameters α and β. Theseparameters are used in the final scoring calculation.In the current version of SemLinker, they have beenexperimentally set to α = 10 and β = 2.Re-ranking: For all i ∈ I , for each set{AD[i][k], k ∈ {1, ..., 15}}, the re-ranking processis conducted according to the following steps:

For all i ∈ I ,

1. ∀k ∈ {1, ..., 15}, calculate dsr score(AD[i][k])

2. ∀k ∈ {1, ..., 15}, calculate csr score(AD[i][k])

3. calculate mutual relation score(AD[i][k]) =α.dsr score(AD[i][k]) + β.csr score(AD[i][k])

4. re-rank {AD[i][k], k ∈ {1, ..., 15}} bydecreasing order of mutual relation score.

5.3 Link Extraction moduleThe system is now ready to extract the URI associ-ated with the TAC-KBP query. At this step we as-sume the majority of first rank Wikipedia URI an-notations are accurate, but some errors remain. Toimprove the robustness of the link extraction pro-cess, our system makes use of a query expansionmechanism as described in Section 5.1. The origi-nal query or its expanded version is then submitted

to heuristics that select the best final proposition ofWikipedia URI.

It is very similar in its final results as those pre-sented for example in (Cucerzan, 2012) or (Tamanget al., 2012). Though, it differs as it can be viewedas a passive mechanism while, for example, theCUNY Blender system uses an active pattern match-ing to expand a GPE name that starts with only [Cityname] to [City name][State name]. Since the firstannotation step provided by Wikimeta is a NE de-tection made by a CRF classifier, the returned an-notated document provides for person or locationnames the exact boundaries of the NE surface formswithout further process. In the link extraction pro-cess, this allows to automatically extend a query ac-cording to the longest surface form of the NE foundat the query position.

For example, in query EL ENG 00954, where theentity to link is Kennedy in the sentence Rep. PatrickKennedy (D-RI) has announced that ..., WikimetaCRF automatically finds the complete surface formPatrick Kennedy as a PERS. Therefore, the systemonly needs to collect the complete surface form re-lated to the position of Kennedy inAD to obtain pas-sively the expanded query Patrick Kennedy.

Two cases occurred in the link extraction process:1) an expanded query is available; 2) no query ex-pansion is available. To expand a query, the systemevaluates if a NE surface form larger than the orig-inal query has been annotated by Wikimeta at thequery position. If the query is an abbreviation, rulesare applied to extract the corresponding full name,if it exists elsewhere in the document (e.g. the fullname corresponding to the abbreviation expressed inparenthesis, just behind the first occurrence of theabbreviation). Finally, the resulting expanded queryis used to filter the candidate list from AD, and toselect the best candidate according to the followingrules:

• Perfectmatch : all the Wikipedia URIs for thesame mention are identical (or NILs).

• BestKeyAtPos : majority of the WikipediaURIs in the filtered list using NE label as com-plementary filter are identical (or NILs) withthe one at the mention position.

• BestKey : majority of the Wikipedia URIs for

the same mention of more than one word areidentical (or NILs).

• KeyAtPos : if none of the previous heuristicsgives an answer, select the Wikipedia URI (orNIL) at the mention position.

If there is no Wikipedia URI provided after appli-cation of all these rules, NIL value is assumed, butthe expanded query is saved for later use in the NILclustering process.

5.4 Clustering moduleBefore the final clustering process, entity links pro-vided by our system are Wikipedia URIs. They donot take KB nodes into account. As the annotationengine uses a recent edition of Wikipedia, many NILentities also receive a Wikipedia URI in our sys-tem. The clustering process for annotated entitiesis implicit, and requires no specific actions: all thequeries with a Wikipedia URI link are assumed tobelong to the same cluster.

Then remains a need of clustering for entities thathave not received any Wikipedia URI. The cluster-ing process for those remaining entities follows 2steps:

1. make use of extended queries collected duringthe link extraction process to attach remainingqueries to existing clusters, or create new oneswhen they share a similar surface form of morethan one word,

2. remaining entities are clustered as singletons.

Finally, KB node affectation is done using the cor-respondence table between Wikipedia URIs and KBnodes which is described in Section 4.4. Remainingclusters receive a NIL reference.

6 Experiments

The system was developed using the TAC-KBP2012 queries13. On this resource, used for training,our system obtained the results presented in Table3. Compared to the TAC-KBP 2012 official resultsin Table 4, our system obtains near state-of-the artperformance.

On the test corpus of TAC-KBP 2013, the scoreof our system is higher than the median of the global

13tac 2012 kbp english evaluation entity linking query.xml

Category global median SemLinker no-web-no-wiki medianB3 + F1 B3 + F1 B3 + F1

Overall 0.588 0.596 0.540KB (in KB) 0.535 0.535 0.488NIL (not in KB) 0.611 0.662 0.662NW (news doc) 0.664 0.649 0.573WEB (web doc) 0.522 0.592 0.481DF (forum doc) 0.469 0.508 0.448PER (person) 0.610 0.708 0.550ORG (organization) 0.542 0.607 0.510GPE (geopolitical entity) 0.543 0.486 0.485

Table 5: TAC-KBP 2013 SemLinker results and median results of the global and no-web-no-wiki categories

Category B3 + P B3 +R B3 + F1

Overall 0.695 0.696 0.695NIL 0.786 0.759 0.772KB 0.635 0.639 0.637

Table 3: SemLinker results on the TAC-KBP 2012 testcorpus used as development resource.

Rang B3 + F1 1 2 3Overall 0,730 0,699 0,689NIL 0,789 0,781 0,765KB 0,685 0,653 0,620

Table 4: Official results of the three best systems on theTAC-KBP 2012 test corpus.

results, and also higher than the median score of theno-web-no-wiki category. The median of global re-sults is calculated on the best runs of the 26 teams,the median of the no-web-no-wiki category is cal-culated on the best runs of 9 teams. The resultsare presented in Table 5. We observe that our sys-tem obtains a good level of performance on varioustypes of documents, and on PERS and ORG NE cat-egories, but under-performs on the GPE NEs. Afterinvestigation, it appears that this low level of perfor-mance on this specific category of NEs is due to aheuristic applied on GPE in the link extraction pro-cess (presented in 5.3). This GPE heuristic, whichworks correctly on the development corpus, reducesthe performance on the test corpus. This can be con-sidered as a development error that should be solvedby minimal correction for improving subsequentlythe global performance of SemLinker.

7 Conclusion

We have presented SemLinker, a general experimen-tal platform dedicated to the exploration of the en-tity linking task according to the TAC-KBP eval-uation framework. This platform can be easilyadapted with various external annotation tools com-pliant with Wikipedia URI annotation format. Weplan to implement complementary annotation soft-ware to compare them in this experimental context.Our system introduces two novel techniques to im-prove entity linking: a Query Reformulation mod-ule, and a Mutual Disambiguation algorithm basedon semantic relatedness in a document. Using ageneric annotation engine, an annotator able to pro-vide Wikipedia URIs as links and not specificallytrained or built for the TAC-KBP task, this proposi-tion obtained an encouraging medianB3+F1 scorein the general task without any supervised train-ing using the KB content or Wiki-text. Our systemperforms better than the median score compared tothe scores of non-supervised systems (no Wiki-text).The source code of our system is publicly available.We plan to upgrade it in the next months to comparethe performance of various annotation tools in thesame context.

Reproducibility

All the experiments presented in this paper are fullyreproducible on NIST TAC-KBP data using theSemLinker software14.

14Open Source and publicly released at https://code.google.com/p/semlinker/

Acknowledgments

This research was supported as part of Dr EricCharton Mitacs Elevate Grant, sponsored by 3CE.Participation of Dr Marie-Jean Meurs in this workwas supported by the Genozymes Project, a projectfunded by Genome Canada and Genome Quebec.Authors would like to thank Wikimeta Technolo-gies Inc. which provided support and computing re-sources for this research.

ReferencesFrederic Bechet and Eric Charton. 2010. Unsupervised

knowledge acquisition for extracting named entitiesfrom speech. In ICASSP 2010, Dallas. ICASSP.

Razvan C. Bunescu and Marius Pasca. 2006. Using en-cyclopedic knowledge for named entity disambigua-tion. In Proceedings of EACL, volume 6.

Eric Charton and Michel Gagnon. 2012. A disambigua-tion resource extracted from Wikipedia for semanticannotation. In LREC 2012.

Eric Charton and J.M. Torres-Moreno. 2010. NLGbAse:a free linguistic resource for Natural Language Pro-cessing systems. In LREC, editor, LREC 2010, num-ber 1, Malta. Proceedings of LREC 2010.

Eric Charton, Michel Gagnon, and Benoit Ozell. 2010.Poly-co : an unsupervised co-reference detection sys-tem. In INLG 2010-GREC, Dublin. ACL SIGGEN.

Silviu Cucerzan. 2012. The MSR System for EntityLinking at TAC 2012. In Proceedings of the Text Anal-ysis Conference 2012.

Aldo Gangemi. 2013. A Comparison of Knowledge Ex-traction Tools for the Semantic Web. In 10th Interna-tional Conference, ESWC 2013, Montpellier, France,May 26-30, 2013.

David Graus, Tom Kenter, Marc Bron, Edgar Meij, andMaarten De Rijke. 2012. Context-Based EntityLinkingUniversity of Amsterdam at TAC 2012. TAC2012.

Johannes Hoffart, Mohamed A. Yosef, and Ilaria Bor-dino. 2011. Robust disambiguation of named enti-ties in text. In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language Processing,pages 782–792.

Pablo N Mendes, Joachim Daiber, Max Jakob, and Chris-tian Bizer. 2011. Evaluating DBpedia Spotlight forthe TAC-KBP Entity Linking Task. In Proceedings ofthe Text Analysis Conference - TAC KBP 2011, pages1–4.

David Milne and Ian H. Witten. 2008. Learning to linkwith Wikipedia. Proceedings of the 17th ACM Con-ference on Information and Knowledge Management.

Will Radford, Will Cannings, Andrew Naoum, Joel Noth-man, Glen Pink, Daniel Tse, and James R Curran.2012. (Almost) Total Recall -. In Proceedings of theText Analysis Conference 2012.

Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. In-formation processing & management.

Michael Strube and Simone Paolo Ponzetto. 2006.WikiRelate! Computing Semantic Relatedness UsingWikipedia. Number February.

Suzanne Tamang, Zheng Chen, and Heng Ji. 2012.CUNY BLENDER TAC-KBP2012 Entity LinkingSystem and Slot Filling Validation System. In Pro-ceedings of the Text Analysis Conference 2012, num-ber Cc.

Majid Yazdani and Andrei Popescu-Belis. 2013. Com-puting text semantic relatedness using the contents andlinks of a hypertext encyclopedia. Artificial Intelli-gence.

Tao Zhang, Kang Liu, and Jun Zhao. 2012. The NL-PRIR Entity Linking System at TAC 2012. Text Anal-ysis Conference KBP.

semlinker system for kbp2013: a disambiguation algorithm

Documents