combining similarities and regression for entity linking

46
César de Pablo Sanchez Stat rosa pristina nomine, nomine nuda tenemus

Upload: cesar-de-pablo

Post on 05-Jul-2015

240 views

Category:

News & Politics


1 download

DESCRIPTION

An outline of the UC3M participation in TAC-KBP Entity Linking task in 2010. Joint work with Juan Perea and Paloma Martínez. This presentation was given at the Structural Biology Group at CNIO in June 2012, includes some initial slides as presentation.

TRANSCRIPT

Page 1: Combining Similarities and Regression for Entity Linking

César de Pablo Sanchez

Stat rosa pristina nomine,

nomine nuda tenemus

Page 2: Combining Similarities and Regression for Entity Linking

1. Task definition: KBP and EL

2. System description

3. Results

4. Conclusions

TAC-KBP 2010 - Combining Similarities and Regression Classifiers for Entity Linking

Overview of previous work

Page 3: Combining Similarities and Regression for Entity Linking

Overview of previous work

Page 4: Combining Similarities and Regression for Entity Linking

Drug Drug Interactions

Relation extraction

Anaphora resolution

Page 5: Combining Similarities and Regression for Entity Linking

OPINATOR - Opinion MiningSentiment loaded dictionaries

Sentiment classification

Opinion summarization

Search/Navigation

Page 6: Combining Similarities and Regression for Entity Linking

Knowledge acquisitionList candidates for the Greek elections in June.

Page 7: Combining Similarities and Regression for Entity Linking

Knowledge acquisitionList candidates for the Greek elections in June.

Page 8: Combining Similarities and Regression for Entity Linking

Knowledge acquisitionList candidates for the Greek elections in June.

What party does Tsipras represents?

How old is he?

What does Syriza means?

Page 9: Combining Similarities and Regression for Entity Linking

Knowledge acquisitionList candidates for the Greek elections in June.

What party does Tsipras represents?

How old is he?

What does Syriza means?

Page 10: Combining Similarities and Regression for Entity Linking

Knowledge acquisitionList candidates for the Greek elections in June.

What party does Tsipras represents?

How old is he?

What does Syriza means?

How old is Samaras?

Page 11: Combining Similarities and Regression for Entity Linking

Knowledge acquisitionList candidates for the Greek elections in June.

What party does Tsipras represents?

How old is he?

What does Syriza means?

How old is Samaras?

Page 12: Combining Similarities and Regression for Entity Linking

1. Task definition: KBP and EL

2. System description

3. Results

4. Conclusions

TAC-KBP 2010 - Combining Similarities and Regression Classifiers for Entity Linking

Page 13: Combining Similarities and Regression for Entity Linking

Knowledge Base Population

César de Pablo, Juan Perea, Paloma Martínez

TAC-KBP 2010 - Combining Similarities and Regression Classifiers for Entity Linking

Page 14: Combining Similarities and Regression for Entity Linking

Knowledge Base Population

KBP

Knowledge

Base

Page 15: Combining Similarities and Regression for Entity Linking

Knowledge Base Population

KBP

Knowledge

Base

from Wikipedia dump (2008)● Title, name, type, id, ● wiki text, ● several facts as [name, value]

● 1.3 million English newswire documents

● Published from 1994 and 2008

● 488.240 webpages

Page 16: Combining Similarities and Regression for Entity Linking

IE = KBP?

QA = KBP?

Page 17: Combining Similarities and Regression for Entity Linking

IE = KBP?Accurate extraction of facts – not annotation

Learn facts from corpus - repetition is not important but helps confidence

Asserting wrong information is bad

Scalability

Provenance

QA = KBP?

Page 18: Combining Similarities and Regression for Entity Linking

IE = KBP?Accurate extraction of facts – not annotation

Learn facts from corpus - repetition is not important but helps confidence

Asserting wrong information is bad

Scalability

Provenance

Slots are fixed but targets change

Leverage knowledge from the KB

Global resolution - ground information to the KB

Avoid contradiction

Detect novel info

QA = KBP?

Page 19: Combining Similarities and Regression for Entity Linking

Task at TAC - KBP

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Task 1: Slot Filling

Task 2: Entity Linking

Page 20: Combining Similarities and Regression for Entity Linking

Task at TAC - KBP

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Task 1: Slot Filling

Page 21: Combining Similarities and Regression for Entity Linking

Task at TAC - KBP

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Task 2: Entity Linking

Page 22: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Entity Linking: Example

<query id="EL006455"><name>Reserve Bank</name><docid>eng-NG-31-100316-11150589</docid><entity>E0700143</entity></query>

<query id="EL06472"><name>Reserve Bank</name><docid>eng-NG-31-142262-10040510</docid><entity>E0421510</entity></query>

For a name string and a document, determine which entity in a KB if any is being referred to by the name string

Page 23: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Entity Linking: Example

<query id="EL006455"><name>Reserve Bank</name><docid>eng-NG-31-100316-11150589</docid><entity>E0700143</entity></query>

<query id="EL06472"><name>Reserve Bank</name><docid>eng-NG-31-142262-10040510</docid><entity>E0421510</entity></query>

…E0421510: Reserve Bank of Australia…E0700143: Reserve Bank of India....

NIL

For a name string and a document, determine which entity in a KB if any is being referred to by the name string

Page 24: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Focus on confusable entities ● Ambiguous names : Reserve Bank, Alan Jackson, Fonda ●

Entity Linking: Challenges

Page 25: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Focus on confusable entities ● Ambiguous names ● Multiple Name variants: Saddam Hussain, Saddam Hussein

Entity Linking: Challenges

Page 26: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Focus on confusable entities ● Ambiguous names ● Multiple Name variants● Acronym expansion: CDC, AZ

Entity Linking: Challenges

Page 27: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Focus on confusable entities ● Ambiguous names ● Multiple Name variants● Acronym expansion● Variety of cases : Centre for Disease Control, European Centre

for Disease Control, AZ, Arizona, Astra Zeneca

Entity Linking: Challenges

Page 28: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Focus on confusable entities ● Ambiguous names ● Multiple Name variants● Acronym expansion● Variety of cases

● Pilot task – entity linking withouth text support● Identify missing entities – then cluster (2011)

Entity Linking: Challenges

Page 29: Combining Similarities and Regression for Entity Linking

Name mention – document pairs● Accuracy micro = num correct / num queries ● Accuracy macro = group by entities (2009)

Entity Linking: Evaluation

queries NIL set genre % NIL

3904 2229 eval 2009 news 0.571

1500 426 train 2010 web 0.284

2250 1230 eval 2010 news + web

0.547

Page 30: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Supervised architecture● Use similarities between objects or parts of them – avoid a

wide feature vector●

1) Candidate Entity Retrieval

2) Candidate Filtering

3) Validation (NIL classification)

uc3m EL system

Page 31: Combining Similarities and Regression for Entity Linking

uc3m EL system

Page 32: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Each KB article is indexed using Lucene, using several indexes and fields● ALIAS - include names plus aliases extracted from wiki slots:

alias, abbreviation, website, etc.● NER – Named entities extracted from text: <id, ne, text>● KB - entity slots <id, [(slot_name,slot_value)]>● WIKIPEDIA – anchorList, category, redirect, outlinks, inlinks

● Each EL query transforms into several Lucene queries – result [KB name, score] list

1) Candidate Retrieval

Page 33: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● EL Query: [Michael Jordan,eng-NG-31-100316-11150589]● Lucene queries:

● name=Michael AND name = Jordan● alias=Michael AND alias = Jordan● abbr=Michael AND abbr = Jordan

● For each query: ● [EL0989789, Michael Jordan, 25.00]● [EL6565356, Michael B. Jordan , 25.00]● [EL6565356, Michael I. Jordan , 25.00]● [EL6565356, Michael-Hakim Jordan , 25.00]● [EL6565356, Jordan , 20.00]

1) Candidate Retrieval

Page 34: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Classification problem ● decide (EL query + text , KB name + wiki text ) is a good

match● In fact, rank by prediction confidence

● Use similarity scores as features – norm and unnorm ● Use a cost sensitive classifier.● Best results: Model trees with linear regression leafs

2) Candidate Filtering

Page 35: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Index-based scores: ● sim (EL queries, KB entries) directly from initial retrieval

● Context-similarity scores: ● sim(document, wikitext) o sim(document,slots)

● Name similarity score: ● sim (EL queries, KB entries) – more expensive: equal,

QcontainsE, EcontainsQ, Jaro, Jaro-Winkler, SLIM (based on SecondString)

Features

Page 36: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Classification – selected candidate is good enough or NIL ● Positive examples – correct candidate example● Negative examples – top ranked entities for those queries

that do not have a link in the KB ● Balanced dataset ● Best classifier: Logistic Regression

3) Validation

Page 37: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Influence of domain?

EL results - mainnews web news+web Highest Median

750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69

Page 38: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

EL results - mainnews web news+web Highest Median

750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69

Page 39: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● GPE are particularly difficult

EL results - mainnews web news+web Highest Median

750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69

Page 40: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● AA

EL results - mainnews web news+web Highest Median

750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69

news web news+web Highest Median

2250 ALL 0.67 0.65 0.68 0.87 0.69

1020 noNIL 0.51 0.59 0.49

1230 NIL 0.81 0.70 0.82

Page 41: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● AA

EL results - mainnews web news+web Highest Median

750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69

news web news+web Highest Median

2250 ALL 0.67 0.65 0.68 0.87 0.69

1020 noNIL 0.51 0.59 0.49

1230 NIL 0.81 0.70 0.82

Page 42: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Including name similarity scores helped

EL results – pilot w/o textnews(main) news +n-sim NIL +n-sim all

2250 ALL 0.67 0.58 0.66 0.70

1020 noNIL 0.51 0.35 0.40 0.47

1230 NIL 0.81 0.77 0.88 0.88

Page 43: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Including name similarity scores helped

EL results – pilot w/o textnews(main) news +n-sim NIL +n-sim all

2250 ALL 0.67 0.58 0.66 0.70

1020 noNIL 0.51 0.35 0.40 0.47

1230 NIL 0.81 0.77 0.88 0.88

Page 44: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Prior on Link probability/popularity (Stanford-UBC 2009, LCC 2010, Microsoft 2011)

Learning to rank algorithms: ListNet (CUNY 2011)

● Expand queries: acronym expansion/correference (NUS 2011)

● Unsupervised system – entity co-ocurrence + PageRank (WebTLab 2010)

● Inductive EL – first cluster, then link (LCC 2011)

● Collective entity linking (Microsoft 2011)

EL systems comparison

Page 45: Combining Similarities and Regression for Entity Linking

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

● Supervised EL system● Influence of training size ● beware of training data distribution

● Consider name-similarities even for reranking ● Improve initial candidate retrieval ● Perform collective Entity Linking ● Efficiency?

Conclusion

Page 46: Combining Similarities and Regression for Entity Linking

Related tasks

● Cluster Documents Mentioning Entities ● Entity correference – document and cross-

document● Add missing links between Wikipedia pages ● Link entities to matching Wikipedia articles