gerhard weikum max planck institute for informatics weikum
Post on 25-Feb-2016
39 Views
Preview:
DESCRIPTION
TRANSCRIPT
Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/
Knowledge Harvesting from Text and Web SourcesPart 3: Knowledge Linking
Quiz Time
3-2
How many days do you need to visitall Shangri-La places on this planet?
Answer: 365
Source: geonames.org
3-2
Quiz Time
3-3
How many days do you need to visitall Shangri-La places on this planet?
3-3
Linkied Data: RDF Triples on the Web
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
30 Bio. triples500 Mio. links
3-4
owl:s
ameAs
rdf.freebase.com/ns/en.rome
owl:sameAs
owl:sameAs
data.nytimes.com/51688803696189142301
Coord
geonames.org/5134301/city_of_rome
N 43° 12' 46'' W 75° 27' 20''
dbpprop:citizenOf
dbpedia.org/resource/Rome
rdf:ty
pe
rdf:subclassOf
yago/wordnet:Actor109765278
rdf:ty
pe
rdf:subclassOfyago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
prop:actedInimdb.com/name/nm0910607/
Linked RDF Triples on the Web
prop: composedMusicFor
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
3-5
owl:s
ameAs
rdf.freebase.com/ns/en.rome_ny
owl:sameAs
owl:sameAs
data.nytimes.com/51688803696189142301
Coord
geonames.org/5134301/city_of_rome
N 43° 12' 46'' W 75° 27' 20''
dbpprop:citizenOf
dbpedia.org/resource/Rome
rdf:ty
pe
rdf:subclassOf
yago/wordnet:Actor109765278
rdf:ty
pe
rdf:subclassOfyago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
prop:actedInimdb.com/name/nm0910607/
Linked RDF Triples on the Web
prop: composedMusicFor
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
Referential data quality?Hand-crafted sameAs links?generated sameAs links? ?
? ?
3-6
Web Page in Standard HTML http://schema.org/
Jane Doe<img src="janedoe.jpg" />
Professor20341 Whitworth Institute405 WhitworthSeattle WA 98052(425) 123-4567<a href="mailto:jane-doe@xyz.edu">jane-doe@illinois.edu</a>
Jane's home page:<a href="http://www.janedoe.com">janedoe.com</a>
Graduate students:<a href="http://www.xyz.edu/students/alicejones.html">Alice Jones</a><a href="http://www.xyz.edu/students/bobsmith.html">Bob Smith</a>
3-13
Web Page in HTML with Microdata<div itemscope itemtype="http://schema.org/Person"> <span itemprop="name">Jane Doe</span> <img src="janedoe.jpg" itemprop="image" />
<span itemprop="jobTitle">Professor</span> <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="streetAddress"> 20341 Whitworth Institute 405 N. Whitworth </span> <span itemprop="addressLocality">Seattle</span>, <span itemprop="addressRegion">WA</span> <span itemprop="postalCode">98052</span> </div> <span itemprop="telephone">(425) 123-4567</span> <a href="mailto:jane-doe@xyz.edu" itemprop="email"> jane-doe@xyz.edu</a>
Jane's home page: <a href="http://www.janedoe.com" itemprop="url">janedoe.com</a>
Graduate students: <a href="http://www.xyz.edu/students/alicejones.html" itemprop="colleague"> Alice Jones</a> <a href="http://www.xyz.edu/students/bobsmith.html" itemprop="colleague"> Bob Smith</a></div>
http://schema.org/
3-14
Web-of-Data vs. Web-of-Contents
3-15
Critical for knowledge linkage: entity name ambiguity
more structured data combined with text boosted by knowledge harvesting methods
Embedding RDFa in Web ContentsMay 2, 2011
Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such asthe Ecstasy of Gold.In programme two concerts for July 14th and 15th.
<html … May 2, 2011
<div typeof=event:music>
<span id="Maestro_Morricone">Maestro Morricone<a rel="sameAs"resource="dbpedia…/Ennio_Morricone "/></span>…<span property = "event:location" >Smetana Hall </span>…<span property="rdf:type"resource="yago:performance">The concert </span> will feature …<span property="event:date" content="14-07-2011"></span>July 1
</div>
RDF data and Web contents need to be interconnectedRDFa & microformats provide the mechanism
Need ways of creating more embedded RDF triples!3-16
Outline
...
Entity-Name Disambiguation
Motivation
Wrap-up
Mapping Questions into Queries
Entity Linkage
3-17
Named-Entity Disambiguation
Harry fought with you know who. He defeats the dark lord.
1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger)
2) co-reference resolution: link to preceding NP (trained classifier over linguistic features)3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB)
Three NLP tasks:
HarryPotter
DirtyHarry
LordVoldemort
The Who(band)
Prince Harryof England
3-18
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Named Entity Disambiguation
D5 Overview May 30, 2011
Sergio means Sergio_LeoneSergio means Serge_GainsbourgEnnio means Ennio_AntonelliEnnio means Ennio_MorriconeEli means Eli_(bible)Eli means ExtremeLightInfrastructureEli means Eli_WallachEcstasy means Ecstasy_(drug)Ecstasy means Ecstasy_of_Goldtrilogy means Star_Wars_Trilogytrilogy means Lord_of_the_Ringstrilogy means Dollars_Trilogy … … …
KB
Eli (bible)
Eli Wallach
Mentions(surface names)
Entities(meanings)
Dollars Trilogy
Lord of the Rings
Star Wars Trilogy
Benny Andersson
Benny Goodman
Ecstasy of Gold
Ecstasy (drug)
?
3-19
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Mention-Entity Graph
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
bag-of-words orlanguage model:words, bigrams, phrases
3-20
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
Mention-Entity Graph
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
jointmapping
3-21
Mention-Entity Graph
22 / 20
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy(drug)
Eli (bible)
Eli Wallach
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
3-22
Mention-Entity Graph
23 / 20
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)
American Jewsfilm actorsartistsAcademy Award winners
Metallica songsEnnio Morricone songsartifactssoundtrack music
spaghetti westernsfilm trilogiesmoviesartifactsDollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
3-23
Mention-Entity Graph
24 / 20
KB+Stats
weighted undirected graph with two types of nodes
Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)
http://.../wiki/Dollars_Trilogyhttp://.../wiki/The_Good,_the_Bad, _the_Uglyhttp://.../wiki/Clint_Eastwoodhttp://.../wiki/Honorary_Academy_Award
http://.../wiki/The_Good,_the_Bad,_the_Uglyhttp://.../wiki/Metallicahttp://.../wiki/Bellagio_(casino)http://.../wiki/Ennio_Morricone
http://.../wiki/Sergio_Leonehttp://.../wiki/The_Good,_the_Bad,_the_Uglyhttp://.../wiki/For_a_Few_Dollars_Morehttp://.../wiki/Ennio_MorriconeDollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
3-24
Mention-Entity Graph
25 / 20
KB+StatsPopularity(m,e):• freq(m,e|m)• length(e)• #links(e)
Similarity (m,e):• cos/Dice/KL (context(m), context(e))
Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)
Metallica on Morricone tributeBellagio water fountain showYo-Yo MaEnnio Morricone composition
The Magnificent SevenThe Good, the Bad, and the UglyClint EastwoodUniversity of Texas at Austin
For a Few Dollars MoreThe Good, the Bad, and the UglyMan with No Name trilogysoundtrack by Ennio Morricone
weighted undirected graph with two types of nodes
Dollars Trilogy
Lord of the Rings
Star Wars
Ecstasy of Gold
Ecstasy (drug)
Eli (bible)
Eli Wallach
Sergio talked to Ennio aboutEli‘s role in theEcstasy scene. This sequence onthe graveyardwas a highlight inSergio‘s trilogyof western films.
3-25
Collective Learning with Prob. Factor Graphs(Chakrabarti et al.: KDD‘09):
• model P[m|e] by similarity and P[e1|e2] by coherence• consider likelihood of P[m1 … mk | e1 … ek]• factorize by all m-e pairs and e1-e2 pairs• use hill-climbing, LP, etc. for solution
Different ApproachesCombine Popularity, Similarity, and Coherence Features(Cucerzan: EMNLP‘07, Milne/Witten: CIKM‘08):
• for sim (context(m), context(e)): consider surrounding mentions and their candidate entities
• use their types, links, anchors as features of context(m)
• set m-e edge weights accordingly• use greedy methods for solution
3-26
Joint Mapping
• Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB• Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e)
9030
5100
100
50 20
50
90
80 90
30
10 10
20
30
30
3-27
Mention-Entity Popularity Weights
Collect hyperlink anchor-text / link-target pairs from• Wikipedia redirects• Wikipedia links between articles• Interwiki links between Wikipedia editions• Web links pointing to Wikipedia articles…Build statistics to estimate P[entity | name]
Need dictionary with entities‘ names:• full names: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corporation• short names: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, …• nicknames & aliases: Terminator, City of Angels, Evil Empire, …• acronyms: LA, UCLA, MS, MSFT• role names: the Austrian action hero, Californian governor, the CEO of MS, ……plus gender info (useful for resolving pronouns in context): Bill and Melinda met at MS. They fell in love and he kissed her.
[Milne/Witten 2008, Spitkovsky/Chang 2012]
3-28
Mention-Entity Similarity Edges
Extent of partial matches Weight of matched words
Precompute characteristic keyphrases q for each entity e:anchor texts or noun phrases in e page with high PMI:
)()(
),()(~)|(
mcontextinekeyphrasesq
mcover(q)distqscoremescore
1
)|(#~)|(qw
cover(q)w
e)|weight(w
ewweight
cover(q)oflengthwordsmatchingeqscore
)()(),(log),(
efreqqfreqeqfreqeqweight
Match keyphrase q of candidate e in context of mention m
Compute overall similarity of context(m) and candidate e
„Metallica tribute to Ennio Morricone“
The Ecstasy piece was covered by Metallica on the Morricone tribute album.
3-29
Entity-Entity Coherence EdgesPrecompute overlap of incoming links for entities e1 and e2
))2(),1(min(log||log))2()1(log())2,1(max(log1
eineinEeineineein~e2)coh(e1,-mw
Alternatively compute overlap of anchor texts for e1 and e2
or overlap of keyphrases, or similarity of bag-of-words, or …
)2()1()2()1(
engramsengramsengramsengrams
~e2)coh(e1,-ngram
Optionally combine with type distance of e1 and e2(e.g., Jaccard index for type instances)
For special types of e1 and e2 (locations, people, etc.)use spatial or temporal distance
3-30
Coherence Graph Algorithm
• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search
9030
5100
100
50 50
90
80 90
30
10 20
10
20
30
30
[J. Hoffart et al.: EMNLP‘11]140
180
50
470
145
230
3-31
Coherence Graph Algorithm
• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search
9030
5100
100
50 50
90
80 90
30
1030
30
[J. Hoffart et al.: EMNLP‘11]140
180
50
470
145
230
140
170
470
145
210
3-32
Coherence Graph Algorithm
• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search
9030
5100
100 90
80 90
30
30
[J. Hoffart et al.: EMNLP‘11]140
170
460
145
210
120
460
145
210
3-33
Coherence Graph Algorithm
• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search
90100
100 90
90
30
[J. Hoffart et al.: EMNLP‘11]
120
380
145
210
3-34
Alternative: Random Walks
• for each mention run random walks with restart (like personalized PR with jumps to start mention(s))• rank candidate entities by stationary visiting probability• very efficient, decent accuracy
50
90
80 90
30
10
20
10
0.83
0.7
0.4 0.75
0.15
0.17
0.2
0.1
9030
5100
100
50
30
30 20
0.750.25
0.040.96
0.77
0.5
0.23
0.3 0.2
3-35
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/3-36
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/3-37
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Very Difficult Example
3-38
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Very Difficult Example
3-39
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/3-40
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/3-41
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/3-42
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/3-43
Some NED Online Tools forJ. Hoffart et al.: EMNLP 2011, VLDB 2011https://d5gate.ag5.mpi-sb.mpg.de/webaida/P. Ferragina, U. Scaella: CIKM 2010http://tagme.di.unipi.it/R. Isele, C. Bizer: VLDB 2012http://spotlight.dbpedia.org/demo/index.htmlReuters Open Calaishttp://viewer.opencalais.com/ S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009http://www.cse.iitb.ac.in/soumen/doc/CSAW/D. Milne, I. Witten: CIKM 2008http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/
perhaps more
some use Stanford NER tagger for detecting mentionshttp://nlp.stanford.edu/software/CRF-NER.shtml
3-44
NED: Experimental EvaluationBenchmark:• Extended CoNLL 2003 dataset: 1400 newswire articles• originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase• difficult texts: … Australia beats India … Australian_Cricket_Team … White House talks to Kreml … President_of_the_USA … EDS made a contract with … HP_Enterprise_Services
Results:Best: AIDA method with prior+sim+coh + robustness test82% precision @100% recall, 87% mean average precisionComparison to other methods, see paper
J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011http://www.mpi-inf.mpg.de/yago-naga/aida/
3-45
Ongoing Research & Remaining Challenges• More efficient graph algorithms (multicore, etc.)
• Short and difficult texts: • tweets, headlines, etc.• fictional texts: novels, song lyrics, etc.• incoherent texts
• Disambiguation beyond entity names:• coreferences: pronouns, paraphrases, etc.• common nouns, verbal phrases (general WSD)
• Leverage deep-parsing structures, leverage semantic types Example: Page played Kashmir on his Gibson
subj obj
mod
• Allow mentions of unknown entities, mapped to null
• Structured Web data: tables and lists
3-46
General Word Sense Disambiguation{songwriter, composer}
{cover, perform}
{cover, report, treat}
{cover, help out}
Which
song writers
covered
ballads
written by
the Stones ?
3-47
Handling Out-of-Wikipedia Entities
last.fm/Nick_Cave/Weeping_Song
wikipedia.org/Weeping_(song)
wikipedia.org/Nick_Cave
last.fm/Nick_Cave/O_Children
last.fm/Nick_Cave/Hallelujah
wikipedia/Hallelujah_(L_Cohen)
wikipedia/Hallelujah_Chorus
wikipedia/Children_(2011 film)
wikipedia.org/Good_Luck_Cave
Cave composedhaunting songs likeHallelujah,O Children,and theWeeping Song.
3-48
Handling Out-of-Wikipedia Entities
last.fm/Nick_Cave/Weeping_Song
wikipedia.org/Weeping_(song)
wikipedia.org/Nick_Cave
last.fm/Nick_Cave/O_Children
last.fm/Nick_Cave/Hallelujah
wikipedia/Hallelujah_(L_Cohen)
wikipedia/Hallelujah_Chorus
wikipedia/Children_(2011 film)
wikipedia.org/Good_Luck_Cave
Cave composedhaunting songs likeHallelujah,O Children,and theWeeping Song.
Gunung Mulu National ParkSarawak Chamberlargest underground chamber
eerie violinBad SeedsNo More Shall We Part
Bad SeedsNo More Shall We PartMurder Songs
Leonard CohenRufus WainwrightShrek and Fiona
Nick Cave & Bad SeedsHarry Potter 7 moviehaunting choir
Nick CaveMurder SongsP.J. HarveyNick and Blixa duet
Messiah oratorioGeorge Frideric Handel
Dan Heymannapartheid system
South Korean film
3-49
Handling Out-of-Wikipedia Entities• Characterize all entities (and mentions) by sets of keyphrases• Entity coherence then becomes: keyphrases overlap, no need for href link data• For each mention add a „self“ candidate: out-of-KB entity with keyphrases computed by Web search
Efficient comparison of two keyphrase-sets two-stage hashing, using min-hash sketches and LSH
KORE (e,f) = pe,qf PO(p,q)2 min(e(p), f(q))
pe e(p) + qf f(q) entities e,f
with phrase weights
PO(p,q) = wpq min(p(w), q(w))
phrases p,qwith word weights wpq max(p(w), q(w))
[J. Hoffart et al.: CIKM‘12]
3-50
Variants of NED at Web Scale
• How to run this on big batch of 1 Mio. input texts? partition inputs across distributed machines, organize dictionary appropriately, … exploit cross-document contexts
• How to deal with inputs from different time epochs? consider time-dependent contexts, map to entities of proper epoch (e.g. harvested from Wikipedia history)
• How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies)
Tools can map short text onto entities in a few seconds
3-51
Outline
...
Entity-Name Disambiguation
Motivation
Wrap-up
Mapping Questions into Queries
Entity Linkage
3-52
Word Sense Disambiguation forQuestion-to-Query Translation
Select ?p Where {?p type person.?p actedIn Casablanca_(film).?p isMarriedTo ?w.?w type writer .?w bornIn Rome . }
“Who played in Casablanca and was married to a writer born in Rome?”
Translationwith WSD
Question
SPARQL
KB
Answer
?p
?w
3-53
QA system DEANNA[M. Yahya et al.:EMNLP‘12]
www.mpi-inf.mpg.de/yago-naga/deanna/
DEANNA in a Nutshell
DEANNA
Question
SPARQL
KB
Answers
Phrase detection
Phrase mapping
Dependencydetection
Joint Disambig.
QueryGeneration
3-54
DEANNA in a Nutshell
DEANNA
Question
SPARQL
KB
Answers
Phrase detection
Phrase mapping
Dependencydetection
Joint Disambig.
QueryGeneration
3-55
DEANNA in a Nutshell
DEANNA
Question
SPARQL
KB
Answers
Phrase detection
Phrase mapping
Dependencydetection
Joint Disambig.
QueryGeneration
3-56
DEANNA in a Nutshell
DEANNA
Question
SPARQL
KB
Answers
Phrase detection
Phrase mapping
Dependencydetection
Joint Disambig.
QueryGeneration
3-57
DEANNA Components
DEANNA
Question
SPARQL
KB
Phrase detection
Phrase mapping
Dependencydetection
Joint Disambig.
QueryGeneration
1
2
3
4
3-58
Answers
Phrase Detection
Casablanca
played
played in
Who
married
married to
was married to
a writer
Concepts: entities & classes: dictionary-based
Relations:mainly use Reverb [Fader et al: EMNLP’11]: V | VP | VW*P… was/VBD married/VBN to/TO a/DT…
Concept Phrase
Casablanca Casablanca
Casablanca Casablanca, Morocco
Casablanca_(film) Casablanca the film
Casablanca_(film) Casablanca
… …
3-59
DEANNA Components
DEANNA
Question
SPARQL
KB
Phrase detection
Phrase mapping
Dependencydetection
Joint Disambig.
QueryGeneration
1
2
3
4
3-60
Phrase Mapping
Casablanca
played
played in
e:White_Housee:Casablanca
e:Casablanca_(film)e:Played_(film)
r:actedIn
r:hasMusicalRole
Concepts: entities & classes: dictionary-based
Relations: Dictionary -based
Concept Phrase
Casablanca Casablanca
Casablanca Casablanca, Morocco
Casablanca_(film) Casablanca the film
Casablanca_(film) Casablanca
Played_(film) Played
Relation Phrase
actedIn acted in
actedIn played in
hasMusicalRole plays
hasMusicalRole mastered
3-61
DEANNA Components
DEANNA
Question
SPARQL
KB
Phrase detection
Phrase mapping
Dependencydetection
Joint Disambig.
QueryGeneration
1
2
3
4
3-62
Dependency Detection
Look for specific patterns in dependency graph [de Marneffe et al. LREC’06]
writer
in
born
Rome
partmod
prep
pobj
a writer
was born
born
Rome
q1
c:writerr:bornInPlacer:bornOnDate
e:Max_Borne:Born_(film)
e:Sydne_Romee:Rome
3-63
Disambiguation Graph
q1
q2
q3
a writer
Casablanca
played
played in
Who
married
married to
was married to
was born
born
Rome
c:writerr:bornInPlacer:bornOnDate
e:Max_Borne:Born_(film)
e:Sydne_Romee:Rome
e:White_Housee:Casablanca
e:Casablanca_(film)e:Played_(film)
r:actedInr:hasMusicalRole
c:person
e:Married_(series)
c: married_personr:isMarriedTo
q-nodes
Phrase-nodesSemantic nodes
3-64
DEANNA Components
DEANNA
Question
SPARQL
KB
Phrase detection
Phrase mapping
Dependencydetection
Joint Disambig.
QueryGeneration
1
2
3
4
3-65
Joint Disambiguation - ILP• ILP: Integer Linear Programming• maximize α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l + …• Subject to:
No token in multiple phrases, Triples observe type constraints, …
3-66
Joint Disambiguation – Objective
α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l
Semantic nodes
q1
a writer
was born
born
Rome
c:writerr:bornInPlacer:bornOnDate
e:Max_Borne:Born_(film)
e:Sydne_Romee:Rome
q-nodes
Phrase nodes Coherence Edges
Similarity Edges
Prior
3-67
Joint Disambiguation – Objective
α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l
Semantic nodes
Coherence
q1
a writer
was born
born
Rome
c:writerr:bornInPlacer:bornOnDate
e:Max_Borne:Born_(film)
e:Sydne_Romee:Rome
q-nodes
Phrase nodesSimilarity
Edges Coherence Edges
3-68
Joint Disambiguation – ConstraintsA phrase node can be assigned to only one semantic node:
Casablanca
e:White_House
e:Casablanca
e:Casablanca_(film)
Phrase nodes
Semantic nodes
a
1
2
3
Ya,1
Ya,2
Ya,3
α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l
3-69
Joint Disambiguation – Constraints
Classes translate to type-constrained variables Every semantic triple should have a class to join
& project!person actedIn Casablanca_(film)
▼ ?x type person . ?x actedIn Casablanca_(film)
q1
a writer
was born
Rome
c:writer
r:bornInPlace
r:bornOnDate
e:Sydne_Rome
e:Rome
q-nodes
e:The_Writer (magazine)
Phrase nodes Semantic nodes
3-70
DEANNA Components
DEANNA
Question
SPARQL
KB
Phrase detection
Phrase mapping
Dependencydetection
Joint Disambig.
QueryGeneration
1
2
3
4
3-71
Structured Query Generation
SELECT ?p WHERE { ?w type writer . ?w bornIn Rome . ?p type person. ?p actedIn Casablanca_(film). ?p isMarriedTo ?w }
q1
q2
q3
a writer
Casablanca
played in
Who
was married to
was born
Rome
c:writer
r:bornIn
e:Rome
e:Casablanca_(film)
r:actedIn
c:person
r:isMarriedTo
3-72
Outline
...
Entity-Name Disambiguation
Motivation
Wrap-up
Mapping Questions into Queries
Entity Linkage
3-73
Entity Linkage for the Web of Data
owl:s
ameAs
rdf.freebase.com/ns/en.rome_ny
owl:sameAs
owl:sameAs
data.nytimes.com/51688803696189142301
Coord
geonames.org/5134301/city_of_rome
N 43° 12' 46'' W 75° 27' 20''
dbpprop:citizenOf
dbpedia.org/resource/Rome
rdf:ty
pe
rdf:subclassOf
yago/wordnet:Actor109765278
rdf:ty
pe
rdf:subclassOfyago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
prop:actedInimdb.com/name/nm0910607/
prop: composedMusicFor
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
sameAs links ?Where? How? ?
? ?
30 Bio. triples500 Mio. links
3-74
Record Linkage (Entity Resolution)
Susan B. DavidsonPeter Buneman
University of Pennsylvania
Yi Chen
record 1 record N
Issues in …
Int. Conf. on VeryLarge Data Bases
O.P. BunemanS. Davison
U PennY. Chen
Issues in …
VLDB Conf.
Y. Davidson
Penn StationS. Chen
Issues in …
XLDB Conference
record 2P. BaumannS. Davidson
Penn StateCheng Y.
Issues in …
PVLDB
record 3 …
Sean Penn
Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, 1959.I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., 1969.
Find equivalence classes of entities, and records, based on:• similarity of values (edit distance, n-gram overlap, etc.)• joint agreement of linkage
similarity joins, grouping/clustering, collective learning, etc. often domain-specific customization (similarity measures etc.)
3-75
Entity Linkage via Markov Logic
Susan B. DavidsonPeter Buneman
University of Pennsylvania
Yi Chen
record 1 record N
Issues in …
Int. Conf. on VeryLarge Data Bases
O.P. BunemanS. Davison
U PennY. Chen
Issues in …
VLDB Conf.
Y. Davidson
Penn StationS. Chen
Issues in …
XLDB Conference
record 2P. BaumannS. Davidson
Penn StateCheng Y.
Issues in …
PVLDB
record 3 …
Find equivalence classes of entities, and records, based on:• similarity of values (edit distance, n-gram overlap, etc.)• joint agreement of linkage
similarity joins, grouping/clustering, collective learning, etc.
Sean Penn
Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, 1959.I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., 1969.
prob. / uncertain rules:sameTitle(x,y) sameAuths(x,y) sameVenue(x,y) sameAs(x,y)sameTitle(x,y) sameAuths(x,y) sameAffil(x,y) sameAs(x,y)overlapAuths(x,y) sameAffil(x,y) sameAuths(x,y)sameAs(rec1.auth1, rec2.auth1) [0.2]sameAs(rec1.auth1, rec2.auth2) [0.9]…• specify in Markov Logic or as factor graph• generate MRF (or …) and solve by MCMC (or …)
(Singla/Domingos: ICDM’06, Hall/Sutton/McCallum:KDD’08)
3-76
sameAs-Link Test across SourcesLOD source 1 LOD source 2
sameAs ?
?
?
?
?ei ej
similarity: sim (ei, ej)
coherence: coh (xN(ei), yN(ej))neighborhoods: N(ei), N(ej)
sameAs (ei, ej)Ü sim (ei, ej) ≥ … x,y coh(x,y) ≥ …
record linkage problem
3-77
sameAs-Link Generation across SourcesLOD source 1 LOD source 2 LOD source 3
ek sameAs ?
ej sameAs ?
sameAs ?
ei
…
…
…
3-78
sameAs-Link Generation across SourcesLOD source 1 LOD source 2 LOD source 3
ek sameAs ?
ej sameAs ?
sameAs ?
ei
sim(ei, ej): likelihood of being equivalent, mapped to [-1,1]coh(x, y): likelihood of being mentioned together, mapped to [-1,1]0-1 decision variables: Xij … Xjk … Xik …objective function:
ij (Xij sim(ei,ej) + Xij xNi, yNj coh(x,y))
+ jk (…) + ik (…) = max!
constraints:
j Xij 1 for all i…(1Xij ) + (1Xjk ) (1Xik) for all i, j, k…
• Joint Mapping• ILP model or prob. factor graph or …• Use your favorite solver• How?
at Webscale ???
3-79
Similarity FloodingGraph with record / entity pairs as nodes (sameAs candidates)and edges connecting related pairs:
R(x,y) and S(u,w) and sameAs candidates (x,u), (y,w) edge between (x,u) and (y,w)
• Node weights: belief strength in sameAs(x,u)• Edge weights: degree of relatedness
Iterate until convergence:• propagate node weights to neighbors• new node weight is linear combination of inputs
Related to belief propagation algorithms,label propagation, etc.
3-80
Blocking of Match CandidatesAvoid computing O(n2) similarities between records / entities
• Group potentially matching records• Run more accurate & more expensive method per group at risk of missing some matches
• Iterative Blocking: distribute found matches to other blocks, then repeat per-block runs• Multi-type Joint Resolution blocks of different record types (author, venue, etc.) propagate matches to other types, then repeat runs
Name Zip Email1 John Doe 49305 jdoe@yahoo2 John Doe 94305 jdoe@gmail3 Jon Foe 94305 jdoe@yahoo4 Jane Foe 12345 jane@msn5 Jane Fog 12345 jane@msn
Group by zip code:{1,4,5} and {2,3} sameAs(4,5), sameAs(2,3)
Group by 1st char of lastname:{1,2} and {3,4,5} sameAs(1,2), sameAs(4,5)
3-81
Iterative Blocking for Joint Resolution with Multiple Entity Types [Whang et al. 2012]
Publications Authors Venues
heuristics for constructing efficient execution plansexploiting „influence graph“
afterround 1
afterround 2
3-82
RiMOM MethodRisk Minimization Based Ontology Matching Method
for joint matching of concepts (entities, classes) & properties (relations)
[Juanzi Li et al.;TKDE‘09]
Strategies using variety of matching criteria:• Linguistic-based:
• edit distance• context vector distance …
• Structure-based: • similarity flooding …
keg.cs.tsinghua.edu.cn/project/RiMOM/3-83
COMA++ Framework [E. Rahm et al.]
• Joint schema alignment and entity matching• Comprehensive architecture with many plug-ins for customizing to specific application• Blocked matchers parallelizable on Map-Reduce platform
dbs.uni-leipzig.de/Research/coma.html/3-84
PARIS Method [F. Suchanek et al. 2012]
webdam.inria.fr/paris/
Probabilistic Alignment of Relations, Instances, and Schema: joint reasoning on sameEntity, sameRelation, sameClass with direct probabilistic assessment
P[literal1 literal2] = … same constant value
P[r1 r2] = … sub-relation
P[e1 e2] = … same entity
P[c1 c2] = … sub-class
Matching entities of DBpedia with YAGO:90% precision, 73% recall, after 4 iterations, 5 h run-time
Iterate through probabilistic equationsEmpirically converges to fixpoint
3-85
PARIS Method [F. Suchanek et al. 2012]
webdam.inria.fr/paris/
P[literal1 literal2] = … based on similarityand co-occurrence
P[x y] =
(1 r(x,u),r(y,w) (1 fun(r1)P[uw])) if relations werealready aligned
same entity
r(x,u) (1 fun(r) r(y,w) (1 P[uw]))) consideringnegative evidence
fun(r) = #x: y: r(x,y)#x,y: r(x,y))
degree to which rIs a function
where
3-86
P[Shanri-La Zhongdian] = … fun(bornIn-1) P[Jet Li Li Lianjie]
PARIS Method [F. Suchanek et al. 2012]
webdam.inria.fr/paris/
P[s r]:
P[s r] =#x,u: s(x,u) r(x,u)
#x,u: s(x,u)if entities werealready resolved
P[s r] =
with same-entityprobabilities
s(x,u) (1 r(y,w) (1 P[xy]P[uw]))
s(x,u) (1 y,w (1 P[xy]P[uw]))
sub-relation
3-87
PARIS Method [F. Suchanek et al. 2012]
webdam.inria.fr/paris/
P[x y] =
(1 s(x,u),r(y,w) (1 P[s r]fun(s1)P[uw]) with sub-relationprobabily
same entityrevisited
(1 P[s r]fun(r1)P[uw]))
s(x,u),r(y,w) (1 P[s r] fun(s) r(y,w) (1 P[uw]))
consideringnegative evidence
(1 P[s r] fun(r) r(y,w) (1 P[uw]))
3-88
PARIS Method [F. Suchanek et al. 2012]
webdam.inria.fr/paris/
P[c d]:
P[c d] =#x type(x,c)) type(x,d)
#x: type(x,c)if entities werealready resolved
P[c d] =
with same-entityprobabilities
x:type(x,c) (1 y:type(y,d) (1 P[xy]))#x: type(x,c)
sub-class
3-89
Partitioned MLN Method V. Rastogi et al. 2011]
• Use Markov Logic Network for entity resolution• Partition MLN with replication of nodes so that:• Each node has its neighborhood in the same partitionRepeat
• local computation: run MLN inference via MCMC on each partition (in parallel)
• message passing: exchange beliefs (on sameAs) among partitions with overlapping node setsUntil convergence
R1: sim(x,y) sameAuthor(x,y)R2: sim(x,y) coAuthor(x,a) coAuthor(y,b) s ameAuthor(a,b) sameAuthor(x,y)
3-90
LINDA: Linked Data Alignment at Scale[C. Böhm et al. 2012]
• uses context sim and joint inference to process sameAs matrix with transitivity and other constraints
• alternates between setting sameAs and recomputing sim• puts promising candidate pairs in priority queue
• queue is partitioned and processing parallelized
e1 … em
e m…
e1
ei ej yei ek yek el y………………
ei ej y……
ei
ej
…
ek
el
Node 1 Node n InputQueue Q
ResultMatrix X
ei
ej
ek
el
(1) accept
(3) update
InputEntity Graph G
rea
d rea
d
Q-part 1ek el y……
Q-part n(2) notify
ei ej y‘ei ek y‘…
QueueUpdates
dist
ribu
e
(4) registerdi
stri
but
edist
ribu
te
G-part 1 G-part n
Experimentwith BTC+ dataset:• 3 Bio. quads• 345 Mio. triples• 95 Mio. URIsResult after 30 h run-time:• 12.3 Mio. sameAs• 66% precision• > 80% for Dbpedia-Yago
3-91
Cross-Lingual Linking
Source: Z. Wang et al.: WWW‘12
+ simpler than monolingual: natural equivalences, interwiki links harder than monolingual: different terminologies & structures
Z. Wang et al. WWW‘12: factor-graph learning 200,000 sameAsT. Nguyen et al. VLDB‘12: sim features & LSI infobox mappings
en.w
ikip
edia
.org
:3.
5 M
io. a
rtic
les
baike.baidu.com:
4 Mio. articles
3-92
Challenges RemainingEntity linkage is at the heart of semantic data integration !More than 50 years of research, still some way to go!
Benchmarks:• OAEI Ontology Alignment & Instance Matching: oaei.ontologymatching.org• TAC KBP Entity Linking: www.nist.gov/tac/2012/KBP/• TREC Knowledge Base Acceleration: trec-kba.org
• Highly related entities with ambiguous names George W. Bush (jun.) vs. George H.W. Bush (sen.)
• Out-of-Wikipedia entities with sparse context• Enterprise data (perhaps combined with Web2.0 data)
• Entities with very noisy context (in social media)• Records with complex DB / XML / OWL schemas
3-93
TREC Task: Knowledge Base Acceleration
http://trec-kba.org
Goal: assist Wikipedia / KB editors• recommend key citations as evidence of truth• recommend infobox structure and categories• recommend entity links and external links
3-94
Outline
...
Entity-Name Disambiguation
Motivation
Wrap-up
Mapping Questions into Queries
Entity Linkage
3-96
Take-Home LessonsWeb of Linked Data is great100‘s of KB‘s with 30 Bio. triples and 500 Mio. linksmostly reference data, dynamic maintenance is bottleneckconnection with Web of Contents needs improvement
Entity detection and disambiguation is keyfor creating sameAs links in text (RDFa, microformats)for machine reading, semantic authoring, knowledge base acceleration, …
Integrated methods for aligning entities, classes and relationsLinking entities across KB‘s is advancing
combine popularity, similarity, and coherenceextend towards general WSD (e.g. for QA)
NED methods come close to human quality
3-97
Open Problems and Grand Challenges
Automatic and continuously maintained sameAs linksfor Web of Linked Data with high accuracy & coverage
Combine algorithms and crowdsourcing for NED & ER
Robust disambiguation of entities, relations and classes
with active learning, minimizing human effort or cost/accuracy
Relevant for question answering & question-to-query translationKey building block for KB building and maintenance
Entity name disambiguation in difficult situationsShort and noisy texts about long-tail entities in social media
3-98
End of Part 3
Questions?
3-99
• J. Hoffart, M. A. Yosef, I. Bordino, et al.: Robust Disambiguation of Named Entities in Text. EMNLP 2011• J. Hoffart et al.: KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. CIKM 2012• R.C. Bunescu, M. Pasca: Using Encyclopedic Knowledge for Named entity Disambiguation. EACL 2006• S. Cucerzan: Large-Scale Named Entity Disambiguation Based on Wikipedia Data. EMNLP 2007• D.N. Milne, I.H. Witten: Learning to link with wikipedia. CIKM 2008• S. Kulkarni et al.: Collective annotation of Wikipedia entities in web text. KDD 2009• G.Limaye et al: Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB
2010• A. Rahman, V. Ng: Coreference Resolution with World Knowledge. ACL 2011• L. Ratinov et al.: Local and Global Algorithms for Disambiguation to Wikipedia. ACL 2011• M. Dredze et al.: Entity Disambiguation for Knowledge Base Population. COLING 2010• P. Ferragina, U. Scaiella: TAGME: on-the-fly annotation of short text fragments. CIKM 2010• X. Han, L. Sun, J. Zhao: Collective entity linking in web text: a graph-based method. SIGIR 2011• M. Tsagkias, M. de Rijke, W. Weerkamp.: Linking Online News and Social Media. WSDM 2011• J. Du et al.: Towards High-Quality Semantic Entity Detection over Online Forums. SocInfo 2011• V.I. Spitkovsky, A.X. Chang: A Cross-Lingual Dictionary for English Wikipedia Concepts, LREC 2012• J.R. Finkel, T. Grenager, C. Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL 2005• V. Ng: Supervised Noun Phrase Coreference Research: The First Fifteen Years. ACL 2010• S. Singh, A. Subramanya, F.C.N. Pereira, A. McCallum: Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models. ACL 2011• T . Lin et al.: No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities. EMNLP 2012• A. Rahman, V. Ng: Inducing Fine-Grained Semantic Classes via Hierarchical Classification. COLING 2010• X. Ling, D.S. Weld: Fine-Grained Entity Recognition. AAAI 2012• R. Navigli: Word sense disambiguation: A survey. ACM Comput. Surv. 41(2), 2009• M. Yahya et al.: Natural Language Questions for the Web of Data. EMNLP 2012• S. Shekarpour: Automatically Transforming Keyword Queries to SPARQL on Large-Scale KBs. ISWC 2011
Recommended Readings: Disambiguation
3-100
Recommended Readings: Linked Data and Entity Linkage
• T. Heath, C. Bizer: Linked Data: Evolving the Web into a Global Data Space. Morgan&Claypool, 2011• A. Hogan, et al.: An empirical survey of Linked Data conformance. J. Web Sem. 14, 2012• H. Glaser, A. Jaffri, I.C. Millard: Managing Co-Reference on the Semantic Web. LDOW 2009• J. Volz, C.Bizer, M.Gaedke, G.Kobilarov : Discovering and Maintaining Links on the Web of Data. ISWC 2009• F. Naumann, M. Herschel: An Introduction to Duplicate Detection. Morgan&Claypool, 2010• H.Köpcke et al: Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing 2010• H. Köpcke et al.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 2010• S. Melnik, H. Garcia-Molina, E. Rahm: Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. ICDE 2002• S. Chaudhuri, V. Ganti, R. Motwani: Robust Identification of Fuzzy Duplicates. ICDE 2005• S.E. Whang et al.: Entity Resolution with Iterative Blocking. SIGMOD 2009• S.E. Whang, H. Garcia-Molina: Joint Entity Resolution. ICDE 2012• L. Kolb, A. Thor, E. Rahm: Load Balancing for MapReduce-based Entity Resolution. ICDE 2012• J.Li, J.Tang, Y.Li, Q.Luo: RiMOM: A dynamic multistrategy ontology alignment framework. TKDE 21(8), 2009• P. Singla, P. Domingos: Entity Resolution with Markov Logic. ICDM 2006• I.Bhattacharya, L. Getoor: Collective Entity Resolution in Relational Data. TKDD 1(1), 2007• R. Hall, C.A. Sutton, A. McCallum: Unsupervised deduplication using cross-field dependencies. KDD 2008• V. Rastogi, N. Dalvi, M. Garofalakis: Large-Scale Collective Entity Matching. PVLDB 2011• F. Suchanek et al.: PARIS: Probabilistic Alignment of Relations, Instances, and Schema. PVLDB 2012• Z. Wang, J. Li, Z. Wang, J. Tang: Cross-lingual knowledge linking across wiki knowledge bases. WWW 2012• T. Nguyen et al.: Multilingual Schema Matching for Wikipedia Infoboxes. PVLDB 2012• A.Hogan et al.: Scalable and distributed methods for entity matching. J. Web Sem. 10, 2012• C. Böhm et al.: LINDA: Distributed Web-of-Data-Scale Entity Matching. CIKM 2012• J. Wang, T. Kraska, M. Franklin, J. Feng: CrowdER: Crowdsourcing Entity Resolution. PVLDB 2012 3-101
Knowledge Harvesting:Overall Take-Home Lessons
KB‘s are great opportunity in the big-data era: revive old AI vision, make it real & large-scale ! challenging, but high pay-off
Strong success story on entities and classes
Many opportunities remaining:temporal knowledge, spatial, visual, commonsensevertical domains: health, music, travel, …
Good progress on relational factsMethods for open-domain relation discovery
Search and ranking:Combine facts (SPO triples) with witness textExtend SPARQL, LM‘s for ranking, UI unclearEntity linking:From names in text to entities in KBsameAs between entities in different KB‘s / DB‘s
1-102
Knowledge Harvesting: ResearchOpportunities & Challenges
Explore & exploit synergies between semantic, statistical, & social Web methods: statistical evidence + logical consistency + wisdom of the crowd !
For DB / AI / IR / NLP / Web researchers: • efficiency & scalability• consistency constraints & reasoning• search and ranking• deep linguistic patterns & statistics• text (& speech) disambiguation• killer app for uncertain data management• knowledge-base life-cycleand more 1-103
de: vielen Dank
en: thank you
fr: Merci beaucoup
es: muchas gracias
cmn: 非常谢谢你
ru: Большое спасибо
tib: ཐུགས་རེྗ་ཆེ་།
yue: 唔該 wu: 谢谢侬
expression ofgratitude
dai: ขอบคณุ
3-104
top related