ralf schenkel joint work with fabian suchanek and gjergji kasneci yawn a semantically annotated...
TRANSCRIPT
Ralf Schenkel
joint work with Fabian Suchanek and Gjergji Kasneci
YAWNA Semantically Annotated Wikipedia XML Corpus
8. März 2007BTW 2007, Aachen 2
Results for „Konferenz Aachen“• NRW KULTURsekretariat
Relevanz: 5.9% - - • Pfadfinderinnenschaft Sankt Georg
Relevanz: 5.7% - - • Konferenz der deutschsprachigen Mathematikfachschaften
Relevanz: 5.2% - - • Leonard Monheim
Relevanz: 5.1% - - • Andreas Kruse
Relevanz: 4.9% - - • Holzbau
Relevanz: 4.9% - - • Wolfgang Seifen
Relevanz: 4.9% - - • Feldpost der Belgier in Deutschland nach dem Ersten Weltkrieg 1918–1935
Relevanz: 4.1% - - • Konferenz der Informatikfachschaften
Relevanz: 4.0% - - • UNESCO-Club
Relevanz: 3.7% - - • Kaiser/Riegraf-Gruppe (Heilbronn)
Relevanz: 3.7% - - • Niederländische Annexionspläne nach dem Zweiten Weltkrieg
Relevanz: 3.6% - -
Find a pageof a conference
that is related to Aachen.
Limit query to certainclasses of result pages
8. März 2007BTW 2007, Aachen 3
Source for Classes: WordNet Thesaurus
ROOT
entity group
thingliving_thing
person
entertainer scientist
physicist biologistmusician actor
meeting
conferencecongress
minority
More than 81000 concepts
8. März 2007BTW 2007, Aachen 4
Mapping Pages to Concepts
cityAutomatic mappingwith high quality
8. März 2007BTW 2007, Aachen 5
Architecture
Wikipedia Pages(Wiki Markup)
HTML
TopXSearch Engine
Concept Mapper
Wikipedia Pages(Annotated XML)
Wikipedia Pages(XML)
8. März 2007BTW 2007, Aachen 6
Concept Mapping (1): Categories
Manually added category information in most pages
Example: Albert Einstein• Excellent_articles• 1879_births• Physics• Swiss_physicists
Technically: exclude admin categories,shallow parsing of category labels,
stemming, mapping heuristics
8. März 2007BTW 2007, Aachen 7
Concept Mapping (2): Regular Structure
Regular structures (list, tables, …) often indicate similar conceptsExample: List of people
• Albert Einstein
• Max Planck
• Nils Bohr
• Werner Heisenberg
Technically: grouping of similar XPathexpressions, find coherent annotations,
frequency & confidence thresholds
physicist
8. März 2007BTW 2007, Aachen 8
Concept Mapping (2): Regular Structure
Technically: grouping of similar XPathexpressions, find coherent annotations,
frequency & confidence thresholds
• /article[1]/…/list[3]/item[1]/link[1]
•/article[1]/…/list[3]/item[2]/link[1]
•/article[1]/…/list[3]/item[3]/link[1]
•/article[1]/…/list[3]/item[4]/link[1]
Regular structures (list, tables, …) often indicate similar conceptsExample: List of people
8. März 2007BTW 2007, Aachen 9
Concept Mapping (3): Outlier DetectionSometimes conflicting annotations of the same page:
ROOT
entity
thingliving_thing
person
ruler
artifact
instrument
rulerSolution:
Compatibility matrixfor high-level concepts
king
Kings_of_SpainEuropean_rulers
?
8. März 2007BTW 2007, Aachen 10
YAWN: Annotated XML
• Add concept tag(s) to articles<city source=„categories“ confidence=“1.0“> <article>…</article></city>
• Add concept tag(s) to outgoing links…<city source=“lists“ confidence=“0.9“> <link target=„“…“>Saarbrücken</link> </city>
8. März 2007BTW 2007, Aachen 11
Querying YAWN
Map concept queries to XPath expressions• „conferences in Aachen“://conference[contains(.,“Aachen“)]
• „scientists who won a nobel prize“://scientist[contains(.,“Nobel prize“)]
• „musicians who performed a song where ‚space‘ occurs in the title“: //musician[contains(//song,“space“)]
Not for end users!Needs good user interface
8. März 2007BTW 2007, Aachen 12
Left Overs and Summary
• XML Conversion
• Templates
• Preliminary evaluation
See paper
Automated detection and annotationof concepts is useful for retrieval.
8. März 2007BTW 2007, Aachen 13
The Future: YAGO [WWW‘07]
city
area
state
Aachen NRW
is_ais_a
instance_of instance_of
located_in
Querying the knowledge representation
8. März 2007BTW 2007, Aachen 14
Thank you!