ralf schenkel joint work with fabian suchanek and gjergji kasneci yawn a semantically annotated...

14
Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

Upload: grete-kellen

Post on 05-Apr-2015

110 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

Ralf Schenkel

joint work with Fabian Suchanek and Gjergji Kasneci

YAWNA Semantically Annotated Wikipedia XML Corpus

Page 2: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 2

Results for „Konferenz Aachen“• NRW KULTURsekretariat

Relevanz: 5.9% - - • Pfadfinderinnenschaft Sankt Georg

Relevanz: 5.7% - - • Konferenz der deutschsprachigen Mathematikfachschaften

Relevanz: 5.2% - - • Leonard Monheim

Relevanz: 5.1% - - • Andreas Kruse

Relevanz: 4.9% - - • Holzbau

Relevanz: 4.9% - - • Wolfgang Seifen

Relevanz: 4.9% - - • Feldpost der Belgier in Deutschland nach dem Ersten Weltkrieg 1918–1935

Relevanz: 4.1% - - • Konferenz der Informatikfachschaften

Relevanz: 4.0% - - • UNESCO-Club

Relevanz: 3.7% - - • Kaiser/Riegraf-Gruppe (Heilbronn)

Relevanz: 3.7% - - • Niederländische Annexionspläne nach dem Zweiten Weltkrieg

Relevanz: 3.6% - -

Find a pageof a conference

that is related to Aachen.

Limit query to certainclasses of result pages

Page 3: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 3

Source for Classes: WordNet Thesaurus

ROOT

entity group

thingliving_thing

person

entertainer scientist

physicist biologistmusician actor

meeting

conferencecongress

minority

More than 81000 concepts

Page 4: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 4

Mapping Pages to Concepts

cityAutomatic mappingwith high quality

Page 5: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 5

Architecture

Wikipedia Pages(Wiki Markup)

HTML

TopXSearch Engine

Concept Mapper

Wikipedia Pages(Annotated XML)

Wikipedia Pages(XML)

Page 6: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 6

Concept Mapping (1): Categories

Manually added category information in most pages

Example: Albert Einstein• Excellent_articles• 1879_births• Physics• Swiss_physicists

Technically: exclude admin categories,shallow parsing of category labels,

stemming, mapping heuristics

Page 7: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 7

Concept Mapping (2): Regular Structure

Regular structures (list, tables, …) often indicate similar conceptsExample: List of people

• Albert Einstein

• Max Planck

• Nils Bohr

• Werner Heisenberg

Technically: grouping of similar XPathexpressions, find coherent annotations,

frequency & confidence thresholds

physicist

Page 8: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 8

Concept Mapping (2): Regular Structure

Technically: grouping of similar XPathexpressions, find coherent annotations,

frequency & confidence thresholds

• /article[1]/…/list[3]/item[1]/link[1]

•/article[1]/…/list[3]/item[2]/link[1]

•/article[1]/…/list[3]/item[3]/link[1]

•/article[1]/…/list[3]/item[4]/link[1]

Regular structures (list, tables, …) often indicate similar conceptsExample: List of people

Page 9: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 9

Concept Mapping (3): Outlier DetectionSometimes conflicting annotations of the same page:

ROOT

entity

thingliving_thing

person

ruler

artifact

instrument

rulerSolution:

Compatibility matrixfor high-level concepts

king

Kings_of_SpainEuropean_rulers

?

Page 10: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 10

YAWN: Annotated XML

• Add concept tag(s) to articles<city source=„categories“ confidence=“1.0“> <article>…</article></city>

• Add concept tag(s) to outgoing links…<city source=“lists“ confidence=“0.9“> <link target=„“…“>Saarbrücken</link> </city>

Page 11: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 11

Querying YAWN

Map concept queries to XPath expressions• „conferences in Aachen“://conference[contains(.,“Aachen“)]

• „scientists who won a nobel prize“://scientist[contains(.,“Nobel prize“)]

• „musicians who performed a song where ‚space‘ occurs in the title“: //musician[contains(//song,“space“)]

Not for end users!Needs good user interface

Page 12: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 12

Left Overs and Summary

• XML Conversion

• Templates

• Preliminary evaluation

See paper

Automated detection and annotationof concepts is useful for retrieval.

Page 13: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 13

The Future: YAGO [WWW‘07]

city

area

state

Aachen NRW

is_ais_a

instance_of instance_of

located_in

Querying the knowledge representation

Page 14: Ralf Schenkel joint work with Fabian Suchanek and Gjergji Kasneci YAWN A Semantically Annotated Wikipedia XML Corpus

8. März 2007BTW 2007, Aachen 14

Thank you!