yago ontology

49
YAGO Ontology YAGO Ontology Dragana Dudic Dragana Dudic

Upload: dragana-dudic

Post on 27-Nov-2014

380 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: YAGO Ontology

YAGO OntologyYAGO Ontology

Dragana DudicDragana Dudic

Page 2: YAGO Ontology

22

IntroductionIntroduction

• Acronym for Yet Another Great OntologyAcronym for Yet Another Great Ontology• One of the largest public free ontologies

Ontology Entities FactsKnowItAll n/a 29,835SUMO 20,000 70,000WordNet 117,659 821,492Cyc 300,000 3,000,000TextRunner n/a 7,800,000YAGO 2,000,000 20,000,000DBpedia 3,500,000 672,000,000

• YAGO has three versions. The first is known as YAGO YAGO has three versions. The first is known as YAGO (founded in 2007), the second is T-YAGO (founded in (founded in 2007), the second is T-YAGO (founded in early 2010) and third is YAGO2 (founded in late 2010)early 2010) and third is YAGO2 (founded in late 2010)

• Part of YAGO-NAGA project at the Max Planck Institute Part of YAGO-NAGA project at the Max Planck Institute for Informatics in Germanyfor Informatics in Germany

Page 3: YAGO Ontology

33

YAGO-NAGA projectYAGO-NAGA project

• The YAGO-NAGA project started in 2006The YAGO-NAGA project started in 2006• Goal is building large, searchable and highly Goal is building large, searchable and highly

accurate knowledge base in a machine-accurate knowledge base in a machine-processible representationprocessible representation

• YAGO-NAGA project consists of 10 sub-YAGO-NAGA project consists of 10 sub-projects: YAGO, NAGA, SOFIE, PROSPERA, projects: YAGO, NAGA, SOFIE, PROSPERA, LEILA, RDF-3X, ANGIE, UWN, K2, Javatools LEILA, RDF-3X, ANGIE, UWN, K2, Javatools

Page 4: YAGO Ontology

44

YAGO-NAGA sub-projectsYAGO-NAGA sub-projects

• YAGO is a huge ontologyYAGO is a huge ontology• NAGA is a semantic search engineNAGA is a semantic search engine• SOFIE is a system for automated ontology SOFIE is a system for automated ontology

extractionextraction• PROSPERA is a extension of SOFIE for large PROSPERA is a extension of SOFIE for large

scale IEscale IE• LEILA is a predecesor of SOFIE. It’s a system LEILA is a predecesor of SOFIE. It’s a system

that can extract pairs of a relation from a set that can extract pairs of a relation from a set of HTML documentsof HTML documents

Page 5: YAGO Ontology

55

YAGO-NAGA sub-projects YAGO-NAGA sub-projects (2)(2)

• RDF-3X is an RDF storage and retrieval systemRDF-3X is an RDF storage and retrieval system• ANGIE is a knowledge system for interactive ANGIE is a knowledge system for interactive

explorationexploration• UWN is a multilingual version of WordNetUWN is a multilingual version of WordNet• K2 is used for gathering and ranking photosK2 is used for gathering and ranking photos• Javatools are a suite of Java classes for a Javatools are a suite of Java classes for a

variety of small tasksvariety of small tasks• More about YAGO-NAGA project: More about YAGO-NAGA project:

http://www.mpi-inf.mpg.de/yago-nagahttp://www.mpi-inf.mpg.de/yago-naga

Page 6: YAGO Ontology

66

YAGO introductionYAGO introduction

• YAGO is a YAGO is a light-weight and extensible ontology with high ontology with high coverage and quality

• YAGO builds on entities and relations• Currently contains more than 2 million entities, about 20

million facts, 92 relations and about 250 thousands classes• Number of entities in YAGO

Individuals (without words and literals) 1,941,578

People 615,924 Locations 303,372 Institutions/companies 30,508 Movies 39,851

Page 7: YAGO Ontology

77

YAGO introduction (2)YAGO introduction (2)

• Largest relations in YAGO Relation Facts

means 5347523type 4505603inLanguage 3563111 isCalled 2185860familyNameOf 569410givenNameOf 568852bornOnDate 441274subClassOf 249463

• YAGO has been automatically derived from Wikipedia and Wordnet (large lexical database of English)

Page 8: YAGO Ontology

88

YAGO introduction (3)YAGO introduction (3)

• YAGO has precision of 95%• Accuracy of some YAGO relations

subClassOf 97.7% ± 1.6%type 94.5% ± 2.3%familyNameOf 97.8% ± 1.8%givenNameOf 97.6% ± 2.1%bornInYear 93.1% ± 3.7%diedInYear 98.7% ± 1.3%

Page 9: YAGO Ontology

99

YAGO model - structureYAGO model - structure

• YAGO is based on very expressive data model – YAGO YAGO is based on very expressive data model – YAGO model (OWL-full is undecidable, OWL-Lite and OWL-DL model (OWL-full is undecidable, OWL-Lite and OWL-DL cannot express relations, RDFS has primitive semantics)cannot express relations, RDFS has primitive semantics)

• YAGO model slightly extends RDFSYAGO model slightly extends RDFS• All objects are entities and two objects can stand in a All objects are entities and two objects can stand in a

relationrelation• Example below states that Tim Bernars-Lee has website Example below states that Tim Bernars-Lee has website

http://www.w3.org/People/Berners-Lee

Tim_Berners-LeeTim_Berners-Lee

hasWebsitehasWebsite

http://www.w3.org/People/Berners-Leehttp://www.w3.org/People/Berners-Lee

Page 10: YAGO Ontology

1010

YAGO model – structure YAGO model – structure (2)(2)

• Numbers, dates and strings are represented as Numbers, dates and strings are represented as entitiesentitiesTim Berners-LeeTim Berners-Lee bornOnDate bornOnDate 1955-06-081955-06-08

• It’s possible to express that a certain word refers to a It’s possible to express that a certain word refers to a certain entitycertain entity““Berners-Lee” means Berners-Lee” means Tim_Berners-LeeTim_Berners-Lee (computer (computer scientist)scientist)

oror““Berners-Lee” means Berners-Lee” means 13926_Bernars-Lee13926_Bernars-Lee (asteroid) (asteroid)

• Similar entities are grouped into classesTim_Berners-LeeTim_Berners-Lee type type computer_scientistcomputer_scientist

• Classes are arranged in a taxonomic hierarchycomputer_scientistcomputer_scientist subClassOf scientist subClassOf scientist

Page 11: YAGO Ontology

1111

YAGO model – structure YAGO model – structure (3)(3)

• Relations are also entities, so it’s possible to represent Relations are also entities, so it’s possible to represent transitivity and other properties of relations

• Example below states that subClassOf is transitive relationsubClassOf type transitiveRelation

• Triple entity relation entity is called fact. Entities of a fact are called arguments. Each fact has fact identifierid0: Tim_Berners-LeeTim_Berners-Lee type type computer_scientistcomputer_scientistid0 id0 foundInfoundIn http://en.wikipedia.org/wiki/Tim_Berners-Leehttp://en.wikipedia.org/wiki/Tim_Berners-Lee

• n-Ary relationsTim_Berners-LeeTim_Berners-Lee bornIn London foundIn bornIn London foundIn http://http://en.wikipedia.org/wiki/Tim_Bernersen.wikipedia.org/wiki/Tim_Berners-Lee-Lee

Page 12: YAGO Ontology

1212

YAGO model - semanticsYAGO model - semantics

• is a set of relations. It must contain at is a set of relations. It must contain at least: type, subClassOf, domain, range, least: type, subClassOf, domain, range, subRelationOfsubRelationOf

• is a set of common entities (entities that is a set of common entities (entities that are neither facts nor relations)are neither facts nor relations)

• is a set of fact identifiersis a set of fact identifiers• is a set of possible facts. is a set of possible facts.

Page 13: YAGO Ontology

1313

YAGO model – YAGO model – semantics(2)semantics(2)

• YAGO model has defined a rewrite systemYAGO model has defined a rewrite system• →→ reduces one set of facts to another set of reduces one set of facts to another set of

factsfacts• Instead of Instead of → → it’s used shorthand notationit’s used shorthand notation

to say that if a set of facts contains the facts to say that if a set of facts contains the facts ff11,…f,…fnn than the rewrite rule adds f to this set than the rewrite rule adds f to this set

• Rewrite system contains many rules which Rewrite system contains many rules which one you may found atone you may found at

http://suchanek.name/work/publications/jhttp://suchanek.name/work/publications/jws2008.pdfws2008.pdf

Page 14: YAGO Ontology

1414

The type relationThe type relation

• Individuals for YAGO ontology are taken from WikipediaIndividuals for YAGO ontology are taken from Wikipedia• Every page title is a candidate to become an individualEvery page title is a candidate to become an individual• Category system in Wikipedia is used to establish for each Category system in Wikipedia is used to establish for each

individual its classindividual its class• Only conceptual categories are candidates for classes. Noun Only conceptual categories are candidates for classes. Noun

Group Parser from LEILA is used to distinguish this categories Group Parser from LEILA is used to distinguish this categories from others. from others.

• Category name Category name Fellows of the Royal Society of Arts Fellows of the Royal Society of Arts is broken into is broken into a head and a post-modifier. If a head is a plural word, the a head and a post-modifier. If a head is a plural word, the category is conceptualcategory is conceptual

• Pling-Stemmer from LEILA is used to reliably identify and stem Pling-Stemmer from LEILA is used to reliably identify and stem plural wordsplural words

Page 15: YAGO Ontology

1515

The type relationThe type relation

• Wikipedia articles may contain an Wikipedia articles may contain an infoboxinfobox

• Additionally, type of infobox can Additionally, type of infobox can provide informations for type relationprovide informations for type relationTim_BernersTim_Berners-Lee type -Lee type personperson

• Currently, there is 220 infobox types for peopleCurrently, there is 220 infobox types for people• List of infobox types for people can be found atList of infobox types for people can be found at

http://http://en.wikipedia.org/wiki/Category:People_infobox_ten.wikipedia.org/wiki/Category:People_infobox_templatesemplates

Page 16: YAGO Ontology

1616

The subClassOf relationThe subClassOf relation

• Almost every sysnet of WordNet becomes a class Almost every sysnet of WordNet becomes a class of YAGO (proper names known to WordNet are of YAGO (proper names known to WordNet are excluded)excluded)

• Each leaf category in Wikipedia becomes a class Each leaf category in Wikipedia becomes a class of YAGOof YAGO

• The subClassOf hierarchy is taken from the The subClassOf hierarchy is taken from the hyponymy relation from WordNethyponymy relation from WordNetFellows of the Royal Society of ArtsFellows of the Royal Society of Arts subClassof subClassof colleaguecolleague

• Special algorithm is used to connect lower classes Special algorithm is used to connect lower classes extracted from Wikipedia with higher classes extracted from Wikipedia with higher classes extracted from WordNet extracted from WordNet

Page 17: YAGO Ontology

1717

The subClassOf relation The subClassOf relation (2)(2)

Function wiki2wordnet(c)Function wiki2wordnet(c) Input: Wikipedia category name cInput: Wikipedia category name c Output: WordNet synsetOutput: WordNet synset 1 head = headCompound(c)1 head = headCompound(c) 2 pre = preModifier(c)2 pre = preModifier(c) 3 post = postModifier(c)3 post = postModifier(c) 4 head = stem(head)4 head = stem(head) 5 If there is a WordNet synset s for pre + head5 If there is a WordNet synset s for pre + head 6 return s6 return s 7 If there are WordNet synsets s 1 , ...s n for head7 If there are WordNet synsets s 1 , ...s n for head 8 (ordered by their frequency for head)8 (ordered by their frequency for head) 9 return s 9 return s 10 fail10 fail

Page 18: YAGO Ontology

1818

The subClassOf relation The subClassOf relation (3)(3)

• Dozen exceptions are corrected manuallyDozen exceptions are corrected manually

http://http://wordnetweb.princeton.edu/perl/webwnwordnetweb.princeton.edu/perl/webwn

Page 19: YAGO Ontology

1919

The means relationThe means relation

• First, a class for each synset known to Wordnet is First, a class for each synset known to Wordnet is introduced and then the relation means is established introduced and then the relation means is established between each word of synset and corresponding class between each word of synset and corresponding class

colleaguecolleague means means fellowfellow

• For each Wikipedia redirect a coresponding means relation For each Wikipedia redirect a coresponding means relation is introduced is introduced ““Berners-Lee, Tim”Berners-Lee, Tim” means means Tim_Berners-LeeTim_Berners-Lee

• givenNameOf and familyNameOf are sub-relations of means givenNameOf and familyNameOf are sub-relations of means and they are used on individuals that are persons.and they are used on individuals that are persons.““Tim”Tim” givenNameOf givenNameOf Tim_Berners-LeeTim_Berners-Lee

““Bernars-Lee”Bernars-Lee” familyNameOf familyNameOf Tim_Bernars-LeeTim_Bernars-Lee

• Name Parser from LEILA is used for identification and Name Parser from LEILA is used for identification and decomposition of person namesdecomposition of person names

Page 20: YAGO Ontology

2020

Other relationsOther relations

• Wikipedia infoboxes Wikipedia infoboxes are a rich source of facts

birth_date birth_date →→bornOnDatebornOnDate

birth_place birth_place →→bornInbornIn

Page 21: YAGO Ontology

2121

Other relations (2)Other relations (2)

• Relational Wikipedia categories are very useful because Relational Wikipedia categories are very useful because not every article has an infoboxnot every article has an infobox

• Simple heuristics are designed to exploit the category Simple heuristics are designed to exploit the category namesnames

• Each heuristic is a pair of a regular expression and a Each heuristic is a pair of a regular expression and a relationrelation(.*)laureates hasWonPrize(.*)laureates hasWonPrizeTim_Berners-LeeTim_Berners-Lee hasWonPrize hasWonPrize Japan PrizeJapan Prize

• More heuristics can have the same relationMore heuristics can have the same relation(.*)winners hasWonPrize(.*)winners hasWonPrize

• For now, there are about 100 heuristics (74 >95%)For now, there are about 100 heuristics (74 >95%)

Page 22: YAGO Ontology

2222

Other relations (3)Other relations (3)

• Precision of some YAGO's heuristics:hasExpenses 100.0% ± 0.0%hasInflation 100.0% ± 0.0%hasLaborForce 97.7% ± 0.0%during 97.5% ± 1.838%participatedIn 96.9% ± 3.056%establishedInYear 96.8% ± 3.157%createdOn 96.8% ± 3.157%hasSuccessor 94.9% ± 4.804 %discovered 91.0% ± 5.702 %

Page 23: YAGO Ontology

2323

Other relations (4)Other relations (4)

• Wikipedia has special categories Wikipedia has special categories that indicate article name in other that indicate article name in other languageslanguagesTim Berners-Lee sr:Тим Бернерс-ЛиTim Berners-Lee sr:Тим Бернерс-Ли

Tim_Berners-LeeTim_Berners-Lee isCalled “ isCalled “Тим Бернерс-Ли”Тим Бернерс-Ли” inLanguage inLanguage SerbianSerbian

Page 24: YAGO Ontology

2424

YAGO quality controlYAGO quality control

• CanonicalizationCanonicalization makes each fact makes each fact and each entity reference unique and each entity reference unique

• Type CheckingType Checking eliminates individuals eliminates individuals that do not have a class and facts that that do not have a class and facts that do not respect the domain and range do not respect the domain and range constraints of their relationconstraints of their relation

• More about quality controlMore about quality control

http://suchanek.name/work/publications/jws2008.http://suchanek.name/work/publications/jws2008.pdfpdf

Page 25: YAGO Ontology

2525

T-YAGO introductionT-YAGO introduction

• YAGO is static – world is dynamic. New facts arise while some facts change over time

• T-YAGO (Timely YAGO) extends YAGO with temporal aspectsBruce_Willis isMarriedTo Demi_MooreBruce_Willis isMarriedTo Emma_Heming

• A temporal fact is a relation with an associated validity time

• Temporal facts are the most convenient for sports domain

• There is no other work on automatically constructing ontologies with specific consideration of temporal facts

Page 26: YAGO Ontology

2626

Temporal factsTemporal facts

• Extracted from Wikipedia infoboxes, categories and Extracted from Wikipedia infoboxes, categories and lists in articleslists in articles

• Most of temporal facts cannot be directly Most of temporal facts cannot be directly represented - facts are limited to binary relations represented - facts are limited to binary relations while temporal facts have more than two argumentswhile temporal facts have more than two arguments

• n-ary fact is decomposed into a primary fact and n-ary fact is decomposed into a primary fact and several associated facts to support temporal factsseveral associated facts to support temporal facts

• Dates are denoted in the standard format YYYY-MM-Dates are denoted in the standard format YYYY-MM-DD (ISO 8601)DD (ISO 8601)

• If only the year is known, dates are written in the If only the year is known, dates are written in the form YYYY-##-##form YYYY-##-##

• May be valid at a time point or within a time intervalMay be valid at a time point or within a time interval

Page 27: YAGO Ontology

2727

Relation onRelation on

• Relation Relation onon is used to describe is used to describe the validity time for temporal the validity time for temporal facts valid at a time pointfacts valid at a time pointid0: Bruce_Willis hasWonPrize Emmy_Awardid0: Bruce_Willis hasWonPrize Emmy_Award

id0 on id0 on 2000-##-##2000-##-##

Page 28: YAGO Ontology

2828

Relations since and untilRelations since and until

• The relations The relations sincesince and and untiluntil are used for are used for temporal facts that are valid during a time temporal facts that are valid during a time periodperiodid1: Bruce_Willis isMarriedTo Demi_Mooreid1: Bruce_Willis isMarriedTo Demi_Mooreid1 since 1987-##-##id1 since 1987-##-##id1 until 2000-##-##id1 until 2000-##-##

• Sometimes it is impossible to extract accurate Sometimes it is impossible to extract accurate timepoints - use earliest and latest possible timepoints - use earliest and latest possible timetime

• Sometimes are known only the begin or the Sometimes are known only the begin or the end of a fact’s validity but not both - use end of a fact’s validity but not both - use earliest and latest possible time intervalearliest and latest possible time interval

Page 29: YAGO Ontology

2929

Relations since and until Relations since and until (2)(2)

• For example, let’s say that Bruce Willis and Demi Moore For example, let’s say that Bruce Willis and Demi Moore

get married inget married in March 1987 and divorced in 2000March 1987 and divorced in 2000id1 since [1987-03-01, 1987-03-31]id1 since [1987-03-01, 1987-03-31]

id1 until [2000-01-01, 2000-12-31]id1 until [2000-01-01, 2000-12-31]

• Regular expression matching is used for temporal fact Regular expression matching is used for temporal fact extractionextraction

• For higher coverage, additional elements in Wikipedia For higher coverage, additional elements in Wikipedia articles are analyzed (lists of awards, honors, medals) articles are analyzed (lists of awards, honors, medals)

• T-YAGO contains 300,000 temporal facts (mostly about T-YAGO contains 300,000 temporal facts (mostly about sportsmen)sportsmen)

• All these temporal facts have been integrated into YAGOAll these temporal facts have been integrated into YAGO

Page 30: YAGO Ontology

3030

YAGO2 introductionYAGO2 introduction

• YAGO2 extends YAGO (with T-YAGO) with spatial YAGO2 extends YAGO (with T-YAGO) with spatial aspectsaspects

• YAGO2 is knowing not only that a fact is true but YAGO2 is knowing not only that a fact is true but also when and where it was true also when and where it was true

• The geographical location is a crucial property not The geographical location is a crucial property not just of physical entities such as countries, just of physical entities such as countries, mountains, or rivers, but also of organization mountains, or rivers, but also of organization headquarters, or events such as battles, fairs, or headquarters, or events such as battles, fairs, or people’s birthspeople’s births

• Spatial types and facts are gathered from Spatial types and facts are gathered from Wikipedia and GeoNames (geographical database)Wikipedia and GeoNames (geographical database)

Page 31: YAGO Ontology

3131

YAGO2 introduction (2)YAGO2 introduction (2)

• YAGO2 has precision of 95%• Accuracy of some YAGO2 relations

hasLatitude 97.7% ± 2.3%since 97.6% ± 2.3%isMarriedTo 97.3% ± 2.7%until 96.9% ± 3.0%hasLongitude 96.4% ± 3.6%happenedIn 95.9% ± 3.8%locatedIn 95.9% ± 3.8%happenedIn 95.9% ± 3.8%

Page 32: YAGO Ontology

3232

YAGO2 introduction (3)YAGO2 introduction (3)

• YAGO2 contains more than 80 million facts YAGO2 contains more than 80 million facts for more than 10 million entities with for more than 10 million entities with GeoNames and 33 million facts for 2.6 million GeoNames and 33 million facts for 2.6 million entities without GeoNamesentities without GeoNames

• Number of entities in YAGO2Number of entities in YAGO2People 882,534Organizations 240,047Locations 695,712 (7,569,708 incl. GeoNames)Events 212,236Other 631,065

• Extraction rules and temporal aspects are Extraction rules and temporal aspects are improvedimproved

Page 33: YAGO Ontology

3333

Rules in YAGO2Rules in YAGO2

• In YAGO much of the extraction was done In YAGO much of the extraction was done by hard-wired rules in the source code by hard-wired rules in the source code (this design doesn’t allow easy extension)(this design doesn’t allow easy extension)

• In YAGO2 extraction is done by declarative In YAGO2 extraction is done by declarative rules. There are 4 different types of rulesrules. There are 4 different types of rules– Factual rules Factual rules – Implication rulesImplication rules– Replacement rulesReplacement rules– Extraction rulesExtraction rules

Page 34: YAGO Ontology

3434

Factual rulesFactual rules

• Factual rules are simply additional facts expressed Factual rules are simply additional facts expressed as a YAGO fact. as a YAGO fact.

• Factual rules are declarative translations of all the Factual rules are declarative translations of all the manually defined exceptions and facts that the manually defined exceptions and facts that the previous YAGO code contained (definitions of all previous YAGO code contained (definitions of all relations, their domains and ranges, and the relations, their domains and ranges, and the definition of the classes)definition of the classes)

• The factual rules also add 3 new classes to the The factual rules also add 3 new classes to the taxonomy: taxonomy: – yagoLegalActor (combines legal actors such as organizations and yagoLegalActor (combines legal actors such as organizations and

people),people),– yagoLegalActorGeo(yagoLegalActorGeo(the union of yagoLegalActor and geopolitical

entities) ) – yagoGeoEntity (groups geographical locations)yagoGeoEntity (groups geographical locations)

Page 35: YAGO Ontology

3535

Factual rules (2)Factual rules (2)

• 60 words with their primary meaning 60 words with their primary meaning are included (if the primary meaning as are included (if the primary meaning as defined in WordNet is not suitable)defined in WordNet is not suitable)

• For example, in WordNet ‘fellow’ has as For example, in WordNet ‘fellow’ has as a primary meaning a boy or a man. If a primary meaning a boy or a man. If we want to use in the sense of member, we want to use in the sense of member, we would add factual rule:we would add factual rule:““fellow"fellow"

hasPreferredMeaning hasPreferredMeaning

wordnet_fellow_09935990wordnet_fellow_09935990

Page 36: YAGO Ontology

3636

Implication rulesImplication rules

• Implication rules say that if certain facts appear in Implication rules say that if certain facts appear in YAGO2, then another fact shall be addedYAGO2, then another fact shall be added

• Implication rule is also expressed as a YAGO factImplication rule is also expressed as a YAGO fact• The subject of the fact states the premise of the The subject of the fact states the premise of the

implication, and the object of the fact holds the implication, and the object of the fact holds the conclusion, both as stringsconclusion, both as strings

• For example, if a relation is a sub-property of another For example, if a relation is a sub-property of another relation, then all instances of the first relation are relation, then all instances of the first relation are also instances of the second relationalso instances of the second relation"$1 $2 $3; $2 subpropertyOf $4;" "$1 $2 $3; $2 subpropertyOf $4;"

implies implies

"$1 $4 $3""$1 $4 $3"

Page 37: YAGO Ontology

3737

Replacement rulesReplacement rules

• Replacement rules say that if a part of the Replacement rules say that if a part of the source text matches a specified regular source text matches a specified regular expression, it should be replaced by a certain expression, it should be replaced by a certain stringstring

• Replacement rules are used for cleaning up Replacement rules are used for cleaning up HTML tags, interpreting micro-formats, HTML tags, interpreting micro-formats, normalizing numbers, eliminating administrative normalizing numbers, eliminating administrative Wikipedia categories and articles that we do not Wikipedia categories and articles that we do not want to process (replacing by the empty string)want to process (replacing by the empty string)

• Arguments are stringsArguments are strings"\{\{USA\}\}" replace "[[United States]]""\{\{USA\}\}" replace "[[United States]]"

Page 38: YAGO Ontology

3838

Extraction rulesExtraction rules

• Extraction rules say that if a part of the Extraction rules say that if a part of the source text matches a specified regular source text matches a specified regular expression, a sequence of facts shall be expression, a sequence of facts shall be generatedgenerated

• These rules apply to patterns found in These rules apply to patterns found in the Wikipedia infoboxes, Wikipedia the Wikipedia infoboxes, Wikipedia categories, article titles, headings, categories, article titles, headings, links, and referenceslinks, and references"\[\[Category:(.+) births\]\]" "\[\[Category:(.+) births\]\]" pattern pattern "$0 bornOnDate Date($1)""$0 bornOnDate Date($1)"

Page 39: YAGO Ontology

3939

Temporal facts in YAGO2Temporal facts in YAGO2

• Temporal facts from T-YAGO are Temporal facts from T-YAGO are extended with extraction time of extended with extraction time of factsfacts

• It’s possible to include or exclude It’s possible to include or exclude facts from certain points of timefacts from certain points of timeid0: Bruce_Willis hasWonPrize Emmy_Awardid0: Bruce_Willis hasWonPrize Emmy_Award

id1: id1: id0id0 on on 2000-##-##2000-##-##

id1 extractedOn 2009-06-05id1 extractedOn 2009-06-05

Page 40: YAGO Ontology

4040

Temporal entitiesTemporal entities

• Entities are assigned a time span to denote Entities are assigned a time span to denote their existence in timetheir existence in time

• There are four temporal entity types:There are four temporal entity types:– People – bornOnDate and diedOnDatePeople – bornOnDate and diedOnDate– Groups (music bands, football clubs, universities, Groups (music bands, football clubs, universities,

companies) – createdOnDate and companies) – createdOnDate and destroyedOnDatedestroyedOnDate

– Artifacts (buildings, paintings, books, music songs, Artifacts (buildings, paintings, books, music songs, albums) – createdOnDate and destroyedOnDatealbums) – createdOnDate and destroyedOnDate

– Events (wars, sports competitions, named epochs) Events (wars, sports competitions, named epochs) - startedOnDate and endedOnDate- startedOnDate and endedOnDate

Page 41: YAGO Ontology

4141

Spatial informations Spatial informations

• All physical objects have a location in space All physical objects have a location in space (countries, cities, mountains, rivers…)(countries, cities, mountains, rivers…)

• yagoGeoEntity is class which groups together yagoGeoEntity is class which groups together all entities with a permanent physical location all entities with a permanent physical location on Earth (geo-entities)on Earth (geo-entities)

• Subclasses of yagoGeoEntity are: location, Subclasses of yagoGeoEntity are: location, body of water, geological formation, real body of water, geological formation, real property, facility, excavation, structure, track, property, facility, excavation, structure, track, way and landway and land

• The position of a geo-entity can be described The position of a geo-entity can be described by geographical coordinates, consisting of by geographical coordinates, consisting of latitude and longitude (geo-coordinate)latitude and longitude (geo-coordinate)

Page 42: YAGO Ontology

4242

Spatial informationsSpatial informations

• YAGO2 only knows about coordinates, YAGO2 only knows about coordinates, not polygons, so even locations that not polygons, so even locations that have a physical extent are represented have a physical extent are represented by a single geo-coordinate pair:by a single geo-coordinate pair:– for a settlement like a city, it represents the for a settlement like a city, it represents the

centercenter– for military and industrial establishments for military and industrial establishments

the main gatethe main gate– for administrative districts it represents the for administrative districts it represents the

head officehead office

Page 43: YAGO Ontology

4343

Extracting geo-entitiesExtracting geo-entities

• Main source for extracting geo-entities is Main source for extracting geo-entities is WikipediaWikipedia

• Wikipedia contains a large number of cities, Wikipedia contains a large number of cities, regions, mountains, rivers, lakes and many regions, mountains, rivers, lakes and many of them have associated geographical of them have associated geographical coordinatescoordinates

• For others is used GeoNames, which For others is used GeoNames, which contains data on more than 7 million contains data on more than 7 million locations, information on location hierarchies locations, information on location hierarchies and alternate names for each locationand alternate names for each location

Page 44: YAGO Ontology

4444

Extracting geo-entities (2)Extracting geo-entities (2)

• To avoid duplication of entities, next procedure is To avoid duplication of entities, next procedure is taken:taken:

1.1. If the Wikipedia entity has the type yagoGeoEntity and shares If the Wikipedia entity has the type yagoGeoEntity and shares its name with exactly one entity in GeoNames, they are its name with exactly one entity in GeoNames, they are matchedmatched

2.2. If the Wikipedia entity has the type yagoGeoEntity and shares If the Wikipedia entity has the type yagoGeoEntity and shares its name with more than one entity in GeoNames, and there its name with more than one entity in GeoNames, and there are coordinates for the Wikipedia entity, it is matched to the are coordinates for the Wikipedia entity, it is matched to the geographically closest GeoNames entity – if its distance does geographically closest GeoNames entity – if its distance does not exceed 5kmnot exceed 5km

3.3. All the unmatched GeoNames entities are added as new All the unmatched GeoNames entities are added as new individual entities to YAGO, together with all the facts about individual entities to YAGO, together with all the facts about them given in GeoNamesthem given in GeoNames

Belgrade’s Wikipedia: 44°49′14″N 20°27′44″E Belgrade’s Wikipedia: 44°49′14″N 20°27′44″E Belgrade location in GeoNames: Belgrade location in GeoNames: 44° 48' 14'‘N44° 48' 14'‘N 20° 27' 54'‘E20° 27' 54'‘Edistance is less then 2km (1.866km)distance is less then 2km (1.866km)

Page 45: YAGO Ontology

4545

Extracting geo-entities (3)Extracting geo-entities (3)

• Each entity needs to be typedEach entity needs to be typed• GeoNames assigns a class to each location, and they are used as GeoNames assigns a class to each location, and they are used as

typetype• To avoid duplication of classes, they need to be matched with To avoid duplication of classes, they need to be matched with

existing classesexisting classes• Matching works as follows:Matching works as follows:

1.1. For every class from GeoNames, a set of WordNet classes from For every class from GeoNames, a set of WordNet classes from YAGO that have the same name as the GeoNames class is YAGO that have the same name as the GeoNames class is identifyedidentifyed

2.2. If there are no such classes, we do a shallow noun phrase If there are no such classes, we do a shallow noun phrase parsing of the GeoNames class name in order to determine the parsing of the GeoNames class name in order to determine the head noun. Searching for classes in YAGO2 that carry the head head noun. Searching for classes in YAGO2 that carry the head noun as their name is done.noun as their name is done.

3.3. From the resulting YAGO2 classes are removed the ones that From the resulting YAGO2 classes are removed the ones that are not subclasses of yagoGeoEntityare not subclasses of yagoGeoEntity

Page 46: YAGO Ontology

4646

Extracting geo-entities (4)Extracting geo-entities (4)

4.4. If only a single class remains, it is returned as the matching class.

5. If more than one class remains, glosses are used for describing the GeoNames class and the YAGO classes, respectively. The glosses are tokenized, and the Jaccard Similarity of the resulting bag-of-words is calculated between the GeoNames-class gloss and each candidate’s gloss. The class with the highest overlap is returned as best match

6. If there is no overlap between the glosses at all, then is returned YAGO2 class that is most often denoted by the name of the GeoNames class- this information is taken from WordNet, which sorts senses for each word in order of most common use

• Matched classes are added to YAGO2 as subclass of the matched class, unmatched classes are added as subclass of yagoGeoEntity

Page 47: YAGO Ontology

4747

Spatial entitiesSpatial entities

• Many entities are associated with a locationMany entities are associated with a location• There are three spatial entity types:There are three spatial entity types:

– EventsEvents that took place at a specific location, such as that took place at a specific location, such as battles or sports competitions where the relation battles or sports competitions where the relation happenedIn holds the place where it happened.happenedIn holds the place where it happened.

– GroupsGroups or organizations that have a venue, such as the or organizations that have a venue, such as the headquarters of a company or the campus of a headquarters of a company or the campus of a university. The location for such entities is given by the university. The location for such entities is given by the locatedIn relation.locatedIn relation.

– ArtifactsArtifacts that are physically located somewhere, like that are physically located somewhere, like the Mona Lisa in the Louvre, where the location is again the Mona Lisa in the Louvre, where the location is again given by locatedIn.given by locatedIn.

• isLocatedIn and happenedIn are defined as sub-isLocatedIn and happenedIn are defined as sub-properties of relation placedInproperties of relation placedIn

Page 48: YAGO Ontology

4848

Spatial factsSpatial facts

• Some facts also have a spatial dimensionSome facts also have a spatial dimensionid0: Bruce_Willis hasWonPrize Emmy_Awardid0: Bruce_Willis hasWonPrize Emmy_Award

id0 in id0 in Los_AngelesLos_Angeles

• There are three cases when ontologically There are three cases when ontologically meaningful location can be deduced:meaningful location can be deduced:– Permanent RelationsPermanent Relations describe properties of describe properties of

entities that are immutableentities that are immutable– Space-Bound RelationsSpace-Bound Relations for facts that occur in a for facts that occur in a

place that is indicated by their subject or objectplace that is indicated by their subject or object– Tandem RelationsTandem Relations are relations where one are relations where one

relation determines the location of the otherrelation determines the location of the other

Page 49: YAGO Ontology

4949

Thanks for your time Thanks for your time

• Any questions (hopefully on Any questions (hopefully on Serbian Serbian )?)?