unsupervised learning of an extensive and usable taxonomy for dbpedia
TRANSCRIPT
Unsupervised Learning of an
Extensive and Usable Taxonomy
for DBpedia
Presented by Claus Stadler
Vienna, 17th September 2015
Marco Fossati, Dimitris Kontokostas, and Jens Lehmann
Outline
I.Motivation
II.Research Contribution
III.Problem & Solution
IV.Approach
V.Results & Evaluation
VI.Advantages & Drawbacks
VII.Conclusion & Future Work
2
{1} Motivation
3
DBpedia
General-purpose Knowledge Base
Central hub for Linked Data
Heterogeneous granularity:
Lack of coverage: 2.8 M typed resources out of 4.9 M
Wikipedia Category SystemChaotic: cycles
Too high granularity: "Radio Stations in Traverse City, Michigan"
DBpedia ontology (DBPO)Organisation
Band
SambaSchool???
{2} Research Contributions
6
Exhaustive type coverage
Focus on usability
Replication across language chapters
Unsupervised approach
7
{3} Problem & Solution
8
Problem
Untyped resources in DBpedia (coverage)
Total entries = 4.9 million
Typed entries = 2.8 million
Unbalanced DBPO
9
Solution
Wikipedia Category System
Almost complete knowledge backbone
Identify a prominent subset
Learn a type taxonomy
10
{4} Approach
11
1.Leaf Node Extraction
2.Prominent Node Discovery
3.Type Taxonomy Generation (T-Box)
4.Pages Type Assignment (A-Box)
12
Stage 1: Leaf Node Extraction
INPUT = cyclic graph; OUTPUT = tree
Bottom-up approach: from leaves to the root
Extract categories linked to actual articles only
Set of categories with no sub-categories =
Leaf Nodes Set:
13
Inuit_deitiesUgandan_monarchies
Inuit_goddesses
Stage 2: Prominent Node Discovery
(A) Leaf Graph Traversal
(B) Natural Language Processing for is-a relations
(C) Interlanguage Links Weight
14
Stage 2A: Leaf Graph Traversal
INPUT = leaf nodes set
For each leaf L :
Get parents;
For each parent P :
Are all its children leaves?
YES: P is a prominent node
NO: L is a prominent node15
Inuit_goddesses
Inuit_deities
Ugandan_monarchies
Stage 2B: NLP for is-a relations
Category = Noun Phrase (NP)
HEAD extraction
Shallow syntactic parsing
Is the HEAD plural?
YES: class candidate;
Depluralize
16
Deity Monarchy
Stage 2C: Interlanguage Links Weight
The more interlanguage links a category has, the more it is used across language editions
Prune categories with interlanguage links < Threshold
Threshold = 3
17
Inuit_deities Ugandan_monarchies
Stage 3: T-Box
Cycle removal
Breadth-first, top-down
Short paths
Instance pruning
18
Monarchy < owl#Thing
Stage 4: A-Box
INPUT = prominent nodes heads
For each prominent node head H :
Extract the category set with head = H
Extract the page set for each category ;
For each page P :
Is it an article page?
YES: < P, instance-of, H >
NO: Repeat until it is19
Monarchy
Bengal_Sultanateinstance-of
{5} Results & Evaluation
20
Data
T-Box (classes)
A-Box (assertions)
Typed resources
1,902 10,729,507 4,260,530
21
Coverage
System Ratio
DBPO 0.513
MENTA 0.537
SDType 0.147
YAGO 0.673
WiBi 0.794
DBTax 0.99422
T-Box Evaluation: Settings
50 random classes
10 evaluators (peers)
Resource namespaces hidden to avoid bias
T-Box Evaluation: Questions (1/2)
• “Is this a class or an instance?”Restaurant VS Puella_Magi_Madoka_Magica (movie)
• “Can this class be broken down into more than one class?Mountain VS Musical_groups_from_Gothenburg
• “Is this a valid class hierarchy path?” wikicategory_Golden_Bear_winners < yagoLegalActorGeo < owl#Thing
T-Box Evaluation: Questions (2/2)
• “Is this hierarchy too specific?” (too many levels)Porter_County,_Indiana < Chicago_metropolitan_area < Metropolitan_areas_of_Illinois < Populated_places_in_Illinois < owl#Thing
• “Is this hierarchy too broad?” (very few levels) Gonorynchiforme (fish family) < owl:Thing
Resource ClassesNon-
breakable classes
Valid paths
Not too specific paths
Not too broad paths
DBPO 0.66 0.67 0.89 0.97 0.84
YAGO 0.90 0.38 0.81 0.55 0.93
WiBi 0.75 0.38 0.73 0.41 0.85
Wikidata 0.19 0.48 0.85 0.66 0.88
Wikipedia 0.81 0.29 0.66 0.77 0.78
DBTax 0.65 0.76 0.77 0.98 0.40
T-Box Evaluation: Results
26
A-Box Evaluation: Settings
Crowdsourced to the layman
Evaluation set: 500 random entities with no type in DBpedia
5 judgments per entity
Prevent a worker from answering the same question twice
A-Box Evaluation: Test Questions
Automatically discard untrusted judgments
Untrusted worker: < 80% correct test questions
Subjective task: missed test questions
They affect # of untrusted judgments
The class label may be ambiguous
Resource Precision Recall F1 Untrusted judgments
Wikidata 0.808 0.982 0.886 1,847
MENTA 0.793 0.589 0.675 1,093
SDType 0.924 0.098 0.178 1,723
YAGO 0.461 0.727 0.565 1,358
WiBi 0.858 0.597 0.704 2,075
DBTax 0.744 1 0.853 518
A-Box Evaluation: Results
30
{6} Advantages & Drawbacks
31
Advantages
Exhaustive coverage (almost 100%)
Type coverage comparison
Recall in A-Box evaluation
Intuitive
Crowdsourced (the layman) A-Box evaluation
Least # of untrusted judgments 32
Drawbacks
Short hierarchy paths
Cycle removal
Instance pruning
Relatively low precision
NLP may still yield "weird" is-a relations
"Elvis Presley is a Burial"33
!!!
{7} Conclusion & Future
Work
34
Conclusion
Significant type coverage leap
Intuitive for end users
Balance between DBPO (too generic) and YAGO (too specific)
Integrated in the latest DBpedia release
35
Future Work
Merge the T-Box into mappings.dbpedia.org for curation
Word Sense Disambiguation for homonymous classes
Multilingual deployment (currently English and Italian)
36
http://it.dbpedia.org/resource/Elvis_Presley
DBTax in action
at the Italian DBpedia chapter
37
Thanks for your attention!Download DBTax at:
http://downloads.dbpedia.org/current/core-i18n/en/
Browse the Italian DBTax at: http://it.dbpedia.org/sparql
Contact the first author at: [email protected]