unsupervised learning of an extensive and usable taxonomy for dbpedia

38
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia Presented by Claus Stadler Vienna, 17th September 2015 Marco Fossati, Dimitris Kontokostas, and Jens Lehmann

Upload: marco-fossati

Post on 20-Feb-2017

420 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Unsupervised Learning of an

Extensive and Usable Taxonomy

for DBpedia

Presented by Claus Stadler

Vienna, 17th September 2015

Marco Fossati, Dimitris Kontokostas, and Jens Lehmann

Page 2: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Outline

I.Motivation

II.Research Contribution

III.Problem & Solution

IV.Approach

V.Results & Evaluation

VI.Advantages & Drawbacks

VII.Conclusion & Future Work

2

Page 3: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

{1} Motivation

3

Page 4: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

DBpedia

General-purpose Knowledge Base

Central hub for Linked Data

Page 5: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Heterogeneous granularity:

Lack of coverage: 2.8 M typed resources out of 4.9 M

Wikipedia Category SystemChaotic: cycles

Too high granularity: "Radio Stations in Traverse City, Michigan"

DBpedia ontology (DBPO)Organisation

Band

SambaSchool???

Page 6: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

{2} Research Contributions

6

Page 7: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Exhaustive type coverage

Focus on usability

Replication across language chapters

Unsupervised approach

7

Page 8: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

{3} Problem & Solution

8

Page 9: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Problem

Untyped resources in DBpedia (coverage)

Total entries = 4.9 million

Typed entries = 2.8 million

Unbalanced DBPO

9

Page 10: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Solution

Wikipedia Category System

Almost complete knowledge backbone

Identify a prominent subset

Learn a type taxonomy

10

Page 11: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

{4} Approach

11

Page 12: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

1.Leaf Node Extraction

2.Prominent Node Discovery

3.Type Taxonomy Generation (T-Box)

4.Pages Type Assignment (A-Box)

12

Page 13: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Stage 1: Leaf Node Extraction

INPUT = cyclic graph; OUTPUT = tree

Bottom-up approach: from leaves to the root

Extract categories linked to actual articles only

Set of categories with no sub-categories =

Leaf Nodes Set:

13

Inuit_deitiesUgandan_monarchies

Inuit_goddesses

Page 14: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Stage 2: Prominent Node Discovery

(A) Leaf Graph Traversal

(B) Natural Language Processing for is-a relations

(C) Interlanguage Links Weight

14

Page 15: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Stage 2A: Leaf Graph Traversal

INPUT = leaf nodes set

For each leaf L :

Get parents;

For each parent P :

Are all its children leaves?

YES: P is a prominent node

NO: L is a prominent node15

Inuit_goddesses

Inuit_deities

Ugandan_monarchies

Page 16: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Stage 2B: NLP for is-a relations

Category = Noun Phrase (NP)

HEAD extraction

Shallow syntactic parsing

Is the HEAD plural?

YES: class candidate;

Depluralize

16

Deity Monarchy

Page 17: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Stage 2C: Interlanguage Links Weight

The more interlanguage links a category has, the more it is used across language editions

Prune categories with interlanguage links < Threshold

Threshold = 3

17

Inuit_deities Ugandan_monarchies

Page 18: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Stage 3: T-Box

Cycle removal

Breadth-first, top-down

Short paths

Instance pruning

18

Monarchy < owl#Thing

Page 19: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Stage 4: A-Box

INPUT = prominent nodes heads

For each prominent node head H :

Extract the category set with head = H

Extract the page set for each category ;

For each page P :

Is it an article page?

YES: < P, instance-of, H >

NO: Repeat until it is19

Monarchy

Bengal_Sultanateinstance-of

Page 20: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

{5} Results & Evaluation

20

Page 21: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Data

T-Box (classes)

A-Box (assertions)

Typed resources

1,902 10,729,507 4,260,530

21

Page 22: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Coverage

System Ratio

DBPO 0.513

MENTA 0.537

SDType 0.147

YAGO 0.673

WiBi 0.794

DBTax 0.99422

Page 23: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

T-Box Evaluation: Settings

50 random classes

10 evaluators (peers)

Resource namespaces hidden to avoid bias

Page 24: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

T-Box Evaluation: Questions (1/2)

• “Is this a class or an instance?”Restaurant VS Puella_Magi_Madoka_Magica (movie)

• “Can this class be broken down into more than one class?Mountain VS Musical_groups_from_Gothenburg

• “Is this a valid class hierarchy path?” wikicategory_Golden_Bear_winners < yagoLegalActorGeo < owl#Thing

Page 25: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

T-Box Evaluation: Questions (2/2)

• “Is this hierarchy too specific?” (too many levels)Porter_County,_Indiana < Chicago_metropolitan_area < Metropolitan_areas_of_Illinois < Populated_places_in_Illinois < owl#Thing

• “Is this hierarchy too broad?” (very few levels) Gonorynchiforme (fish family) < owl:Thing

Page 26: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Resource ClassesNon-

breakable classes

Valid paths

Not too specific paths

Not too broad paths

DBPO 0.66 0.67 0.89 0.97 0.84

YAGO 0.90 0.38 0.81 0.55 0.93

WiBi 0.75 0.38 0.73 0.41 0.85

Wikidata 0.19 0.48 0.85 0.66 0.88

Wikipedia 0.81 0.29 0.66 0.77 0.78

DBTax 0.65 0.76 0.77 0.98 0.40

T-Box Evaluation: Results

26

Page 27: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

A-Box Evaluation: Settings

Crowdsourced to the layman

Evaluation set: 500 random entities with no type in DBpedia

5 judgments per entity

Prevent a worker from answering the same question twice

Page 28: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

A-Box Evaluation: Test Questions

Automatically discard untrusted judgments

Untrusted worker: < 80% correct test questions

Subjective task: missed test questions

They affect # of untrusted judgments

The class label may be ambiguous

Page 29: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Page 30: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Resource Precision Recall F1 Untrusted judgments

Wikidata 0.808 0.982 0.886 1,847

MENTA 0.793 0.589 0.675 1,093

SDType 0.924 0.098 0.178 1,723

YAGO 0.461 0.727 0.565 1,358

WiBi 0.858 0.597 0.704 2,075

DBTax 0.744 1 0.853 518

A-Box Evaluation: Results

30

Page 31: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

{6} Advantages & Drawbacks

31

Page 32: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Advantages

Exhaustive coverage (almost 100%)

Type coverage comparison

Recall in A-Box evaluation

Intuitive

Crowdsourced (the layman) A-Box evaluation

Least # of untrusted judgments 32

Page 33: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Drawbacks

Short hierarchy paths

Cycle removal

Instance pruning

Relatively low precision

NLP may still yield "weird" is-a relations

"Elvis Presley is a Burial"33

!!!

Page 34: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

{7} Conclusion & Future

Work

34

Page 35: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Conclusion

Significant type coverage leap

Intuitive for end users

Balance between DBPO (too generic) and YAGO (too specific)

Integrated in the latest DBpedia release

35

Page 36: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Future Work

Merge the T-Box into mappings.dbpedia.org for curation

Word Sense Disambiguation for homonymous classes

Multilingual deployment (currently English and Italian)

36

Page 37: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

http://it.dbpedia.org/resource/Elvis_Presley

DBTax in action

at the Italian DBpedia chapter

37

Page 38: Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Thanks for your attention!Download DBTax at:

http://downloads.dbpedia.org/current/core-i18n/en/

Browse the Italian DBTax at: http://it.dbpedia.org/sparql

Contact the first author at: [email protected]