type inference through the analysis of wikipedia links
TRANSCRIPT
![Page 1: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/1.jpg)
stlab.istc.cnr.it
Type inference through the analysis of
Wikipedia linksAndrea Giovanni Nuzzolese
Aldo [email protected]
Valentina [email protected]
Paolo [email protected]
16 April 2012 - Lyon, France - LDOW 2012
![Page 2: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/2.jpg)
stlab.istc.cnr.it
• Motivations
• Materials
• Applied methods
• Results
• Conclusions
Outline
2
![Page 3: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/3.jpg)
stlab.istc.cnr.it3
Motivations
✦ Only a subset of the DBpedia resources is typed with the DBpedia ontology (DBPO)
✦ The typing procedure is top-down.
✦ Is the DBPO complete with respect to the DBpedia domain?
✦ How good and homogeneous is the granularity of DBPO types?
Resources used in wikilinksrelations:
15,944,381
Resources having a DBPO type:
1,518,697
![Page 4: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/4.jpg)
stlab.istc.cnr.it4
Dataset # of triples
wikilink triples 107,892,317
infobox mapping-based “data” triples 9,357,273
rdfs:label triples 7,972,225
rdf:type triples 6,173,940
infobox mapping-based “object” triples 4,251,239
Wikilinkstriples:
107,892,317
Wikilink triples with typed subject/object:
16,745,830
Materials
DBpedia ontology:272 classes
DBpedia 3.6
![Page 5: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/5.jpg)
stlab.istc.cnr.it5
What we did• Wikilinks of a DBpedia resource convey knowledge that
can be used for classifying it.
• Classification methods ✦ Inductive learning: k-Nearest Neighbor algorithm✦ Abductive classification based on EKPs [1] and homotypes used as
background knowledge
• The methods were performed on
Sample of untyped resources:
1,000
Resources used in wikilinksrelations:
15,944,381
Resources having a DBPO type:
1,518,697
[1] A. G. Nuzzolese, A. Gangemi, V. Presutti, and P. Ciancarini. Encyclopedic Knowledge Patterns from Wikipedia Links. In L. Aroyo, N. Noy, and C. Welty, editors, Proceedings of the 10th International Semantic Web Conference (ISWC2011), pages 520-536. Springer, 2011.
![Page 6: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/6.jpg)
stlab.istc.cnr.it65
Inductive classification
• We designed two inductive classification experiments based on the k-NN algorithm
✦ on 272 features, i.e., all the classes in the DBPO✦ on 27 features, i.e., the top-level classes in the DBPO
hierarchy
• For each experiment we built a labeled feature space model as training set by using a randomly sampled 20% of typed resources
✦ the algorithms were tested on the remaining 80% of typed resources
![Page 7: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/7.jpg)
stlab.istc.cnr.it7
dbpedia:Steve_Jobs
dbpedia:Apple_Inc. dbpedia:NeXT
dbpedia:Cupertino,_Californiadbpedia:Forbes
dbpo:wikiPageWikiLink
Mammal Scientist Company Drug City Magazine Class
dbpedia:Steve_Jobs
...
Building the training set for K-Nearest Neighbor algorithm
![Page 8: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/8.jpg)
stlab.istc.cnr.it7
dbpedia:Steve_Jobs
dbpedia:Apple_Inc. dbpedia:NeXT
dbpedia:Cupertino,_Californiadbpedia:Forbes
dbpo:Organisation
dbpo:Magazine dbpo:City
dbpo:wikiPageWikiLink
rdf:type
Mammal Scientist Company Drug City Magazine Class
dbpedia:Steve_Jobs
dbpo:Person
dbpo:Person
...
Building the training set for K-Nearest Neighbor algorithm
![Page 9: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/9.jpg)
stlab.istc.cnr.it7
dbpedia:Steve_Jobs
dbpo:Organisation
dbpo:Magazine dbpo:Citydbpo:wikiPageWikiLink
rdf:type
kp:linksTo
Mammal Scientist Company Drug City Magazine Class
dbpedia:Steve_Jobs dbpo:Person0 0 1 0 1 1
...
Building the training set for K-Nearest Neighbor algorithm
![Page 10: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/10.jpg)
stlab.istc.cnr.it7
dbpedia:Steve_Jobs
dbpo:Organisation
dbpo:Magazine dbpo:Citydbpo:wikiPageWikiLink
rdf:type
kp:linksTo
Mammal Scientist Company Drug City Magazine Class
dbpedia:Steve_Jobs dbpo:Person0 0 1 0 1 1
... ... ... ... ... ... ... ...
Building the training set for K-Nearest Neighbor algorithm
![Page 11: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/11.jpg)
stlab.istc.cnr.it7
dbpedia:Steve_Jobs
dbpo:Organisation
dbpo:Magazine dbpo:Citydbpo:wikiPageWikiLink
rdf:type
kp:linksTo
Building the training set for K-Nearest Neighbor algorithm
✦ Precision using all DBPO types as features: 31.65%✦ Precision using the top-level of DBPO as features: 40.27%
![Page 12: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/12.jpg)
stlab.istc.cnr.it8
Abductive classification with EKPs
• EKPs✦ A EKP of a certain entity
type is a small vocabulary that captures the core types used for describing such entity type as it emerges from the Wikipedia crowds
visit aemoo.org for an exploratory tool based on EKPs
![Page 13: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/13.jpg)
stlab.istc.cnr.it9
How can we infer the type of
“Galileo Galilei”?
http://www.aemoo.org
![Page 14: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/14.jpg)
stlab.istc.cnr.it9
How can we infer the type of
“Galileo Galilei”?
We know its path types
http://www.aemoo.org
![Page 15: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/15.jpg)
stlab.istc.cnr.it9
We have 231 EKPs We compare the path types involving
“Galileo Galilei” as subject with EKPs in order to identify the most similar, which
is the "Scientist" EKP.
http://www.aemoo.org
![Page 16: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/16.jpg)
stlab.istc.cnr.it9
The inferred type for the resource
“Galileo Galiei” is the class “Scientist”
http://www.aemoo.org
![Page 17: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/17.jpg)
stlab.istc.cnr.it10
Distinctive weakness of some EKPs
✦ The distinctive weakness seems due to wide overlaps among some EKPs
✦ Systematic ambiguity of the 4 largest classes
✦ Precision and recall on all DBPO types both 44.4%✦ Precision and recall on the top-level of DBPO hierarchy: 36.5% and 79.5%
![Page 18: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/18.jpg)
stlab.istc.cnr.it11
Homotype-based abductive classification
• Homotypes are wikilinks that have the same type on both the subject and the object of the triple
• We have observed how the homotype is usually the most frequent (or in the top 3) wikilink type
• Given an untyped entity, we hypothesize that the most frequent type involved in its ingoing/outgoing wikilinks detects its homotype, hence it indicates its type
dbpedia:Immanuel_Kant dbpedia:Plato
dbpo:wikiPageWikiLink
dbpo:Philosopher dbpo:Philosopherrdf:type rdf:type
![Page 19: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/19.jpg)
stlab.istc.cnr.it12
Homotype-based abductive classification
s
![Page 20: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/20.jpg)
stlab.istc.cnr.it12
Homotype-based abductive classification
s
![Page 21: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/21.jpg)
stlab.istc.cnr.it13
Results on classifying already typed resources
![Page 22: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/22.jpg)
stlab.istc.cnr.it14
Results on untyped resources• Results on a sample of 1,000 untyped resources
are much less satisfactory
With EKPs
With Homotypes
![Page 23: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/23.jpg)
stlab.istc.cnr.it15
Why? [1]
• Typed entities: 2:3 typed wikilinks ratio
• Untyped entities: 1:3 typed wikilinks ratio
• Link structure for untyped entities is not rich enough
![Page 24: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/24.jpg)
stlab.istc.cnr.it16
Why? [2]
• DBPO does not provide a complete set of classes for correctly typing DBpedia resources
dbpedia:Counterattack Plandbpedia:Computer_Science ScientificDiscipline
dbpedia:List_of_FIFA_World_Cup_finals Collection
dbpedia:Eros(concept) Conceptdbpedia:Gentlemen’s_agreement Agreement
![Page 25: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/25.jpg)
stlab.istc.cnr.it17
Conclusions• We have investigated different approaches for
typing DBpedia resources based on the data set of wikilinks
• Results are acceptable in the test set, but extensive untypedness in output links, and poor DBPO coverage severely compromise automatic typing for untyped resources
• We have analyzed possible causes deriving from some bias in DBpedia
![Page 26: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/26.jpg)
stlab.istc.cnr.it18
Future work
• Yago could be helpful but✦ there is a lack of mapping between YAGO and DBPO✦ it has larger coverage and only an overlap with DBPO✦ the granularity of its categories is finer, and not easily
reusable, because the top level is very large
![Page 27: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/27.jpg)
stlab.istc.cnr.it19
Thank you
Andrea Nuzzolese-
STLab, ISTC-CNR&
Dipartimento di Scienze dell’InformazioneUniversity of Bologna
Italy
![Page 28: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/28.jpg)
stlab.istc.cnr.it
nRes(dbpo:MusicalArtist)
Chad_Smith
John_Lennon
Michael_Jackson
PREFIX dbpo : http://dbpedia.org/ontology/
Paul_McCartney
Jackie_Jackson rdf:type
Anthony_Kiedis
dbpo:MusicalArtist
dbpo:MusicalArtist
rdf:type
rdf:type
dbpo:MusicalArtist
rdf:type
rdf:type
rdf:type
Number of resources having type dbpo:MusicalArtist
dbpo:MusicalArtist
Path popularity
![Page 29: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/29.jpg)
stlab.istc.cnr.it
nSubjectRes(Pi,j)
Dave_Grohl
Foo Fighters
John_Lennon
Michael_Jackson
PREFIX dbpo : http://dbpedia.org/ontology/
Paul_McCartney
Jackson_5
Beatles
Jackie_Jackson
dbpo:MusicalArtist dbpo:Band dbpo:wikiPageWikiLink Si Oj
Nirvana
Number of distinct resources that participate in a path as subjects
Path popularity
![Page 30: Type inference through the analysis of Wikipedia links](https://reader033.vdocuments.net/reader033/viewer/2022042614/554ea254b4c905fb7c8b4777/html5/thumbnails/30.jpg)
stlab.istc.cnr.it
pathPopularity(Pi,j,Si)
Dave_Grohl
Foo Fighters
John_Lennon
Michael_Jackson
Paul_McCartney
Jackson_5
Beatles
Jackie_Jackson
Nirvana
Madonna
Prince Charlie_Parker Keith_Jarrett
nSubjectRes(Pi,j)/nRes(Si)
Path popularity