capturing emerging relations between schema ontologies on the web of data
TRANSCRIPT
Capturing emerging relations between schema ontologies on the Web of Data
Andriy NikolovEnrico Motta
Public linked data
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
• “Linking” in the Linked Data cloud:– References to instance URIs described in external sources– Special case: identity links between equivalent resources
Motivation
• Schema heterogeneity is an obstacle both for creating and for utilising these links– Extracting information on the same topic from different
repositories– Discovering equivalence links between individuals
• Motivation for our work: discovering instance-level links– How to choose the repositories to connect a new one?– Which subsets of repositories contain co-referring instances?
TV programs
movies
pieces of music
LinkedMDB
DBPedia
Freebase
MusicBrainz
?
Schema-level interlinks
Data-level
Schema-level?
Matching approaches
• “Top-down”– Analyzing schema ontologies and
generating alignments (manually or automatically)
– UMBEL• Using CYC as a “backbone”• Mapping commonly used schema ontologies
• “Bottom-up”– Inferring schema mappings based on
instance-level information
Our approach
• Constructing a large-scale network of schema mappings– Applying a light-weight instance-
based matcher• Analysing the resulting network
– What does it tell us about the use of ontologies?
Motivating factors
• Potential use case scenarios– Discovering relevant sources for connection– Discovering relevant subsets of comparable
instances• Tolerance to the quality of mappings
– A mapping between “strongly overlapping” classes is still useful even if there is no strict equivalence/subsumption
Instance-based matching
• Use of instance-based matching– Some implicit schema-level assumptions cannot be
captured using only schema-level evidence• Interpretation mismatches
– dbpedia:Actor = professional actor (film or stage)– movie:actor = anybody who participated in a movie
• Class interpretation “as used” vs “as designed”– FOAF: foaf:Person = any person– DBLP: foaf:Person = computer scientistRepository Richard Nixon David Garrick
dbpedia:Actor DBPedia - +
movie:Actor LinkedMDB + -
Instance set overlaps
LinkedMDB DBPediaMusicBrainz
music:artist/a16…9fdf
==
dbpedia:Ennio_Morriconemovie:music_contributor/2490
movie:music_contributor dbpedia:Artist
is_a is_amo:MusicArtist
DBPedia
dbpedia:Ennio_Morricone
dbpedia:Artist
is_a
yago:ItalianComposers
is_a
• Co-typing
• Declared association
Dataset
• Billion Triple Challenge 2009– about 1.14 billion triples– contains
• core LOD repositories (DBPedia, Freebase, Geonames, Musicbrainz, LinkedMDB,…)
• smaller semantic datasets retrieved by search servers (Falcon-S, Sindice)
– ≈3.6M co-typing-based overlapping pairs of classes
– ≈1M association-based pairs
Inferring mappings• Classification task
– Classes A, B: is there a mapping?– Boolean classification
• type of mappings assigned based on comparing sizes of instance sets
• Features– , : namespaces of class URIs– : size of the overlap– , : sizes of instance sets– : ratio of the overlapping subset to the complete
instance set- direct/indirect: whether classes have instances
explicitly declared to be equivalent
Test
Mapping set Algorithm Precision Recall F1
Association-based J48 0.939 0.689 0.795
Co-typing-based J48 0.952 0.944 0.948
• Training– Training set: 6000 overlapping pairs of classes– Test: 10-fold cross-validation
• Training– Training set: 6000 overlapping pairs of classes– Test: 10-fold cross-validation
• Applying– 2 networks of class mappings
Property Association-based Co-typing-based
Nodes 20365 35578
Edges 82422 67620
Max. connections/node 5301 18137
Node with max. connections geonames:Feature foaf:Person
Avg. connections/node 8.09 3.80
Distribution law power power
Observations: class mappings
• Association-based network: classes involved into the largest number of mappings– High-level classes represented concepts covered in many repositories– … and describing categories with very fine-grained class
decomposition– Usually also the most populated ones
1 10 100
1000
10000
100000
1000000
10000000
100000000
1
10
100
1000
10000
Instance set size
Num
ber o
f map
ping
s per
clas
s
geonames:Featurefreebase:people.personyago:PhysicalEntitylinkedmdb:filmumbel:Person
akt:Personakt:ArticleReference…“under-linked” ones?
1 10 100
1000
10000
100000
1000000
10000000
100000000
1
10
100
1000
10000
100000
Instance set size
Num
ber o
f map
ping
s per
clas
sObservations: class mappings
• Co-typing-based network: classes involved into the largest number of mappings– Popular classes reused in many repositories– … or in DBPedia– … and describing categories with fine-grained class decomposition– Usually also the most populated ones
foaf:Personumbel:Persondbpedia:Persondbpedia:FootballPlayerwordnet:Persondbpedia:Album
sioc:WikiArticlegeonames:Feature…
Links between ontologies
Property Association-based Co-typing-based
Nodes 52 743
Edges 172 1352
Max. connections/node 29 504
Node with max. connections YAGO FOAF
Avg. connections/node 3.96 1.85
Connected components 5 35
• Aggregated network: connections between ontologies– Mapping-based links between ontologies– At least 1 mapping between corresponding classes must exist
Association-based network
Association-based network
Generic:- YAGO- Freebase- UMBEL- OpenCYC- DBPedia
Association-based network
Domain-specific
Generic:- YAGO- Freebase- UMBEL- OpenCYC- DBPedia
• Main factor: topic coverage• Popularity for linking is not
reflected– Data-level: DBPedia has more
connections than Freebase– Schema-level: no substantial
difference– Effect of exploiting composed links
Association-based network
Co-typing-based network
• Main factor:– Popularity for
reuse
• FOAF and WordNet:– the most
popular
• DBPedia, YAGO, OpenCYC, UMBEL– Reused for
DBPedia instances
Outcomes
• Possible usage scenarios for mappings– Selecting suitable sources to connect
• “LinkedMDB contains more movies than DBPedia – more likely to cover all my instances”
– Selecting an ontology to reuse to structure new instances• Which sources use this ontology? Do I want my data to be
integrated with them?– Other data-driven tasks
• E.g., exploratory search
• Generic challenges– How to take into account task requirements in
ontology matching?• Recall vs precision, fuzzy vs exact
– How to capture changes in the data?• BTC 2009 is almost obsolete by now
Limitations and future work
• Limitations– Light-weight matcher can lead to lower
quality mappings• OK for our scenario but not others
– Pre-existing instance-level mappings are not always available
• Future work– Combining with schema-based ontology
matching techniques– Taking into account properties and complex
correspondences
Questions?
Thanks for your attention
Disjoint but overlapping
• Spurious owl:sameAs link– dbpedia:Hippocrates (Hippocrates) =
bookmashup:9004095748 (Hippocratic Lives and Legends (Studies in Ancient Medicine, Vol 4))
• Spurious rdf:type assignment– dbpedia:Celtic_Frost (band) defined as
Person in DBPedia (fixed in the current version of DBPedia)
• Modelling assumptions– dbpedia:Masada describes both the
geographical place and the battle