Capturing emerging relations between schema ontologies on the Web of Data
Andriy NikolovEnrico Motta
Public linked data
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
• “Linking” in the Linked Data cloud:– References to instance URIs described in external sources– Special case: identity links between equivalent resources
Motivation
• Schema heterogeneity is an obstacle both for creating and for utilising these links– Extracting information on the same topic from different
repositories– Discovering equivalence links between individuals
• Motivation for our work: discovering instance-level links– How to choose the repositories to connect a new one?– Which subsets of repositories contain co-referring instances?
TV programs
movies
pieces of music
LinkedMDB
DBPedia
Freebase
MusicBrainz
?
Schema-level interlinks
Data-level
Schema-level?
Matching approaches
• “Top-down”– Analyzing schema ontologies and
generating alignments (manually or automatically)
– UMBEL• Using CYC as a “backbone”• Mapping commonly used schema ontologies
• “Bottom-up”– Inferring schema mappings based on
instance-level information
Our approach
• Constructing a large-scale network of schema mappings– Applying a light-weight instance-
based matcher• Analysing the resulting network
– What does it tell us about the use of ontologies?
Motivating factors
• Potential use case scenarios– Discovering relevant sources for connection– Discovering relevant subsets of comparable
instances• Tolerance to the quality of mappings
– A mapping between “strongly overlapping” classes is still useful even if there is no strict equivalence/subsumption
Instance-based matching
• Use of instance-based matching– Some implicit schema-level assumptions cannot be
captured using only schema-level evidence• Interpretation mismatches
– dbpedia:Actor = professional actor (film or stage)– movie:actor = anybody who participated in a movie
• Class interpretation “as used” vs “as designed”– FOAF: foaf:Person = any person– DBLP: foaf:Person = computer scientistRepository Richard Nixon David Garrick
dbpedia:Actor DBPedia - +
movie:Actor LinkedMDB + -
Instance set overlaps
LinkedMDB DBPediaMusicBrainz
music:artist/a16…9fdf
==
dbpedia:Ennio_Morriconemovie:music_contributor/2490
movie:music_contributor dbpedia:Artist
is_a is_amo:MusicArtist
DBPedia
dbpedia:Ennio_Morricone
dbpedia:Artist
is_a
yago:ItalianComposers
is_a
• Co-typing
• Declared association
Dataset
• Billion Triple Challenge 2009– about 1.14 billion triples– contains
• core LOD repositories (DBPedia, Freebase, Geonames, Musicbrainz, LinkedMDB,…)
• smaller semantic datasets retrieved by search servers (Falcon-S, Sindice)
– ≈3.6M co-typing-based overlapping pairs of classes
– ≈1M association-based pairs
Inferring mappings• Classification task
– Classes A, B: is there a mapping?– Boolean classification
• type of mappings assigned based on comparing sizes of instance sets
• Features– , : namespaces of class URIs– : size of the overlap– , : sizes of instance sets– : ratio of the overlapping subset to the complete
instance set- direct/indirect: whether classes have instances
explicitly declared to be equivalent
Test
Mapping set Algorithm Precision Recall F1
Association-based J48 0.939 0.689 0.795
Co-typing-based J48 0.952 0.944 0.948
• Training– Training set: 6000 overlapping pairs of classes– Test: 10-fold cross-validation
• Training– Training set: 6000 overlapping pairs of classes– Test: 10-fold cross-validation
• Applying– 2 networks of class mappings
Property Association-based Co-typing-based
Nodes 20365 35578
Edges 82422 67620
Max. connections/node 5301 18137
Node with max. connections geonames:Feature foaf:Person
Avg. connections/node 8.09 3.80
Distribution law power power
Observations: class mappings
• Association-based network: classes involved into the largest number of mappings– High-level classes represented concepts covered in many repositories– … and describing categories with very fine-grained class
decomposition– Usually also the most populated ones
1 10 100
1000
10000
100000
1000000
10000000
100000000
1
10
100
1000
10000
Instance set size
Num
ber o
f map
ping
s per
clas
s
geonames:Featurefreebase:people.personyago:PhysicalEntitylinkedmdb:filmumbel:Person
akt:Personakt:ArticleReference…“under-linked” ones?
1 10 100
1000
10000
100000
1000000
10000000
100000000
1
10
100
1000
10000
100000
Instance set size
Num
ber o
f map
ping
s per
clas
sObservations: class mappings
• Co-typing-based network: classes involved into the largest number of mappings– Popular classes reused in many repositories– … or in DBPedia– … and describing categories with fine-grained class decomposition– Usually also the most populated ones
foaf:Personumbel:Persondbpedia:Persondbpedia:FootballPlayerwordnet:Persondbpedia:Album
sioc:WikiArticlegeonames:Feature…
Links between ontologies
Property Association-based Co-typing-based
Nodes 52 743
Edges 172 1352
Max. connections/node 29 504
Node with max. connections YAGO FOAF
Avg. connections/node 3.96 1.85
Connected components 5 35
• Aggregated network: connections between ontologies– Mapping-based links between ontologies– At least 1 mapping between corresponding classes must exist
Association-based network
Association-based network
Generic:- YAGO- Freebase- UMBEL- OpenCYC- DBPedia
Association-based network
Domain-specific
Generic:- YAGO- Freebase- UMBEL- OpenCYC- DBPedia
• Main factor: topic coverage• Popularity for linking is not
reflected– Data-level: DBPedia has more
connections than Freebase– Schema-level: no substantial
difference– Effect of exploiting composed links
Association-based network
Co-typing-based network
• Main factor:– Popularity for
reuse
• FOAF and WordNet:– the most
popular
• DBPedia, YAGO, OpenCYC, UMBEL– Reused for
DBPedia instances
Outcomes
• Possible usage scenarios for mappings– Selecting suitable sources to connect
• “LinkedMDB contains more movies than DBPedia – more likely to cover all my instances”
– Selecting an ontology to reuse to structure new instances• Which sources use this ontology? Do I want my data to be
integrated with them?– Other data-driven tasks
• E.g., exploratory search
• Generic challenges– How to take into account task requirements in
ontology matching?• Recall vs precision, fuzzy vs exact
– How to capture changes in the data?• BTC 2009 is almost obsolete by now
Limitations and future work
• Limitations– Light-weight matcher can lead to lower
quality mappings• OK for our scenario but not others
– Pre-existing instance-level mappings are not always available
• Future work– Combining with schema-based ontology
matching techniques– Taking into account properties and complex
correspondences
Questions?
Thanks for your attention
Disjoint but overlapping
• Spurious owl:sameAs link– dbpedia:Hippocrates (Hippocrates) =
bookmashup:9004095748 (Hippocratic Lives and Legends (Studies in Ancient Medicine, Vol 4))
• Spurious rdf:type assignment– dbpedia:Celtic_Frost (band) defined as
Person in DBPedia (fixed in the current version of DBPedia)
• Modelling assumptions– dbpedia:Masada describes both the
geographical place and the battle