a scalable approach to learn semantic models of structured sources
DESCRIPTION
Semantic models of data sources describe the meaning of the data in terms of the concepts and relationships defined by a domain ontology. Building such models is an important step toward integrating data from different sources, where we need to provide the user with a unified view of underlying sources. In this paper, we present a scalable approach to automatically learn semantic models of a structured data source by exploiting the knowledge of previously modeled sources. Our evaluation shows that the approach generates expressive semantic models with minimal user input, and it is scalable to large ontologies and data sources with many attributes.TRANSCRIPT
A Scalable Approach to Learn Semantic Models of Structured Sources
Mohsen Taheriyan
Craig Knoblock
Pedro Szekely
Jose Luis Ambite 8th IEEE International Conference on Semantic Computing
2
Motivation
How to express the intended meaning of data?
Explicit semantics is missing in many of the structured sources
creator? actor? rightsHolder?
artwork? movie? referenced entity?
3
Map the Source to the Domain Ontology
EDM: Europeana Data Model SKOS: Simple Knowledge Organization SystemFOAF: Friend of a FriendAAC: American Art CollaborativeElementsGr2: RDA Group 2 ElementsORE: Open Archive InitiativeDCTerms: Dublin Core Metadata Terms
Data Source: artworks in the Indianapolis Museum of Art
Domain ontologies
Semantic Model: a mapping from the source to the concepts and relationships defined by the domain
ontologies
4
Semantic Model
aac:CulturalHeritageObject edm:WebResou
rce
skos:Concept
aac:Person
edm:EuropeanaAggregation
dcterms:title
edm:aggregatedCHO
skos:prefLabel
ElementsGr2:nameOfThePerson
rdf:type
edm:hasResource
dcterms:creator
edm:hasType
dcterms:description
Key ingredient to automate source discovery, data integration, and publishing RDF triples
5
Problem: How to automatically learn a semantic model for a source
6
Main Idea
Sources in the same domain often have similar data
Exploit knowledge of known semantic models to hypothesize a semantic model for a new sources
7
Previous Approach (ISWC 2013)Input
Learn semantic types for attributes(s)
• Sample data from new source (S)• Domain Ontologies (O)• Known semantic models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
1
2
3
4
Output• A ranked set of semantic models for
S
8
LimitationsInput
Learn semantic types for attributes(s)
• Sample data from new source (S)• Domain Ontologies (O)• Known semantic models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
1
2
3
4
Output• A ranked set of semantic models for
S
Consider only one semantic type (label) for each attribute
Not scalable to sources with a large number of attributes
9
ContributionsInput
Learn semantic types for attributes(s)
• Sample data from new source (S)• Domain Ontologies (O)• Known semantic models
Build Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
1
2
3
4
Output• A ranked set of semantic models for
S
Consider K candidate semantic types per attribute
A Beam search algorithm to score and prune the mappings
10
Example
New source: Indianapolis Museum of Art
EDM
SKOS
FOAF
AAC
ElementsGr2
ORE
DCTerms
Domain ontologies:
S1(title, creationDate, name, birthDate, deathDate, type)
Known Semantic Models:S1: Dallas MuseumS2: The Metropolitan Museum of Art
S2(name, copyright, materials, dimensions, imageUrl)
S(title, label, image, type, artist)
Goal: Semantic model for source S
Semantic model of S1
Semantic model of S2
11
• Sample data from new source (S)
ApproachInput
Learn semantic types for attributes(s)
• Domain Ontologies (O)• Known semantic
models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
1
2
3
4
Output• A ranked set of semantic models for
S
12
Learn Semantic Types• A CRF-based machine learning technique to learn Semantic Types for each
attribute from its data
• Semantic Type– Ontology Class: <class_uri>– Data Property + Domain Class: <class_uri, property_uri>
• Pick top K semantic types according to their confidence values
New source: S(title, label, image, type, artist)
title <aac:CulturalHeritageObject, dcterms:title> 0.19
<aac:CulturalHeritageObject, rdfs:label> 0.08
label <aac:CulturalHeritageObject, dcterms:description>
0.7
<aac:Person, ElementsGr2:note> 0.03
image <edm:WebResource> 0.58
<foaf:Document> 0.41
type <skos:Concept, skos:prefLabel> 0.82
<skos:Concept, rdfs:label> 0.15
name <foaf:Person, foaf:name> 0.27
<aac:Person, ElementsGr2:nameOfThePerson>
0.19
13
• Sample data from new source (S)
ApproachInput
Learn semantic types for attributes(s)
• Domain Ontologies (O)• Known semantic
models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
1
2
3
4
Output• A ranked set of semantic models for
S
14
Build Graph G: Add Known Models
15
Build Graph G: Add Semantic Types
16
Build Graph G: Expand with Paths from Ontologies
17
• Sample data from new source (S)
ApproachInput
Learn semantic types for attributes(s)
• Domain Ontologies (O)• Known semantic
models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
1
2
3
4
Output• A ranked set of semantic models for
S
18
Map Source Attributes to the GraphNew source: S(title, label, image, type, artist)
title <aac:CulturalHeritageObject, dcterms:title> <aac:CulturalHeritageObject, rdfs:label>
label <aac:CulturalHeritageObject, dcterms:description>
<aac:Person, ElementsGr2:note>
image <edm:WebResource> <foaf:Document>
type <skos:Concept, skos:prefLabel> <skos:Concept, rdfs:label>
name <foaf:Person, foaf:name> <aac:Person, ElementsGr2:nameOfThePerson>
19
Map Source Attributes to the GraphNew source: S(title, label, image, type, artist)
title <aac:CulturalHeritageObject, dcterms:title> <aac:CulturalHeritageObject, rdfs:label>
label <aac:CulturalHeritageObject, dcterms:description>
<aac:Person, ElementsGr2:note>
image <edm:WebResource> <foaf:Document>
type <skos:Concept, skos:prefLabel> <skos:Concept, rdfs:label>
name <foaf:Person, foaf:name> <aac:Person, ElementsGr2:nameOfThePerson>
20
Map Source Attributes to the GraphNew source: S(title, label, image, type, artist)
title <aac:CulturalHeritageObject, dcterms:title> <aac:CulturalHeritageObject, rdfs:label>
label <aac:CulturalHeritageObject, dcterms:description>
<aac:Person, ElementsGr2:note>
image <edm:WebResource> <foaf:Document>
type <skos:Concept, skos:prefLabel> <skos:Concept, rdfs:label>
name <foaf:Person, foaf:name> <aac:Person, ElementsGr2:nameOfThePerson>
21
Scalability Issue
• Multiple mappings from attributes(S) to nodes of G– Each attribute has more than one semantic type– Multiple matches for each semantic type
• Not feasible to generate all possible mappings– The size of graph may be large – The source may have many attributes
• Exponential in terms of number of attributes– N attributes, M match for each MN mappings
22
Prune the Mappings• Score the partial mappings after mapping each
attribute– Coherence: number of nodes in a mapping that belong to
same component– Confidence: average confidence of semantic types in m– Score = arithmetic mean of coherence and size reduction
• Beam Search – Keep only top K mappings after mapping each attribute
• Number of mappings will not exceed a fixed size after mapping each attribute
23
Score the MappingsNew source: S(title, label, image, type, artist)
title <aac:CulturalHeritageObject, dcterms:title>, 0.19 <aac:CulturalHeritageObject, rdfs:label>
label <aac:CulturalHeritageObject, dcterms:description>, 0.7
<aac:Person, ElementsGr2:note>
image <edm:WebResource>, , 0.58 <foaf:Document>
type <skos:Concept, skos:prefLabel>, 0.82 <skos:Concept, rdfs:label>
name <foaf:Person, foaf:name>, 0.27 <aac:Person, ElementsGr2:nameOfThePerson>
Coherence: 4/9 = 0.44Confidence: 0.56Score: 0.5
Example Mapping 2
24
Score the MappingsNew source: S(title, label, image, type, artist)
title <aac:CulturalHeritageObject, dcterms:title>, 0.19 <aac:CulturalHeritageObject, rdfs:label>
label <aac:CulturalHeritageObject, dcterms:description>, 0.7
<aac:Person, ElementsGr2:note>
image <edm:WebResource>, , 0.58 <foaf:Document>
type <skos:Concept, skos:prefLabel>, 0.82 <skos:Concept, rdfs:label>
name <foaf:Person, foaf:name> <aac:Person, ElementsGr2:nameOfThePerson>, 0.19
Coherence: 6/9 = 0.66Confidence: 0.55Score: 0.605
Example Mapping 1
This mapping gets higher score even though it uses the 2nd ranked semantic type for artist
25
• Sample data from new source (S)
ApproachInput
Learn semantic types for attributes(s)
• Domain Ontologies (O)• Known semantic
models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
1
2
3
4
Output• A ranked set of semantic models for
S
26
Generate Semantic Models
• Select top M mappings• Compute a Steiner tree for each
mapping– A minimal tree connecting nodes of
mapping
• Each tree is a candidate model• Rank candidate models (Steiner trees)
– Cost – Score of the corresponding mapping
27
Steiner Tree
28
Evaluation• Dataset
– 29 museum data sources– 332 attributes (average 11 per source)
• Domain ontologies– EDM ,SKOS, FOAF, ORE, ElementsGr2, AAC, DCTerms– 119 classes, 351 properties
• Compute precision and recall between learned models and correct models
How many of the learned relationships are correct?
How many of the correct relationships are learned?
29
Quality
k = 1 correct semantic type learned only for 62% of attributes k = 4 correct semantic type was among the 4 learned types for 87% of attributes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 280.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
precision (k=1)
recall (k=1)
precision (k=4)
recall (k=4)
precision (correct types)
recall (correct types)
Number of known semantic models
Performance
The previous approach was not able to learn semantic models for sources with more than 4 attributes in the
timeout of 1 hourExample: S16 with only 5 attributes 16,633,298 mappings (29*29*29*31*22)
0 5 10 15 20 25 300
10
20
30
40
50
60
Previous Approach
New Approach
Number of Attributes
Time
(Kbeam = 100)
31
Related Work• Schema matching & schema mapping
– iMAP [Dhamankar et al., 2004], Clio [Fagin et al., 2009]
• Mapping databases and spreadsheets to ontologies– Mapping languages: D2R [Bizer, 2003], D2RQ [Bizer and Seaborne, 2004],
R2RML [Das et al., 2012]– Tools: RDOTE [Vavliakis et al., 2010], RDF123 [Han et al., 2008], XLWrap
[Langegger and Woß, 2009]– String similarity between column names and ontology terms [Polfliet and Ichise,
2010]
• Understand semantics of Web tables– Use column headers and cell values to find the labels and relations from a
database of labels and relations populated from the Web [Wang et al., 2012] [Limaye et al., 2010] [Venetis et al., 2011]
• Exploit Linked Open Data (LOD)– Link the values to the entities in LOD to find the types of the values and their
relationships [Muoz et al., 2013] [Mulwad et al., 2013]
• Learn Semantic Definitions of Online Information Sources [Carman, Knoblock, 2007]
– Learns LAV rules from known sources– Only learns descriptions that are conjunctive combinations of known
descriptions
32
Future Work
• Scalability regarding number of the known models– Create a more compact graph– Consolidate overlapping segments of known models
• Leverage Linked Open Data (LOD)– Exploit the relationships between instances– Improve the accuracy of the learned relations
• Integrate the new approach in Karma– http://www.isi.edu/integration/karma– @KarmaSemWeb