graph-based ontology analysis in the linked open data
TRANSCRIPT
Graph-based Ontology Analysis in the Linked Open Data
Lihua Zhao, Ryutaro Ichise
September 5, 2012, I-Semantics2012, Graz, Austria
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Outline
Introduction
Related WorkOur Approach
Graph Pattern Extraction<Predicate, Object> CollectionRelated Classes and Predciates GroupingIntegration for All Graph PatternsManual Revision
ExperimentsExperimental DataGraph Patterns of Linked InstancesClass-level AnalysisPredicate-level Analysis
Comparison with Previous Work
Conclusion and Future Work
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 2
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Introduction
Linked Open Data (LOD)295 data sets, 31 billion RDF triples (as of Sep. 2011).Interlinked instances (owl:sameAs).
WorldFact-book
JohnPeel
(DBTune)
Pokedex
Pfam
US SEC(rdfabout)
LinkedLCCN
Europeana
EEA
IEEE
ChEMBL
SemanticXBRL
SWDogFood
CORDIS(FUB)
AGROVOC
OpenlyLocal
Discogs(Data
Incubator)
DBpedia
yovisto
Tele-graphis
tags2condelicious
NSF
MediCare
BrazilianPoli-
ticians
dotAC
ERA
OpenCyc
Italianpublic
schools
UB Mann-heim
JISC
MoseleyFolk
SemanticTweet
OS
GTAA
totl.net
OAI
Portu-guese
DBpedia
LOCAH
KEGGGlycan
CORDIS(RKB
Explorer)
UMBEL
Affy-metrix
riese
business.data.gov.
uk
OpenData
Thesau-rus
GeoLinkedData
UK Post-codes
SmartLink
ECCO-TCP
UniProt(Bio2RDF)
SSWThesau-
rus
RDFohloh
Freebase
LondonGazette
OpenCorpo-rates
Airports
GEMET
P20
TCMGeneDIT
Source CodeEcosystemLinked Data
OMIM
HellenicFBD
DataGov.ie
MusicBrainz
(DBTune)
data.gov.ukintervals
LODE
Climbing
SIDER
ProjectGuten-berg
MusicBrainz
(zitgist)
ProDom
HGNC
SMCJournals
Reactome
NationalRadio-activity
JP
legislationdata.gov.uk
AEMET
ProductTypes
Ontology
LinkedUser
Feedback
Revyu
GeneOntology
NHS(En-
AKTing)
URIBurner
DBTropes
Eurécom
ISTATImmi-
gration
LichfieldSpen-ding
SurgeRadio
Euro-stat
(FUB)
PiedmontAccomo-dations
NewYork
Times
Klapp-stuhl-club
EUNIS
Bricklink
reegle
CO2Emission
(En-AKTing)
AudioScrobbler(DBTune)
GovTrack
GovWILDECS
South-amptonEPrints
KEGGReaction
LinkedEDGAR
(OntologyCentral)
LIBRIS
OpenLibrary
KEGGDrug
research.data.gov.
uk
VIVOCornell
UniRef
WordNet(RKB
Explorer)
Cornetto
medu-cator
DDC DeutscheBio-
graphie
Wiki
Ulm
NASA(Data Incu-
bator)
BBCMusic
DrugBank
Turismode
Zaragoza
PlymouthReading
Lists
education.data.gov.
uk
KISTI
UniPathway
Eurostat(OntologyCentral)
OGOLOD
Twarql
MusicBrainz(Data
Incubator)
GeoNames
PubChem
ItalianMuseums
Good-win
Familyflickr
wrappr
Eurostat
Thesau-rus W
OpenLibrary(Talis)
LOIUS
LinkedGeoData
LinkedOpenColors
WordNet(VUA)
patents.data.gov.
uk
GreekDBpedia
SussexReading
Lists
MetofficeWeatherForecasts
GND
LinkedCT
SISVU
transport.data.gov.
uk
Didac-talia
dbpedialite
BNB
OntosNewsPortal
LAAS
ProductDB
iServe
Recht-spraak.
nl
KEGGCom-pound
GeoSpecies
VIVO UF
LinkedSensor Data(Kno.e.sis)
lobidOrgani-sations
LEM
LinkedCrunch-
base
FTS
OceanDrillingCodices
JanusAMP
ntnusc
WeatherStations
Amster-dam
Museum
lingvoj
Crime(En-
AKTing)
Course-ware
PubMed
ACM
BBCWildlifeFinder
Calames
Chronic-ling
America
data-open-
ac-uk
OpenElection
DataProject
Slide-share2RDF
FinnishMunici-palities
OpenEI
MARCCodes
List
VIVOIndiana
HellenicPD
LCSH
FanHubz
bibleontology
IdRefSudoc
KEGGEnzyme
NTUResource
Lists
PRO-SITE
LinkedOpen
Numbers
Energy(En-
AKTing)
Roma
OpenCalais
databnf.fr
lobidResources
IRIT
theses.fr
LOV
Rådatanå!
DailyMed
Taxo-nomy
New-castle
GoogleArt
wrapper
Poké-pédia
EURES
BibBase
RESEX
STITCH
PDB
EARTh
IBM
Last.FMartists
(DBTune)
YAGO
ECS(RKB
Explorer)
EventMedia
STW
myExperi-ment
BBCProgram-
mes
NDLsubjects
TaxonConcept
Pisa
KEGGPathway
UniParc
Jamendo(DBtune)
Popula-tion (En-AKTing)
Geo-WordNet
RAMEAUSH
UniSTS
Mortality(En-
AKTing)
AlpineSki
Austria
DBLP(RKB
Explorer)
Chem2Bio2RDF
MGI
DBLP(L3S)
Yahoo!Geo
Planet
GeneID
RDF BookMashup
El ViajeroTourism
Uberblic
SwedishOpen
CulturalHeritage
GESIS
datadcs
Last.FM(rdfize)
Ren.EnergyGenera-
tors
Sears
RAE2001
NSZLCatalog
Homolo-Gene
Ord-nanceSurvey
TWC LOGD
Disea-some
EUTCProduc-
tions
PSH
WordNet(W3C)
semanticweb.org
ScotlandGeo-
graphy
Magna-tune
Norwe-gian
MeSH
SGD
TrafficScotland
statistics.data.gov.
uk
CrimeReports
UK
UniProt
US Census(rdfabout)
Man-chesterReading
Lists
EU Insti-tutions
PBAC
VIAF
UN/LOCODE
Lexvo
LinkedMDB
ESDstan-dards
reference.data.gov.
uk
t4gminfo
Sudoc
ECSSouth-ampton
ePrints
Classical(DB
Tune)
DBLP(FU
Berlin)
Scholaro-meter
St.AndrewsResource
Lists
NVD
Fishesof
TexasScotlandPupils &Exams
RISKS
gnoss
DEPLOY
InterPro
Lotico
OxPoints
Enipedia
ndlna
Budapest
CiteSeer
Media
Geographic
Publications
User-generated content
Government
Cross-domain
Life sciences
As of September 2011
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 3
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Challenging Problems
Infeasible to understand all the ontology schema of linked data sets.
Ontology heterogeneity problemHeterogeneous ontology classes
DBpedia: http://dbpedia.org/ontology/Country.Geonames: http://www.geonames.org/ontology#A.PCLI.LinkedMDB: http://data.linkedmdb.org/resource/movie/country.
Heterogeneous ontology predicates
http://dbpedia.org/property/populationTotal.http://dbpedia.org/property/population.
Time-consuming and infeasible to inspect large ontologiesMisuse of classes and predicatesDBpedia: 320 classes and thousands of predicates.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 4
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Solution for the Problems
Automatically or semi-automatically integrate different ontologiesby analyzing interlinked instances.
Semi-automatic ontology integrationReduce the ontology heterogeneity.Identify important ontology classes and predicates that link instances.Easy to understand simple integrated ontology.Simplify the queries on various data sets.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 5
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Related Work
Find useful attributes from frequent graph patterns. [Le, et al.,2010]
Only for geographic data.
Analysis of basic predicates of SameAs network, Pay-Level-Domainnetwork and Class-Level Similarity network. [Ding, et al., 2010]
Only frequent types are considered to analyze how data are connected.
A debugging method for mapping lightweight ontologies. [Meilicke,et al., 2008]
Limited to the expressive lightweight ontologies.
Construct intermediate-layer ontology from geospatial, zoology, andgenetics data resources. [Parundekar, et al., 2010]
Only for specific domains and only considers at class-level.
Construct an integrated mid-ontology from DBpedia, Geonames,and NYTimes. [Zhao, et al., 2011]
Needs a hub data set and only considers at predicate-level.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 6
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Our Approach
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 7
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Step 1: Graph Pattern Extraction
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 8
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Graph Pattern Extraction
Extract graph patterns from interlinked instances to discoverrelated ontology classes and predicates.
SameAs Graph SG = (V, E, I), V is a set of labels of data sets, E⊆ V × V, I is a set of URIs of the interlinked instances.
Example: SGAustria = (V, E, I)V = {D, G, N, M}E = {(D,G), (D,N), (G,N), (G,M)}I = { db:Austria, geo:2782113, nyt:66221058161318373601,mdb-country:AT}.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 9
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Step 2: <Predicate, Object> Collection
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 10
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
<Predicate, Object> Collection
An instance has a collection of <subject, predicate, object>.(instance URI → subject, property → predicate, class → object)
<predicate, object> (PO) pairs as the content of a SameAs Graph.
Classify PO pairs into five types
Class: rdf:type and skos:inScheme.Date: XMLSchema:date, gYear, gMonthDay, etc.Number: XMLSchema:integer, int, float, double, etc.URI: starts with “http://” and XMLSchema:anyURI.String: XMLSchema:string and Others.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 11
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
An Example of Collected PO pairs
Table: PO pairs and types for SGAustria
Predicate Object Type
rdf:type owl:Thing Classrdf:type db-onto:Place Classrdf:type db-onto:PopulatedPlace Classrdf:type db-onto:Country Classrdfs:label “Austria”@en Stringdb-onto:wikiPageExternalLink http://www.austria.mu/ URIdb-prop:populationEstimate 8356707 Number. . . . . . . . . . . . . . . . . .geo-onto:name Austria Stringgeo-onto:alternateName “Austria”@en Stringgeo-onto:alternateName “Republic of Austria”@en Stringgeo-onto:featureClass geo-onto:A Classgeo-onto:featureCode geo-onto:A.PCLI Classgeo-onto:population 8205000 Number. . . . . . . . . . . . . . . . . .rdf:type mdb:country Classmdb:country name Austria String. . . . . . . . . . . . . . . . . .skos:inScheme nyt:nytd geo Classskos:prefLabel “Austria”@en Stringnyt-prop:first use 2004-10-04 Date
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 12
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Step 3: Related Classes and Predicates Grouping
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 13
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Related Classes Grouping
Group related classes from each SameAs Graph by trackingsubsumption relations owl:subClassOf and skos:inScheme.
< C1 owl:subClassOf C2 > or < C1 skos:inScheme C2 > means theconcept of class C1 is more specific than the concept of class C2.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 14
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Related Predicates Grouping
Perform pairwise comparison on <predicate, object> (PO) pairs tofind out related predicates (properties).
Discover related predicates using different methods for thetypes of Date, URI, Number, and String.
Date, URI: exact matching.Number, String: exact matching + similarity matching.
Exact matching on PO pairs to create initial sets of PO pairs.
If OPOi= OPOj
or PPOi= PPOj
⇒ Sk ← POi ,POj
OPO: the object of PO.
PPO: the predicate of PO.
S : Initial set of PO pairs.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 15
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Related Predicates Grouping
Similarity matching on PO pairs of type Number and String.
Similarity between POi and POj .
Sim(POi ,POj) =ObjSim(POi ,POj) + PreSim(POi ,POj)
2
Merge similar initial sets Si and Sj .
if Sim(POi ,POj) ≥ θ, where POi ∈ Si , POj ∈ Sj
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 16
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Related Predicates Grouping
Similarity of objects between two PO pairs.
ObjSim(POi ,POj) =
{1−
|OPOi−OPOj
|OPOi
+OPOjif OPO is Number
StrSim(OPOi,OPOj
) if OPO is String
OPO: the object of PO.StrSim(OPOi ,OPOj ): the average of the three string-based similarityvalues JaroWinkler, Levenshtein distance, and n-gram.
Similarity of predicates between POi and POj
PreSim(POi ,POj) = WNSim(TPOi,TPOj
)
TPO: the pre-processed terms of the predicates in PO.WNSim(TPOi ,TPOj ): the average of the nine applied WordNet-basedsimilarity values.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 17
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Step 4: Integration for All Graph Patterns
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 18
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Integration for All Graph Patterns
Groups of related classes and predicates are independent for eachgraph pattern. Hence, we integrate them for all the graph patternsto construct an integrated ontology.
Select terms for integrated ontology.ex-onto:ClassTerm: select one concept from a set of classes.ex-prop:propTerm: select one concept from a set of predicates.
Construct relations.ex-prop:hasMemberClasses: link sets of classes withex-onto:ClassTerm.ex-prop:hasMemberDataTypes: link sets of predicates withex-prop:propTerm.
Construct an integrated ontology.Sets of related classes and predicates.Selected terms: ClassTerm and propTerm.Constructed relations.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 19
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Step 5: Manual Revision
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 20
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Manual Revision
Minor revision process on the automatically constructed ontology.
Modify incorrect termsNot all the terms of classes and predicates are properly selected.
Add domain informationAbout 40% of the predicate sets lack of rdfs:domain information.
Modify incorrectly grouped classes and predicatesWe can not guarantee 100% accuracy.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 21
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Experiments
Analyze the characteristics of linked instances with the integratedontology constructed with our approach.
Experimental Data
Graph Patterns of Linked Instances
Class-level Analysis
Predicate-level Analysis
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 22
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Experimental Data
DBpedia: cross-domain, 3.5 million things, 8.9 million URIs.
Geonames: geographical domain, 7 million URIs.
NYTimes: media domain, 10,467 subject news.
LinkedMDB: media domain, 0.5 million entities.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 23
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Graph Patterns of Linked Instances
13 graph patterns
Frequent graph patterns:
GP1, GP2, GP3
N,G,D: GP4, GP5, GP7, GP8
N,M,D: GP6
M,G,D: GP9
M,D,N,G: GP10, GP11,
GP12, GP13
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 24
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Class-level Analysis
Successfully integrated related classes from extracted graph patters.
Characteristics of graph patterns
Class Type Graph Pattern
Actor GP2, GP6
Person(Athlete, Politician, etc) GP3
Organization/Agent GP1, GP3, GP8
Film GP2
City/Settlement GP1, GP4, GP5, GP7, GP8
Country GP9, GP10, GP11, GP12, GP13
Place(Mountain, River, etc) GP1, GP3, GP7
Integrated 97 classes into 48 groupsExample: ex-onto:Countrydb-onto:Country geo-onto:A.PCLImdb:country nyt:nytd geo
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 25
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Class-level Analysis
Discover missing class informationExample: db:Shingo Katori
db:Shingo Katori rdf:type dbpedia-owl:MusicalArtist.mdb-actor:27092 owl:sameAs db:Shingo Katori
Therefore, db:Shingo Katori rdf:type db-onto:Actor.
Main classes of each data set.
NYTimes: person, organization, and place.LinkedMDB: movie, actor, and country.Geonames: A(country, administrative region), P (city, settlement), T(mountain), S (building, school), and H (Lake, river).DBpedia: person (artist, politician, athlete), organization (company,educational institute, sports team), work (film), and place (populatedplace, natural place, architectural structure).
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 26
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Predicate-level Analysis
Integrated 367 predicates into 38 groupsExample: ex-prop:birthDate
Predicate Number of Instances
db-onto:birthDate 287,327db-prop:datebirth 1,675db-prop:dateofbirth 87,364db-prop:dateOfBirth 163,876db-prop:born 34,832db-prop:birthdate 70,630db-prop:birthDate 101,121
Recommend standard predicates<db-onto:birthDate, rdfs:domain, db-onto:Person>“db-onto:birthDate” has the highest frequency of usage
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 27
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Comparison with Previous Work
Compare our ontology integration approach with the mid-ontologyapproach [Zhao, et al., JIST2011].
Mid-Ontology approach Our approach
A hub data for data collection. No hub data.String-based similarity measuresfor all types of objects.
Different similarity measures fordifferent types of objects.
105 predicates in 22 groups. 367 predicates into 38 groups.No classes 97 classes into 48 groups
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 28
Introduction Related Work Our Approach Experiments Comparison with Previous Work Conclusion and Future Work
Conclusion and Future Work
ConclusionIntegrate heterogeneous ontologies from various data sets.Identify the characteristics of graph patterns using the integratedontology classes.Recommend standard predicates using the integrated ontologypredicates.Reduce the heterogeneity of ontologies.Construct an integrated ontology without learning the entire ontologyschema.
Future Work
Use more data sets in the LOD cloud.Apply MapReduce method to solve scalability and ontologyheterogeneity problem.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 29
Questions?
Lihua Zhao, [email protected] Ichise, [email protected]
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 30