How to integrate Linked Data into your application

Download How to integrate Linked Data into your application

Post on 08-May-2015




0 download

Embed Size (px)


Slides presented by Christian Becker at the Semantic Technology & Business Conference, San Francisco, June 2012. More details at:


<ul><li>1.SEMANTIC TECHNOLOGY &amp; BUSINESS CONFERENCE |SAN FRANCISCO, JUNE 5, 2012 HOW TOINTEGRATE LINKED DATAINTO YOUR APPLICATIONLDIF Team: Andreas Schultz, Freie Universitt Berlin Andrea Matteini, mes|semanticsRobert Isele, Freie Universitt BerlinPablo N. Mendes, Freie Universitt Berlin Christian Becker, mes|semanticsChristian Bizer, Freie Universitt BerlinWith contributions by: Hannes Mhleisen, Freie Universitt Berlin; William Smith, Vulcan Inc.</li></ul> <p>2. | WHAT IS LINKED DATA? Raw data (RDF) Accessible on the web Data can link to other data sourcesThing Thing Thing Thing ThingThing Thing Thing Thing Thingdata link data link data link data link A BC DE Benets: Ease of access and re-use; enables discovery One API for all data sources? 3. |LINKING OPEN DATA CLOUDLinkedLOVUserSlideshare tags2conAudio Feedback 2RDFdelicious MoseleyScrobbler BricklinkSussexFolk (DBTune)ReadingSt. GTAAMagna-ListsAndrewsKlapp- tune stuhl- Resource NTU DBclubListsResource Tropes Lotico SemanticyovistoJohn Music Man- ListsMusic Tweet chester Hellenic Peel Brainz NDL (DBTune)(DataBrainzReadingsubjectsFBD(zitgist) Lists Open EUTCIncubator)LinkedHellenicLibraryOpen t4gmProduc- Crunch-PD Surge RDF info tions DiscogsbaseLibrary RadioOntosSource Code CrimeohlohPlymouth (Talis) (DataNewsLEMEcosystem Reading RAMEAUReports business Incubator)Crime Portal Linked DataLists SHUKMusic Jamendo (En-uk Brainz (DBtune)LinkedLOx AKTing)FanHubz gnossntnusc(DBTune)SSW CCN Points Thesau-Last.FMThesaurMedia Pok-Popula-artists pdia Didactalusrus WLIBRIS tion (En-(DBTune)Last.FM ia theses.LCSH Rdatareegleresearchpatents MARCAKTing) (rdfize)myfrn! data.go Codes Ren. NHS Good- Experi- Classical ListEnergy (En-win flickrment (DB PokedexFamily Norwe-Genera- AKTing)Mortality BBCwrapprSudoc PSHTune) gian(En- torsProgramMeSH Geographic AKTing)semantic mesBBCIdRef GNDCO2 SW EnergySudoc EEA(En- Chronic- Linked(En- ov.ukPortu- Food UB AKTing) lingEvent MDBAKTing) gueseMann-Europeana BBC America Media DBpedia Calames heim Ord- Recht-WildlifeDeutsche OpenRevyu DDC Openly spraak. Finder Bio-lobidnancePublicationsElectionRDFgraphieData legislationSurveyLocalnl dataUlm Resources NSZL Swedish EUTele-New BookProject graphis CatalogOpenInsti-YorkURI Open MashupCultural tutions Times GreekP20UK Post- Burner Calais Heritage codesDBpedia ECS WikistatisticslobidGovWILD TaxoniServeSouth-Organi-LOIUS BNBBrazilian ukConcept ECS amptonsations Geo WorldBibBaseSTWGESISUser-generated content OS South-ECSPoli- ESD NamesFact-ampton(RKB ticians stan- reference bookBudapest Freebase EPrints Explorer) dards NASA uk intervalsProjectOAI Lichfield transport (DataDBpedia data Incu-dcs RESEX Scholaro-ISTATdingbator) FishesbergDBLP DBLPukGeo meter Immi-Scotland of Texas(FU (L3S)Pupils &amp; UberblicDBLP Species Berlin)Government grationIRIT ExamsEuro-dbpedia data- (RKB LondonTCMACM statliteopen- Explorer)NVD Gazette(FUB)Gene IBMTraffic Geoac-uk ScotlandTWC LOGDEurostat Daily DIT LinkedUN/ Data UMBEL MedERAData LOCODEDEPLOY CORDIS YAGO New-lingvoj Disea- (RKB some SIDER RAE2001castleLOCAH Explorer) Linked Eurcom Cross-domainCORDISDrug Roma Eurostat Sensor DataCiteSeer (FUB)(Ontology Bank GovTrack(Kno.e.sis)OpenPfam Course- Central)riese Enipedia CycLexvoLinkedCT wareLinkedPDB UniProt VIVO EURES EDGAR dotAC US SEC IndianaePrintsIEEE(Ontology (rdfabout) Central) WordNet RISKSLife sciences (VUA)Taxono UniProt US Census EUNISTwarql HGNCSemantic Cornetto(Bio2RDF) (rdfabout) my VIVOFTS XBRL PRO-ProDom STITCHCornellLAAS SITEKISTINSFScotlandGeo-GeoWord LODE graphy NetWordNet WordNetJISC(W3C)(RKB Climbing Linked Affy-KEGG SMC Explorer)SISVUPubVIVO UFPiedmontGeoData metrix Drug ECCO-FinnishJournals PubMedGeneSGD ChemAccomo- ElTCPMunici-AGROV Ontology dations Alpine biblepalities ViajeroOCSki ontology Tourism KEGG Austria PBACOceanGEMETEnzymeMetofficeChEMBL Italian DrillingOMIMKEGG WeatherOpenpublic CodicesAEMETLinked MGI PathwayData schoolsForecasts OpenInterProGeneID KEGG EARTh Thesau- Turismo rus Colors ReactiondeZaragozaProductSmart KEGGWeather DBLinkMedi Glycan JanusStationsProduct Care KEGGAMP UniParc UniRefUniSTS TypesItalianHomolo Com-Yahoo! AirportsMuseums poundOntologyGoogle Gene GeoArtPlanetNational Chem2wrapper Radio- Bio2RDFactivityUniPath JPSearsOpenLinked OGOLODway Corpo- Amster-Reactome dammedu-OpenratesNumbersMuseumcator As of September 2011 4. |TYPES OF LINKED DATAVERY SOON?Open, LinkedCommercialPublic DataEnterpriseLinked Data (LOD Cloud) Data... AND WHAT YOU CAN DO WITH THEM Provide interfaces on top of them Augment your website Integrate them into your application logic Create specialized data marts 5. |AUGMENT YOUR WEBSITE: BBC BBC online properties make intensive use of data from Wikipedia and MusicBrainz 6. | DATA MARTS: NEUROWIKI NeuroWiki creates viewsfor genes, drugs anddiseases data from fourRDF data sources Provides navigation andcomposition tools foraccessing and mining thedata 7. |APPLICATION LOGIC: IBM WATSON IBM Watson makes use of Linked Data sources such as DBpedia 8. |4 STEPS TOLINKED DATA INTEGRATION 9. | STEP #1:ACCESS LINKED DATA Linked Data is published via HTTP, SPARQL endpoints, RDF dumps Access Methods Decision Factors Architecture HTTP Dump SPARQL Recency Speed / ScalabilityReliability Complexity DereferencingimportOn-The-FlyXHighLowLow HighDereferencing DecreasesModerate with exponentially asQuery Federation X High Low SPARQL 1.1 new sources areSERVICE clause addedCrawling and CachingXXXDepends High HighHigh Adapted from: Linked Data: Evolving the Web into a Global Data Space (Heath/Bizer 2011) Live access allows quick prototyping and limited production use As data sets grow in size and more data sources are added, acrawling/caching architecture often becomes necessary 10. |STEP #1: ACCESS LINKED DATAImplementations: On-the-y dereferencing LDspider, SQUIN, Semantic Web Client library Query federation SPARQL 1.1 SERVICE clause Crawling and Caching Triplestore import script Public caches (e.g. Sindice, OpenLink LOD endpoint) LDIF 11. | STEP #2:NORMALIZE VOCABULARIES Data sources that overlap in content use a wide range of vocabularies. mpeg7 swrc podcam bib tlwot rdfg txncompass metalexdoapdcwdrs admingeo vann api orgsawsdlOver 60 % of all LOD sources usesdmxgeospecies qb xml revvu-wordnetumbeluniprot http scovovoidtagproprietary vocabularies dbpbioore dbo grdbpediaeventtime xsd Its up to the data consumer to frbrgeonamesccnormalize the vocabulariessioc foaf vcard Enterprise: Need to translate mo between internal and externalbibo aktvocabularies xhtml skosgeo Most widely used vocabularies in the LOD cloud (08/10/2011)Source: FU Berlin / DERI; 12. | STEP #2:NORMALIZE VOCABULARIESApproaches to Schema Mapping: Hand-crafting queries against individual sources no different than an APIOPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } .OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc }OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] } Source: Ontology Representation Languages: OWL, RDFS Rules: SWRL, RIF Query Languages SPARQL CONSTRUCT clause TopQuadrant SPARQLMotion Mosto R2R (part of LDIF) 13. | STEP #2:NORMALIZE VOCABULARIESUsing SPARQL: Rename a classCONSTRUCT {?s a mo:MusicArtist} WHERE {?s a dbpedia-owl:MusicalArtist} Value transformationCONSTRUCT {?s movie:runtime ?runtimeInMinutes .} WHERE {?s dbpedia-owl:runtime ?runtime .BIND(?runtime * 60 As ?runtimeInMinutes)} Create URI from literalCONSTRUCT {?s diseasome:omim ?omimuri .?omimuri dc:identifier ?identifier .} WHERE {?s dbpedia-owl:omim ?omim .BIND(IRI(concat(, ?omim)) As ?omimuri)BIND(concat(omim:, ?omim) As ?identifier)} Slide credits: Andreas Schultz 14. | STEP #3:RESOLVE IDENTIFIERSData sources that overlap in content use different identiers for thesame real-world entity. 1 linked data sets98 Most LOD sources only provide 2 linked data sets62owl:sameAs links to one otherdata source 3 linked data sets38 4 linked data sets19 Its up to the data consumer togenerate additional links 5 linked data sets 5 Enterprise: Need to link both6 - 10 linked data sets17internal and external resources &gt; 10 linked data sets27025 50 75100Number of linked data sets per source (08/10/2011)Source: FU Berlin / DERI; 15. | STEP #3:RESOLVE IDENTIFIERSApproaches to Identity Resolution: Improvised or manual merging Rule-based approaches: SILK (part of LDIF) LIMESUnion Sq., New YorkUnion Sq., SeattleUnion Sq., San Francisco N47W 2437 2 12 Union Sq. Union = Square Union Sq., San FranciscoN 47W 24 37 212 16. |STEP #4: FILTER DATAData sources that overlap in content provide data that is conicting and ofvarying quality. Data sources have... ... different knowledge levels, views or intents ... wrong, biased, inconsistent or outdated information Approaches: Import data into distinct Named Graphs; query them separatelyusing the SPARQL GRAPH clause Sieve (part of LDIF) 17. |LDIF LINKED DATA INTEGRATION FRAMEWORKIntegrates Linked Data from multiple sources into a clean, local targetrepresentation while keeping track of data provenance 1 Collect data: Managed download and update 2 Translate data into a single target vocabulary 3 Resolve identier aliases into local target URIs NEW 4 Cleanse data; resolving the conicting values 5 Output Follows the Crawling and Caching Architecture Pattern Open source (Apache License, Version 2.0) Collaboration between Freie Universitt Berlin and mes|semantics 18. | LDIF PIPELINE1 Collect data Supported data sources:2 Translate data RDF dumps (all common formats) SPARQL Endpoints3 Resolve identities Crawling Linked Data via HTTP4 Cleanse data5 Output 19. | LDIF PIPELINE1 Collect data Sources use a wide range of different RDF vocabularies2 Translate datadbpedia-owl: City3 Resolve identitiesschema:PlaceR2Rlocal:Cityfb:location.citytown4 Cleanse data5 Output Simple mappings using OWL / RDFS statements (x rdfs:subClassOf y) Complex mappings with SPARQL expressivity Built-in transformation function library (XPath) 20. |LDIF PIPELINE1 Collect dataSources use different identiers for the same entity2 Translate data Union Sq., New York Union Sq., Seattle3 Resolve identities Union Sq., San FranciscoN 47 4 W37 22 124 Cleanse dataUnion Sq. Union=5 Output Square Silk Union Sq.,San Francisco N 47 4 W 37 2212 Automated link creation based on Link Specications Supports various comparators and transformations(string similarity, basic arithmetics, time, geographicaldistance) 21. | LDIF PIPELINESources provide different values for the same property1 Collect data San Francisco2 Translate data population is 0.7M3 Resolve identities San Francisco San4 Cleanse data population is Francisco 0.8MSievepopulation5 Outputis 0.8M 1. Quality Assessment assign quality scores to NamedGraphs (by time, by source preference, thresholds) 2. Data Fusion resolve conicting property values(according to quality scores, frequency, averages) 22. | LDIF PIPELINE1 Collect data Output options:2 Translate data N-Quads3 Resolve identities N-Triples SPARQL Update Stream4 Cleanse data5 Output Provenance tracking using Named Graphs 23. ! |!!!LDIF ARCHITECTUREApplication!Layer! Application!Code!!SPARQL!or!RDF!API! !!!!!!LDIF!!!!Data!Access,!!Data!Identity!Data!Quality!Integration!and!! Web!Data!Integrated! Translation! Resolution!and!Fusion!Access!Module! Web!Data!Storage!Layer! ! Module! Module!Module!!!HTTP!Web!of!Data! HTTP! HTTP!HTTP!RDFa! LD!Wrapper! LD!Wrapper!Publication!Layer!RDF/X ML! Database!A!Database!B! CMS! 24. | VERSIONS In-memory fast, but scalability limited by local RAM RDF Store (TDB) stores intermediate results in a Jena TDB RDF store can process more data than In-memory but doesnt scale Cluster (Hadoop) scales by parallelizing work across multiple machines using Hadoop can process a virtually unlimited amount of data ready for Amazon Elastic MapReduce 25. | BENCHMARKSKEGG GENES VS. UNIPROT (CLUSTER)300M TRIPLES3.6B TRIPLES 26. |Q&amp;A 27. | THANKS! Early adopters wanted! Website: Google Group: Supported in part by Vulcan Inc. as part of its Project Halo EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data(Grant No. 257943) Slide credits: Andrea Matteini, Robert Isele, Andreas Schultz </p>