how to integrate linked data into your application

Download How to integrate Linked Data into your application

Post on 08-May-2015

12.329 views

Category:

Technology

1 download

Embed Size (px)

DESCRIPTION

Slides presented by Christian Becker at the Semantic Technology & Business Conference, San Francisco, June 2012. More details at: http://ldif.wbsg.de

TRANSCRIPT

  • 1.SEMANTIC TECHNOLOGY & BUSINESS CONFERENCE |SAN FRANCISCO, JUNE 5, 2012 HOW TOINTEGRATE LINKED DATAINTO YOUR APPLICATIONLDIF Team: Andreas Schultz, Freie Universitt Berlin Andrea Matteini, mes|semanticsRobert Isele, Freie Universitt BerlinPablo N. Mendes, Freie Universitt Berlin Christian Becker, mes|semanticsChristian Bizer, Freie Universitt BerlinWith contributions by: Hannes Mhleisen, Freie Universitt Berlin; William Smith, Vulcan Inc.

2. | WHAT IS LINKED DATA? Raw data (RDF) Accessible on the web Data can link to other data sourcesThing Thing Thing Thing ThingThing Thing Thing Thing Thingdata link data link data link data link A BC DE Benets: Ease of access and re-use; enables discovery One API for all data sources? 3. |LINKING OPEN DATA CLOUDLinkedLOVUserSlideshare tags2conAudio Feedback 2RDFdelicious MoseleyScrobbler BricklinkSussexFolk (DBTune)ReadingSt. GTAAMagna-ListsAndrewsKlapp- tune stuhl- Resource NTU DBclubListsResource Tropes Lotico SemanticyovistoJohn Music Man- ListsMusic Tweet chester Hellenic Peel Brainz NDL (DBTune)(DataBrainzReadingsubjectsFBD(zitgist) Lists Open EUTCIncubator)LinkedHellenicLibraryOpen t4gmProduc- Crunch-PD Surge RDF info tions DiscogsbaseLibrary RadioOntosSource Code CrimeohlohPlymouth (Talis) (DataNewsLEMEcosystem Reading RAMEAUReports business Incubator)Crime data.gov. Portal Linked DataLists SHUKMusic Jamendo (En-uk Brainz (DBtune)LinkedLOx AKTing)FanHubz gnossntnusc(DBTune)SSW CCN Points Thesau-Last.FMThesaurMedia Pok-Popula-artists pdia Didactalusrus WLIBRIS tion (En-(DBTune)Last.FM ia theses.LCSH Rdatareegleresearchpatents MARCAKTing) (rdfize)myfrn!data.gov. data.go Codes Ren. NHS ukv.uk Good- Experi- Classical ListEnergy (En-win flickrment (DB PokedexFamily Norwe-Genera- AKTing)Mortality BBCwrapprSudoc PSHTune) gian(En- torsProgramMeSH Geographic AKTing)semantic mesBBCIdRef GNDCO2 educatioOpenEIweb.org SW EnergySudoc ndlnaEmissionn.data.gMusicDogVIAF EEA(En- Chronic- Linked(En- ov.ukPortu- Food UB AKTing) lingEvent MDBAKTing) gueseMann-Europeana BBC America Media DBpedia Calames heim Ord- Recht-WildlifeDeutsche OpenRevyu DDC Openly spraak. Finder Bio-lobidnancePublicationsElectionRDFgraphieData legislationSurveyLocalnl dataUlm Resources NSZL Swedish EUTele-New BookProject data.gov.uk graphis bnf.fr CatalogOpenInsti-YorkURI Open MashupCultural tutions Times GreekP20UK Post- Burner Calais Heritage codesDBpedia ECS WikistatisticslobidGovWILD data.gov. TaxoniServeSouth-Organi-LOIUS BNBBrazilian ukConcept ECS amptonsations Geo WorldBibBaseSTWGESISUser-generated content OS South-ECSPoli- ESD NamesFact-ampton(RKB ticians stan- reference bookBudapest data.gov.uk Freebase EPrints Explorer) dards data.gov. NASA uk intervalsProjectOAI Lichfield transport (DataDBpedia data Guten-PisaSpen-data.gov. Incu-dcs RESEX Scholaro-ISTATdingbator) FishesbergDBLP DBLPukGeo meter Immi-Scotland of Texas(FU (L3S)Pupils & UberblicDBLP Species Berlin)Government grationIRIT ExamsEuro-dbpedia data- (RKB LondonTCMACM statliteopen- Explorer)NVD Gazette(FUB)Gene IBMTraffic Geoac-uk ScotlandTWC LOGDEurostat Daily DIT LinkedUN/ Data UMBEL MedERAData LOCODEDEPLOY Gov.ie CORDIS YAGO New-lingvoj Disea- (RKB some SIDER RAE2001castleLOCAH Explorer) Linked Eurcom Cross-domainCORDISDrug Roma Eurostat Sensor DataCiteSeer (FUB)(Ontology Bank GovTrack(Kno.e.sis)OpenPfam Course- Central)riese Enipedia CycLexvoLinkedCT wareLinkedPDB UniProt VIVO EURES EDGAR dotAC US SEC IndianaePrintsIEEE(Ontology totl.net (rdfabout) Central) WordNet RISKSLife sciences (VUA)Taxono UniProt US Census EUNISTwarql HGNCSemantic Cornetto(Bio2RDF) (rdfabout) my VIVOFTS XBRL PRO-ProDom STITCHCornellLAAS SITEKISTINSFScotlandGeo-GeoWord LODE graphy NetWordNet WordNetJISC(W3C)(RKB Climbing Linked Affy-KEGG SMC Explorer)SISVUPubVIVO UFPiedmontGeoData metrix Drug ECCO-FinnishJournals PubMedGeneSGD ChemAccomo- ElTCPMunici-AGROV Ontology dations Alpine biblepalities ViajeroOCSki ontology Tourism KEGG Austria PBACOceanGEMETEnzymeMetofficeChEMBL Italian DrillingOMIMKEGG WeatherOpenpublic CodicesAEMETLinked MGI PathwayData schoolsForecasts OpenInterProGeneID KEGG EARTh Thesau- Turismo rus Colors ReactiondeZaragozaProductSmart KEGGWeather DBLinkMedi Glycan JanusStationsProduct Care KEGGAMP UniParc UniRefUniSTS TypesItalianHomolo Com-Yahoo! AirportsMuseums poundOntologyGoogle Gene GeoArtPlanetNational Chem2wrapper Radio- Bio2RDFactivityUniPath JPSearsOpenLinked OGOLODway Corpo- Amster-Reactome dammedu-OpenratesNumbersMuseumcatorhttp://lod-cloud.net As of September 2011 4. |TYPES OF LINKED DATAVERY SOON?Open, LinkedCommercialPublic DataEnterpriseLinked Data (LOD Cloud) Data... AND WHAT YOU CAN DO WITH THEM Provide interfaces on top of them Augment your website Integrate them into your application logic Create specialized data marts 5. |AUGMENT YOUR WEBSITE: BBC BBC online properties make intensive use of data from Wikipedia and MusicBrainz 6. | DATA MARTS: NEUROWIKI NeuroWiki creates viewsfor genes, drugs anddiseases data from fourRDF data sources Provides navigation andcomposition tools foraccessing and mining thedata 7. |APPLICATION LOGIC: IBM WATSONhttp://www.ickr.com/photos/ibm_media/ IBM Watson makes use of Linked Data sources such as DBpedia 8. |4 STEPS TOLINKED DATA INTEGRATION 9. | STEP #1:ACCESS LINKED DATA Linked Data is published via HTTP, SPARQL endpoints, RDF dumps Access Methods Decision Factors Architecture HTTP Dump SPARQL Recency Speed / ScalabilityReliability Complexity DereferencingimportOn-The-FlyXHighLowLow HighDereferencing DecreasesModerate with exponentially asQuery Federation X High Low SPARQL 1.1 new sources areSERVICE clause addedCrawling and CachingXXXDepends High HighHigh Adapted from: Linked Data: Evolving the Web into a Global Data Space (Heath/Bizer 2011) Live access allows quick prototyping and limited production use As data sets grow in size and more data sources are added, acrawling/caching architecture often becomes necessary 10. |STEP #1: ACCESS LINKED DATAImplementations: On-the-y dereferencing LDspider, SQUIN, Semantic Web Client library Query federation SPARQL 1.1 SERVICE clause Crawling and Caching Triplestore import script Public caches (e.g. Sindice, OpenLink LOD endpoint) LDIF 11. | STEP #2:NORMALIZE VOCABULARIES Data sources that overlap in content use a wide range of vocabularies. mpeg7 swrc podcam bib tlwot rdfg txncompass metalexdoapdcwdrs admingeo vann api orgsawsdlOver 60 % of all LOD sources usesdmxgeospecies qb xml revvu-wordnetumbeluniprot http scovovoidtagproprietary vocabularies dbpbioore dbo grdbpediaeventtime xsd Its up to the data consumer to frbrgeonamesccnormalize the vocabulariessioc foaf vcard Enterprise: Need to translate mo between internal and externalbibo aktvocabularies xhtml skosgeo Most widely used vocabularies in the LOD cloud (08/10/2011)Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/ 12. | STEP #2:NORMALIZE VOCABULARIESApproaches to Schema Mapping: Hand-crafting queries against individual sources no different than an APIOPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } .OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc }OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] } Source: http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php Ontology Representation Languages: OWL, RDFS Rules: SWRL, RIF Query Languages SPARQL CONSTRUCT clause TopQuadrant SPARQLMotion Mosto R2R (part of LDIF) 13. | STEP #2:NORMALIZE VOCABULARIESUsing SPARQL: Rename a classCONSTRUCT {?s a mo:MusicArtist} WHERE {?s a dbpedia-owl:MusicalArtist} Value transformationCONSTRUCT {?s movie:runtime ?runtimeInMinutes .} WHERE {?s dbpedia-owl:runtime ?runtime .BIND(?runtime * 60 As ?runtimeInMinutes)} Create URI from literalCONSTRUCT {?s diseasome:omim ?omimuri .?omimuri dc:identifier ?identifier .} WHERE {?s dbpedia-owl:omim ?omim .BIND(IRI(concat(http://bio2rdf.org/omim:, ?omim)) As ?omimuri)BIND(concat(omim:, ?omim) As ?identifier)} Slide credits: Andreas Schultz 14. | STEP #3:RESOLVE IDENTIFIERSData sources that overlap in content use different identiers for thesame real-world entity. 1 linked data sets98 Most LOD sources only provide 2 linked data sets62owl:sameAs links to one otherdata source 3 linked data sets38 4 linked data sets19 Its up to the data consumer togenerate additional links 5 linked data sets 5 Enterprise: Need to link both6 - 10 linked data sets17internal and external resources > 10 linked data sets27025 50 75100Number of linked data sets per source (08/10/2011)Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/ 15. | STEP #3:RESOLVE IDENTIFIERSApproaches to Identity Resolution: Improvised or manual merging Rule-based approaches: SILK (part of LDIF) LIMESUnion Sq., New YorkUnion Sq., SeattleUnion Sq., San Francisco N47W 2437 2 12 Union Sq. Union = Square Union Sq., San FranciscoN 47W 24 37 212 16. |STEP #4: FILTER DATAData sources that overlap in content provide data that is conicting and ofvarying quality. Data sources have... ... different knowledge levels, views or intents ... wrong, biased, inconsistent or outdated information Approaches: Import data into distinct Named Graphs; query them separatelyusing the SPARQL GRAPH clause Sieve (part of LDIF) 17. |LDIF LINKED DATA INTEGRATION FRAMEWORKIntegrates Linked Data from multiple sources into a clean, local targetrepresentation while keeping track of data provenance 1 Collect data: Managed download and update 2 Translate data into a single target vocabulary 3 Resolve identier aliases into local target URIs NEW 4 Cleanse data; resolving the conicting values 5 Output Follows the Crawling and Caching Architecture Pattern Open source (Apache License, Version 2.0) Collaboration between Freie Universitt Berlin and mes|semantics 18. | LDIF PIPELINE1 Collect data Supported data sources:2 Translate data RDF dumps (all common formats) SPARQL Endpoint