exposing cagrid data services as linked data

Click here to load reader

Download Exposing caGrid Data Services as Linked Data

Post on 16-Jan-2016

37 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Exposing caGrid Data Services as Linked Data. Joshua Phillips Alejandra Gonzalez-Beltran Jyoti Pathak October 22, 2009. Basic Premise. It is both useful and practical to expose caBIG data sets as Linked Data. What is Linked Data?. Linked Data - PowerPoint PPT Presentation

TRANSCRIPT

  • Exposing caGrid Data Services as Linked Data

    Joshua PhillipsAlejandra Gonzalez-BeltranJyoti Pathak

    October 22, 2009

  • Basic PremiseIt is both useful and practical to expose caBIG data sets as Linked Data.

  • What is Linked Data?Linked DataSet of principles/best practices for publishing data on the Web.Aligned with WWW architectureWeb of Data vs. Web of DocumentsSemantic WebWeb of machine interpretable data

  • caBIG+Linked Data is UsefulImmediately accessible through WebInterlinking with other data sets creates additional valueEnables powerful technologieslinked data browserssemantic search engineslogic-based reasonersDiscovery of new patterns in data

  • caBIG+Linked Data is PracticalAligns with caBIG goalsOpen, federated data sharing networkPrecise semantic definitions to enable interoperabilitySemantic Infrastructure ReuseReuse existing processes and toolsMinimize barriers (cost) to data providersLinked Data is gaining momentumNetwork effect -> increased value, stable technology

  • OutlineBackgroundLinked Data & caBIGPreliminary WorkDiscussion

  • Linked Data BackgroundTitle of a design note by Tim Berners-LeeBest practices for publishing data on the Web.Web of data using HTTP, URIs, and typed (RDF) links (vs. hypertext links).This graphic illustrates linked data background on the web using http, URI's and typed RDF links.

  • Linked Data PrinciplesUse URIs as names for thingsUse HTTP URIs so that people can look up for those namesWhen someone looks up a URI, provide useful RDF informationInclude RDF statements that link to other URIs so that they can discover related things

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

    https://array.nci.nih.gov/caarray/project/abate-00279

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPAhttps://array.nci.nih.gov/caarray/project/abate-00279

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPAhttps://array.nci.nih.gov/caarray/project/abate-00279

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • More about Linked DataDe-referencing URIsNon-information vs. information resourcesHash URIs vs. slashes303 redirects & content negotiationVocabulary of Interlinked Datasets (voiD)What is the subject of the dataset?Where is the SPARQL endpoint?Relationships with other datasets.

  • Who is providing Linked Data now?Publish existing open license datasets as Linked Data on the WebInterlink things between different data sourcesSize of dataset: approx. 7.8 billion triplesNumber of links: approx. 143 million

  • Life Sciences ContributorsHCLS Linked Open Drug DataWon 2009 Triplification ChallengeBio2RDF40 biology-, gene- and medical-related datasets (altogether 2.3 billion triples)Many morehttp://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets

  • Many are getting involvedThese are examples of different organizations that are publishing their data as Linked Data.

  • caBIG+Linked Data: How?Need RDF vocabularies for describing caBIG data sets.Need consistent approach to naming things with URIs.Need to RDFize caBIG datasets.Must minimize technology barriers to data providers.

  • RDF VocabulariesUML-OWL Generator [1]OWL representations of information modelsRetains semantics of original modelIncludes NCIt conceptsProvides schemata for data

    1. McCusker et al. Semantic web data warehousing for caGrid. BMC Bioinformatics 2009, 10(Suppl 10):S2

  • RDF Vocabularies

  • Naming caBIG resources with URIscaGrid Identifier FrameworkAddresses issues of change.Based on Persistent URLs (PURLs)Provides HTTP URI Naming Authority, Prefix Authority, and resolution scheme.Consistent with Linked Data principles.

  • Naming caBIG resources with URIs

  • RDFizing caBIG DataAlternative approachesStatic RDF filesNative triplestoreAdapter over API (e.g. Data Service API)Adapter over relational DB

  • Recommended ApproachAdapter over relational DBFactors:caCORE SDK generated services use relational database backend.Potentially large data sets.Potential high frequency of change.

  • Initial ProcessUML-OWL Generator -> OWL info. modelD2R Service -> Generate RDF-relational mappingUse caCORE SDK XMI to modify D2R mapping to use OWL classes representing information model.D2R Server to exposing Linked Data & SPARQL HTTP Interfaces

  • Preliminary WorkExposing caTissueSuite 1.1 data set as Linked Data.Goal: To validate high-level use cases with important, real caBIG data set.

  • caTissue Suite 1.1 UML

  • caTissue Suite 1.1 OWL

  • caTissue Suite 1.1 OWL

  • caTissue Suite 1.1 OWL

  • D2R MappingMaps relational model to RDF classesMapping expressed in RDF (N3 syntax)ClassMapMaps tables to RDFS or OWL classesPropertyBridgeMaps columns to simple RDF properties (i.e. OWL DatatypeProperty)Maps FKs to RDF links (OWL ObjectProperty)Both tables and columns can be mapped to existing RDFS or OWL vocabularies.

  • D2R Mapping@prefix vocab: ....map:catissue_tissue_specimen a d2rq:ClassMap; d2rq:dataStorage map:database; d2rq:uriPattern ... d2rq:class vocab:TissueSpecimen; ...

    map:catissue_specimen_event_param_SPECIMEN_ID a d2rq:PropertyBridge; d2rq:belongsToClassMap map:catissue_specimen_event_param; d2rq:property vocab:SpecimenEventParameters_specimen_AbstractSpecimen; d2rq:refersToClassMap map:catissue_abstract_specimen; d2rq:join "catissue_specimen_event_param.SPECIMEN_ID => catissue_abstract_specimen.IDENTIFIER";...

  • Linked Data & SPARQL InterfaceLinked Data navigationOWL Class defined inInformation ModelSPARQL QueryInterface

  • Linked Data & SPARQL InterfaceMapped role defined in OWLInformation Model

  • Prototype EvaluationVery preliminary workRelatively straightforward to map tables and columns to OWL classes and properties.Similar functionality to caCORE GetXML or GetHTML HTTP APIs.ChallengesInterpreting alternative O-R mapping strategies for OO Inheritance.D2R server Performance

  • Related Work in caBIGProstate Cancer Information System (PCIS)A prototype system (Fox Chase Cancer Center)Developed Prostate Cancer Ontology (PCO)Apply PCO to integrate two database systemsTumor RegistryProstate Cancer DatabaseA web-based ontology query formulation

    Hua Min, Frank J. Manion, Elizabeth Goralczyk, Yu-Ning Wong, Eric Ross, J. Robert Beck Integration of Prostate Cancer Clinical Data Using an Ontology (JBI 2009)

  • PCIS Ontology Mapping

  • Other RDFizer ToolsW3C RDB2RDF Incubator GroupSurveyed all existing approachesDeveloped comparison frameworkRecommends using Rule Interchange Framework (RIF)No existing tools support RIF

    http://www.w3.org/2005/Incubator/rdb2rdf/

  • Performance AnalysesBenchmarking Tests ExistLehigh University Benchmark (LUBM)Ontology Benchmark (UOBM)Berlin SPARQL Benchmark (BSBM)

    http://esw.w3.org/topic/RdfStoreBenchmarking

  • Berlin SPARQL BenchmarkRelational still faster than RDF & SPARQLhttp://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html

  • DiscussionLimitations of mappingsLink discovery and maintenancePerformance Enhancements/optimizations are neededRDB2RDF tools show promiseNeed to exploit research experience in RDBMSs, esp. in query translations

  • ReferencesHeath et al. How to Publish Linked Data on the Web. 8th Intl. Semantic Web Conference, 2008. Tutorial.Berners-Lee. Linked Data. http://www.w3.org/DesignIssues/LinkedData.htmlA Survey of Current Approaches for Mapping of Relational Databases to RDF W3C RDB2RDF Incubator Group Report http://www.w3.org/2005/Incubator/rdb2rdf/Alasdair J. G. Gray et al. Can RDB2RDF Tools Feasibily Expose Large Science Archives for Data Integration? In ESWC 2009, volume 5554 of LNCS, pp 491-505. Springer.Bizer et al. Linked Data the story so far.

    **Linked Data is the title of a W3C design note published by Tim Berners-Lee in 2006, in which he outlined some best practices for effectively publishing data on the Web, using HTTP, URIs, and typed RDF links. These ideas have since been elaborated several groups. The ultimate goals is to enable data sets to be combined and re-used in unexpected ways.*https://array.nci.nih.gov/caarray/project/abate-00279****

View more