exposing cagrid data services as linked cagrid data services as linked data joshua phillips...

Download Exposing caGrid Data Services as Linked   caGrid Data Services as Linked Data Joshua Phillips Alejandra Gonzalez-Beltran Jyoti Pathak October 22, 2009

Post on 22-Feb-2018




2 download

Embed Size (px)


  • Exposing caGrid

    Data Services as

    Linked Data

    Joshua PhillipsAlejandra Gonzalez-Beltran

    Jyoti Pathak

    October 22, 2009

  • Basic Premise

    It is both useful and practical to expose caBIG data sets as Linked Data.

  • What is Linked Data?

    Linked Data Set of principles/best practices for

    publishing data on the Web. Aligned with WWW architecture Web of Data vs. Web of Documents

    Semantic Web Web of machine interpretable data

  • caBIG+Linked Data is Useful

    Immediately accessible through Web Interlinking with other data sets creates

    additional value Enables powerful technologies

    linked data browsers semantic search engines logic-based reasoners

    Discovery of new patterns in data

  • caBIG+Linked Data is Practical

    Aligns with caBIG goals Open, federated data sharing network Precise semantic definitions to enable

    interoperability Semantic Infrastructure Reuse

    Reuse existing processes and tools

    Minimize barriers (cost) to data providers Linked Data is gaining momentum

    Network effect -> increased value, stable technology

  • Outline

    Background Linked Data & caBIG Preliminary Work Discussion

  • Linked Data Background

    Title of a design note by Tim Berners-Lee Best practices for publishing data on the Web. Web of data using HTTP, URIs, and typed (RDF)

    links (vs. hypertext links).

  • Linked Data Principles

    1. Use URIs as names for things2. Use HTTP URIs so that people can look

    up for those names3. When someone looks up a URI, provide

    useful RDF information4. Include RDF statements that link to

    other URIs so that they can discover related things

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • Example: Navigating caBIG WebcaArray -> caTissue -> COPPA

  • More about Linked Data

    De-referencing URIs Non-information vs. information resources Hash URIs vs. slashes

    303 redirects & content negotiation

    Vocabulary of Interlinked Datasets (voiD) What is the subject of the dataset?

    Where is the SPARQL endpoint?

    Relationships with other datasets.

  • Who is providing Linked Data now?

    Publish existing open license datasets as Linked Data on the Web

    Interlink things between different data sources

    Size of dataset: approx. 7.8 billion triples Number of links: approx. 143 million

  • Life Sciences Contributors

    HCLS Linked Open Drug Data Won 2009 Triplification Challenge

    Bio2RDF 40 biology-, gene- and medical-related

    datasets (altogether 2.3 billion triples) Many more


  • Many are getting involved

    These are examples of different organizations that are publishing their data as Linked Data.

  • caBIG+Linked Data: How?

    Need RDF vocabularies for describing caBIG data sets.

    Need consistent approach to naming things with URIs.

    Need to RDFize caBIG datasets. Must minimize technology barriers to data


  • RDF Vocabularies

    UML-OWL Generator [1] OWL representations of information

    models Retains semantics of original model Includes NCIt concepts Provides schemata for data

    1. McCusker et al. Semantic web data warehousing for caGrid. BMCBioinformatics 2009, 10(Suppl 10):S2

  • RDF Vocabularies

  • Naming caBIG resources with URIs

    caGrid Identifier Framework Addresses issues of change. Based on Persistent URLs (PURLs) Provides HTTP URI Naming Authority,

    Prefix Authority, and resolution scheme. Consistent with Linked Data principles.

  • Naming caBIG resources with URIs

  • RDFizing caBIG Data

    Alternative approaches Static RDF files Native triplestore Adapter over API (e.g. Data Service API) Adapter over relational DB

  • Recommended Approach

    Adapter over relational DB Factors:

    caCORE SDK generated services use relational database backend.

    Potentially large data sets. Potential high frequency of change.

  • Initial Process

    1. UML-OWL Generator -> OWL info. model

    2. D2R Service -> Generate RDF-relational mapping

    3. Use caCORE SDK XMI to modify D2R mapping to use OWL classes representing information model.

    4. D2R Server to exposing Linked Data & SPARQL HTTP Interfaces

  • Preliminary Work

    Exposing caTissueSuite 1.1 data set as Linked Data.

    Goal: To validate high-level use cases with important, real caBIG data set.

  • caTissue Suite 1.1 UML

  • caTissue Suite 1.1 OWL

  • caTissue Suite 1.1 OWL

  • caTissue Suite 1.1 OWL

  • D2R Mapping

    Maps relational model to RDF classes Mapping expressed in RDF (N3 syntax) ClassMap

    Maps tables to RDFS or OWL classes PropertyBridge

    Maps columns to simple RDF properties (i.e. OWL DatatypeProperty)

    Maps FKs to RDF links (OWL ObjectProperty) Both tables and columns can be mapped to

    existing RDFS or OWL vocabularies.

  • D2R Mapping

    @prefix vocab: .


    map:catissue_tissue_specimen a d2rq:ClassMap;

    d2rq:dataStorage map:database;

    d2rq:uriPattern ...

    d2rq:class vocab:TissueSpecimen;


    map:catissue_specimen_event_param_SPECIMEN_ID a d2rq:PropertyBridge;

    d2rq:belongsToClassMap map:catissue_specimen_event_param;

    d2rq:property vocab:SpecimenEventParameters_specimen_AbstractSpecimen;

    d2rq:refersToClassMap map:catissue_abstract_specimen;

    d2rq:join "catissue_specimen_event_param.SPECIMEN_ID => catissue_abstract_specimen.IDENTIFIER";


  • Linked Data & SPARQL Interface

    Linked Data navigation

    OWL Class defined inInformation Model

    SPARQL QueryInterface

  • Linked Data & SPARQL Interface

    Mapped role defined in OWLInformation Model

  • Prototype Evaluation

    Very preliminary work Relatively straightforward to map tables

    and columns to OWL classes and properties.

    Similar functionality to caCORE GetXML or GetHTML HTTP APIs.

    Challenges Interpreting alternative O-R mapping

    strategies for OO Inheritance. D2R server Performance

  • Related Work in caBIG

    Prostate Cancer Information System (PCIS) A prototype system (Fox Chase Cancer Center)

    Developed Prostate Cancer Ontology (PCO) Apply PCO to integrate two database systems

    Tumor Registry Prostate Cancer Database

    A web-based ontology query formulation

    Hua Min, Frank J. Manion, Elizabeth Goralczyk, Yu-Ning Wong, Eric Ross, J. Robert Beck Integration of Prostate Cancer Clinical Data Using an Ontology (JBI 2009)

  • PCIS Ontology Mapping

  • Other RDFizer Tools

    W3C RDB2RDF Incubator Group Surveyed all existing approaches Developed comparison framework Recommends using Rule Interchange

    Framework (RIF) No existing tools support RIF


  • Performance Analyses

    Benchmarking Tests Exist Lehigh University Benchmark (LUBM) Ontology Benchmark (UOBM) Berlin SPARQL Benchmark (BSBM)


  • Berlin SPARQL Benchmark

    Relational still faster than RDF & SPARQL


  • Discussion

    Limitations of mappings Link discovery and maintenance Performance

    Enhancements/optimizations are needed RDB2RDF tools show promise

    Need to exploit research experience in RDBMSs, esp. in query translations

  • References

    Heath et al. How to Publish Linked Data on the Web. 8th Intl. Semantic Web Conference, 2008. Tutorial.

    Berners-Lee. Linked Data. http://www.w3.org/DesignIssues/LinkedData.html A Survey of Current Approaches for Mapping of Relational

    Databases to RDF W3C RDB2RDF Incubator Group Reporthttp://www.w3.org/2005/Incubator/rdb2rdf/

    Alasdair J. G. Gray et al. Can RDB2RDF Tools Feasibily Expose Large Science Archives for Data Integration? In ESWC 2009, volume 5554 of LNCS, pp 491-505. Springer.

    Bizer et al. Linked Data the story so far.


View more >