exposing cagrid data services as linked data
Post on 16-Jan-2016
37 views
Embed Size (px)
DESCRIPTION
Exposing caGrid Data Services as Linked Data. Joshua Phillips Alejandra Gonzalez-Beltran Jyoti Pathak October 22, 2009. Basic Premise. It is both useful and practical to expose caBIG data sets as Linked Data. What is Linked Data?. Linked Data - PowerPoint PPT PresentationTRANSCRIPT
Exposing caGrid Data Services as Linked Data
Joshua PhillipsAlejandra Gonzalez-BeltranJyoti Pathak
October 22, 2009
Basic PremiseIt is both useful and practical to expose caBIG data sets as Linked Data.
What is Linked Data?Linked DataSet of principles/best practices for publishing data on the Web.Aligned with WWW architectureWeb of Data vs. Web of DocumentsSemantic WebWeb of machine interpretable data
caBIG+Linked Data is UsefulImmediately accessible through WebInterlinking with other data sets creates additional valueEnables powerful technologieslinked data browserssemantic search engineslogic-based reasonersDiscovery of new patterns in data
caBIG+Linked Data is PracticalAligns with caBIG goalsOpen, federated data sharing networkPrecise semantic definitions to enable interoperabilitySemantic Infrastructure ReuseReuse existing processes and toolsMinimize barriers (cost) to data providersLinked Data is gaining momentumNetwork effect -> increased value, stable technology
OutlineBackgroundLinked Data & caBIGPreliminary WorkDiscussion
Linked Data BackgroundTitle of a design note by Tim Berners-LeeBest practices for publishing data on the Web.Web of data using HTTP, URIs, and typed (RDF) links (vs. hypertext links).This graphic illustrates linked data background on the web using http, URI's and typed RDF links.
Linked Data PrinciplesUse URIs as names for thingsUse HTTP URIs so that people can look up for those namesWhen someone looks up a URI, provide useful RDF informationInclude RDF statements that link to other URIs so that they can discover related things
Example: Navigating caBIG WebcaArray -> caTissue -> COPPA
https://array.nci.nih.gov/caarray/project/abate-00279
Example: Navigating caBIG WebcaArray -> caTissue -> COPPAhttps://array.nci.nih.gov/caarray/project/abate-00279
Example: Navigating caBIG WebcaArray -> caTissue -> COPPAhttps://array.nci.nih.gov/caarray/project/abate-00279
Example: Navigating caBIG WebcaArray -> caTissue -> COPPA
Example: Navigating caBIG WebcaArray -> caTissue -> COPPA
Example: Navigating caBIG WebcaArray -> caTissue -> COPPA
Example: Navigating caBIG WebcaArray -> caTissue -> COPPA
More about Linked DataDe-referencing URIsNon-information vs. information resourcesHash URIs vs. slashes303 redirects & content negotiationVocabulary of Interlinked Datasets (voiD)What is the subject of the dataset?Where is the SPARQL endpoint?Relationships with other datasets.
Who is providing Linked Data now?Publish existing open license datasets as Linked Data on the WebInterlink things between different data sourcesSize of dataset: approx. 7.8 billion triplesNumber of links: approx. 143 million
Life Sciences ContributorsHCLS Linked Open Drug DataWon 2009 Triplification ChallengeBio2RDF40 biology-, gene- and medical-related datasets (altogether 2.3 billion triples)Many morehttp://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets
Many are getting involvedThese are examples of different organizations that are publishing their data as Linked Data.
caBIG+Linked Data: How?Need RDF vocabularies for describing caBIG data sets.Need consistent approach to naming things with URIs.Need to RDFize caBIG datasets.Must minimize technology barriers to data providers.
RDF VocabulariesUML-OWL Generator [1]OWL representations of information modelsRetains semantics of original modelIncludes NCIt conceptsProvides schemata for data
1. McCusker et al. Semantic web data warehousing for caGrid. BMC Bioinformatics 2009, 10(Suppl 10):S2
RDF Vocabularies
Naming caBIG resources with URIscaGrid Identifier FrameworkAddresses issues of change.Based on Persistent URLs (PURLs)Provides HTTP URI Naming Authority, Prefix Authority, and resolution scheme.Consistent with Linked Data principles.
Naming caBIG resources with URIs
RDFizing caBIG DataAlternative approachesStatic RDF filesNative triplestoreAdapter over API (e.g. Data Service API)Adapter over relational DB
Recommended ApproachAdapter over relational DBFactors:caCORE SDK generated services use relational database backend.Potentially large data sets.Potential high frequency of change.
Initial ProcessUML-OWL Generator -> OWL info. modelD2R Service -> Generate RDF-relational mappingUse caCORE SDK XMI to modify D2R mapping to use OWL classes representing information model.D2R Server to exposing Linked Data & SPARQL HTTP Interfaces
Preliminary WorkExposing caTissueSuite 1.1 data set as Linked Data.Goal: To validate high-level use cases with important, real caBIG data set.
caTissue Suite 1.1 UML
caTissue Suite 1.1 OWL
caTissue Suite 1.1 OWL
caTissue Suite 1.1 OWL
D2R MappingMaps relational model to RDF classesMapping expressed in RDF (N3 syntax)ClassMapMaps tables to RDFS or OWL classesPropertyBridgeMaps columns to simple RDF properties (i.e. OWL DatatypeProperty)Maps FKs to RDF links (OWL ObjectProperty)Both tables and columns can be mapped to existing RDFS or OWL vocabularies.
D2R Mapping@prefix vocab: ....map:catissue_tissue_specimen a d2rq:ClassMap; d2rq:dataStorage map:database; d2rq:uriPattern ... d2rq:class vocab:TissueSpecimen; ...
map:catissue_specimen_event_param_SPECIMEN_ID a d2rq:PropertyBridge; d2rq:belongsToClassMap map:catissue_specimen_event_param; d2rq:property vocab:SpecimenEventParameters_specimen_AbstractSpecimen; d2rq:refersToClassMap map:catissue_abstract_specimen; d2rq:join "catissue_specimen_event_param.SPECIMEN_ID => catissue_abstract_specimen.IDENTIFIER";...
Linked Data & SPARQL InterfaceLinked Data navigationOWL Class defined inInformation ModelSPARQL QueryInterface
Linked Data & SPARQL InterfaceMapped role defined in OWLInformation Model
Prototype EvaluationVery preliminary workRelatively straightforward to map tables and columns to OWL classes and properties.Similar functionality to caCORE GetXML or GetHTML HTTP APIs.ChallengesInterpreting alternative O-R mapping strategies for OO Inheritance.D2R server Performance
Related Work in caBIGProstate Cancer Information System (PCIS)A prototype system (Fox Chase Cancer Center)Developed Prostate Cancer Ontology (PCO)Apply PCO to integrate two database systemsTumor RegistryProstate Cancer DatabaseA web-based ontology query formulation
Hua Min, Frank J. Manion, Elizabeth Goralczyk, Yu-Ning Wong, Eric Ross, J. Robert Beck Integration of Prostate Cancer Clinical Data Using an Ontology (JBI 2009)
PCIS Ontology Mapping
Other RDFizer ToolsW3C RDB2RDF Incubator GroupSurveyed all existing approachesDeveloped comparison frameworkRecommends using Rule Interchange Framework (RIF)No existing tools support RIF
http://www.w3.org/2005/Incubator/rdb2rdf/
Performance AnalysesBenchmarking Tests ExistLehigh University Benchmark (LUBM)Ontology Benchmark (UOBM)Berlin SPARQL Benchmark (BSBM)
http://esw.w3.org/topic/RdfStoreBenchmarking
Berlin SPARQL BenchmarkRelational still faster than RDF & SPARQLhttp://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html
DiscussionLimitations of mappingsLink discovery and maintenancePerformance Enhancements/optimizations are neededRDB2RDF tools show promiseNeed to exploit research experience in RDBMSs, esp. in query translations
ReferencesHeath et al. How to Publish Linked Data on the Web. 8th Intl. Semantic Web Conference, 2008. Tutorial.Berners-Lee. Linked Data. http://www.w3.org/DesignIssues/LinkedData.htmlA Survey of Current Approaches for Mapping of Relational Databases to RDF W3C RDB2RDF Incubator Group Report http://www.w3.org/2005/Incubator/rdb2rdf/Alasdair J. G. Gray et al. Can RDB2RDF Tools Feasibily Expose Large Science Archives for Data Integration? In ESWC 2009, volume 5554 of LNCS, pp 491-505. Springer.Bizer et al. Linked Data the story so far.
**Linked Data is the title of a W3C design note published by Tim Berners-Lee in 2006, in which he outlined some best practices for effectively publishing data on the Web, using HTTP, URIs, and typed RDF links. These ideas have since been elaborated several groups. The ultimate goals is to enable data sets to be combined and re-used in unexpected ways.*https://array.nci.nih.gov/caarray/project/abate-00279****