building a network of interoperable and independently produced linked and open biomedical data

40
Building a network of interoperable and independently produced linked and open biomedical data 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University @micheldumontier::ACS:23-08-16 An invited talk in support of the 2016 Herman Skolnik Awardees

Upload: michel-dumontier

Post on 15-Apr-2017

228 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

1

Building a network of interoperable and independently produced linked and open

biomedical data

Michel Dumontier, Ph.D.

Associate Professor of Medicine (Biomedical Informatics)Stanford University

@micheldumontier::ACS:23-08-16An invited talk in support of the 2016 Herman Skolnik Awardees

Page 2: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-162

My research aims to develop computational methods for biomedical knowledge discovery

We develop tools and methods to represent, store, publish, integrate, query, and reuse biomedical data, software, and ontologies

Page 3: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-163

reuse needs to be considered firmly in the context of discovery and

reproducibility

Page 4: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-164

Most published research findings are false- John Ioannidis, Stanford University

Page 5: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-165

Reproducible discovery

1. Data Science Tools and Methods– Infrastructure: To identify, annotate, link, integrate,

search for and query data and services– Tools: To identify and uncover support for known or

novel associations2. Community Standards to contribute to and interrogate a massive, decentralized network of interconnected data and software

Page 6: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-166

FAIR: Findable, Accessible, Interoperable, Re-usable

Page 7: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-167

FAIR: Findable, Accessible, Interoperable, Re-usable

Findable– Globally unique identifiers for datasets and the data they contain– Rich set of descriptors to search and filter with– Indexed and searchable

Accessible– Identifiers can be used to retrieve representations using standard protocols

(e.g. HTTP)– Metadata is always available.

Interoperable– Data represented with formal knowledge representations– Include links to other datasets/vocabularies

Reusable– Licensing, Provenance, Community standards

Page 8: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-168

The Semantic Web is the new global web of knowledge

standards for publishing, sharing and querying facts, expert knowledge and services

scalable approach for the discoveryof independently formulated

and distributed knowledge

Page 9: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-169

Linked Data offers a solid foundation for FAIR data

• Entities (people, proteins, pathways, etc) are identified using globally unique identifiers (URIs)

• Entity descriptions are represented with a standardized language (RDF)

• Data can be retrieved using a universal protocol (HTTP)

• Entities (concepts, data, resources) can be linked together to increase interoperability

Page 10: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-16

Linked Data for the Life Sciences

10

Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF.

chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications

• 11B+ interlinked statements from 35 biomedical datasets and 400+ ontologies

• dataset description, provenance & statistics• A growing interoperable ecosystem with the EBI,

NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers

Page 11: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

11

Bio2RDF normalizes identifiers, formats, links, and access

@micheldumontier::ACS:23-08-16

Page 12: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1612

Page 13: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1613

Bio2RDF shows how datasets are connected together

Page 14: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

14

Queries can be federated across private and public SPARQL databases

Get all protein catabolic processes (and more specific GO terms) in biomodels

SELECT ?go ?label count(distinct ?x) WHERE { service <http://bioportal.bio2rdf.org/sparql> { ?go rdfs:label ?label . ?go rdfs:subClassOf+ ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") } service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . }}

@micheldumontier::ACS:23-08-16

Page 15: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1615

Graph-like representation amenableto finding mismatches and discovering new links

W Hu, H Qiu, M Dumontier. Link Analysis of Life Science Linked Data. International Semantic Web Conference (2) 2015: 446-462.

Page 16: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1616

EbolaKB Using Linked Data and Software

Kamdar, Dumontier. An Ebola virus-centered knowledge base. Database. 2015 Jun 8;2015. doi: 10.1093/database/bav049.

Page 17: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1617

Network analysis and discovery

McCusker, McGuiness, Dumontier. In prep.

Page 18: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1618

Can we implement an open version of PREDICT using Linked Data?

AUC 0.91 across all therapeutic indications

A. Chemical structure Similarity

B. Side Effect Similarity

C. Target Sequence Similarity

D. Target Functional Similarity

E. Network Distance

A. Phenotype Based

B. Text Extracted Concepts

Disease-disease similarityDrug-drug similarity

Page 19: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1619

HyQue: Hypothesis Validation

• A platform for knowledge discovery that uses data retrieval coupled with automated reasoning to validate scientific hypotheses

• Leverages semantic technologies to provide access to linked data, ontologies, and semantic web services

• Uses positive and negative findings, captures provenance

• Weighs evidence according to context • Used to find aging genes in worm,

assess cardiotoxicity of tyrosine kinase inhibitorsHyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.

Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012.

Page 20: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1620

What evidence might we gather?• clinical: Are there cardiotoxic effects associated with the drug?

– Literature (studies) [curated db]– Product labels (studies) [r3:sider]– Clinical trials (studies) [r3:clinicaltrials]– Adverse event reports [r2:pharmgkb/onesides] – Electronic health records (observations)

• pre-clinical associations:– genotype-phenotype (null/disease models) [r2:mgi, r2:sgd; r3:wormbase]– in vitro assays (IC50) [r3:chembl]– drug targets [r2:drugbank; r2:ctd; r3:stitch]– drug-gene expression [r3:gxa]– pathways [r2:kegg; r3:reactome]– Drug-pathway, disease-pathway enrichments [aberrant pathways]– Chemical properties [r2:pubchem; r2.drugbank]– Toxicology [r1.toxkb/cebs]

Page 21: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1621

HyQue

Page 22: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1622

Beyond Bio2RDF

Page 23: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1623

Network of Linked Data (~2007)

Page 24: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1624

Expansion across domains

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

Page 25: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

25

A rapidly growing network of Linked Data

@micheldumontier::ACS:23-08-16Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"

Page 26: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1626

Page 27: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1627

Page 28: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1628

Page 29: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1629

but the lack of coordination makes Linked Open Data is chaotic and unwieldy

Page 30: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1630

There is no shortage of vocabularies, ontologies and community-based

standards

Page 31: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1631

68 168

Page 32: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1632

metadatacenter.org

NIH COMMONS

Making it Easier, Possibly Even Pleasant, to Author Interoperable Experimental Metadata

Page 33: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1633

PubChem engaged the community to reuse and extend existing vocabularies

Page 34: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

34 @micheldumontier::ACS:23-08-16

Semanticscience Ontology (SIO)An effective upper level ontology.1500+ classes207 object properties (inc. inverses)1 datatype property

Page 35: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1635

Chemical Information Ontology (CHEMINF)

• Collaborative ontology• Distinguishes algorithmic, or

procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data.

Page 36: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1636

Where are we going?

• Large scale publishing on the web across biomedical datatypes is possible on the web

• Hubs, such as NCBI and EBI now integrate data, but there is need for global coordination on all datatypes

• Standard Vocabularies must to be open, freely accessible, and demonstrably reused

• Use of worldwide data integration formats (RDF) and improved linking of data

• Easier to deploy toolkits for providing standards-compliant linked data

Page 37: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

37

Linked Data Platform

Docker

• Data conversion scripts

• Query Editor

• Faceted Browser

• Relation Exploration

• API

• Data and data store

Model Organism Linked Data

MO-LD.org

Page 38: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1638

In Summary

• We use semantic technologies such as ontologies and linked data to make sense of and facilitate access to biomedical data (FAIR)

• The intimate development and use of standards by PubChem and others brings us closer to an interoperability ideal

• Much more work is needed to support (computational) discovery in a reproducible manner.

Page 39: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

39

AcknowledgementsDumontier Lab• Amrapali Zaveri• Mary Panahiazar• Shima Dastgheib• Sandeep Ayyar• Remzi Celebi• David Odgers• Wei Hu• Ruben Verborgh

• Leo Chepelev• Alison Callahan• Jose Miguel Toledo Cruz• Tanya Hiebert• Beatriz Lujan+ many more

Collaborators• Mark Musen• Nigam Shah• Robert Hoehndorf• Janna Hastings• Christoph Steinbeck • Egon Willighagen• Nico Adams• Colin Batchelor• David Wild• Evan Bolton • Gang Fu+ many more

@micheldumontier::ACS:23-08-16

Page 40: Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

@micheldumontier::ACS:23-08-1640

[email protected]

Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier