data science for the win

31
Data Science For The Win! 1 Michel Dumontier, Ph.D. Distinguished Professor of Data Science @micheldumontier::KE@Work:2017-01- 26

Upload: michel-dumontier

Post on 14-Feb-2017

29 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Data Science for the Win

1

Data Science For The Win!

Michel Dumontier, Ph.D.

Distinguished Professor of Data Science

@micheldumontier::KE@Work:2017-01-26

Page 2: Data Science for the Win

2 @micheldumontier::KE@Work:2017-01-26HTTP://XKCD.COM/242/

Page 3: Data Science for the Win

Science!

@micheldumontier::KE@Work:2017-01-263

Page 4: Data Science for the Win

@micheldumontier::KE@Work:2017-01-264

Most published research findings are false.- John Ioannidis, Stanford University

Non-reproducibility of 65–89% in pharmacological studies and 64% in psychological studies.

PLoS Med 2005;2(8): e124. 

Page 5: Data Science for the Win

@micheldumontier::KE@Work:2017-01-265

Science is hard.

Page 6: Data Science for the Win

@micheldumontier::KE@Work:2017-01-266

Statistics aren’t sufficient.

Page 7: Data Science for the Win

@micheldumontier::KE@Work:2017-01-267

Biology is unruly.

Page 8: Data Science for the Win

@micheldumontier::KE@Work:2017-01-268

Medicine is complicated.

Page 9: Data Science for the Win

@micheldumontier::KE@Work:2017-01-269

we need new ways to think about discovery science using more knowledge

while paying attention toreproducibility

Page 10: Data Science for the Win

@micheldumontier::KE@Work:2017-01-26

Our confidence in a finding is strengthened when found in multiple independent datasets of the same type

10

A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantationKhatri et al. JEM. 210 (11): 2205DOI: 10.1084/jem.20122709

Page 11: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2611

Known to be associated through perturbation expts.Mentioned in scientific text togetherInteract with known genes…Use statistical methods to find significant patterns and predictive models.

Our confidence in a findingis strengthened when we can uncover evidence at multiple

levels

A study to examine what support exists for the role of genes in aging in the worm.

Page 12: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2612

How can we automatically find the evidence that support or dispute a scientific hypothesis using the totality of available data, tools and scientific knowledge?

Page 13: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2613

So what do we need to achieve this?

1. Data ScienceInfrastructure to identify, represent, store, transport, retrieve, aggregate, query, mine, analyze data and execute services on demand in a reproducible manner.Methods to discover plausible, supported, prioritized, and experimentally verifiable associations.

2. Communityto build a massive, decentralized network of interconnected and interoperable data and services

Page 14: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2614

FAIR: Findable, Accessible, Interoperable, Re-usable

Applies to all digital resources and their metadata

Page 15: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2615

The Semantic Web is the new global web of knowledge

standards for publishing, sharing and querying facts, expert knowledge and services

scalable approach for the discoveryof independently formulated

and distributed knowledge

Page 16: Data Science for the Win

16

Be FAIR by creating Linked Data

@micheldumontier::KE@Work:2017-01-26Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"

Page 17: Data Science for the Win

@micheldumontier::KE@Work:2017-01-26

Linked Data for the Life Sciences

17

Bio2RDF is an open source project that uses semantic web technologies to make FAIR biomedical data

chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications

• Billions of facts from 35 biomedical datasets• Provenance & statistics• A growing interoperable ecosystem with global

partners: EBI, NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers

Page 18: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2618

Get data in standardized representations using web technology

Page 19: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2619

Discover connections across datasets

Page 20: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2620

Efficiently find and explore data

Page 21: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2621

Examine the facts and their provenance

Page 22: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2622

Develop timely knowledge portals

Kamdar, Dumontier. An Ebola virus-centered knowledge base. Database. 2015 Jun 8;2015. doi: 10.1093/database/bav049.

Page 23: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2623

Apply graph methods to assess find bad links and fill in the gaps

W Hu, H Qiu, M Dumontier. Link Analysis of Life Science Linked Data. International Semantic Web Conference (2) 2015: 446-462.

Page 24: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2624

Find new uses for existing drugs

Finding melanoma drugs through a probabilistic knowledge graph. PeerJ Preprints

using data mining

And validate them against pipelines for drug discovery

Page 25: Data Science for the Win

@micheldumontier::KE@Work:2017-01-26

Find new uses for existing drugs

25

using machine learning

Drug Disease

Chemical Stru

cture

Side Effects

Target Sequences

Target Functions

Molecular Interactions

Phenotypes

Text-mined term

s

know

nne

gativ

e

Page 26: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2626

Examine overlap in “gold standards”

diseasesdrugs

Page 27: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2627

Uncover evidence in a transparent manner

HyQue

Page 28: Data Science for the Win

@micheldumontier::KE@Work:2017-01-26

Key Research Challengesto accelerate discovery science

• Scalable, shared, fault-tolerant, and readily re-deployable frameworks for archiving and providing versioned and maximally FAIR biomedical (meta)data

• Scalable methods for the prospective and retrospective authoring, assessment, and repair of metadata.

• Scalable frameworks for open, transparent, reproducible and recurrent analysis and meta-analysis of FAIR research data.

• Methods to identify reporting biases and knowledge gaps • Scalable and reliable methods for the evaluation of scientific

hypotheses using evidence gathered across scales and sources• Scalable methods for validation of research findings.

28

Page 29: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2629

The mission of the Institute of Data Science is to accelerate scientific discovery, improve clinical care and well being, and to strengthen communities. We will foster a collaborative environment for inter- and multi-disciplinary research and training founded on accurate, reproducible, multi-scale, distributed, and efficient computation.

Institute of Data Science

Page 30: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2630

Institute of Data Science

Tackle impactful research problems through multidisciplinary teams involving students, researchers, partners, and stakeholders inside and out of the University.• Establish a core team of researchers to broadly fulfill the

mission of the Institute.• Highlight and engage the incredible data scientists already

at UM• Work with communities to deploy and evaluate the impact

of the application of our methods.• Train the next generation of data scientists to beeven more

collaborative and interdisciplinary.

Page 31: Data Science for the Win

@micheldumontier::KE@Work:2017-01-2631

[email protected]: http://dumontierlab.com

Presentations: http://slideshare.com/micheldumontier