bioinformatics in the era of open science and big data

54
Bioinformatics in the Era of Open Science and Big Data Philip E. Bourne University of California San Diego [email protected] 1/28/14 SIB Biel/Bienne 1

Upload: philip-bourne

Post on 06-May-2015

1.643 views

Category:

Education


3 download

DESCRIPTION

Keynote presentation at the Swiss Institute of Bioinformatics (SIB) annual meeting in Biel, Switzerland on January 28, 2014.

TRANSCRIPT

Page 1: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 1

Bioinformatics in the Era of Open Science and Big Data

Philip E. BourneUniversity of California San Diego

[email protected]

1/28/14

Page 2: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 2

My Bias• RCSB PDB/IEDB Database Developer – Views on

community, quality, sustainability …• PLOS Journal Co-founder – Open Science Advocate• Associate Vice Chancellor for Innovation – Business

models, interaction with the private sector, sustainability

• Professor – Mentoring, reward system, value (or not) of research

• Associate Director of NIH for Data Science - ??

1/28/14

Page 3: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 3

The History of Bioinformatics According to Bourne

1980s 1990s 2000s 2010s 2020

Discipline:

Unknown Expt. Driven Emergent Over-sold A Service A Partner A Driver

The Raw Material:

Non-existent Limited /Poor More/Ontologies Big Data/Siloed Open/Integrated

The People:

No name Technicians Industry recognition data scientists Academics

Searls (ed) The Roots in Bioinformatics Series PLOS Comp Biol

1/28/14

Page 4: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 4

We Need to Start By Asking How Are We Using the Data Now!

Only Then Can We Make Rational Decisions About Data – Large or Small

1/28/14

Page 5: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

Web Logs etc. Are Not Enough

* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm

Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010

1RUZ: 1918 H1 Hemagglutinin

Structure Summary page activity forH1N1 Influenza related structures

3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir

5[Andreas Prlic]1/28/14

Page 6: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 6

We Need to Learn from Industries Whose Livelihood Addresses the Question of Use

1/28/14

Page 7: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 7

Next Consider What We Do Every Day

We take actions on digital data increasingly across boundaries

1/28/14

Page 8: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 8

Actions on Data Implies:

• Insuring data quality and hence trust• Making data sustainable• Making data open and accessible• Making data findable• Providing suitable metadata and annotation• Making data queryable• Making data analyzable• Presenting data as to maximize its value• Rewarding good data practices1/28/14

Page 9: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 9

Actions on Data Implies:

• Insuring data quality and hence trust • Making data sustainable • Making data open and accessible • Making data findable • Providing suitable metadata and annotation• Making data queryable• Making data analyzable• Presenting data as to maximize its value• Rewarding good data practices1/28/14

Page 10: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 10

Boundaries on Data Implies:

• Working across biological scales• Working across biomedical disciplines• Working across basic and clinical research and

practice• Working across institutional boundaries• Working across public and private sectors• Working across national and international

borders• Working across funding agencies1/28/14

Page 11: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 11

Boundaries on Data Implies:

• Working across biological scales • Working across biomedical disciplines• Working across basic and clinical research and

practice• Working across institutional boundaries• Working across public and private sectors • Working across national and international

borders• Working across funding agencies1/28/14

Page 12: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 12

These Issues Have Been Around Almost As Long As Bioinformatics

The Good News is That “Big Data” Has Bought More Attention to the Problem

1/28/14

Page 13: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 13

What Are Big Data?

• Large datasets from high throughput experiments

• Large numbers of small datasets• Data which are “ill-formed”• The why (causality) is replaced by the what• A signal that a fundamental change is taking

place – a tipping point?

1/28/14

Page 14: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 14

That Change is Embodied in:The Digital Enterprise

• Consists of digital assets• E.g. datasets, papers, software, lab notes• Each asset is uniquely identified and has

provenance, including access control• E.g. publishing simply involves changing the

access control• Digital assets are interoperable across the

enterprise

1/28/14

Page 15: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 15

The Enterprise Is Almost Anything..Your Lab, your Institution, the SIB,

the NIH….

1/28/14

Page 16: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 16

Consider an Academic Institution As A Digital Enterprise

• Jane scores extremely well in parts of her graduate on-line neurology class. Neurology professors, whose research profiles are on-line and well described, are automatically notified of Jane’s potential based on a computer analysis of her scores against the background interests of the neuroscience professors. Consequently, professor Smith interviews Jane and offers her a research rotation. During the rotation she enters details of her experiments related to understanding a widespread neurodegenerative disease in an on-line laboratory notebook kept in a shared on-line research space – an institutional resource where stakeholders provide metadata, including access rights and provenance beyond that available in a commercial offering. According to Jane’s preferences, the underlying computer system may automatically bring to Jane’s attention Jack, a graduate student in the chemistry department whose notebook reveals he is working on using bacteria for purposes of toxic waste cleanup. Why the connection? They reference the same gene a number of times in their notes, which is of interest to two very different disciplines – neurology and environmental sciences. In the analog academic health center they would never have discovered each other, but thanks to the Digital Enterprise, pooled knowledge can lead to a distinct advantage. The collaboration results in the discovery of a homologous human gene product as a putative target in treating the neurodegenerative disorder. A new chemical entity is developed and patented. Accordingly, by automatically matching details of the innovation with biotech companies worldwide that might have potential interest, a licensee is found. The licensee hires Jack to continue working on the project. Jane joins Joe’s laboratory, and he hires another student using the revenue from the license. The research continues and leads to a federal grant award. The students are employed, further research is supported and in time societal benefit arises from the technology.

From What Big Data Means to Me JAMIA 2014

1/28/14

Page 17: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 17

The NIH is Starting to Think About the Digital Enterprise, Witness…

1/28/14

bd2k.nih.gov

Page 18: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 18

What Defines the Digital Enterprise

• Trans-NIH collaboration – change culture• Long-term NIH strategic planning • The BD2K Initiative• A “hub” of data science activities • International cooperation• Interagency cooperation• Data sharing policies

1/28/14

Page 19: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 19

Consider One NIH Scenario

• NIH-Drive– Investigator A from the NCI makes frequent

reference to the over expression of genes x and y.

– Investigator B from the NHLBI makes frequent reference to the under expression of genes x and y

– Automatic notification of a potential common interest before publication or database deposition

1/28/14

Page 20: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 20

The NIH Process

An external advisory group provided a valuable blueprint for what should be done

http://acd.od.nih.gov/Data%20and%20Informatics%20Working%20Group%20Report.pdf

1/28/14

Page 21: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 21

Blueprint Recommendations• Promote central and federated catalogs

– Establish minimal metadata framework– Tools to facilitate data sharing– Elaborate on existing data sharing policies

• Support methods and applications– Fund all phases of software development– Leverage lessons from National Centers

• Training– More funding– Enhance review of training apps– Quantitative component to all awards

• On campus IT strategic plan– Catalog of existing tools– Informatics laboratory– Ditto big data

• Sustainable funding commitment

1/28/14

acd.od.nih.gov/diwg.htm

Page 22: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 22

General Features of NIH Data Science

• Lightweight metadata standards• Data & software registries• Expanded policies on data sharing, open

source software• Training programs & reward systems• Institutional incentives• Private sector incentives• Data centers serving community needs1/28/14

Page 23: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 23

What is Under Way?• Now:

– Data centers (under review)– Data science training grants (call Q1 14)– Pilot data catalog consortium (call out)– Genomic Research Data Alliance (being finalized)– Piloting “NIH-drive

• What Is Planned:– Extended public-private programs specifically for data science

activities– Interagency activities– International exchange programs– Cold Spring Harbor-like training facilities – by-coastal?– Programs for better data descriptions– Reward institutions/communities– Policies to get clinical trial data into the public domain

1/28/14

Page 24: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 24

The History of Bioinformatics According to PEB

1980s 1990s 2000s 2010s 2020

Discipline:

Unknown Expt. Driven Emergent Over-sold A Service A Partner Driver

The Raw Material:

Non-existent Limited /Poor More/Ontologies Big Data/Siloed Open/Integrated

The People:

No name Technicians Industry recognition data scientists Academics

The Roots in Bioinformatics Series PLOS Comp Biol

1/28/14

Page 25: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 25

Why Will Science Become More Open?

• The public (and hence the politicians demand it)

• Its the right thing to do• Its part of the modern psyche• The scholarly enterprise is broken and more

stakeholders are acknowledging it

1/28/14

Page 26: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 26

Personal Evidence

• I have a paper with 16,000 citations that no one has ever read

• I have papers in PLOS ONE that have more citations than ones in PNAS

• I have data sets I am proud of but no place to put them

• I “cant” reproduce work from my own lab

1/28/14

Page 27: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 27

Politicians Demand It:G8 open data charter

http://opensource.com/government/13/7/open-data-charter-g81/28/14

Page 28: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 28

What Are Some of the Ramifications of Open Science?

1/28/14

Page 29: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

Open Science Has The Potential to Deinstitutionalize

29

Daniel Hulshizer/Associated Press

1/28/14

Page 30: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

Open Science Has The Potential to Deinstitutionalize

30

Daniel Hulshizer/Associated Press

1/28/14

Page 31: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

An Example of That Potential:The Story of Meredith

31

http://fora.tv/2012/04/20/Congress_Unplugged_Phil_Bourne

1/28/14

Page 32: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

Open Science Has The Potential to Deinstitutionalize

32

Daniel Hulshizer/Associated Press

1/28/14

Page 33: Bioinformatics in the Era of Open Science and Big Data

Open Science Has The Potential to Deinstitutionalize

SIB Biel/Bienne 33

Daniel Hulshizer/Associated Press

1/28/14

Page 34: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

There Still Needs to be a Reward SystemThe Wikipedia Experiment – Topic Pages

Identify areas of Wikipedia that relate to the journal that are missing of stubs

Develop a Wikipedia page in the sandbox

Have a Topic Page Editor Review the page

Publish the copy of record with associated rewards

Release the living version into Wikipedia

341/28/14

Page 35: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 35

1. A link brings up figures from the paper

0. Full text of PLoS papers stored in a database

2. Clicking the paper figure retrievesdata from the PDB which is

analyzed

3. A composite view ofjournal and database

content results

One Possible End Product of Open Science

1. User clicks on thumbnail2. Metadata and a

webservices call provide a renderable image that can be annotated

3. Selecting a features provides a database/literature mashup

4. That leads to new papers

4. The composite view haslinks to pertinent blocks

of literature text and back to the PDB

1.

2.

3.

4.

PLoS Comp. Biol. 2005 1(3) e341/28/14

Page 36: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 36

Change in the Way we Support the Research Lifecycle

IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION

AuthoringTools

Lab Notebooks

DataCapture

SoftwareRepositories

Analysis Tools

Visualization

ScholarlyCommunication

Commercial &Public Tools

Git-likeResources

By Discipline

Data JournalsDiscipline-

Based MetadataStandards

Community Portals

Institutional Repositories

New Reward Systems

Commercial Repositories

Training

1/28/14

Page 37: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 37

Change in the Way we Support the Research Lifecycle

IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION

AuthoringTools

Lab Notebooks

DataCapture

SoftwareRepositories

Analysis Tools

Visualization

ScholarlyCommunication

Commercial &Public Tools

Git-likeResources

By Discipline

Data JournalsDiscipline-

Based MetadataStandards

Community Portals

Institutional Repositories

New Reward Systems

Commercial Repositories

Training

1/28/14

Page 38: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 38

automate: workflows, pipeline & service integrative frameworks

pool, share & collaborate web systems

nanopub

semantics & ontologiesmachine readable documentation

scientific software engineering

CSSE

[Carole Goble]1/28/14

Page 39: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

Why is This Important to Me Personally?

• My wife is being treated for stage 1 breast cancer

• This highlights for me the disparity between what is happening in the lab and what is happening in the clinic– In the lab cancer is a personalized and treatable

condition– In the clinic we are still equally “poisoning”

patients with drugs first introduced 10-20 years ago

391/28/14

Page 40: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

http://sagecongress.org/Presentations/Sommer.pdf

40

[Josh Sommer]

1/28/14

Page 41: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

http://sagecongress.org/Presentations/Sommer.pdf

41

[Josh Sommer]

1/28/14

Page 42: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne

Most Laboratories

• We are the long tail• Goodbye to the student is

goodbye to the data• Very few of us have

complied (or will comply with the data management plans we write into grants)

• Too much software is unusable

S.Veretnik, J.L.Fink, and P.E. Bourne 2008 Computational Biology Resources Lack Persistence and Usability. PLoS Comp. Biol. . 4(7): e1000136

421/28/14

Page 43: Bioinformatics in the Era of Open Science and Big Data

Today’s Research Lifecycle is Digitally Fragmented at Best

• Proof:– I cant immediately reproduce the research in

my own laboratory• It took an estimated 280 hours for an average user

to approximately reproduce the paper

– Workflows are maturing and becoming helpful– Data and software versions and accessibility

prevent exact reproducability

Daniel Garijo et al. 2013 Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome PLOS ONE 8(11) e80278 .

SIB Biel/Bienne 431/28/14

Page 44: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 44

We Have Some Really Big Problems to Solve – The Commons Can Help

1/28/14

Page 45: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 45

What Really Happens When You Take a Drug?

• Can we predict drug efficacy and toxicity?• Can we reuse old drugs?• Can we design personalized medicines?

1/28/14

Page 46: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 46

One Drug, One Gene, One Disease

Bernard M. Nat Rev Drug Disc 8(2009), 959-968 1/28/14

Page 47: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 47

Polypharmacology• Tykerb – Breast cancer

• Gleevac – Leukemia, GI cancers

• Nexavar – Kidney and liver cancer

• Staurosporine – natural product – alkaloid – uses many e.g., antifungal antihypertensive

Collins and Workman 2006 Nature Chemical Biology 2 689-700

1/28/14

Page 48: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 48

A.L. Hopkins Nat. Chem. Biol. 2008 4:682-690

Polypharmacology is Not Rare but Common

• Single gene knockouts only affect phenotype in 10-20% of cases

• 35% of biologically active compounds bind to two or more targets that do not have similar sequences or global shapesPaolini et al. Nat. Biotechnol. 2006 24:805–815

Kaiser et al. Nature 462 (2009) 175-81

Predict side effects Repurpose drugs

1/28/14

Page 49: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 49

Drug Binding is Dynamic

• Drug effect dependents on not only how strong (binding affinity) but also how long the drug is “stuck” in the protein (residence time).

• Molecular Dynamics (MD) simulation is powerful but computationally intensive.

~ns 1 day simulation

~ms – hours >106 days

D. Huang et al. (2011), PLoS Comp Biol 7(2):e1002002

1/28/14

Page 50: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 50

Systems Pharmacology

Target binding

Affect protein function

Systemic response

Drug molecules

×Uptake

Secretion(or biomass components)

× × ×× ××

Enzyme inhibition

Metabolic network

Catalytic site

Slide from Roger Chang1/28/14

Page 51: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 51

Multiscale Modeling of Drug Actions

physiological process

Understanding of dynamics and kinetics of protein-ligand interactions

physiological processphysiological processphysiological process

Knowledge representation and discovery & model integration

Prediction of molecular interaction network on

a genome scale

Reconstruction, analysis and simulation of

biological networks

Traditional Approach

Systems-based Approach

Slide from Lei Xie1/28/14

Page 52: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 52

More Generally Any Translational-based Research That Involves Modeling at Multiple Scales

1/28/14

http://sagebase.org/

Page 53: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 53

The History of Bioinformatics According to Bourne

1980s 1990s 2000s 2010s 2020

Discipline:

Unknown Expt. Driven Emergent Over-sold A Service A Partner A Driver

The Raw Material:

Non-existent Limited /Poor More/Ontologies Big Data/Siloed Open/Integrated

The People:

No name Technicians Industry recognition data scientists Academics

The Roots in Bioinformatics Series PLOS Comp Biol

1/28/14

Page 54: Bioinformatics in the Era of Open Science and Big Data

SIB Biel/Bienne 54

In Summary:By the End of the Decade Biomedical

Research will Be a Truly Digital Enterprise and Computational

Scientists Will Be At the Forefront

You Have Much to Look Forward Too

1/28/14