bioinformatics in the era of open science and big data
DESCRIPTION
Keynote presentation at the Swiss Institute of Bioinformatics (SIB) annual meeting in Biel, Switzerland on January 28, 2014.TRANSCRIPT
SIB Biel/Bienne 1
Bioinformatics in the Era of Open Science and Big Data
Philip E. BourneUniversity of California San Diego
1/28/14
SIB Biel/Bienne 2
My Bias• RCSB PDB/IEDB Database Developer – Views on
community, quality, sustainability …• PLOS Journal Co-founder – Open Science Advocate• Associate Vice Chancellor for Innovation – Business
models, interaction with the private sector, sustainability
• Professor – Mentoring, reward system, value (or not) of research
• Associate Director of NIH for Data Science - ??
1/28/14
SIB Biel/Bienne 3
The History of Bioinformatics According to Bourne
1980s 1990s 2000s 2010s 2020
Discipline:
Unknown Expt. Driven Emergent Over-sold A Service A Partner A Driver
The Raw Material:
Non-existent Limited /Poor More/Ontologies Big Data/Siloed Open/Integrated
The People:
No name Technicians Industry recognition data scientists Academics
Searls (ed) The Roots in Bioinformatics Series PLOS Comp Biol
1/28/14
SIB Biel/Bienne 4
We Need to Start By Asking How Are We Using the Data Now!
Only Then Can We Make Rational Decisions About Data – Large or Small
1/28/14
SIB Biel/Bienne
Web Logs etc. Are Not Enough
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010
1RUZ: 1918 H1 Hemagglutinin
Structure Summary page activity forH1N1 Influenza related structures
3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir
5[Andreas Prlic]1/28/14
SIB Biel/Bienne 6
We Need to Learn from Industries Whose Livelihood Addresses the Question of Use
1/28/14
SIB Biel/Bienne 7
Next Consider What We Do Every Day
We take actions on digital data increasingly across boundaries
1/28/14
SIB Biel/Bienne 8
Actions on Data Implies:
• Insuring data quality and hence trust• Making data sustainable• Making data open and accessible• Making data findable• Providing suitable metadata and annotation• Making data queryable• Making data analyzable• Presenting data as to maximize its value• Rewarding good data practices1/28/14
SIB Biel/Bienne 9
Actions on Data Implies:
• Insuring data quality and hence trust • Making data sustainable • Making data open and accessible • Making data findable • Providing suitable metadata and annotation• Making data queryable• Making data analyzable• Presenting data as to maximize its value• Rewarding good data practices1/28/14
SIB Biel/Bienne 10
Boundaries on Data Implies:
• Working across biological scales• Working across biomedical disciplines• Working across basic and clinical research and
practice• Working across institutional boundaries• Working across public and private sectors• Working across national and international
borders• Working across funding agencies1/28/14
SIB Biel/Bienne 11
Boundaries on Data Implies:
• Working across biological scales • Working across biomedical disciplines• Working across basic and clinical research and
practice• Working across institutional boundaries• Working across public and private sectors • Working across national and international
borders• Working across funding agencies1/28/14
SIB Biel/Bienne 12
These Issues Have Been Around Almost As Long As Bioinformatics
The Good News is That “Big Data” Has Bought More Attention to the Problem
1/28/14
SIB Biel/Bienne 13
What Are Big Data?
• Large datasets from high throughput experiments
• Large numbers of small datasets• Data which are “ill-formed”• The why (causality) is replaced by the what• A signal that a fundamental change is taking
place – a tipping point?
1/28/14
SIB Biel/Bienne 14
That Change is Embodied in:The Digital Enterprise
• Consists of digital assets• E.g. datasets, papers, software, lab notes• Each asset is uniquely identified and has
provenance, including access control• E.g. publishing simply involves changing the
access control• Digital assets are interoperable across the
enterprise
1/28/14
SIB Biel/Bienne 15
The Enterprise Is Almost Anything..Your Lab, your Institution, the SIB,
the NIH….
1/28/14
SIB Biel/Bienne 16
Consider an Academic Institution As A Digital Enterprise
• Jane scores extremely well in parts of her graduate on-line neurology class. Neurology professors, whose research profiles are on-line and well described, are automatically notified of Jane’s potential based on a computer analysis of her scores against the background interests of the neuroscience professors. Consequently, professor Smith interviews Jane and offers her a research rotation. During the rotation she enters details of her experiments related to understanding a widespread neurodegenerative disease in an on-line laboratory notebook kept in a shared on-line research space – an institutional resource where stakeholders provide metadata, including access rights and provenance beyond that available in a commercial offering. According to Jane’s preferences, the underlying computer system may automatically bring to Jane’s attention Jack, a graduate student in the chemistry department whose notebook reveals he is working on using bacteria for purposes of toxic waste cleanup. Why the connection? They reference the same gene a number of times in their notes, which is of interest to two very different disciplines – neurology and environmental sciences. In the analog academic health center they would never have discovered each other, but thanks to the Digital Enterprise, pooled knowledge can lead to a distinct advantage. The collaboration results in the discovery of a homologous human gene product as a putative target in treating the neurodegenerative disorder. A new chemical entity is developed and patented. Accordingly, by automatically matching details of the innovation with biotech companies worldwide that might have potential interest, a licensee is found. The licensee hires Jack to continue working on the project. Jane joins Joe’s laboratory, and he hires another student using the revenue from the license. The research continues and leads to a federal grant award. The students are employed, further research is supported and in time societal benefit arises from the technology.
From What Big Data Means to Me JAMIA 2014
1/28/14
SIB Biel/Bienne 17
The NIH is Starting to Think About the Digital Enterprise, Witness…
1/28/14
bd2k.nih.gov
SIB Biel/Bienne 18
What Defines the Digital Enterprise
• Trans-NIH collaboration – change culture• Long-term NIH strategic planning • The BD2K Initiative• A “hub” of data science activities • International cooperation• Interagency cooperation• Data sharing policies
1/28/14
SIB Biel/Bienne 19
Consider One NIH Scenario
• NIH-Drive– Investigator A from the NCI makes frequent
reference to the over expression of genes x and y.
– Investigator B from the NHLBI makes frequent reference to the under expression of genes x and y
– Automatic notification of a potential common interest before publication or database deposition
1/28/14
SIB Biel/Bienne 20
The NIH Process
An external advisory group provided a valuable blueprint for what should be done
http://acd.od.nih.gov/Data%20and%20Informatics%20Working%20Group%20Report.pdf
1/28/14
SIB Biel/Bienne 21
Blueprint Recommendations• Promote central and federated catalogs
– Establish minimal metadata framework– Tools to facilitate data sharing– Elaborate on existing data sharing policies
• Support methods and applications– Fund all phases of software development– Leverage lessons from National Centers
• Training– More funding– Enhance review of training apps– Quantitative component to all awards
• On campus IT strategic plan– Catalog of existing tools– Informatics laboratory– Ditto big data
• Sustainable funding commitment
1/28/14
acd.od.nih.gov/diwg.htm
SIB Biel/Bienne 22
General Features of NIH Data Science
• Lightweight metadata standards• Data & software registries• Expanded policies on data sharing, open
source software• Training programs & reward systems• Institutional incentives• Private sector incentives• Data centers serving community needs1/28/14
SIB Biel/Bienne 23
What is Under Way?• Now:
– Data centers (under review)– Data science training grants (call Q1 14)– Pilot data catalog consortium (call out)– Genomic Research Data Alliance (being finalized)– Piloting “NIH-drive
• What Is Planned:– Extended public-private programs specifically for data science
activities– Interagency activities– International exchange programs– Cold Spring Harbor-like training facilities – by-coastal?– Programs for better data descriptions– Reward institutions/communities– Policies to get clinical trial data into the public domain
1/28/14
SIB Biel/Bienne 24
The History of Bioinformatics According to PEB
1980s 1990s 2000s 2010s 2020
Discipline:
Unknown Expt. Driven Emergent Over-sold A Service A Partner Driver
The Raw Material:
Non-existent Limited /Poor More/Ontologies Big Data/Siloed Open/Integrated
The People:
No name Technicians Industry recognition data scientists Academics
The Roots in Bioinformatics Series PLOS Comp Biol
1/28/14
SIB Biel/Bienne 25
Why Will Science Become More Open?
• The public (and hence the politicians demand it)
• Its the right thing to do• Its part of the modern psyche• The scholarly enterprise is broken and more
stakeholders are acknowledging it
1/28/14
SIB Biel/Bienne 26
Personal Evidence
• I have a paper with 16,000 citations that no one has ever read
• I have papers in PLOS ONE that have more citations than ones in PNAS
• I have data sets I am proud of but no place to put them
• I “cant” reproduce work from my own lab
1/28/14
SIB Biel/Bienne 27
Politicians Demand It:G8 open data charter
http://opensource.com/government/13/7/open-data-charter-g81/28/14
SIB Biel/Bienne 28
What Are Some of the Ramifications of Open Science?
1/28/14
SIB Biel/Bienne
Open Science Has The Potential to Deinstitutionalize
29
Daniel Hulshizer/Associated Press
1/28/14
SIB Biel/Bienne
Open Science Has The Potential to Deinstitutionalize
30
Daniel Hulshizer/Associated Press
1/28/14
SIB Biel/Bienne
An Example of That Potential:The Story of Meredith
31
http://fora.tv/2012/04/20/Congress_Unplugged_Phil_Bourne
1/28/14
SIB Biel/Bienne
Open Science Has The Potential to Deinstitutionalize
32
Daniel Hulshizer/Associated Press
1/28/14
Open Science Has The Potential to Deinstitutionalize
SIB Biel/Bienne 33
Daniel Hulshizer/Associated Press
1/28/14
SIB Biel/Bienne
There Still Needs to be a Reward SystemThe Wikipedia Experiment – Topic Pages
Identify areas of Wikipedia that relate to the journal that are missing of stubs
Develop a Wikipedia page in the sandbox
Have a Topic Page Editor Review the page
Publish the copy of record with associated rewards
Release the living version into Wikipedia
341/28/14
SIB Biel/Bienne 35
1. A link brings up figures from the paper
0. Full text of PLoS papers stored in a database
2. Clicking the paper figure retrievesdata from the PDB which is
analyzed
3. A composite view ofjournal and database
content results
One Possible End Product of Open Science
1. User clicks on thumbnail2. Metadata and a
webservices call provide a renderable image that can be annotated
3. Selecting a features provides a database/literature mashup
4. That leads to new papers
4. The composite view haslinks to pertinent blocks
of literature text and back to the PDB
1.
2.
3.
4.
PLoS Comp. Biol. 2005 1(3) e341/28/14
SIB Biel/Bienne 36
Change in the Way we Support the Research Lifecycle
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
AuthoringTools
Lab Notebooks
DataCapture
SoftwareRepositories
Analysis Tools
Visualization
ScholarlyCommunication
Commercial &Public Tools
Git-likeResources
By Discipline
Data JournalsDiscipline-
Based MetadataStandards
Community Portals
Institutional Repositories
New Reward Systems
Commercial Repositories
Training
1/28/14
SIB Biel/Bienne 37
Change in the Way we Support the Research Lifecycle
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
AuthoringTools
Lab Notebooks
DataCapture
SoftwareRepositories
Analysis Tools
Visualization
ScholarlyCommunication
Commercial &Public Tools
Git-likeResources
By Discipline
Data JournalsDiscipline-
Based MetadataStandards
Community Portals
Institutional Repositories
New Reward Systems
Commercial Repositories
Training
1/28/14
SIB Biel/Bienne 38
automate: workflows, pipeline & service integrative frameworks
pool, share & collaborate web systems
nanopub
semantics & ontologiesmachine readable documentation
scientific software engineering
CSSE
[Carole Goble]1/28/14
SIB Biel/Bienne
Why is This Important to Me Personally?
• My wife is being treated for stage 1 breast cancer
• This highlights for me the disparity between what is happening in the lab and what is happening in the clinic– In the lab cancer is a personalized and treatable
condition– In the clinic we are still equally “poisoning”
patients with drugs first introduced 10-20 years ago
391/28/14
SIB Biel/Bienne
http://sagecongress.org/Presentations/Sommer.pdf
40
[Josh Sommer]
1/28/14
SIB Biel/Bienne
http://sagecongress.org/Presentations/Sommer.pdf
41
[Josh Sommer]
1/28/14
SIB Biel/Bienne
Most Laboratories
• We are the long tail• Goodbye to the student is
goodbye to the data• Very few of us have
complied (or will comply with the data management plans we write into grants)
• Too much software is unusable
S.Veretnik, J.L.Fink, and P.E. Bourne 2008 Computational Biology Resources Lack Persistence and Usability. PLoS Comp. Biol. . 4(7): e1000136
421/28/14
Today’s Research Lifecycle is Digitally Fragmented at Best
• Proof:– I cant immediately reproduce the research in
my own laboratory• It took an estimated 280 hours for an average user
to approximately reproduce the paper
– Workflows are maturing and becoming helpful– Data and software versions and accessibility
prevent exact reproducability
Daniel Garijo et al. 2013 Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome PLOS ONE 8(11) e80278 .
SIB Biel/Bienne 431/28/14
SIB Biel/Bienne 44
We Have Some Really Big Problems to Solve – The Commons Can Help
1/28/14
SIB Biel/Bienne 45
What Really Happens When You Take a Drug?
• Can we predict drug efficacy and toxicity?• Can we reuse old drugs?• Can we design personalized medicines?
1/28/14
SIB Biel/Bienne 46
One Drug, One Gene, One Disease
Bernard M. Nat Rev Drug Disc 8(2009), 959-968 1/28/14
SIB Biel/Bienne 47
Polypharmacology• Tykerb – Breast cancer
• Gleevac – Leukemia, GI cancers
• Nexavar – Kidney and liver cancer
• Staurosporine – natural product – alkaloid – uses many e.g., antifungal antihypertensive
Collins and Workman 2006 Nature Chemical Biology 2 689-700
1/28/14
SIB Biel/Bienne 48
A.L. Hopkins Nat. Chem. Biol. 2008 4:682-690
Polypharmacology is Not Rare but Common
• Single gene knockouts only affect phenotype in 10-20% of cases
• 35% of biologically active compounds bind to two or more targets that do not have similar sequences or global shapesPaolini et al. Nat. Biotechnol. 2006 24:805–815
Kaiser et al. Nature 462 (2009) 175-81
Predict side effects Repurpose drugs
1/28/14
SIB Biel/Bienne 49
Drug Binding is Dynamic
• Drug effect dependents on not only how strong (binding affinity) but also how long the drug is “stuck” in the protein (residence time).
• Molecular Dynamics (MD) simulation is powerful but computationally intensive.
~ns 1 day simulation
~ms – hours >106 days
D. Huang et al. (2011), PLoS Comp Biol 7(2):e1002002
1/28/14
SIB Biel/Bienne 50
Systems Pharmacology
Target binding
Affect protein function
Systemic response
Drug molecules
×Uptake
Secretion(or biomass components)
× × ×× ××
Enzyme inhibition
Metabolic network
Catalytic site
Slide from Roger Chang1/28/14
SIB Biel/Bienne 51
Multiscale Modeling of Drug Actions
physiological process
Understanding of dynamics and kinetics of protein-ligand interactions
physiological processphysiological processphysiological process
Knowledge representation and discovery & model integration
Prediction of molecular interaction network on
a genome scale
Reconstruction, analysis and simulation of
biological networks
Traditional Approach
Systems-based Approach
Slide from Lei Xie1/28/14
SIB Biel/Bienne 52
More Generally Any Translational-based Research That Involves Modeling at Multiple Scales
1/28/14
http://sagebase.org/
SIB Biel/Bienne 53
The History of Bioinformatics According to Bourne
1980s 1990s 2000s 2010s 2020
Discipline:
Unknown Expt. Driven Emergent Over-sold A Service A Partner A Driver
The Raw Material:
Non-existent Limited /Poor More/Ontologies Big Data/Siloed Open/Integrated
The People:
No name Technicians Industry recognition data scientists Academics
The Roots in Bioinformatics Series PLOS Comp Biol
1/28/14
SIB Biel/Bienne 54
In Summary:By the End of the Decade Biomedical
Research will Be a Truly Digital Enterprise and Computational
Scientists Will Be At the Forefront
You Have Much to Look Forward Too
1/28/14