big data in pharma - overview and use cases
TRANSCRIPT
Big Data Analyses in PharmaAn Overview
Josef Scheiber, PhDManaging Director
July 2015
Geographie
Startup Center in Waldsassen
Main siteData Analyses and Software Development
Westpark CenterGarmischer Str. in MunichScientific ActivitiesSince Jan 1, 2015
Basel/SwitzerlandData Curation and customer-related activities
Prag150 km
München200 km
Berlin300 km
Frankfurt250 km
BioVariance at a Glance –Get most out of your complex data
Curate.Integrate
Analyze.Model
Visualize.ExploreDECIDE
Overview
• Background• Strategy• Examples
Background
Courtesy: M. Zeinab, slideshare
What do we need out of Big Data?1. What are the inhibitors of kinase X and the five most similar
kinases with IC50 < 1 μM and with MW < 500 from all internal and external data sources?
2. What assay technologies have been used against my kinase? Which cell lines?
3. What other proteins are in the same kinase branch as target X, where there were validated chemical hits from external or internal sources?
4. If I hit a particular kinase, what would the potential side-effect profile look like? Which known inhibitor of this kinase has the best safety profile and the fewest known IC50s?
5. Have I identified other compounds with a bioactivity profile similar to compound X and with the same core substructure?
6. Can we create a phylochemical tree of kinases and for a new kinase target place it into the tree on the basis of activity against a reference panel of compounds?
7. Have I identified all kinases with an x-ray structure (in-house or external) that are in pathway X?
Bridging Chemical and Biological Data: Implications for Pharmaceutical Drug DiscoveryJL Jenkins, J Scheiber, D Mikhailov, A Bender, A Schuffenhauer, B Cornett, V Chan, J Kondracki, B Rohde, JW Davies (2012) In: Computational Approaches in Cheminformatics andBioinformatics Edited by:A Bender, R Guha. 25-56 John Wiley & Sons, Inc.
ANSW
ERS
Context matters!
metabolitesdrugs
targets pathways
diseases (phenotypes)
Context matters
RNADNA
It´s not that simple …
Descriptive:What happened?
Diagnostic:Why did it happen?
Predictive:What will happen?
Prescriptive:How can we make it happen?
Better data for better analytics
Hindsight Insight Foresight
Need for interpretation
33,3
1020
30
70
33,3 8070
60
10
33,3
10 10 1020
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Before molecularbiology
Molecular biologygolden age
Genomics age Deep sequencingage
Very soon
Data Analysis Experiment Experimental Design
Big Data?
Volume
Genome Sequencing
Slide adapted from George Church
Genome Sequencing
Slide adapted from George Church
Cost Reduction - Example
458 Ferrari Spider - $398,000 in 2006 –
40 cents now!
Much more data for way lessmoney
Challenges for Informatics? –1 genome is roughly 500 GB/data
2011 – several 100 exomes
Drug Discovery Pipeline
Target finding
Lead FindingLead
Optimization… Phase 1 … Market
Drug candidates Patients
Velocity
Velocity
• Mutations in tumor• Resistance mechanisms in patients• long term/short term AE • compliance• Nutrition and microbiome• Data from wearables relevant for drugs
For each patient
Variety
Variety
Variety
• Bioinformatics• Clinical
• Social network• E-health
• Also text/patents
A simplified overview –Molecules in Man
Adapted from Gohlke JM, Portier CJ.Environ. Health Perspect. 115:1261-1263 (2007)
A question of complexity –They all interact …
Biology
Chemistry
Physics
Dealing with a very complex environment –i.e. many opportunities
DNA RNA Protein Interactions Clinical parameters Treatment History Tissue anatomy Surgical History Epigenetic Profiles from many
patients at different timeponits
Target Off-targets Metabolites Additional indications Unspecific effects Similar drugs
Adapted from: J. Scheiber; How can we enable drug discovery informatics for personalized healthcare?Expert Opinion on Drug Discovery, 1-6; 2/2011
… individual polypharmacology
Sequences Expression Proteomics Biological networks(but also: Cells, Tissues, Organs)
POPULATION
Veracity
Veracity
• Chemogenomics data• Gene expression data Imputation?
Veracity - Chemogenomics
Adapted from Tanrikulu et al. Missing Value Estimation for Compound-Target Activity Data, J. Mol. Inf
Veracity - Interactomics
A Proteome-Scale Map of the Human Interactome Network
Rolland, Thomas et al.Cell , Volume 159 , Issue 5 , 1212 - 1226
Veracity – Social Media
Strategy
Biological/PharmacologicalUnderstanding
drugs
targets pathways
diseases (phenotypes)
Data integration strategy
a) A central vocabulary/pointer server (informationstored are preferred names and synonyms plus pointers to data servers, where to find what)
b) semantic integration layer with domain-specificterminology and referential data
c) A database for each datatype collected, storing onlypreferred names along with raw measurements
d) Clearly defined APIs for further integration withpublic data sources and to enable large-scaleanalyses
Vocabularies needed
• Genes, Drugs, Proteins• Diseases• Organisms• Microbiome species & genes• Localization & source• Phenotype• Metabolite common names
Answering workflow
Vocabulary
Vocabulary server acts astranslator, aggregator andlocator, i.e. knows wherethe respective facts can befound
Firmicutes produce alpha-Linolein and thereby cause gut irritation
species
metabolite
Further
Data of each type isstored in a specificdatabase toenhanceperformance oflarge-scale analysesExpert tools talk todata directly or via webservices
API
API
API
API
End
use
rin
terf
ace
and
visu
aliz
atio
n
Examples
Genome data at scale
Workflow
Identify drug targets(primary and off-targets,
from DrugBank)
Call variations on a per-individuum basis
Workflow
Analyse mutation rates in the targets and in
particular drug bindingpockets
Example: Donepezil / Acetylcholinesterase
• PDB 4EY7
Image extracted from Cheung et al.,
2012 [2]
Example: Donepezil / Acetylcholinesterase
Example: Acetylcholinesterase
Integrative Genomics Viewer
Not very successful
Alignment of the 3D structures of mutant number 52 (yellow) and PDB 4EY7 AChE protein (green). The only changed residue is the Y150 (magenta) to H150 (red). The white surface represents the molecular surface of donepezil.
Why is this a bad example?
AChE a key enzyme in human biology these arethe most highly conserved, even interspecies
Learning: Look at that stuff before investingtime
Generating Vocabularies
Vocabulary generation
Extensive mapping of terms from various sources
Vocabulary generation
397211 preferred
names
598532 synonyms
102086 identifiers
The chevron diagram shows the number of samples annotated
with names. Already by looking at the numbers you can see that
mapping everything is non-trivial.
A Big Data exercise in itself …
Tweet mining
Mining Twitter for side effects
Needed Drug Name and synonyms:
AdalimumabHumiraExemptia331731-18-1L04AB04
MedDRA vocabulary
Many birds tweet lots of noise … BUT …
• [1] "Lipitor headache 0"[1] "Lipitor rash 1"[1] "Lipitor pain 27"[1] "Lipitor bleeding 0"[1] "Lipitor cough 0"[1] "Lisinopril headache 0"[1] "Lisinopril rash 0"[1] "Lisinopril pain 8"[1] "Lisinopril bleeding 0"[1] "Lisinopril cough 7"[1] "Simvastatin headache 0"[1] "Simvastatin rash 0"[1] "Simvastatin pain 0"[1] "Simvastatin bleeding 0"[1] "Simvastatin cough 0"[1] "Plavix headache 0"[1] "Plavix rash 0"[1] "Plavix pain 0"[1] "Plavix bleeding 1"[1] "Plavix cough 0"[1] "Crestor headache 0"[1] "Crestor rash 0"[1] "Crestor pain 0"[1] "Crestor bleeding 0"[1] "Crestor cough 0"
Top 200 drugs
- Cutoff is at 1500 tweets that a few drugs easily surpass (althoughit's mostly only pharmaciesadvertizing) - Others are not mentioned once(probably a synonym issue as I restricted to English as language). -- top drugs are tweeted moreoften, but e.g. Tarceva (in 2006) at the very bottom also reaches thetop number of tweets (109 on list).
089 – 189 6582 – 80Garmischer Str. 4/V80339 München
[email protected]: 09632 – 9248 325Konnersreuther Str. 6g95652 Waldsassen
Questions?