2014 07 ismb personalized medicine
DESCRIPTION
Atul Butte presentation at ISMB 2014TRANSCRIPT
Big Data in Biomedicine: Transla3ng 300 trillion points of data into new drugs and diagnos3cs
Atul Bu;e, MD, PhD Chief, Division of Systems Medicine,
Departments of Pediatrics, Gene3cs, and, by courtesy, Computer Science, Pathology, and Medicine
Center for Pediatric Bioinforma3cs, LPCH Stanford University
abu;[email protected] @atulbu;e
@ImmPortDB
Disclosures • Scien'fic founder and
advisory board membership – Genstruct – NuMedii – Personalis – Carmenta
• Honoraria for talks – Lilly – Pfizer – Siemens – Bristol Myers Squibb – AstraZeneca – Roche – Genentech
• Past or present consultancy – Lilly – Johnson and Johnson – Roche – NuMedii – Genstruct – Tercica – Ecoeos – Ansh Labs – Prevendia – Samsung
– Assay Depot – Regeneron – Verinata – Geisinger – Covance
• Corporate Rela'onships – Northrop Grumman – Aptalis – Thomson Reuters
• Speakers’ bureau – None
• Companies started by students – Carmenta – Serendipity – NuMedii – S'mulomics – NunaHealth – Praedicat – MyTime – Flipora
Big Data in Biomedicine
Nearly 1.4 million microarrays available Doubles every 2-‐3 years
Bu;e AJ. Transla3onal Bioinforma3cs: coming of age. JAMIA, 2008.
127 million substances x 740,000 assays 1.2 billion points of data within a grid of 100 trillion cells ~250 million ac3ve substances
5,178 compounds ·∙ 1,300 off-‐patent FDA-‐approved drugs ·∙ 700 bioac've tool compounds ·∙ 2,000+ screening hits (MLPCN and others) 3,712 genes (shRNA + cDNA) ·∙ targets/pathways of FDA-‐approved drugs (n=900) ·∙ candidate disease genes (n=600) ·∙ community nomina'ons (n=500+) 15 cell types ·∙ Banked primary cell types ·∙ Cancer cell lines ·∙ Primary hTERT immortalized ·∙ Pa'ent derived iPS cells ·∙ 5 community nominated
Protein
Protein
Cancer markers
Transplant Rejec3on markers
Preeclampsia: large cause of maternal and fetal death
• Incidence • 5-‐8% of all pregnancies in the U.S. and worldwide
• 4.1 million births in the U.S. in 2009
• Up to 300K cases of preeclampsia annually in the U.S.
• Mortality • Responsible for 18% of all maternal deaths in the U.S.
• Maternal death in 56 out of every 100,000 live births in US
• Neonatal death in 71 out of every 100,000 live births in US
• Cost • $20 billion in direct costs in the U.S annually
• Average hospital stay of 3.5 days Linda Liu
Ma; Cooper Bruce Ling
New markers for preeclampsia
p value 3.49 X 10-‐4 1.79 X 10-‐5
ng/m
l
p value = 1.92 X 10-‐8
Control N=16
Preeclampsia N=15
Control N=16
Preeclampsia N=17
GA 23-‐34 weeks GA > 34 weeks
ng/m
l
Gesta3onal age (weeks)
march of dimes®
prematurity research center
VERSION: MOD_PRC_LOGO_R7G_082712
at STANFORD University School of Medicine
Linda Liu Bruce Ling
Sequencing Excitement • 454/Roche, Life Technologies • Helicos: $30k genome • Pacific Biosystems: sequence human genome in 15 minutes
• Run 'mes in minutes at a cost of hundreds of dollars
• Complete Genomics: 80 genomes/day
• Ion Torrent and Illumina: ~$1500 per genome
• Oxford: USB s'ck
Lancet, 375:1525, May 1, 2010.
Credit: Euan Ashley, Russ Altman, Steve Quake, Lancet
• Study published in 2008 in Inflammatory Bowel Disease
• Crohn’s Disease and Ulcera've Coli's
• Inves'gated 9 loci in 700 Finnish IBD pa'ents
• We record 100+ items – GWAS, non-‐GWAS papers – Disease, Phenotype – Popula'on, Gender – Alleles and Genotypes – p-‐value (and confidence) – Odds ra'o (and confidence) – Technology, Study design – Gene'c model
• Mapped to UMLS concepts Rong Chen Optra Systems
• Study published in 2008 in Inflammatory Bowel Disease
• Crohn’s Disease and Ulcera've Coli's
• Inves'gated 9 loci in 700 Finnish IBD pa'ents
• We record 100+ items – GWAS, non-‐GWAS papers – Disease, Phenotype – Popula'on, Gender – Alleles and Genotypes – p-‐value (and confidence) – Odds ra'o (and confidence) – Technology, Study design – Gene'c model
• Mapped to UMLS concepts
• Study published in 2009 in Rheumatology
• Ankylosing spondyli's
• Inves'gated 8 SNPs in IL23R in 2000 UK case-‐control pa'ents
• Tables can be rotated • NLP is hard
• Study published in 2009 in Rheumatology
• Ankylosing spondyli's
• Inves'gated 8 SNPs in IL23R in 2000 UK case-‐control pa'ents
• Tables can be rotated • NLP is hard
• Study published in 2009 in Rheumatology
• Ankylosing spondyli's
• Inves'gated 8 SNPs in IL23R in 2000 UK case-‐control pa'ents
• Tables can be rotated • NLP is hard
What are the alleles for rs1004819?
Alleles for rs1004819 are C and T
~11% of records reported genotypes in the nega3ve strand
Number of papers curated
Number of records
Dis3nct SNPs Diseases and phenotypes
~19,000 ~1.6 million ~473,000 ~7,400
Rong Chen Anil Patwardhan
Michael Clark Optra Systems
Personalis
VARIMED: Variants Informing Medicine
Chen R, Davydov EV, Sirota M, Bu;e AJ. PLoS One. 2010 October: 5(10): e13574.
Diseases and Traits • Risk factors are associated with an increased likelihood of developing a given diseases • Smoking à chronic obstruc've pulmonary disease
• Risk factors are iden'fied for diseases through large scale epidemiological studies, which are resource intensive • GWAS have iden'fied gene'c variants for thousands of diseases and traits • If traits and diseases share the same associated gene'c variants, could the trait be used to suggest risk factors for disease?
Li L, Ruau DJ, Patel CJ, Weber SC, Chen R, Tatonej NP, Dudley JT, Bu;e AJ. Science Transla3onal Medicine, 2014, 6(234).
Li Li
EMR Cohort
Identify significant disease-trait genetic associations and clinically validate using EMR data
Gene counts > 3
Disease (n=201)
Varimed
TF-IDF weighing Cosine distance Random shuffling
Trait (n=85)
Disease (n=69)
Trait (n=249)
Disease-Trait Pair (n=120)
p < 1e-8 Disease modules (n=8)
Gene3cs Module
D
Clinical Valida3on
Novel predictions (n=26)
T
q ≤ 0.01
D
Published findings (n=94)
T D
D
D D
T D
T T
T T
Trait modules (n=7)
Complications
Diagnostic tests
Risk factors
1st dx
After dx Before dx
1st dx
Li Li
Assessing significance of disease-‐trait (D-‐T) pair
• Each gene within individual disease or trait by taking into account the frequency of the gene: Term Frequency–Inverse Document Frequency • 2-‐idf(i, j) = 2(i, j) × idfi, = ni, j/(∑k nk, j) x log(D/Di) which adjusted the score of 6(i, j) by taking into account the popularity level of the gene i.
• e.g, 154 D+T, 28 genes in Alzheimer's disease and 5 genes in ESR, CR1 was in common • s-‐idf (AD)=1/28 x log(154/2,10)=0.067 • s-‐idf (ESR)=1/5 x log(154/2,10)=0.377
• D-‐T distance score was calculated using Cosine distance to evaluate similarity between all pairs.
• Randomly sampling all the genes across all the traits, and calculated the D-‐T similarity, repeated 1,000 'mes and generated the q value based on the number of the samplings.
∑∑∑
==
=
×
×=
•=−
n
i in
i i
n
i ii
TD
TDTDTDTDsimilarityine
12
12
1
)()(),(cos = 0.9274524
Li L, Ruau DJ, Patel CJ, Weber SC, Chen R, Tatonej NP, Dudley JT, Bu;e AJ. Science Transla>onal Medicine, 2014, 6(234).
Li Li
Li Li
Li Li
Categoriza3ons for known D-‐T pairs and discover poten3al confounders in GWAS studies
38 pairs 27 pairs 28 pairs
93 pairs
T D
Gene3c Variants
T D
Gene3c Variants
Timing of Disease Progression
Risk Factor Consequence
T
D
Gene3c Variants
Diagnos3c Test
Li Li
Diagnos3c tests where traits occur at the same 3me as disease onset
An3body 3ter
Hepa<<s B vaccine response Png et al, Hum Mol Genet, 2011
Even though this GWAS did not explicitly par'cipants with the autoimmune diseases above, our approach inferred known rela'onships between diseases and traits based on their shared gene'c architecture
T
D
Gene3c Variants
Diagnos3c Test
Li Li
Significant genes shared between an3body 3ter and 16 autoimmune diseases
Disease Common Genes Genes Shared q-‐value Alopecia areata 4 BTNL2; C6orf10; RDBP; TNXB <0.001
Ankylosing spondyli's 2 BTNL2; LOC100507436 0.001 Asthma 4 BTNL2; C6orf10; HLA-‐DPA1; NOTCH4; <0.001
Biliary liver cirrhosis 3 BTNL2; C6orf10; HLA-‐DPB1 0.003 Chronic hepa''s b 2 HLA-‐DPA1; HLA-‐DPB1 <0.001
HIV infec'on 7 C6orf10; HLA-‐C; LOC100507436; NOTCH4; PRRC2A; RDBP; TNXB <0.001
Membranous nephropathy 15 AGPAT1; BAG6; BTNL2; C6orf10; EHMT2; GPANK1; LY6G5B; LY6G6C; NOTCH4; PRRC2A; RDBP; RNF5; SLC44A4; TNXB; ZBTB12 <0.001
Mul'ple sclerosis 7 AGPAT1; BAG6; BTNL2; C6orf10; EHMT2; NOTCH4; TNXB <0.001 Neonatal lupus 3 BAG6; C6orf10; ZBTB12 <0.001
Primary biliary cirrhosis 3 BTNL2; C6orf10; HLA-‐DPB1 0.005
Rheumatoid arthri's 20 AGPAT1; BAG6; BTNL2; C6orf10; EHMT2; GPANK1; HLA-‐C; HLA-‐DPA1; HLA-‐DPB1;
LOC100507436; LY6G5B; LY6G6C; LY6G6F; NOTCH4; PRRC2A; RDBP; RNF5; SLC44A4; TNXB; ZBTB12
<0.001
Systemic lupus erythematosus 9 BAG6; BTNL2; C6orf10; GPANK1; HLA-‐DPB1; NOTCH4; PRRC2A; TNXB; ZBTB12 <0.001
Systemic sclerosis 3 HLA-‐DPA1; HLA-‐DPB1; NOTCH4 <0.001 Type 1 diabetes 5 BAG6; BTNL2; C6orf10; HLA-‐C; HLA-‐DPB1 0.001
Vi'ligo 6 AGPAT1; BTNL2; NOTCH4; RNF5; SLC44A4; TNXB <0.001 Wegener's granulomatosis 2 HLA-‐DPA1; HLA-‐DPB1 <0.001
Li Li
Risk factors where traits occur prior to the disease onset and may accompany disease
Trait Disease Common Genes Genes Shared q-‐value
Smoking Chronic obstruc've pulmonary disease 3 AGPHD1; CHRNA3; RAB4B <0.001
Gene3cs Variants
Known clinical study: Smoking is the primary risk factor for COPD although lixle was known the pathogenesis between smoking and COPD. Pauwels et al, 2001, Vestbo et al 2012 In GWAS study: Six GWAS studies are related to COPD in VARIMED and their COPD cohorts all are from smoking pa'ents. Cho et al, 2012, Pillai SG, 2010, Wang et al 2010, Cho et al, 2010, lambrechts et al, 2010, Pillai SG, 2009 As COPD occurs ayer smoking, the variants associated with COPD could be influenced by smoking, and the gene'c variants for COPD could be unmasked if smoking confounder is excluded in GWAS.
Smoking COPD
Li Li
Gene3c Variants
Consequence where traits occur aqer the disease onset Trait Common Genes Genes Shared q-‐value
Alanine aminotransferase levels 1 C12orf51 0.001
Cholesterol levels 3 ALDH2; BRAP; C12orf51 0.001
HDL cholesterol levels 2 C12orf51; OAS3 <0.001
Known clinical study: High HDL criterion was observed with triple frequency in the ADS group, high cholesterol diet was associated with ADS pa'ents , and ALT levels have been seen to increase with daily alcohol intake in pa'ents who developed ADS. Kahl et al, 2010; imhof et al, 2001, Gross GA, 1994
In GWAS study: 3 genes for cholesterol levels reported by Kato et al. and 2 genes for ALT and HDL-‐C reported by Young et al. could be biased by alcohol effect as the authors did not perform alcohol intake adjustment or controlled for drinking habits on these genes in their GWAS studies. Kato et al, 2011; Kamatani et al, 2010 The GWAS to iden'fy concrete gene'c variants for these three clinical measurements should be performed in pa'ents without ADS as a confounder
Alcohol dependence syndrome (ADS)
ALT HDL-‐C
ADS
Li Li
27 novel pairs Trait Disease Common
Genes Genes Shared q-‐value
Mean corpuscular volume Acute lymphoblas3c leukemia 1 IKZF1 0.001 Mean cell hemoglobin concentra3on Alcohol dependence 1 ALDH2 0.005
Platelet count Alcohol dependence 1 C12orf51 0.007 Lung func'on Alopecia areata 1 AGER 0.008
Erythrocyte sedimenta3on rate Alzheimer's disease 1 CR1 0.004 Prostate-‐Specific an'gen levels Basal cell carcinoma 1 CLPTM1L 0.004
Eye color Chronic lymphocy'c leukemia 1 IRF4 0.006 Freckles Chronic lymphocy'c leukemia 1 IRF4 0.008
Blood pressure Esophageal cancer 3 ALDH2, C12orf51, PLCE1 0.009 Factor vii coagulant ac'vity Esophageal cancer 1 ADH4 0.008 Serum magnesium levels Gastric cancer 3 MUC1; THBS3; TRIM46 <0.001
Prostate-‐Specific an'gen levels Glioma 1 TERT 0.005 Alpha linolenic acid levels Glucose intolerance 1 FADS1 0.01
Alanine aminotransferase levels Hypertension 1 C12orf51 0.003 Serum transferrin levels Hypertension 1 HFE 0.005
Smoking Kawasaki disease 1 RAB4B 0.003 Prostate-‐Specific an'gen levels Lung cancer 2 CLPTM1L; TERT 0.001
Homocysteine levels Melanoma 1 C16orf55 0.01 Protein c levels Melanoma 2 NCOA6; PIGU <0.001
Transferrin receptor levels Metabolic syndrome 3 APOA5; BUD13; ZNF259 <0.001 PR interval Open-‐Angle glaucoma 1 CAV1 0.002 PR interval Restless legs syndrome 1 MEIS1 0.003
Bone mineral density Sudden cardiac arrest 1 ESR1 0.006 Acenocoumarol maintenance dosage Systemic lupus erythematosus 2 ITGAM; ITGAX 0.004
Platelet count Tes'cular cancer 1 BAK1 0.003 Prostate-‐Specific an'gen levels Tes'cular cancer 2 CLPTM1L; TERT <0.001 Alkaline phosphatase levels Venous thromboembolism 1 ABO 0.008
Li Li
Independent pa3ent cohort valida3on: clinical data warehouses
• STRIDE: clinical data warehouse, has ICD9 diagnoses codes, CPT procedure codes, and lab results on over 1.7 million pediatric and adult pa'ents at Stanford Hospital and Clinic, independent cohort 1/1/2005 to 7/15/2012
• Collabora'ons also with Columbia University and Mount Sinai School of Medicine to validate findings
• Time frame for analysis: within one year before the 1st disease diagnosis or within one year ayer the 1st disease diagnosis
1st Dx
Target disease (case) Non-‐target disease (control)
lab lab 1 year 1 year
Li Li
Serum magnesium levels and gastric cancer
Li Li
immport.niaid.nih.gov
Digital compara3ve effec3veness Find precision subsets
If entry criteria are same, outcome measures are same, and comparable studies, can perform “meta-‐trial”
Take Home Points
• Personalized medicine ≥ DNA. Will include other clinical, molecular, and environment measures.
• We need new inves'gators who can imagine basic ques'ons to ask of these repositories of clinical and genomic measurements.
• Bioinforma'cs is not just about building tools. We know our tools; we should use them first. Don’t be afraid to test your ideas.
Funded post-‐doctoral posi3ons in Transla3onal Bioinforma3cs
Contact Atul Bu;e
Collaborators • Jeff Wiser, Patrick Dunn, Mike Atassi / Northrop Grumman • Ashley Xia and Quan Chen / NIAID • Takashi Kadowaki, Momoko Horikoshi, Kazuo Hara, Hiroshi Ohtsu / U Tokyo • Kyoko Toda, Satoru Yamada, Junichiro Irie / Kitasato Univ and Hospital • Shiro Maeda / RIKEN • Alejandro Sweet-‐Cordero, Julien Sage / Pediatric Oncology • Mark Davis, C. Garrison Fathman / Immunology • Russ Altman, Steve Quake / Bioengineering • Euan Ashley, Joseph Wu, Tom Quertermous / Cardiology • Mike Snyder, Carlos Bustamante, Anne Brunet / Gene'cs • Jay Pasricha / Gastroenterology • Rob Tibshirani, Brad Efron / Sta's'cs • Hannah Valan'ne, Kiran Khush/ Cardiology • Ken Weinberg / Pediatric Stem Cell Therapeu'cs • Mark Musen, Nigam Shah / Na'onal Center for Biomedical Ontology • Minnie Sarwal / Nephrology • David Miklos / Oncology
Support • Lucile Packard Founda'on for Children's Health • NIH: NIAID, NLM, NIGMS, NCI; NIDDK, NHGRI, NIA, NHLBI, NCATS • March of Dimes • Hewlex Packard • Howard Hughes Medical Ins'tute • California Ins'tute for Regenera've Medicine • Luke Evnin and Deann Wright (Scleroderma Research Founda'on) • Clayville Research Fund • PhRMA Founda'on • Stanford Cancer Center, Bio-‐X, SPARK
• Tarangini Deshpande • Alan Krensky, Harvey Cohen • Hugh O’Brodovich • Isaac Kohane
Admin and Tech Staff • Susan Aptekar • Jen Cory • Boris Oskotsky