lincoln stein - informatics of cancer genomes
DESCRIPTION
Lincoln Stein, Ontario Institute for Cancer ResearchTRANSCRIPT
Informatics of Cancer Genomes
Lincoln SteinOntario Institute for Cancer Research
AMIA Summit on Translational GenomicsSan Francisco, March 2011
The Genome Project’s Challenge
Translating genome sciences into improved healthcare has been more difficult than we expected!
3
Risk prediction & prognosis: Germline variants predict risk, help optimize screening programs to identify tumours at earlier, more curable, stages;
Promises of cancer genome research to patients, health care providers and payers
Diagnosis: Cancer diagnosis will be more precise, allowing optimization of treatment interventions;
New therapies will be developed that target specific alterations in cancer cells, reducing the need for highly toxic, nonspecific chemotherapies.
A few successes – potential far from being realized
Cancer is a Complex Genomic Disease
Healthy BreastTissue
Early Cancer Invasive Cancer
Cancer is a Genomic Disease
Why Apply Genomics to Cancer?
Every cancer genome is different
Cancers currently treated with a one-size-fits-all strategy (w/ a few exceptions)
Knowledge of genomic changes will inform therapy
Challenges in Understanding Cancer
Tumours are heterogeneous & evolve over time.
Host factors are poorly understood. Different sets of mutated genes may lead
to similar tumours. Different tumour types may have similar
sets of mutated genes. Deep & broad sequencing necessary.
Whole Genome Sequencing- sequencing platform reagents only
$10,000,000
$1,000,000
$100,000
$10,000
$1,000
2005 2006 2007 2008 20102009
OICR Sequencing/Biocomputing Platform
>7 terabases per month (2000 human genomes)
capacity and growing
5500 cores
185 nodes with 16 GB RAM
221 nodes with 24 GB RAM
32 nodes with 96 GB RAM
5 nodes with 256 GB RAM
2.5PB of online storage
1Gb, 10Gb and fibre connectivity
ABI Solid 5500
Illumina GA1
Illumina HiSeq 2000
Pac Bio
Three Parts of Talk
• Part 1 – International Cancer Genome Consortium (ICGC)
• Part 2 – Network Analysis of Cancer Genomes
• Part 3 – Genome Pathways Sequencing (GPS)
Part 1: International Cancer Genome Consortium
Discover and catalog the driver genes in cancer tissues
12
Rationale for an international consortium
The scope is huge, such that no country can do it all;
Coordinated cancer genome initiatives will reduce duplication of effort for common tumours and ensure complete studies for many less frequent forms of cancer;
Standardization and uniform quality measures across studies will enable the merging of datasets, increasing power to detect additional targets;
The spectrum of many cancers varies across the world for many tumour types;
The ICGC will accelerate the dissemination of genomic and analytical methods across participating sites and the user community.
The Strategy Identify genomic abnormalities in 50
different major cancer types
Make the data available to the research community & public
Identify genome changes
…GATTATTCCAGGTAT… …GATTATTGCAGGTAT… …GATTATTGCAGGTAT…
Data being Collected
• Clinical data – tumor pathology, age, gender, treatment, survival (partly controlled access)
• Germline data – SNPs (controlled access)• Somatic mutations in tumour• Copy number variations• RNA abundance & splicing• DNA methylation
Big Data
50 tumor types and/or subtypesA minimum of 500 specimens and control tissues per tumor type
50,000 Human Genome Projects
Federate or Centralize?
Source DB
Source DB
Source DB
Staging AreaSite 1
Source DB
Source DB
Source DB
Site 1 Site 2 Site 3
Mart DB Mart DB
Mart DB
Mart DB
Centralized Model Federated Model
BioMart: A Federated Data Warehouse(EBI, CSHL, OICR)
Reactome Ensembl HapMap
BioMart
Bioclipse Taverna
GalaxyCytoscapeBioConductor
BioMart BioMart
Arek Kasprzyk
Queries Offered
Search for somatic mutations affecting a gene of interest.
Optionally filtered by various criteria.Search for mutated genes affecting a tumor type of interest.
Optionally filtered by various clinical criteria.Search for donors and samples of interest.
Optionally filtered by clinical & molecular criteria.
Potentially identifiable information is in a controlled tier.Includes germline mutations; NOT somatic mutations.
Apply to Data Access Control Office (DACO) and certify you will not attempt to reidentify patients.
Send list of OpenIDs to allow access.
After review of application, DACO will authorize DCC to accept the OpenIDs indicated for controlled access.
Controlled Data Tier
Part 2: Network Analysis of Cancer Genomes
Discover patterns & mechanisms of altered genes in cancer
Why Network Analysis is Useful
• No single mutated gene is necessary & sufficient to cause cancer.
• Typically one or two common mutations (e.g. TP53) plus many rare mutations.
• Network analysis reduces hundreds of mutated genes to a < dozen mutated pathways.
• Can elucidate mechanism of action of drivers.
Reactome Pathway Coverage
Curated Human Data – Version 35. 5078 proteins 4166 reactions3870 complexes 1112 pathways
Expanding Reactome’s Coverage
Curated Pathways Uncurated Information
human PPI
PPI inferred from fly, worm & yeast
PPI from text mining
Gene co-expression
GO annotation on biological processes
Protein domain- domain interactions
CellMap TRED
GeneWays
Annotated Functional Interactions
Naïve Bayes Classifier
Predicted Functional Interactions
Wu et al. (2010) Genome Biology
Functional Interaction (FI) Network
10,956 proteins (9,542 genes) 209,988 FIs 5% of network shown here
A Paradigm for Interpreting Gene Lists
Reactome Functional Interaction network
Disease subnetwork
Extract mutated, overexpressed, undexpressed, expanded/deleted genesAdd Linker
genes
Disease “modules”
Disease gene prediction
Sample classification
Hypothesis generationApply community clustering algorithms
OICR Pancreatic Cancer Whole Genome Sequencing
• 5 Patients
– Primary, xenograft, cell line, normal
• Whole genome sequencing
• Somatic mutation calling
• Very conservative – only mutations appearing in primary+xenograft+cell line kept
• 310 NS somatically mutated genes
OICR Pooled Data from 5 Pancreatic Cancer Genomes
(108 mutated genes in network)
p53 & p38 MAPK signaling
KRAS signaling
Hedgehog signaling
Wnt & Cadherin signaling
Zinc fingers
Olfactory signaling
Transcription
Apoptosis
Syndecan-3-mediated signaling
Hedgehog module – 11 genes
BMP8B = bone morphogenetic protein 8b
PCSI0002
PCSI0005
PCSI0006
PCSI0022
PCSI0024
2 patients
3 patients
linker
Comparison to 2008 Johns Hopkins Dataset
SCG2SLC1A6SMAD4SMARCA4ST6GAL2TGFBR2TNRTPOZNF835DZ4
Genes mutated in ≥ 3 patients Genes mutated in ≥ 2 patients
p-value = 1.87E-4
Jones et al. Science (2008) 321: 1801
ABLIM2AHNAKBAI3CDH10CDKN2ACTNNA2DPP6FMN2GPR133LRP1B
MYH2ODZ4OVCH1PCDH15PCDH18PIK3CGPPP1R3APREX2PXDNRYR2
SEZ6LSLC45A1TP53TRPM3TTNUSP20ZNF443
AGXT2L2ALDH8A1ARHGEF7ARID1AARSAATN1BOCCAND2CNTNAP2DCHS2
DMDDOCK2DUOX2FAM123CFAT4FRAXAGAS7KRASLGR6LRRTM4
MLL3MUC16NANOS1NKX2-2NLRP4NPY1ROTOFPKD1L2PODNRBM27
PDE4DIPPOTEHRGPD3TBX20TPTE2TUBB2CWASH7PZNF705AZNF717ZNF814
AGAP4ANKRD36AQP7BMP8BCLEC18BFLGFRG2BHERC2HRNRIKZF2KRTAP5-10
LYZL2MST1P9NBPF1NBPF10NBPF14NBPF8NCOR1NF1P4NOTCH2NLOR2T34PABPC3
Hopkins data1278 genes mutated in 24 patients
KRAS signaling
p53 signaling
Integrin signaling
Wnt & Cadherin signaling
TGFβ signalingHeterotrimeric G-protein signaling
Cell cycle
G2/M transition
Lipid metabolism
Hedgehog signaling
Rho GTPase signaling
Muscle contraction
ADAM metallopeptidase with thrombospondin
Transcription
Zinc fingers
Very Similar at Module Level
p53 & p38 MAPK signaling
KRAS signaling
Hedgehog signaling
Wnt & Cadherin signaling
Zinc fingers
Olfactory signaling
Transcription
Apoptosis
Syndecan-3-mediated signaling
Discovering Prognostic Signatures in Cancer Module Datasets
Disease Module Map
Correlate principal components with clinical parameters
Principal component analysis on modulesExpression Analysis of
tumours from multiple patients
Module-Based Signatures of Breast Cancer Survival
• Nejm: van de Vijver et al 2002
– 295 Samples, ~12,000 genes
– Event: death
• GSE4922: Ivshina et al. Cancer Res. 2006
– 249 Samples, ~13,000 genes
– Event: recurrence or death
Building the Network
• Built based on the Nejm data set
– 27 modules selected based on size cutoff 7 and average correlation cutoff 0.25.
• Validated using GSE4922.
Summary
• Reactome, coupled with the FI network, can lead to
useful insights in analyzing genomic datasets.
• Cytoscape plugin lets anyone perform the
paradigmatic workflow of discovering and annotating
network modules and finding potential drug targets.
• All data and software are open to public; no licensing
required.
• www.reactome.org.
•Collab between OICR & Princess Margaret Hospital•Recruit patients with metastatic breast, colorectal, lung & ovarian CA who have “failed” standard therapy.•Sequence ~1000 cancer-related genes.•Identify “actionable” mutations.•Route patients to clinical trials of drugs targeting their particular mutated genes.•Assess clinical & sociological outcomes.
Why are We Doing This?
• Feasibility – can we adapt high-throughput sequencing to the clinical laboratory?
• Performance – can we turn the results around in ~3 weeks?
• Efficacy – Does targeting mutations improve health outcome?
• Sociology – How do clinicians and patients deal with genomic data?
Single Molecule Sequencing
•Pacific Biosystems RS•Single-molecule sequencing; circular consensus•>1000 bp reads, ~8x coverage•15 min/run
Mutation Consequences Knowledgebase
• ~200 cancer-related genes selected by consensus of local oncologists.
• ~800 being added from knowledgebases at MSKCC, COSMIC and NCI.
• Actionable common mutations being annotated by oncology fellows at PMH.
• Informatics system will generate a draft report; reviewed and revised by expert panel of oncologists.
• Results fed back to knowledgebase.
Actionable Mutations
• Patients with mutations in KRAS, BRAF, & PI3K referred to ongoing clinical trials with targeted inhibitors of those pathways.– Other actionable mutations to be added as suitable
trials become available.
– Does treatment with targeted therapy improve patient outcome?
• Patients with germline variants associated with increased risk of cancer will receive counseling, and offered genetic counseling of potentially affected family members.
Sociological Questions
• How will patients respond to being told that they will be denied treatment based on absence of targeted mutation?
• How will patients respond to learning they carry cancer risk alleles?
• How will patients & clinicians respond to “incidental” findings and nonactionable mutations?
Status of GPS Project
• Oncologists, pathologists, radiologists, clinical trials nurse and other clinical staff recruited.
• Sociologists recruited.
• Clinical & genomics databases built.
• Consequences knowledgebase prototype running.
• Study approved by IRB.
• First patient to be recruited week of March 13, 2011.
Summary
Three OICR Projects1. ICGC – Discover cancer driver genes.
2. Reactome – Discover how driver genes relate to disease mechanisms & to clinical behavior.
3. GPS – Translate genomics to the bedside.
Tentative first steps towards personalized medicine.
Acknowledgements• ICGC DCC & Portal
– Arek Kasprzyk– Junjun Zhang– Francis Ouellette
• Network Analysis– Guanming Wu– Irina Kalatskaya– Christina Yung
• GPS Project– Lillian Siu– John McPherson– Suzanne Karmel-Reid
• My Boss– Thomas Hudson
• Funding– National Institutes of Health– Ministry of Research &
Innovation, Ontario
Ministry of Research and Innovation