analyzing genomic data for whole populations
TRANSCRIPT
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Analyzing Genomic Data for Whole PopulationsHow AWS Scalable Resources Enable Fast and Cost-Effective Analysis of Large Cohorts
Abhinav Nellore, PhDJohns Hopkins University
Peter White, PhDNationwide Children’s HospitalThe Ohio State University
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Intel Head in the Cloud Challenge on AWS POC: Population Scale Human Genome Analysis on the Cloud
Peter White, PhDDirector of Biomedical Genomics Core and Molecular Bioinformatics
The Research Institute at Nationwide Children’s HospitalAssistant Professor of Pediatrics, The Ohio State University
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
The Human Genome Project
15 Years$3,000,000,000
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
The First Printed Human GenomeWellcome Collection, London
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Next-Generation Sequencing in the Biomedical Genomics Core
1.2 terabytes
12 Human Genomes: 3.5 days
Data, Data, DATA…10,000s of samples
Trillions of base pairs
Molecular diagnosticsClinical treatmentClinical outcomes
TranslationalBioinformatics
5 billion sequence reactions
Illumina HiSeq 4000
Illumina MiSeq
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
The Solution: Churchill
The Problem:• 2 days for raw data• ~2 weeks for analysis
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Parallelization Strategies
Chromosome1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Region S
ize (Mbp)
0
50
100
150
200
250
Chromosomal
Chromosomal sub-regions distributed accross CPUs1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Region S
ize (Mbp)
0
50
100
150
200
250
Churchill: Balanced Subchromosomal
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Churchill is the most efficient pipeline
92%utilization
46%utilization
30%utilization
Churchill
HugeSeq
GATK-Queue
Chromosome1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Regio
n S
ize (M
bp)
0
50
100
150
200
250
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Churchill enables <2h genome analysis
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Churchill has highest diagnostic effectiveness
Churchill GATK-Q HugeSeq
Validated SNP 99.9% 99.2% 99.2%
% Known (dbSNP) 94.6% 93.0% 92.5%
Error rate 0.0012% 0.0019% 0.0034%
Youden Index 99.66% 98.96% 98.65%
Validation using NIST GIAB
100% Reproducible and Deterministic regardless of platform or level of parallelism
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Population Scale Genomics:Proof of Concept
Jan 2008: Project begins – consortium of >400 scientistsOct 2011: Phase 1 completed (1,092 individuals)Apr 2013: Phase 3 sequencing completedSep 2014: Completion of Phase 3 analysis
26 Populations: 2,504 Individuals
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
First Challenge: Data Upload• Data Upload Challenge:
– Final 1 KG data consist of 2,504 exomes and 2,504 whole genomes. Total size of FASTQ dataset is 70 TB.
– How do you transfer 70 TB of data quickly for genomic analysis?
• Data Upload Solution:– Collocate Cloud infrastructure. 1 KG utilizes Amazon Simple
Storage Service (S3) to store all data in US East Region - deploy GenomeNext analysis platform in the same region
– Automated data upload. Create genomic samples that reference S3 object locations of raw data and pass location to compute clusters
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Data Upload Solution
Data Upload Solution
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Second Challenge: Scale• The population scale challenge: complete
analysis of 5,008 genomic samples in 7 days
Clinical/Research Portal (SaaS)
US East Region
1000 Genome Project Phase II I Data
Automated Upload
Analysis Platform
Automation of AWS Infrastructure
GenomeNext Results
Access Results
Results added to GenomeNext Storage
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Scale Solution
1,000 instances
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
7 DAYS: FASTQ to Results
r2 = 0.9978,p-value < 2.2e-16
>75 million variants in common
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
@AbhiNellore from JHU at AWS WWPS DC Symposium
Joint with Leo Collado-Torres, Andrew Jaffe, Jamie Morton, Jacob Pritt, José Alquicira-Hernández, Jeff Leek, and Ben Langmead
Rail-RNA: analyzing many RNA sequencing samples with AmazonElastic MapReduce
22
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
RNA and gene expression
Crick’s central dogma https://kaiserscience.wordpress.com/biology-the-living-environment/genetics/dna-translation-and-the-genetic-code/
23
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
\
RNA sequencing (RNA-seq)
Zhong Wang, Mark Gerstein & Michael Snyder. RNA-Seq: a revolutionary tool for transcriptomics, Nature Reviews Genetics (2009).
24
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
\
Growth of sequence read archives (ENA)http://www.ebi.ac.uk/ena/about/statistics
25
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Why care about many samples?Genomics initiatives are now accumulating large RNA-seq datasets and making them available publicly.Study |
~# samplesENCODE |
100GEUVADIS |
667Depression Genes & Networks | 922TCGA |
2k+GTEx | 10k+ 26
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
• Disagree with existing analyses
• Repurpose rich, hard-to-obtain data
• Ensure reproducibility
• Add power to experiment
Why reanalyze?
27
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Why use cloud computing?
• Elasticity• Pay-as-you-go access to
computer cluster of desired size• Amazon S3 = infinite storage
• Reproducibility• Same software on same
hardware
28
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AdvantageAmazon EMR: consistency• Eventually consistent cloud storage may
not give right answer if reading too soon after write/update/delete
=> Data loss => Compromised reproducibility
• EMRFS imposes consistency with Amazon DynamoDB table reflecting what S3 should look like
29
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AlignmentSometimes, we’re lucky, and a read correctly aligns to the reference genome end to end.
…ATACATCAGACTAGACCGTACCACTCATAGACCTAGACCAGATACAG…
CAGACTAGACCGTACCACTCATAGACCTAGACCAGATAC
chr1
read
chr1
30
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Spliced alignmentBut when introns are overlapped, we divide the read into readlets…
ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT
ATACATCAGACTAGACCGTACCACAATCAGACTAGACCGTACCACACAGC
GACTAGACCGTACCACACAGCATGAAGACCGTACCACACAGCATGACAGT
CCACACAGCATGACAGTCATTCGACACAGCATGACAGTCATTCGACGTACCAGCATGACAGTCATTCGACGTACT
ATACATCAGACTAGAATACATCAGACTAGACATACATCAGACTAGACCGATACATCAGACTAGACCGTATACATCAGACTAGACCGTACATACATCAGACTAGACCGTACAGC
AGCATGACAGTCATTCGACGTACTATGACAGTCATTCGACGTACT
GACAGTCATTCGACGTACTACAGTCATTCGACGTACTAGTCATTCGACGTACT
CGTACCACACAGCATGACAGTCATT
read
readlets
31
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
…and align readlets to the genome to infer intron presence.Realignment of reads may be needed.
Spliced alignment
CATCAGACTAGACCGTACCACAGTACAGCATGACAGTCATTCGACGTACTCGT
TACAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAG
…ATACATCAGACTAGACCGTACCACAGTAGTTCATGACCCTCAGCAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC…intron
read 1
read 2 needs realignment to find intron
chr1
32
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
analyzes many RNA-seq samples in the cloud with Amazon Elastic MapReduce.
rail-rna go elastic—-manifest listOf100RNASeqFASTQs.tab—-assembly hg19—-core-instance-count 50—-output s3://my-bucket/my-output
1 or 2 commands alignments, introns, indels,
coverage
33
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Download at
http://rail.bio .
34
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
• Software requirements• Gets faster with more workers in cluster• Works on lots of data
• Solution: MapReduce via• Alternating aggregation and computation steps• Aggregation: data are grouped and sorted• Computation: each worker operates on subset of data
How to scale?
35
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
1. Aggregate: group unique read sequences
2. Compute: segment unique sequences
3. Aggregate: group unique readlet sequences
4. Compute: align unique readlet sequences
5. Aggregate: group alignments of readlets from same read sequence
6. Compute: search for introns by unique read sequence
MapReduce reduces redundant analysis in Rail
36
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Scalability overview: speedup
chr1
throughput
# computers
idea
l
in practice
37
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Scalability: speedup
# s
am
pl e
s /
hour
# c3.2xlarge “spot” computers ($0.11/computer/hr)
# computers | $/sample 10 | 1.19 20 | 1.46 40 | 1.58
38
On 28 GEUVADIS samples
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Scalability: improved throughput
chr1
throughput
# samples
naive
Rail-RNA
39
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Scalability: improved throughput
# GEUVADIS samples
On 40 c3.2xlarge computers
# samples | $/sample 28 | 1.58 56 | 1.26 112 | 0.95
# s
am
pl e
s /
hour
40
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
• Analyzed GEUVADIS• 666 samples in 12 hours on 1920 processing cores
• ~69 cents per sample
• Next: analyze all human Illumina RNA-seq on SRA (~20k samples)
Milestones
41
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
http://rail.bio .
Collaborators
Jacob PrittLeo Collado-Torres Jamie Morton
Jeff Leek Ben Langmead
Andrew Jaffe
José Alquicira-Hernández
Specialthanks to
Angel Pizarro
42
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Thank You.This presentation will be loaded to SlideShare the week following the Symposium.
http://www.slideshare.net/AmazonWebServices
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015