analyzing genomic data for whole populations

43
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Analyzing Genomic Data for Whole Populations How AWS Scalable Resources Enable Fast and Cost- Effective Analysis of Large Cohorts Abhinav Nellore, PhD Johns Hopkins University Peter White, PhD Nationwide Children’s Hospital The Ohio State University ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Upload: amazon-web-services

Post on 14-Aug-2015

211 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Analyzing Genomic Data for Whole PopulationsHow AWS Scalable Resources Enable Fast and Cost-Effective Analysis of Large Cohorts

Abhinav Nellore, PhDJohns Hopkins University

Peter White, PhDNationwide Children’s HospitalThe Ohio State University

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Page 2: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Intel Head in the Cloud Challenge on AWS POC: Population Scale Human Genome Analysis on the Cloud

Peter White, PhDDirector of Biomedical Genomics Core and Molecular Bioinformatics

The Research Institute at Nationwide Children’s HospitalAssistant Professor of Pediatrics, The Ohio State University

Page 3: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

The Human Genome Project

15 Years$3,000,000,000

Page 4: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

The First Printed Human GenomeWellcome Collection, London

Page 5: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Next-Generation Sequencing in the Biomedical Genomics Core

1.2 terabytes

12 Human Genomes: 3.5 days

Data, Data, DATA…10,000s of samples

Trillions of base pairs

Molecular diagnosticsClinical treatmentClinical outcomes

TranslationalBioinformatics

5 billion sequence reactions

Illumina HiSeq 4000

Illumina MiSeq

Page 6: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Page 7: Analyzing Genomic Data for Whole Populations

The Solution: Churchill

The Problem:• 2 days for raw data• ~2 weeks for analysis

Page 8: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Parallelization Strategies

Chromosome1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Region S

ize (Mbp)

0

50

100

150

200

250

Chromosomal

Chromosomal sub-regions distributed accross CPUs1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Region S

ize (Mbp)

0

50

100

150

200

250

Churchill: Balanced Subchromosomal

Page 9: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Churchill is the most efficient pipeline

92%utilization

46%utilization

30%utilization

Churchill

HugeSeq

GATK-Queue

Chromosome1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Regio

n S

ize (M

bp)

0

50

100

150

200

250

Page 10: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Churchill enables <2h genome analysis

Page 11: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Churchill has highest diagnostic effectiveness

Churchill GATK-Q HugeSeq

Validated SNP 99.9% 99.2% 99.2%

% Known (dbSNP) 94.6% 93.0% 92.5%

Error rate 0.0012% 0.0019% 0.0034%

Youden Index 99.66% 98.96% 98.65%

Validation using NIST GIAB

100% Reproducible and Deterministic regardless of platform or level of parallelism

Page 12: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Population Scale Genomics:Proof of Concept

Page 13: Analyzing Genomic Data for Whole Populations

Jan 2008: Project begins – consortium of >400 scientistsOct 2011: Phase 1 completed (1,092 individuals)Apr 2013: Phase 3 sequencing completedSep 2014: Completion of Phase 3 analysis

26 Populations: 2,504 Individuals

Page 14: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

First Challenge: Data Upload• Data Upload Challenge:

– Final 1 KG data consist of 2,504 exomes and 2,504 whole genomes. Total size of FASTQ dataset is 70 TB.

– How do you transfer 70 TB of data quickly for genomic analysis?

• Data Upload Solution:– Collocate Cloud infrastructure. 1 KG utilizes Amazon Simple

Storage Service (S3) to store all data in US East Region - deploy GenomeNext analysis platform in the same region

– Automated data upload. Create genomic samples that reference S3 object locations of raw data and pass location to compute clusters

Page 15: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Data Upload Solution

Page 16: Analyzing Genomic Data for Whole Populations

Data Upload Solution

Page 17: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Second Challenge: Scale• The population scale challenge: complete

analysis of 5,008 genomic samples in 7 days

Clinical/Research Portal (SaaS)

US East Region

1000 Genome Project Phase II I Data

Automated Upload

Analysis Platform

Automation of AWS Infrastructure

GenomeNext Results

Access Results

Results added to GenomeNext Storage

Page 18: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Scale Solution

1,000 instances

Page 19: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Page 20: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

7 DAYS: FASTQ to Results

r2 = 0.9978,p-value < 2.2e-16

>75 million variants in common

Page 21: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Page 22: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

@AbhiNellore from JHU at AWS WWPS DC Symposium

Joint with Leo Collado-Torres, Andrew Jaffe, Jamie Morton, Jacob Pritt, José Alquicira-Hernández, Jeff Leek, and Ben Langmead

Rail-RNA: analyzing many RNA sequencing samples with AmazonElastic MapReduce

22

Page 23: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

RNA and gene expression

Crick’s central dogma https://kaiserscience.wordpress.com/biology-the-living-environment/genetics/dna-translation-and-the-genetic-code/

23

Page 24: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

\

RNA sequencing (RNA-seq)

Zhong Wang, Mark Gerstein & Michael Snyder. RNA-Seq: a revolutionary tool for transcriptomics, Nature Reviews Genetics (2009).

24

Page 25: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

\

Growth of sequence read archives (ENA)http://www.ebi.ac.uk/ena/about/statistics

25

Page 26: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Why care about many samples?Genomics initiatives are now accumulating large RNA-seq datasets and making them available publicly.Study |

~# samplesENCODE |

100GEUVADIS |

667Depression Genes & Networks | 922TCGA |

2k+GTEx | 10k+ 26

Page 27: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

• Disagree with existing analyses

• Repurpose rich, hard-to-obtain data

• Ensure reproducibility

• Add power to experiment

Why reanalyze?

27

Page 28: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Why use cloud computing?

• Elasticity• Pay-as-you-go access to

computer cluster of desired size• Amazon S3 = infinite storage

• Reproducibility• Same software on same

hardware

28

Page 29: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AdvantageAmazon EMR: consistency• Eventually consistent cloud storage may

not give right answer if reading too soon after write/update/delete

=> Data loss => Compromised reproducibility

• EMRFS imposes consistency with Amazon DynamoDB table reflecting what S3 should look like

29

Page 30: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AlignmentSometimes, we’re lucky, and a read correctly aligns to the reference genome end to end.

…ATACATCAGACTAGACCGTACCACTCATAGACCTAGACCAGATACAG…

CAGACTAGACCGTACCACTCATAGACCTAGACCAGATAC

chr1

read

chr1

30

Page 31: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Spliced alignmentBut when introns are overlapped, we divide the read into readlets…

ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT

ATACATCAGACTAGACCGTACCACAATCAGACTAGACCGTACCACACAGC

GACTAGACCGTACCACACAGCATGAAGACCGTACCACACAGCATGACAGT

CCACACAGCATGACAGTCATTCGACACAGCATGACAGTCATTCGACGTACCAGCATGACAGTCATTCGACGTACT

ATACATCAGACTAGAATACATCAGACTAGACATACATCAGACTAGACCGATACATCAGACTAGACCGTATACATCAGACTAGACCGTACATACATCAGACTAGACCGTACAGC

AGCATGACAGTCATTCGACGTACTATGACAGTCATTCGACGTACT

GACAGTCATTCGACGTACTACAGTCATTCGACGTACTAGTCATTCGACGTACT

CGTACCACACAGCATGACAGTCATT

read

readlets

31

Page 32: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

…and align readlets to the genome to infer intron presence.Realignment of reads may be needed.

Spliced alignment

CATCAGACTAGACCGTACCACAGTACAGCATGACAGTCATTCGACGTACTCGT

TACAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAG

…ATACATCAGACTAGACCGTACCACAGTAGTTCATGACCCTCAGCAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC…intron

read 1

read 2 needs realignment to find intron

chr1

32

Page 33: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

analyzes many RNA-seq samples in the cloud with Amazon Elastic MapReduce.

rail-rna go elastic—-manifest listOf100RNASeqFASTQs.tab—-assembly hg19—-core-instance-count 50—-output s3://my-bucket/my-output

1 or 2 commands alignments, introns, indels,

coverage

33

Page 34: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Download at

http://rail.bio .

34

Page 35: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

• Software requirements• Gets faster with more workers in cluster• Works on lots of data

• Solution: MapReduce via• Alternating aggregation and computation steps• Aggregation: data are grouped and sorted• Computation: each worker operates on subset of data

How to scale?

35

Page 36: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

1. Aggregate: group unique read sequences

2. Compute: segment unique sequences

3. Aggregate: group unique readlet sequences

4. Compute: align unique readlet sequences

5. Aggregate: group alignments of readlets from same read sequence

6. Compute: search for introns by unique read sequence

MapReduce reduces redundant analysis in Rail

36

Page 37: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Scalability overview: speedup

chr1

throughput

# computers

idea

l

in practice

37

Page 38: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Scalability: speedup

# s

am

pl e

s /

hour

# c3.2xlarge “spot” computers ($0.11/computer/hr)

# computers | $/sample 10 | 1.19 20 | 1.46 40 | 1.58

38

On 28 GEUVADIS samples

Page 39: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Scalability: improved throughput

chr1

throughput

# samples

naive

Rail-RNA

39

Page 40: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Scalability: improved throughput

# GEUVADIS samples

On 40 c3.2xlarge computers

# samples | $/sample 28 | 1.58 56 | 1.26 112 | 0.95

# s

am

pl e

s /

hour

40

Page 41: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

• Analyzed GEUVADIS• 666 samples in 12 hours on 1920 processing cores

• ~69 cents per sample

• Next: analyze all human Illumina RNA-seq on SRA (~20k samples)

Milestones

41

Page 42: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

http://rail.bio .

Collaborators

Jacob PrittLeo Collado-Torres Jamie Morton

Jeff Leek Ben Langmead

Andrew Jaffe

José Alquicira-Hernández

Specialthanks to

Angel Pizarro

42

Page 43: Analyzing Genomic Data for Whole Populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Thank You.This presentation will be loaded to SlideShare the week following the Symposium.

http://www.slideshare.net/AmazonWebServices

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015