analyzing genomic data for whole populations

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015


Analyzing Genomic Data for Whole PopulationsHow AWS Scalable Resources Enable Fast and Cost-Effective Analysis of Large Cohorts

Abhinav Nellore, PhDJohns Hopkins University

Peter White, PhDNationwide Children’s HospitalThe Ohio State University

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.



Intel Head in the Cloud Challenge on AWS POC: Population Scale Human Genome Analysis on the Cloud

Peter White, PhDDirector of Biomedical Genomics Core and Molecular Bioinformatics

The Research Institute at Nationwide Children’s HospitalAssistant Professor of Pediatrics, The Ohio State University


The Human Genome Project

15 Years$3,000,000,000


The First Printed Human GenomeWellcome Collection, London


Next-Generation Sequencing in the Biomedical Genomics Core

1.2 terabytes

12 Human Genomes: 3.5 days

Data, Data, DATA…10,000s of samples

Trillions of base pairs

Molecular diagnosticsClinical treatmentClinical outcomes

TranslationalBioinformatics

5 billion sequence reactions

Illumina HiSeq 4000

Illumina MiSeq

The Solution: Churchill

The Problem:• 2 days for raw data• ~2 weeks for analysis


Parallelization Strategies

Chromosome1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Region S

ize (Mbp)

0

50

100

150

200

250

Chromosomal

Chromosomal sub-regions distributed accross CPUs1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Region S

ize (Mbp)

0

50

100

150

200

250

Churchill: Balanced Subchromosomal


Churchill is the most efficient pipeline

92%utilization

46%utilization

30%utilization

Churchill

HugeSeq

GATK-Queue

Chromosome1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Regio

n S

ize (M

bp)

0

50

100

150

200

250


Churchill enables <2h genome analysis


Churchill has highest diagnostic effectiveness

Churchill GATK-Q HugeSeq

Validated SNP 99.9% 99.2% 99.2%

% Known (dbSNP) 94.6% 93.0% 92.5%

Error rate 0.0012% 0.0019% 0.0034%

Youden Index 99.66% 98.96% 98.65%

Validation using NIST GIAB

100% Reproducible and Deterministic regardless of platform or level of parallelism


Population Scale Genomics:Proof of Concept

Jan 2008: Project begins – consortium of >400 scientistsOct 2011: Phase 1 completed (1,092 individuals)Apr 2013: Phase 3 sequencing completedSep 2014: Completion of Phase 3 analysis

26 Populations: 2,504 Individuals


First Challenge: Data Upload• Data Upload Challenge:

– Final 1 KG data consist of 2,504 exomes and 2,504 whole genomes. Total size of FASTQ dataset is 70 TB.

– How do you transfer 70 TB of data quickly for genomic analysis?

• Data Upload Solution:– Collocate Cloud infrastructure. 1 KG utilizes Amazon Simple

Storage Service (S3) to store all data in US East Region - deploy GenomeNext analysis platform in the same region

– Automated data upload. Create genomic samples that reference S3 object locations of raw data and pass location to compute clusters


Data Upload Solution


Second Challenge: Scale• The population scale challenge: complete

analysis of 5,008 genomic samples in 7 days

Clinical/Research Portal (SaaS)

US East Region

1000 Genome Project Phase II I Data

Automated Upload

Analysis Platform

Automation of AWS Infrastructure

GenomeNext Results

Access Results

Results added to GenomeNext Storage


Scale Solution

1,000 instances


7 DAYS: FASTQ to Results

r2 = 0.9978,p-value < 2.2e-16

>75 million variants in common



@AbhiNellore from JHU at AWS WWPS DC Symposium

Joint with Leo Collado-Torres, Andrew Jaffe, Jamie Morton, Jacob Pritt, José Alquicira-Hernández, Jeff Leek, and Ben Langmead

Rail-RNA: analyzing many RNA sequencing samples with AmazonElastic MapReduce

22


RNA and gene expression

Crick’s central dogma https://kaiserscience.wordpress.com/biology-the-living-environment/genetics/dna-translation-and-the-genetic-code/

23


\

RNA sequencing (RNA-seq)

Zhong Wang, Mark Gerstein & Michael Snyder. RNA-Seq: a revolutionary tool for transcriptomics, Nature Reviews Genetics (2009).

24


\

Growth of sequence read archives (ENA)http://www.ebi.ac.uk/ena/about/statistics

25


• Disagree with existing analyses

• Repurpose rich, hard-to-obtain data

• Ensure reproducibility

• Add power to experiment

Why reanalyze?

27


Why use cloud computing?

• Elasticity• Pay-as-you-go access to

computer cluster of desired size• Amazon S3 = infinite storage

• Reproducibility• Same software on same

hardware

28


AdvantageAmazon EMR: consistency• Eventually consistent cloud storage may

not give right answer if reading too soon after write/update/delete

=> Data loss => Compromised reproducibility

• EMRFS imposes consistency with Amazon DynamoDB table reflecting what S3 should look like

29



AlignmentSometimes, we’re lucky, and a read correctly aligns to the reference genome end to end.

…ATACATCAGACTAGACCGTACCACTCATAGACCTAGACCAGATACAG…

CAGACTAGACCGTACCACTCATAGACCTAGACCAGATAC

chr1

read

chr1

30



Spliced alignmentBut when introns are overlapped, we divide the read into readlets…

ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT

ATACATCAGACTAGACCGTACCACAATCAGACTAGACCGTACCACACAGC

GACTAGACCGTACCACACAGCATGAAGACCGTACCACACAGCATGACAGT

CCACACAGCATGACAGTCATTCGACACAGCATGACAGTCATTCGACGTACCAGCATGACAGTCATTCGACGTACT

ATACATCAGACTAGAATACATCAGACTAGACATACATCAGACTAGACCGATACATCAGACTAGACCGTATACATCAGACTAGACCGTACATACATCAGACTAGACCGTACAGC

AGCATGACAGTCATTCGACGTACTATGACAGTCATTCGACGTACT

GACAGTCATTCGACGTACTACAGTCATTCGACGTACTAGTCATTCGACGTACT

CGTACCACACAGCATGACAGTCATT

read

readlets

31


…and align readlets to the genome to infer intron presence.Realignment of reads may be needed.

Spliced alignment

CATCAGACTAGACCGTACCACAGTACAGCATGACAGTCATTCGACGTACTCGT

TACAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAG

…ATACATCAGACTAGACCGTACCACAGTAGTTCATGACCCTCAGCAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC…intron

read 1

read 2 needs realignment to find intron

chr1

32



analyzes many RNA-seq samples in the cloud with Amazon Elastic MapReduce.

rail-rna go elastic—-manifest listOf100RNASeqFASTQs.tab—-assembly hg19—-core-instance-count 50—-output s3://my-bucket/my-output

1 or 2 commands alignments, introns, indels,

coverage

33

s3://my-bucket/my-output



Download at

http://rail.bio .

34

http://rail.bio/


• Software requirements• Gets faster with more workers in cluster• Works on lots of data

• Solution: MapReduce via• Alternating aggregation and computation steps• Aggregation: data are grouped and sorted• Computation: each worker operates on subset of data

How to scale?

35


1. Aggregate: group unique read sequences

2. Compute: segment unique sequences

3. Aggregate: group unique readlet sequences

4. Compute: align unique readlet sequences

5. Aggregate: group alignments of readlets from same read sequence

6. Compute: search for introns by unique read sequence

MapReduce reduces redundant analysis in Rail

36



Scalability overview: speedup

chr1

throughput

# computers

idea

l

in practice

37



Scalability: speedup

# s

am

pl e

s /

hour

# c3.2xlarge “spot” computers ($0.11/computer/hr)

# computers | $/sample 10 | 1.19 20 | 1.46 40 | 1.58

38

On 28 GEUVADIS samples



Scalability: improved throughput

chr1

throughput

# samples

naive

Rail-RNA

39



Scalability: improved throughput

# GEUVADIS samples

On 40 c3.2xlarge computers

# samples | $/sample 28 | 1.58 56 | 1.26 112 | 0.95

# s

am

pl e

s /

hour

40


• Analyzed GEUVADIS• 666 samples in 12 hours on 1920 processing cores

• ~69 cents per sample

• Next: analyze all human Illumina RNA-seq on SRA (~20k samples)

Milestones

41


http://rail.bio .

Collaborators

Jacob PrittLeo Collado-Torres Jamie Morton

Jeff Leek Ben Langmead

Andrew Jaffe

José Alquicira-Hernández

Specialthanks to

Angel Pizarro

42

http://rail.bio/


Thank You.This presentation will be loaded to SlideShare the week following the Symposium.

http://www.slideshare.net/AmazonWebServices


analyzing genomic data for whole populations

Documents

aws government

nonprofit symposium

aws poc

days data

genomic data

tb of data

raw data

aws scalable resources