genomics at scale | aws public sector summit 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ben Langmead, PhD, Johns HopkinsAngel Pizarro, AWS Scientific Computing

June 20, 2016

Genomics at ScaleUsing the AWS Cloud for Population-Scale Analysis of

Genomics and Life Science Data

Agenda

• Overview of Amazon Elastic MapReduce (Amazon EMR)• Review of Rail-RNA• More EMR for Science!• Q&A

Challenges with in-house infrastructure

Fixed cost

Slow deploymentcycle

Always on Self serve

Static: Not scalable Outages impact Production upgrade

Storage compute

Compute and storage grow together

Tightly coupled

Storage grows along with computeCompute requirements vary

Underutilized or scarce resources

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

20

40

60

80

100

120

Underutilized capacity

Provisioned capacity

ReprocessingWeekly peaks

Steady state

Amazon EMR

Amazon EMR • Managed Apache Hadoop platform• MapReduce, Apache Spark, Presto • Launch a cluster in minutes• Open source distribution and MapR

distribution• Leverage the elasticity of the cloud• Baked-in security features• Pay by the hour and save with Spot

Why Amazon EMR?

Easy to useLaunch a cluster in minutes

Low costPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleCustomize the cluster

Decouple storage and compute

Amazon S3 is your persistent data store

11 9’s of durability$0.03 / GB / month in US-East Lifecycle policiesVersioning Distributed by default EMRFS

Amazon S3

Why is Amazon S3 good for Genomics Data?

• No limit on the number of objects• Object size up to 5TB• Pay only for exactly what you use• Very high bandwidth• Durable• Fine-grained and time-bounded security• Supports versioning & lifecycle policies• Storage tiers for better cost, based on access patterns

Amazon EMR File System (EMRFS)

• Allows you to leverage Amazon S3 as a file system• Streams data directly from Amazon S3 • Uses HDFS for intermediates • Better read/write performance and error handling than

open source components• Consistent view – consistency for read after write• Support for encryption • Fast listing of objects

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 'samples/pig-apache/input/'

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Decoupled compute and storage in practice

Amazon S3Analyst clustersrunning Zeppelin

(m2 and R3 instance types)

Transient clusters runningsingle data cleanup or

ETL jobs at off peak times(spot instance type)

Amazon Redshift data warehouse

Machine learning cluster running R

(c1 and c3 instance types)

http://www.langmead-lab.org

http://rail.bio

Tera

base

s1Pbp

18-month doubling time

Tera

base

s1Pbp

18-month doubling time

Spot

DNA

RNA

rail-rna go elastic —-manifest URLsOf500Samples.txt —-assembly hg38 —-output s3://your-bucket/output_folder —-core-instance-type c3.2xlarge —-core-instance-count 20

InputSpecies

OutputInstance typeInstance count

http://docs.rail.bio/dbgap/

NIH has security requirements and recommendations for analyzing “controlled access” genomic data.

Protects privacy of the research subjects. Particularly concerned with data where sensitive phenotypes (e.g., disease) can ultimately be linked to a subject’s identity.

Detailed instructions on how to run your own dbGaP-compliant EMR app: docs.rail.bio/dbgap

Why is horizontal scale important?

Stages are written separatelyHandoff between steps is through files

Everyone has their own “flavor” of pipeline

A variant-calling pipeline

Parallelization in the cloud

.bam files define a custom .bai index formatUser-defined attributes

Typically in coordinate-sorted order

Lingua franca: file formats

(This is taken from the Picard library.) Why are we managing file handles and spilling reads

to disk inside our bioinformatics methods?

Where is “The Platform?”

Things fall apart when our computation changes

Flat files are a blocker to population-scale genomics

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• A fast and general engine for large-scale data processing

• Massively parallel

• Uses DAGs instead of map-reduce for execution

• Minimizes I/O by storing data in Resilient Distributed Datasets (RDD) in memory

• Partitioning-aware to avoid network-intensive shuffle

Bioinformaticians ❤️

Probabilistic models

Many bioinformatics methods are just large sums

• Hosted at Berkeley and the AMPLab

• Apache 2 License• Contributors from both

research and commercial organizations

• Core spatial primitives, variant calling

• Avro and Parquet for data models and file formats

Spark + Genomics = ADAM

What is ADAM?

ADAM is a genomics analysis platform with specialized file formatsBuilt using Apache Avro, Apache Spark, and ParquetGitHub repository: https://github.com/bigdatagenomics/adam

https://github.com/bigdatagenomics/adam



adam-submit

Process VCF from 1000 Genomes Public Data Set

VCF files located in the public S3 bucket s3.amazonaws.com/1000genomesUse vcf2adam to convert single vcf into multiple ADAM files (gzipped Apache Parquet)$ adam-submit vcf2adam <vcf file on HDFS> <target HDFS folder>

Ex: A single VCF file generates more than 690 gz.parquet files

Process VCF files from 1000 Genomes Cont.

Use SCALA to query the genome data in the ADAM Parquet files

……val gnomeDF = sqlContext.read.parquet("/user/hadoop/adamfiles/part-r-00000.gz.parquet")gnomeDF.printSchema()gnomeDF.registerTempTable("gnome")val gnome_data = sqlContext.sql("select count(*) from gnome")gnome_data.show()…….

genomics at scale | aws public sector summit 2016

Technology