genomics at scale | aws public sector summit 2016

46
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ben Langmead, PhD, Johns Hopkins Angel Pizarro, AWS Scientific Computing June 20, 2016 Genomics at Scale Using the AWS Cloud for Population-Scale Analysis of Genomics and Life Science Data

Upload: amazon-web-services

Post on 14-Apr-2017

416 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Genomics at Scale | AWS Public Sector Summit 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ben Langmead, PhD, Johns HopkinsAngel Pizarro, AWS Scientific Computing

June 20, 2016

Genomics at ScaleUsing the AWS Cloud for Population-Scale Analysis of

Genomics and Life Science Data

Page 2: Genomics at Scale | AWS Public Sector Summit 2016

Agenda

• Overview of Amazon Elastic MapReduce (Amazon EMR)• Review of Rail-RNA• More EMR for Science!• Q&A

Page 3: Genomics at Scale | AWS Public Sector Summit 2016

Challenges with in-house infrastructure

Fixed cost

Slow deploymentcycle

Always on Self serve

Static: Not scalable Outages impact Production upgrade

Storage compute

Page 4: Genomics at Scale | AWS Public Sector Summit 2016

Compute and storage grow together

Tightly coupled

Storage grows along with computeCompute requirements vary

Page 5: Genomics at Scale | AWS Public Sector Summit 2016

Underutilized or scarce resources

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

20

40

60

80

100

120

Underutilized capacity

Provisioned capacity

ReprocessingWeekly peaks

Steady state

Page 6: Genomics at Scale | AWS Public Sector Summit 2016

Amazon EMR

Page 7: Genomics at Scale | AWS Public Sector Summit 2016

Amazon EMR • Managed Apache Hadoop platform• MapReduce, Apache Spark, Presto • Launch a cluster in minutes• Open source distribution and MapR

distribution• Leverage the elasticity of the cloud• Baked-in security features• Pay by the hour and save with Spot

Page 8: Genomics at Scale | AWS Public Sector Summit 2016

Why Amazon EMR?

Easy to useLaunch a cluster in minutes

Low costPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleCustomize the cluster

Page 9: Genomics at Scale | AWS Public Sector Summit 2016

Decouple storage and compute

Page 10: Genomics at Scale | AWS Public Sector Summit 2016

Amazon S3 is your persistent data store

11 9’s of durability$0.03 / GB / month in US-East Lifecycle policiesVersioning Distributed by default EMRFS

Amazon S3

Page 11: Genomics at Scale | AWS Public Sector Summit 2016

Why is Amazon S3 good for Genomics Data?

• No limit on the number of objects• Object size up to 5TB• Pay only for exactly what you use• Very high bandwidth• Durable• Fine-grained and time-bounded security• Supports versioning & lifecycle policies• Storage tiers for better cost, based on access patterns

Page 12: Genomics at Scale | AWS Public Sector Summit 2016

Amazon EMR File System (EMRFS)

• Allows you to leverage Amazon S3 as a file system• Streams data directly from Amazon S3 • Uses HDFS for intermediates • Better read/write performance and error handling than

open source components• Consistent view – consistency for read after write• Support for encryption • Fast listing of objects

Page 13: Genomics at Scale | AWS Public Sector Summit 2016

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 'samples/pig-apache/input/'

Page 14: Genomics at Scale | AWS Public Sector Summit 2016

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Page 15: Genomics at Scale | AWS Public Sector Summit 2016

Decoupled compute and storage in practice

Amazon S3Analyst clustersrunning Zeppelin

(m2 and R3 instance types)

Transient clusters runningsingle data cleanup or

ETL jobs at off peak times(spot instance type)

Amazon Redshift data warehouse

Machine learning cluster running R

(c1 and c3 instance types)

Page 16: Genomics at Scale | AWS Public Sector Summit 2016

http://www.langmead-lab.org

http://rail.bio

Page 17: Genomics at Scale | AWS Public Sector Summit 2016
Page 18: Genomics at Scale | AWS Public Sector Summit 2016
Page 19: Genomics at Scale | AWS Public Sector Summit 2016

Tera

base

s1Pbp

18-month doubling time

Page 20: Genomics at Scale | AWS Public Sector Summit 2016

Tera

base

s1Pbp

18-month doubling time

Spot

Page 21: Genomics at Scale | AWS Public Sector Summit 2016

DNA

RNA

Page 22: Genomics at Scale | AWS Public Sector Summit 2016
Page 23: Genomics at Scale | AWS Public Sector Summit 2016

rail-rna go elastic —-manifest URLsOf500Samples.txt —-assembly hg38 —-output s3://your-bucket/output_folder —-core-instance-type c3.2xlarge —-core-instance-count 20

InputSpecies

OutputInstance typeInstance count

Page 24: Genomics at Scale | AWS Public Sector Summit 2016
Page 25: Genomics at Scale | AWS Public Sector Summit 2016
Page 26: Genomics at Scale | AWS Public Sector Summit 2016

http://docs.rail.bio/dbgap/

NIH has security requirements and recommendations for analyzing “controlled access” genomic data.

Protects privacy of the research subjects. Particularly concerned with data where sensitive phenotypes (e.g., disease) can ultimately be linked to a subject’s identity.

Page 27: Genomics at Scale | AWS Public Sector Summit 2016
Page 28: Genomics at Scale | AWS Public Sector Summit 2016
Page 29: Genomics at Scale | AWS Public Sector Summit 2016

Detailed instructions on how to run your own dbGaP-compliant EMR app: docs.rail.bio/dbgap

Page 30: Genomics at Scale | AWS Public Sector Summit 2016
Page 31: Genomics at Scale | AWS Public Sector Summit 2016
Page 32: Genomics at Scale | AWS Public Sector Summit 2016

Why is horizontal scale important?

Page 33: Genomics at Scale | AWS Public Sector Summit 2016

Stages are written separatelyHandoff between steps is through files

Everyone has their own “flavor” of pipeline

A variant-calling pipeline

Page 34: Genomics at Scale | AWS Public Sector Summit 2016

Parallelization in the cloud

Page 35: Genomics at Scale | AWS Public Sector Summit 2016

.bam files define a custom .bai index formatUser-defined attributes

Typically in coordinate-sorted order

Lingua franca: file formats

Page 36: Genomics at Scale | AWS Public Sector Summit 2016

(This is taken from the Picard library.) Why are we managing file handles and spilling reads

to disk inside our bioinformatics methods?

Where is “The Platform?”

Page 37: Genomics at Scale | AWS Public Sector Summit 2016

Things fall apart when our computation changes

Page 38: Genomics at Scale | AWS Public Sector Summit 2016

Flat files are a blocker to population-scale genomics

Page 39: Genomics at Scale | AWS Public Sector Summit 2016

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• A fast and general engine for large-scale data processing

• Massively parallel

• Uses DAGs instead of map-reduce for execution

• Minimizes I/O by storing data in Resilient Distributed Datasets (RDD) in memory

• Partitioning-aware to avoid network-intensive shuffle

Page 40: Genomics at Scale | AWS Public Sector Summit 2016

Bioinformaticians ❤️

Probabilistic models

Many bioinformatics methods are just large sums

Page 41: Genomics at Scale | AWS Public Sector Summit 2016

• Hosted at Berkeley and the AMPLab

• Apache 2 License• Contributors from both

research and commercial organizations

• Core spatial primitives, variant calling

• Avro and Parquet for data models and file formats

Spark + Genomics = ADAM

Page 42: Genomics at Scale | AWS Public Sector Summit 2016

What is ADAM?

ADAM is a genomics analysis platform with specialized file formatsBuilt using Apache Avro, Apache Spark, and ParquetGitHub repository: https://github.com/bigdatagenomics/adam

Page 43: Genomics at Scale | AWS Public Sector Summit 2016

adam-submit

Page 44: Genomics at Scale | AWS Public Sector Summit 2016

Process VCF from 1000 Genomes Public Data Set

VCF files located in the public S3 bucket s3.amazonaws.com/1000genomesUse vcf2adam to convert single vcf into multiple ADAM files (gzipped Apache Parquet)$ adam-submit vcf2adam <vcf file on HDFS> <target HDFS folder>

Ex: A single VCF file generates more than 690 gz.parquet files

Page 45: Genomics at Scale | AWS Public Sector Summit 2016

Process VCF files from 1000 Genomes Cont.

Use SCALA to query the genome data in the ADAM Parquet files

……val gnomeDF = sqlContext.read.parquet("/user/hadoop/adamfiles/part-r-00000.gz.parquet")gnomeDF.printSchema()gnomeDF.registerTempTable("gnome")val gnome_data = sqlContext.sql("select count(*) from gnome")gnome_data.show()…….

Page 46: Genomics at Scale | AWS Public Sector Summit 2016