genome-scale big data pipelines

36
Dr. Denis Bauer & Lynn Langit Genomic-scale Data Pipelines

Upload: lynn-langit

Post on 21-Jan-2018

192 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Genome-scale Big Data Pipelines

Dr. Denis Bauer & Lynn Langit

Genomic-scale Data Pipelines

Page 2: Genome-scale Big Data Pipelines

Denis Bauer, PhD

Oscar Luo, PhD

Rob Dunne, PhD

Piotr Szul

Team

Aidan O’BrienLaurence Wilson, PhD

Adrian WhiteAndy Hindmarch

Collaborators

David Levy

News

Software

Dan Andrews

Kaitao Lai, PhD

Arash Bayat

John Hildebrandt Mia Chapman

Ian BlairKelly Williams

Jules Damji

Gaetan Burgio Lynn Langit

Natalie Twine, PhD

Prabha Pillay

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Transformational Bioinformatics Team

Page 3: Genome-scale Big Data Pipelines

1000

17

2000

0 500 1000 1500 2000 2500

Astronomy

Twitter

YouTube

Big Data in 2025…Petabytes?

1000

17

2000

0 500 1000 1500 2000 2500

Astronomy

Twitter

YouTube

Big Data in 2025…Petabytes?

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 4: Genome-scale Big Data Pipelines

1

0.17

2

20

0 5 10 15 20 25

Astronomy

Twitter

YouTube

Genomic

GENOMIC Big Data in 2025 - Exabytes

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 5: Genome-scale Big Data Pipelines

Genome holds Blueprint for Every Cell

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 6: Genome-scale Big Data Pipelines

Affects Looks, Disease Risk, and Behavior

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 7: Genome-scale Big Data Pipelines

VCF Data

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 8: Genome-scale Big Data Pipelines

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Genomic Research Workflow

https://www.projectmine.com/about/

BigData Focus

Page 9: Genome-scale Big Data Pipelines

Finding the Disease Gene(s)

Spot the letter that is…• common amongst all affected

• absent in all unaffected*

* oversimplified

cases

controls

Gene1 Gene2

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 10: Genome-scale Big Data Pipelines

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

Cited

4

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 11: Genome-scale Big Data Pipelines

Why Apache Spark?

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 12: Genome-scale Big Data Pipelines

Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

low Accuracy high

low

Spe

ed

h

igh

Page 13: Genome-scale Big Data Pipelines

Cloud Data Pipeline Pattern

Business Problem

DataQuality

Candidate Technologies

Build/TestMVPs

Assemble Pipeline

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 14: Genome-scale Big Data Pipelines

Building a Cloud Data Pipeline

Candidate Technologies

• Ingest/Clean

• Analyze/Predict

• Visualize

Build MVPs

• Test

• Iterate

• Learn

Assemble Pipeline

• Combine pieces

• Validate sections

• Test at scale

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 15: Genome-scale Big Data Pipelines

Building a Cloud Data Pipeline

Spark

•IaaS, PaaS, SaaS Vendors

•AWS, Azure, GCP…

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 16: Genome-scale Big Data Pipelines

Visualizing Machine Learning Results

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 17: Genome-scale Big Data Pipelines

Solving Important Questions…Cancer genomics?

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 18: Genome-scale Big Data Pipelines

DEMO: Who is a Bondi Hipster?

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 19: Genome-scale Big Data Pipelines

Supervised ML: Wide Random Forests

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 20: Genome-scale Big Data Pipelines

Scaling to 50 M variables and 10 K samples

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

100K trees: 5 – 50h

AWS: ~$215.50

100K trees: 200 – 2000h

AWS: ~ $ 8620.00

• Yarn Cluster • 12 workers

• 16 x Intel CPUs

• Xeon [email protected]

• 128 GB RAM

• Spark 1.6.1 • 128 executors

• 6GB / executor 0.75TB

• Synthetic dataset

Whole Genome

RangeGWAS Range

Page 21: Genome-scale Big Data Pipelines

Future Directions for VariantSpark RF

Mixed feature types

Unordered Categorical

Continuous

Build Community

Python API

Non-Genomic Demos

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Implementation by

Page 22: Genome-scale Big Data Pipelines

Try it out: VariantSpark Notebook

Transformational Bioinformatics| Denis C. Bauer @allPowerde

https://docs.databricks.com/spark/latest/training/variant-spark.html

Page 23: Genome-scale Big Data Pipelines

Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy

“Editing does not work every time, e.g. only 7 in 10 embryos were mutation free.”

Aim: Develop computational guidance framework to enable edits the first time; every time

Ma et al. Nature 2017 *

* Controversy around the paper – stay tuned

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 24: Genome-scale Big Data Pipelines

Make Process Parallel and Scalable

SPEED

• Each search can be broken down into parallel tasks - each takes seconds

SCALE

• Researchers might want to search the target for one gene or 100,000

Scalability + Agility =

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 25: Genome-scale Big Data Pipelines

One of the first Serverless Applications in Research

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Featured in

Page 26: Genome-scale Big Data Pipelines
Page 27: Genome-scale Big Data Pipelines

X-Ray Tracing Demo of GT-Scan2• Find performance

bottlenecks

• Fix and test

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Webapp

Resources (S3, DynamoDB)

Lambda

Page 28: Genome-scale Big Data Pipelines

25

50

75

getF

asta

Seq

uenc

e

crea

teJo

b

targ

etSca

n

offta

rget

Sca

nSta

rter

offta

rget

Sea

rch

targ

etIn

ters

ects

targ

etTr

ansc

riptio

nInt

erse

cts

targ

etW

uSco

rer

targ

etSgR

NASco

rer

OnT

arge

tSco

rer

geno

meC

RIS

PR

functions

runtim

e (

s)

Type

base

old

GTScan2 X-Ray Analysis

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 29: Genome-scale Big Data Pipelines

Results – 4x Faster (80% improvement)

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

2 min

30 sec

Page 30: Genome-scale Big Data Pipelines

Considering Servicesfor GT-Scan2

• Use AWS Step Functions• Simplify workflow

• Simplify task timeouts

• Simplify task failures

• Must evaluate costs• SNS vs. Step Functions

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 31: Genome-scale Big Data Pipelines

Cloud Data Pipeline Pattern

Problem Data Technologies MVPs Pipeline

SearchGTScan2

fastq, bed-> S3, NoSQL Ingest ETL, AnalyzeViz

S3LambdaLambda/API Gateway

Serverless

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 32: Genome-scale Big Data Pipelines

Serverless Pipeline Pattern

Lambda function

1

Lambda function

2

Lambda function

3

buckets with objects DynamoDB

API Gateway Users

Step Functions

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 33: Genome-scale Big Data Pipelines

Cloud Data Pipeline Pattern

Problem Data Technologies MVPs Pipeline

AnalyzeGWAS

vcf -> S3/Spark IngestETLAnalyzeViz

S3 -> Databricks DBFSApache SparkVariant-Spark MLNotebook, SQL, R, Python

Spark ServerCluster

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 34: Genome-scale Big Data Pipelines

Spark Server Cluster Pipeline Pattern

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Jupyter Notebook

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 35: Genome-scale Big Data Pipelines

Cloud Genomic-Scale Data Pipelines• Problem # 1 – ML on Large Data

• Solution: Spark-server cluster + custom machine learning

• Problem #2 – Burstable Search

• Solution: Serverless pipeline

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 36: Genome-scale Big Data Pipelines

Genomic-scale Data Pipelines

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Dr. Denis Bauer & Lynn Langit