composable, petabyte-scale genomics workflows with docker and luigi

Upload: wadeschulz

Post on 05-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    1/24

    S L I D E 0

    Composable, Petabyte-Scale

    Genomics Workflows with Dockerand Luigi

    Wade L. Schulz, MD, PhD, Henry M. Rinder, MD, Richard Torres, MD, MS, Alexa Siddon, MD

     Resident, Department of Laboratory Medicine, Yale School of Medicine

     Senior Solution Architect, Helix Data Sciences, Yale-New Haven Hospital 

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    2/24

    S L I D E 1

    Notice of Faculty Disclosure

    In accordance with ACCME guidelines, any individual in a position toinfluence and/or control the content of this ASCP CME activity hasdisclosed all relevant financial relationships within the past 12 months with commercial interests that provide products and/or servicesrelated to the content of this CME activity.

    The individual below has responded that he/she has no relevantfinancial relationship(s) with commercial interest(s) to disclose:

     Wade Schulz, MD, PhD

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    3/24

    S L I D E 2

    Composable, Petabyte-Scale Genomics Workflows withDocker and Luigi

    • Clinical question and background• Open/Big Data

    • Barriers to Large-Scale Genomics (Big Data) Analysis

    • Pipeline Improvements

    Case Study with Performance Metrics

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    4/24

    S L I D E 3

    Clinical Question – Tumor Heterogeneity 

    Hypothesis: Patients with acute myeloid leukemia (AML) who present with multiple hematopoietic clones have a worse prognosis thanpatients with a single clone.

    http://commons.wikimedia.org/wiki/File:Treatment_bottleneck.pdf - Lcchong

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    5/24

    S L I D E 4

    Predictions of Clonal Populations

    •  In vitro analysis by single-cell sequencing•  In silico prediction based on clustering of variants by variant allele

    frequency 

    Miller, C. a., White, B. S., Dees, N. D., et al. (2014). SciClone: Inferring Clonal Architecture and Tracking the Spatial and TemporalPatterns of Tumor Evolution. PLoS Computational Biology, 10(8), e1003665. doi:10.1371/journal.pcbi.1003665

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    6/24

    S L I D E 5

    Impact of Molecular Heterogeneity 

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    7/24S L I D E 6

    Somatic Mutations in Cancer

    • How do we:– Increase our N

    – Increase the number of identified variants

    Grove, C. S., & Vassiliou, G. S. (2014). Acute myeloid leukaemia: a paradigmfor the clonal evolution of cancer? Disease Models & Mechanisms,7 (8), 941–951.

     AML, Somatic, Whole Exome

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    8/24S L I D E 7

    Open Data

    • Clinical Trials• Primary Research Data

    • Government Data

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    9/24S L I D E 8

    The Cancer Genome Atlas (TCGA)

    • TCGA: comprehensive and coordinated effort to accelerate ourunderstanding of the molecular basis of cancer through theapplication of genome analysis technologies

    • cgHub: Genomics data repository, contains >1.4 petabytes of data

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    10/24S L I D E 9

     Analysis Architecture

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    11/24S L I D E 10

    Barriers to High-Throughput Analysis

    Manual Workflow/Fault Tolerance– Need to process 200 patients

    – Errors require manual restart

    • Bandwidth Throughput

    – Gigabit internet connectivity 

    – More limited (100 Mbit) when throttled

    • Drive Space

    – ~110 GB per WGS BAM file, paired tumor/normal for each patient

    • Processor/Memory Capacity 

    – Downstream application lack parallelization

    – Unable to run multiple instances due to software design

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    12/24S L I D E 11

     Architectural Improvments

    •  Workflow Creation– Oozie, Luigi

    • Containerization

    – Docker

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    13/24S L I D E 12

     Why (Luigi) Workflows?

    •  Automation– Fault Tolerance

    • Improved Resource Utilization

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    14/24S L I D E 13

     Architectural Improvments

    •  Workflow Creation– Oozie, Luigi

    • Containerization

    – Docker

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    15/24S L I D E 14

    Containerization

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    16/24S L I D E 15

     Why Containers?

    • Standardization– Software Matched to OS Version

    • Isolation

    – Software Validation

    • Parallelization

    – Processor/Memory Capacity 

    • Clustering

    – Bandwidth Throughput

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    17/24

    S L I D E 16

    Docker Containers

    Infrastructure

    Hypervisor

    OS 1 OS 2 OS 3

    Libs Libs Libs

     App 1 App 2A App 2B

    Infrastructure

    Hypervisor

    OS/Docker Engine

    Libs

     App 1 App 2A App 2B

    Libs

     Virtual Servers Containers

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    18/24

    S L I D E 17

    Updated Analysis Architecture

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    19/24

    S L I D E 18

    Pipeline Performance Comparison

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    20/24

    S L I D E 19

     Virtualized Performance Characteristics

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    21/24

    S L I D E 20

    Pipeline Performance Comparison

    //

    //

    10 days

    3 days

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    22/24

    S L I D E 21

    Conclusions

    •  Workflow automation can increase throughput– Fewer manual steps

    – Continual and immediate data processing

    • Containerization can improve throughput of large,computationally-intensive data sets

    –  Applications that do not support parallelization–  Applications with complex or unsupported dependencies

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    23/24

    S L I D E 22

    Future Directions

    • Can additional throughput improvements be made by clustering?– Deployment of containers through Docker Swarm

    –  When deployed to our data science cluster, expected that 1 petabytecan be entirely processed in

  • 8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi

    24/24

    Questions?

    • Docker (v1.9.1, docker.com)

    • Luigi (v2.0.1, github.com/spotify/luigi)

    – https://hub.docker.com/r/wadeschulz/luigi

    • GeneTorrent (v3.8.7, cghub.ucsc.edu)

    – https://hub.docker.com/r/molecular/cghub

    – https://hub.docker.com/r/molecular/cgdownload

    SomaticSniper (v1.0.5.0, gmt.genome.wustl.edu)– https://hub.docker.com/r/molecular/somaticsniper

    • http://wadeschulz.com/portfolio/api-2016