management, and sharing a cloud-based platform for data...

80
Broad Institute Workbench a Cloud-Based Platform for Data Analysis, Management, and Sharing www.genomics.broadinstitute.org [email protected]

Upload: others

Post on 04-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Broad Institute Workbencha Cloud-Based Platform for Data Analysis,

Management, and Sharing

[email protected]

Page 2: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Topics

● Overview

● Scientific Use Scenarios

● Cancer Genome Analysis

● Medical and Population Genetics

● US Precision Medicine Initiative

● Software Overview

● Takeaway and Questions

Page 3: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

OverviewDavid Siedzik

Chief Product OwnerBroad Institute Genomics Data Sciences Platform

Page 4: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

A story about genomics data generation_________________

Page 5: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

The Widening Gulf

Page 6: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

The day we ran out of compute

Genomic Data Generation Increase

Page 7: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Genome Processing at the Broad...

Page 8: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

...Produces a lot of data...

● A single human genome is approximately 120 GB

● We are sequencing genomes at a rate of 1 every 12 minutes

● That’s ~1 Petabyte every 2 months

● Earlier this year we started filling up storage and maxing out compute.

Page 9: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

...And it’s hard to store and use that data locally...

● Too many copies of the same dataset littered across the file system

● Hard to track and control data access

● Local file storage is expensive and difficult to maintain at petabyte scale

● Not enough compute to go around

Page 10: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

The Opportunity

These challenges focused us on a bold goal:

To develop systems and tools that not only keep up with Broad’s own needs but also increase the availability of genomic resources across the global scientific community - both in data generation, and in the ability to analyze, manage and most importantly share data.

In keeping with our mission, we brought together two core groups—our Sequencing Platform and Data Sciences Platform—to ensure seamless integration of best in class sequencing products and open-source software resources for the global bioinformatics community.

Page 11: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Broad sequenced and hosts genomes/exomes of > 250K individuals to date..

Broad does a lot of sequencing...● Public Clouds allow scientists to

rent compute time vs. relying upon an institution’s cluster

● Broad uses more than 100 pipelines: how do we make those available to run consistently in our internal and external community?

● Platforms built on cloud infrastructure can reduce the need to hire a bioinformatician and buy hardware

● Computing in the cloud can yield both speed and cost advantages

Opportunity: Expand Access to Computing At Scale

250,00065,000

Page 12: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

The Exome Aggregation Consortium (ExAC)The Exome Aggregation Consortium II (ExACII)

GnomAD

Variant frequency across 60K exomes -> tremendous value for clinical interpretation of rare disease genomes

Collaborations (since 2012):

● 1753 sample collections

● 913 distinct projects

● 5679 orders

● 150 PIs at any time

Opportunity: Science Is Ever More Collaborative

Page 13: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Opportunity: Bring Researchers to the Data

Page 14: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

● NCI identified the importance of these three opportunities and funded three software platform development pilots

● Each system aimed to simplify TCGA data access in a secure and scalable cloud-based offering that brings the analysis to a single copy of the data, enabling access to protected TCGA data those with dbGaP approval

● The Broad Institute Workbench framework was initially developed to power Broad’s NCI Cloud Pilot, FireCloud

The First Step: NCI Cloud Pilot

Page 15: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

● One copy of the data. Datasets can be stored in one place and shared with collaborators -> eliminating duplication in data sharing

● A community-driven “App Store” for methods and best practices pipelines.

● Infinite compute resource, as needed. Workflows can scale to use significant computing power as needed, which yields a reduction in compute time and cost compared to on-premises computation

● Access by web browser or API. Addresses the needs of computational scientists/software engineers as well as those who are less technical

A cloud-based data management and analysis platform for scalable, collaborative research

http://www.firecloud.org

Broad Institute Workbench - In Concept

Page 16: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Broad Institute Workbench - In Reality

Page 17: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Cancer Genomics Analysis on FireCloud

Chet BirgerGetz Lab @ Broad Institute

Page 18: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Scientific Goal: Comprehensive Catalogue of Genes Responsible for Cancer Initiation and Progression

● Foundational for cancer diagnostics, therapeutics, clinical trial design, and selection of rational combination therapies for individual patients

● Guides therapeutic development by identifying dysregulated pathways and druggable targets

Catalogue of Cancer Genes

Page 19: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Unbiased identification of genes harbouring somatic genetic variations at a statistically significant rate or pattern in cancer

● MutSig tool suite for SNPs and INDELs

● GISTIC for CNVs

● Baysian Nonnegative Matrix Factorization for mutational signal discovery

● GSEA, PARADIGM to identify dysregulated pathways

● and a lot more!

Computational Methods

Page 20: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

● Requires large international cohorts to achieve goal

● Several large international projects working to compile this catalogue: e.g., TCGA, ICGC and PCAWG to name a few.

● Getz Lab is a key contributor to these projects○ Tools and analytical pipelines

○ Sequencing (Broad Genomics)

○ Computational analysis and interpretation

International Effort

Page 21: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Complete characterization of ~35 adult cancers ~20 common cancers at 500 cases each ~15 rare cancers at 50-150 cases each

~11,000 cases ~2.5PB data, originally stored in CGHub and DCC, now in the GDC

The size of the data set and compute capacity required to work on it makes access and analysis difficult for any but the best-resourced institutions.

TCGA Produced Large Amounts of Data

Page 22: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

International Cancer Genome Consortium

Page 23: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

•––––

•––––

•––

Page 24: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

0.8 PB

5 PB

Genomic Data Distribution is a Challenge

Page 25: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

For 90% powerto detect 90% of cancer genes with frequency ≥ 2%, need ~2000 samples per tumor type.

50 tumor types x 2000= 100,000 pairs

Lawrence et. al., Discovery and saturation analysis of cancer genes across 21 tumour types, Nature, January 2014

What size cohorts do we need?

Page 26: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

We need large datasets with genomic and clinical data to obtain sufficient power to detect/learn:

1. Complete catalog of cancer genes and pathways (>2% of patients) (1000s / tumor type)

2. Explain >95% of tumor types and subtypes (1000s / tumor type)

3. Mutational Signatures (100s - 1000s / tumor type)

4. Germline risk alleles (10,000s / tumor type)

5. Biomarkers for response (100s to 1000s / tumor type / drug)

Preparing for a lot more data

Page 27: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

● 2009: Getz lab began development of Firehose (FireCloud’s on-premises precusor) as a computational platform for TCGA data

● The size of the data sets and computational needs has grown dramatically

● On-premises FireHose not capable of supporting current and future research

● Moving to the cloud with its near limitless storage and elastic compute

● Necessity of migration to cloud coincided with NCI Cancer Genomics Cloud Pilot Project - The Broad Institute Workbench grew from this initial funding

Moving to the Cloud

Page 28: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

● Virtually all of Getz Lab’s efforts at fully characterizing the somatic variations driving cancer are done in the context of large international projects

● Many smaller projects also collaborative efforts

● Support for collaborative science one of FireCloud’s principal design goals

● Achieved through FireCloud’s workspace-centric design

Supporting Collaborative Science

Page 29: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Medical and Population Genetics

Alisa ManningDiabetes Research Group

Broad Institute

Page 30: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Large-Scale Statistical Analysis of Whole Genome Sequence Data with Hail and

the Broad Institute Workbench

http://hail.is/Hail is an open-source framework for scalable genetic data analysis...

Broad Institute Workbench

a Cloud-Based Platform for Data Analysis, Management, and Sharing

Page 31: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

What is… ?

● Hail is a scalable, reliable framework and a powerful language for genetic data analysis

● Open source, under active development, widespread adoption at Broad

● Leveraging open-source big-data tools

● Innovating to solve the unique problems of genetics

Hail web site: http://hail.isHail code: github.com/broadinstitute/hailContact: [email protected]

Page 32: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

• N = 18,877 samples (FHS, OOA, JHS)• Variants called with `vt` (Tan A et al. Bioinformatics. 2015)• ~193,000,000 variants (passing, biallelic sites)

NHLBI’s Trans-Omics for Precision Medicine (TOPMed)

Page 33: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Hail commands for complex QC and variant annotation (2 - 12 hours depending on the number of cores)hail read -i file:///mnt/geno/nhlbi.1575.sftp-exchange-area.keep.freeze3a.pass.gtonly.minDP10.genotypes.vds \filtervariants expr -c 'va.pass' --keep \filtervariants expr -c 'v.contig == "X" || v.contig == "Y" || v.contig == "MT"' --remove \…variantqc filtervariants expr -c 'va.qc.AC > 0' --keep \annotatevariants intervals -r va.isLCF -i file:///mnt/lustre/aganna/LCR.interval_list \annotatevariants expr -c 'va.badpHWE = va.annot.pHWE_Amish <= 0.000000001 || va.annot.pHWE_FHS <= 0.000000001 || va.annot.pHWE_JHS <= 0.000000001' \...

Page 34: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

The genetic architecture of type 2 diabetes. 2016 Aug 4;536(7614):41–7.

Pilot Analysis in Workbench

Page 35: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Workspaces

Method Repository

Google Cloud Storage

Summary Data Analysis Methods Monitor

Workbench Schematic

Page 36: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

The Workspace links methods and analysis to data

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

SummaryWorkspaces

Page 37: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

The Data Model links sample IDs to VCF files, trait files, and other inputs

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

SummaryData Model

Page 38: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

We implemented single-variant association analysis with EPACTs.

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

SummaryEPACTs in Workbench

Page 39: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Methods allow you to specify the input and output for your pipeline

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

SummaryMethod Customization

Page 40: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Logs, input, and output files are linked to the execution of a method

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

SummaryMethod Provenance

Page 41: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Data model links methods, analysis to results

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

SummaryMethod Results

Page 42: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

We are...

• Implementing our standard analysis pipelines in the Workbench

• Creating new methods for whole genome sequence data

• Developing our pipelines with state of the art computing paradigms

• Looking for broader engagement from you!

http://hail.is/Hail is an open-source framework for scalable genetic data analysis...

Scaling Statistical Genetics

Page 43: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Precision Medicine InitiativeKristian Cibulskis

Engineering DirectorBroad Data Sciences Platform

Page 44: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

US Health - Early 1900s

Page 45: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

1948: Launch of Framingham Heart Study

Dawber TR, Meadors GF, Moore FEJ: Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health 1951, 41:279-286. Dawber TR, Kannel WB, Revotskie N, Stokes JI, Kagan A, Gordon T: Some factors associated with the development of coronary heart disease; six years' follow-up experience in the Framingham Study. Am J Public Health 1959, 49:1349-1356.

Page 46: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

1961: Early Biomarkers - “Factors of Risk”

Page 47: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Decline in Heart Disease Mortality

Source: CDC Morbidity and Mortality Weekly Report (MMWR)

Page 48: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Framingham for the 21st Century

Precision Medicine Initiative aims to be a

“Framingham for the 21st Century”

It will be the largest medical scientific study in history of the world

Page 49: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

January 2015: State of the Union

Page 50: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

● 1 million or more participants● Longitudinal, ability to recontact● Focus on engagement

● Two methods of enrollment○ Healthcare provider organizations○ Direct volunteers

September 2015: PMI Working Group report

Page 51: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Precision Medicine InitiativeCohort Program

now known as

All of Us Research Programjoinallofus.org

Building a Research FoundationFor 21st Century Medicine

All of Us

Page 52: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

PMI Organization and Data Flow

Data & Research Center

Biobank

Page 53: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

PMI Organization and Data Flow

Data & Research Center

Biobank

Page 54: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

PMI Data & Research Center (DRC)

Mission● To acquire, organize and provide access to what will be

one of the world’s largest and most diverse datasets for precision medicine research

● Provide research support for the scientific data and analysis tools for the program, helping to build a vibrant community of researchers

Page 55: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Data & Research Center (DRC)

Acquire & Organize

ResearchSupport

Page 56: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Data & Research Center (DRC)

Broad Institute Workbench

Page 57: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

The Workbench Vision

Page 58: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Plans for launch and beyond

● PMI Cohort Program anticipates 3–4 years to reach one million participants

● Phased implementation as we pilot, iterate, and scale

● Initial releases will focus on data collection and portals

Page 59: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Workbench Deep DiveAlex BaumannProduct Owner

Page 60: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

● One copy of the data. Datasets can be stored in one place and shared with collaborators -> eliminating duplication in data sharing

● A community-driven “App Store” for methods and best practices pipelines

● Infinite compute resource, as needed. Workflows can scale to use significant computing power as needed, which yields a reduction in compute time and cost compared to on-premises computation

● Access by web browser or API. Addresses the needs of computational scientists/software engineers as well as those who are less technical

A cloud-based data management and analysis platform for scalable, collaborative research

http://www.firecloud.org

Broad Institute Workbench

Page 61: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

...To the Cloud!

WDL: an open source language for computational biologists to express analytical pipelines

Cromwell: an open source, scalable, robust engine for interpreting and executing a WDL using various backends

Workbench: Several services, all open source

Google Genomics Pipelines API: co-developed by Broad and Google Genomics, a scalable Docker-as-a-Service data scheduler

Page 62: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Introducing Workbench

Within the workbench you can access your workspaces and browse methods

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 63: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

What Are Workspaces?

● Datasets and associated analyses are done within Workspaces, so you can:

● Organize: Workspaces contain datasets and analyses that can be run on these datasets

● Track: Workspaces retain the history of all analyses that have been run to support reproducibility and traceability

● Collaborate: Workspaces can be shared with others as Readers (view-only), Writers (run analyses and modify data), and Owners (modify but also delete and share)

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 64: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Inside a Workspace

The Summary tab supports sharing, accessing the bucket, and various metadata

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 65: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Workbench Features

● Domain specific layer above Google Cloud Platform - data accessible and usable outside of Workbench via Google APIs, with other clouds coming soon

● Designed for scalability of data and analyses

● TCGA data (both open and controlled access) available in Workbench, Broad data delivery will be within workspaces, and we will be hosting other large public cancer data sets (e.g., TARGET, CCLE)

● Broad’s best practice pipelines are going into the methods repo

● Data model supports easily running methods at scale and maintaining organized data files and metadata

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 66: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Google Cloud Storage

Data is stored within buckets, which are accessible via Google console (or gsutil)

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 67: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Sharing a Workspace

A workspace can be shared with other users as Owners, Writers and Readers

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 68: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Workspace Data Model

● Incorporates TCGA data model with Participants, Samples, Pairs, and sets

● Metadata can be constant values such as the number 50, or data file urls

● Allows you to organize multiple datasets that all use one copy of the data

● Can be used as inputs to analyses and updated from outputs of analyses

● Outputs can be written back to data model and used in downstream analyses

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 69: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Organizing Data

The Data tab organizes data files and other metadata around higher level concepts such as participants, samples, pairs of samples, and sets

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 70: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

IGV Integration

The Integrative Genomics Viewer can be used to visualize genomics datasets using the data model and files within buckets

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 71: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Launching Methods

Analyses can be run upon entities in the data model, gathering all data files and metadata from attributes of entities

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 72: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Monitoring Analyses

Analyses can be viewed as they progress, and all are kept for historical reasons

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 73: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Methods Repository

● Stores workflows and tasks you can reuse and share publicly or privately

● Method tools packaged as docker images; ensure tool portability

● Methods are versioned to support reproducibility and reusability

● Many Broad best practice public methods available and more in the works ● We plan to provide and consume methods via the GA4GH Tool Registry API

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 74: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Supporting Curation

Methods show their version, creator, documentation and other metadata, and we plan to add ratings, comments and other tools to aid in community curation

MethodRepository

Google Cloud Storage

Data Analysis Methods Monitor

Workspaces

Summary

Page 75: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Architecture

FireCloud API & Web Portal

Workbench Service

Cromwell(WDL Execution)

Methods Repository

gsutil

Google IDs for Authentication

Google Cloud Storage

All ServicesGoogle Compute Engine

CloudSQL for RDBMS

Cloud Monitoring for Operations

Google Genomics Pipeline API

Page 76: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

What you need to knowDavid Siedzik

Chief Product Owner

Page 77: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

How to Learn More

● Available now at www.firecloud.org and the APIs are at api.firecloud.org

● Post questions and comments on our forum!

● All of our tools are open source, and we encourage software collaborators and feature requests https://github.com/broadinstitute

● Alexander Baumann will be available to answer your questions or show demos at Meet the Expert (Booth 329) Friday from 1-2 pm

Page 78: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

An open, unlimited, and fast future _________________

To the Cloud!

Page 79: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,

Acknowledgements - The Team

AnalysisAlisa Manning ([email protected])Cotton SeedSeung Hoan ChoiPradeep NatarajanMaryam Zekavat

LinksFireCloud: www.firecloud.orgWorkflow Definition Language (WDL): https://software.broadinstitute.org/wdl/

Broad Institute Data Science and Data EngineeringGenomic Platform Cancer Program

National Cancer InstituteNational Institute of Health

PIsGad GetzAnthony Philippakis

Page 80: Management, and Sharing a Cloud-Based Platform for Data Analysis,genomics.broadinstitute.org/data-sheets/PPT_Workbench... · 2017-12-15 · a Cloud-Based Platform for Data Analysis,