Transcript
Page 1: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Science as a Service

Ian Foster, The University of Chicago and Argonne National Laboratory

November 14, 2013

Page 2: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A time of disruptive change

Page 3: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A time of disruptive change

Page 4: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Most labs have limited resources Heidorn: NSF grants in 2007

< $350,000 80% of awards 50% of grant $$

$1,000,000

$100,000

$10,000

$1,000

2000 4000 6000 8000

Page 5: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Automation is required to apply more sophisticated methods to far more data

Page 6: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Automation is required to apply more sophisticated methods to far more data

Outsourcing is needed to achieve economies of scale in the use of automated methods

Page 7: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Building a discovery cloud • Identify time-consuming activities amenable to

automation and outsourcing • Implement as high-quality, low-touch SaaS • Leverage IaaS for reliability,

economies of scale • Extract common elements as

research automation platform Bonus question: Sustainability

Software as a service

Platform as a service

Infrastructure as a service

Page 8: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

We aspire (initially) to create a great user experience for

research data management

What would a “dropbox for science” look like?

Page 9: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

Page 10: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Registry Staging Store

Ingest Store

Analysis Store

Community Store

Archive Mirror

Ingest Store

Analysis Store

Community Store

Archive Mirror

Registry

Quota exceeded

!

Expired credentials

!

Network failed. Retry.

!

Permission denied

!

It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, & Archive BIG DATA … but in reality it’s often very challenging

Page 11: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

Page 12: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

• Move • Sync • Share Capabilities delivered using

Software-as-Service (SaaS) model

Page 13: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Data Source

Data Destination

User initiates transfer request

1

Globus Online moves/syncs files

2

Globus Online notifies user

3

Page 14: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Data Source

User A selects file(s) to share; selects user/group, sets share permissions

1

Globus Online tracks shared files; no need to move files to cloud storage!

2

User B logs in to Globus Online and accesses

shared file

3

Page 15: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Extreme ease of use • InCommon, Oauth, OpenID, X.509, … • Credential management • Group definition and management • Transfer management and optimization • Reliability via transfer retries • Web interface, REST API, command line • One-click “Globus Connect” install • 5-minute Globus Connect Multi User install

Page 16: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Early adoption is encouraging

Page 17: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Early adoption is encouraging

>12,000 registered users; >150 daily >27 PB moved; >1B files

10x (or better) performance vs. scp 99.9% availability

Entirely hosted on Amazon

Page 18: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Amazon web services used • Amazon EC2 for hosting Globus services • Elastic Load Balancing to use multiple

Availability Zones for reliability and uptime • Amazon S3 to store historical state • Amazon RDS PostgreSQL for active state

Page 19: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s

Page 20: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC

Page 21: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience

Page 22: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

2

Credit: Kerstin Kleese-van Dam

Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL

Page 23: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

• Move • Sync • Share Capabilities delivered using

Software-as-Service (SaaS) model

Page 24: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

Page 25: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Online already does a lot

Globus Toolkit

Sharing Service Transfer Service

Globus Nexus (Identity, Group, Profile)

Glo

bus

Onl

ine

API

s

Glo

bus

Con

nect

Page 26: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

The identity challenge in science • Research communities often need to

– Assign identities to their users – Manage user profiles – Organize users into groups for authorization

• Obstacles to high-quality implementations – Complexity of associated security protocols – Creation of identity silos – Multiple credentials for users – Reliability, availability, scalability, security

Page 27: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Nexus provides four key capabilities • Identity provisioning

– Create, manage Globus identities

• Identity hub – Link with other identities; use

to authenticate to services

• Group hub – User-managed groups; groups can

be used for authorization

• Profile management – User-managed attributes;

can use in group admission

I

I I I

I

I a b

I U

V G

Key points: 1) Outsource

identity, group, profile management

2) REST API for flexible integration

3) Intuitive, customizable Web interfaces

Page 28: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Branded sites

Open Science Grid University of Chicago XSEDE

DOE kBase Indiana University University of Exeter

Globus Online NERSC NIH BIRN

Page 29: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A platform for integration

Page 30: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A platform for integration

Page 31: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A platform for integration

Page 32: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Data management SaaS (Globus) + Next-gen sequence analysis pipelines (Galaxy) +

Cloud IaaS (Amazon) = Flexible, scalable, easy-to-use genomics analysis for

all biologists

globus genomics

Page 33: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Toolkit

Sharing Service Transfer Service

Globus Nexus (Identity, Group, Profile)

Glo

bus

Onl

ine

API

s

Glo

bus

Con

nect

We are adding capabilities

Page 34: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Toolkit

Sharing Service Transfer Service

Dataset Services

Globus Nexus (Identity, Group, Profile)

Glo

bus

Onl

ine

API

s

Glo

bus

Con

nect

We are adding capabilities

Page 35: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

We are adding capabilities • Ingest and publication

– Imagine a DropBox that not only replicates, but also extracts metadata, catalogs, converts

• Cataloging – Virtual views of data based on user-defined and/or automatically

extracted metadata

• Computation – Associate computational procedures, orchestrate application,

catalog results, record provenance

Page 36: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Next Gen Sequencing Analysis for Everyone – No IT Required

Ravi K Madduri, The University of Chicago and Argonne National Laboratory

November 14, 2013

Page 37: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

One slide to get your attention

Page 38: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Outline • Globus Vision • Challenges in Sequencing Analysis

– Big Data Management – Analysis at Scale – Reproducibility

• Proposed Approach Using Globus Genomics • Example Collaborations • Q&A

Page 39: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Vision Goal: Accelerate discovery and innovation worldwide by providing research IT as a service Leverage software-as-a-service to:

– provide millions of researchers with unprecedented access to powerful tools for managing Big Data

– reduce research IT costs dramatically via economies of scale

“Civilization advances by extending the number of important operations which we can perform without thinking of them” —Alfred North Whitehead , 1911

Page 40: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Challenges in Sequencing Analysis

Sequencing Centers

Sequencing Centers

Data Movement and Access Challenges

Manual Data Analysis

Public Data

Storage

Local Cluster/ Cloud Seq

Center

Research Lab

How do we analyze this Sequence Data

Picard

GATK

Fastq Ref Genome

Alignment

Variant Calling

• Manually move the data to the Compute node

(Re)Run Script

Install

Modify

• Install all the tools required for the Analysis • BWA, Picard, GATK, Filtering Scripts, etc. • Shell scripts to sequentially execute the tools

• Manually modify the scripts for any change • Error Prone, difficult to keep track, messy.. • Difficult to maintain and transfer the knowledge

• Data is distributed in different locations • Research labs need access to the data for analysis • Be able to share data with other researchers/collaborators

• Inefficient ways of data movement • Data needs to be available on the local and distributed compute

Resources • Local clusters, cloud, grid and transfer the knowledge

Page 41: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Genomics

Sequencing Centers Sequencing Centers

Public Data

Storage

Local Cluster/ Cloud Seq

Center

Research Lab

Globus Provides a • High-performance • Fault-tolerant • Secure file transfer Service between all data-endpoints

Data Management Data Analysis

Galaxy Data Libraries

• Globus Integrated within Galaxy

• Web-based UI • Drag-Drop workflow

creations • Easily modify

Workflows with new tools

Globus Genomics on Amazon EC2

• Analytical tools are automatically run on the scalable compute resources when possible

Galaxy Based Workflow Management System

Globus Genomics

Page 42: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Genomics Architecture

Figure 2: Globus Genomics Architecture

Page 43: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Genomics Usage

Page 44: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013
Page 45: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Genomics • Computational profiles for

various analysis tools • Resources can be provisioned

on-demand with Amazon Web Services cloud based infrastructure

• Glusterfs as a shared file system between head nodes and compute nodes

• Provisioned I/O on Amazon EBS

Page 46: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Coming soon! • Integration with Globus Catalog

– Better data discovery and metadata management

• Integration with Globus Sharing – Easy and secure method to share large datasets with collaborators

• Integration with Amazon Glacier for data archiving • Support for high throughput computational

modalities through Apache Mesos – MapReduce and MPI clusters

• Dynamic Storage Strategies using Amazon S3 or LVM-based shared file system

Page 47: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013
Page 48: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Provide more capability for more people at lower cost by building a “Discovery Cloud”

Delivering “Science as a service”

Our vision for a 21st century discovery infrastructure

Page 49: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Thank you to our sponsors

Page 50: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

For more information • More information on Globus Genomics and to

sign up: www.globus.org/genomics • More information on Globus:

www.globusonline.org • Follow us on Twitter: @ianfoster, @madduri,

@globusgenomics, @globusonline

Page 51: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Thank you!

Page 52: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT 310


Top Related