reproducibility in the clouds - harnessing the in nube paradigm · harnessing the in nube paradigm....

Post on 01-Sep-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Reproducibility in the Clouds -Harnessing the in nube

Paradigm

DNA Learning Center

• Hands on education in molecular biology/bioinformatics for grade 6-12 students; faculty

• 3 dedicated centers (2 more by 2020) and 16 centers licensed/modeled on DNALC

• Approximately 30K Visitors annually (more than 2 million visitors to-date for all centers)

Transforming science through data-driven discovery

More than 70K users, PBs of data, and hundreds of publications, courses, and discoveries

Funded by the National Science Foundation

• We are your colleagues and collaborators (not a company)!

• $115 Million in investment• Freely available to the community• Spur national/international collaboration

DBI-0735191, DBI-1265383, DBI-1743442

CyVerse Funding

Some Jargon and Metaphors

Same substance, different properties, coexisting

Same substance, different properties, coexisting

Same substance, different properties, coexisting

What do we mean by cloud?

Seeing shapes in clouds

Seeing shapes in clouds

PaaSPlatform-as-a-Service

Deploysoftware applications on a remote infrastructure

SaaSSoftware-as-a-Service

Use software applications on a remote infrastructure

IaaSInfrastructure-as-a-

ServiceDeployapplications and provision/orchestrate infrastructure

Why did you come to this session?

Is this being a biologist now?

Reproducible is beautiful

Reproducible is beautiful

Reproducibility Spectrum

Science 02 Dec 2011:Vol. 334, Issue 6060, pp. 1226-1227DOI: 10.1126/science.1213847

Cloud is a great opportunity to think about what reproducibility means to you

4 Things to think about

when moving to the cloud

Tip 1: Collect your questions and translate(Computational thinking)

Question Mapping

• What are the questions I can’t answer now?

• What software is available/what power is required?

• What/where are the data?

• Have I optimized with what I have? (code refactoring)

• What architecture (GPUs/High-mem nodes) support the tools I’ll use?

• What/where are the data? (Co-localization)

Science? Tech?

Flex

ibili

ty/E

ase

of u

se

Computational Power

Tip 2: What resources at what costs?

Combine Cloud, HPC, and Reproducibility

Containers allow you to ship your problems around

Operating System (Version)

Platform/Language (R, Python, Linux)

Tool 1 (version)

Tool 2 (version)

Tool 3 (version)

Operating System (Version)

Docker

Dependency

Dependency

Dependency

Dependency

Dependency

Dependency

Docker Container

Tool 1 and

dependencies

Tools and dependencies separateTools and dependencies bundled

Docker Container

Tool 2 and

dependencies

Docker Container

Tool 3and

dependencies

Combine Cloud, HPC, and Reproducibility

Combine Cloud, HPC, and Reproducibility

Combine Cloud, HPC, and Reproducibility

Ç√

Assembling resources

Institutional resources(Your own institutional core)

Public Research Infrastructure(CyVerse, JetStream, EOSC, BPA)

Commercial Cloud(AWS, GCP, Azure)

Financial Costs

Time Costs Computational Power

Data Concerns

None-Low Low-Med Low-Med Low-Med

None Low-Med Med-High Low

Med-Inf Varies High-Inf Med-High

Working with Big Data

Data Commons Repository (DCR), NCBI-SRA

Publication

Community Data folders, Data Commons, quick share links

Sharing

Discovery Environment, Atmosphere, Agave API,

BisQue, DNA Subway

AnalysisAdd, delete, copy;

metadata templates; bulk

metadata

Metadata

Discovery Environment, iCommands, Cyberduck

UploadData Commons Repository

(DCR), Elasticsearch

Discovery

Tip 3: Take advantage of orchestration and workflows

Container Orchestration: Kubernetes(κυβερνήτης)

Workflow managers

Workflow managers

Tip 4: Don’t go it alone

Get Trained“Does your institution meet this need?”

(‘no’ responses)

***

*

CyVerse Example in Practice

Needs for biologists

• Power (computational capacity)

• Usability (simple to get started)

• Usefulness (right set of tools installed, or able to install)

• Reproducibility

Ready to usePlatforms

FoundationalCapabilities

Established CI Components

Extensible Services

Ease

of U

se Flexibility

CyVerse Product Stack

CyVerse Atmosphere

CyVerse Atmosphere

CyVerse Atmosphere

CyVerse Atmosphere

Emphasis on serverless and SaaS/PaaS layer

Visual Interactive Computing EnvironmentCyVerse VICE

RNA-Seq example: GEA Tools in Docker

1. Write a Dockerfile (1-8 hours) 2. Push container to CyVerse (10 min)

3. Create an interface and get button (10 min) 4. Use your tools (10 min)

Some take home messages:

• Technology problems can be solved by lots of folks – but you and your group know the science

• There isn't a “cloud” but a constellation of technologies and approaches – think about taking the pieces you need

• There are lots of free safe places to learn (public infrastructure)

Parker AntinNirav MerchantEric Lyons

Matthew Vaughn David Micklos

top related