reproducibility in the clouds - harnessing the in nube paradigm · harnessing the in nube paradigm....
TRANSCRIPT
Reproducibility in the Clouds -Harnessing the in nube
Paradigm
DNA Learning Center
• Hands on education in molecular biology/bioinformatics for grade 6-12 students; faculty
• 3 dedicated centers (2 more by 2020) and 16 centers licensed/modeled on DNALC
• Approximately 30K Visitors annually (more than 2 million visitors to-date for all centers)
Transforming science through data-driven discovery
More than 70K users, PBs of data, and hundreds of publications, courses, and discoveries
Funded by the National Science Foundation
• We are your colleagues and collaborators (not a company)!
• $115 Million in investment• Freely available to the community• Spur national/international collaboration
DBI-0735191, DBI-1265383, DBI-1743442
CyVerse Funding
Some Jargon and Metaphors
Same substance, different properties, coexisting
Same substance, different properties, coexisting
Same substance, different properties, coexisting
What do we mean by cloud?
Seeing shapes in clouds
Seeing shapes in clouds
PaaSPlatform-as-a-Service
Deploysoftware applications on a remote infrastructure
SaaSSoftware-as-a-Service
Use software applications on a remote infrastructure
IaaSInfrastructure-as-a-
ServiceDeployapplications and provision/orchestrate infrastructure
Why did you come to this session?
Is this being a biologist now?
Reproducible is beautiful
Reproducible is beautiful
Reproducibility Spectrum
Science 02 Dec 2011:Vol. 334, Issue 6060, pp. 1226-1227DOI: 10.1126/science.1213847
Cloud is a great opportunity to think about what reproducibility means to you
4 Things to think about
when moving to the cloud
Tip 1: Collect your questions and translate(Computational thinking)
Question Mapping
• What are the questions I can’t answer now?
• What software is available/what power is required?
• What/where are the data?
• Have I optimized with what I have? (code refactoring)
• What architecture (GPUs/High-mem nodes) support the tools I’ll use?
• What/where are the data? (Co-localization)
Science? Tech?
Flex
ibili
ty/E
ase
of u
se
Computational Power
Tip 2: What resources at what costs?
Combine Cloud, HPC, and Reproducibility
Containers allow you to ship your problems around
Operating System (Version)
Platform/Language (R, Python, Linux)
Tool 1 (version)
Tool 2 (version)
Tool 3 (version)
Operating System (Version)
Docker
Dependency
Dependency
Dependency
Dependency
Dependency
Dependency
Docker Container
Tool 1 and
dependencies
Tools and dependencies separateTools and dependencies bundled
Docker Container
Tool 2 and
dependencies
Docker Container
Tool 3and
dependencies
Combine Cloud, HPC, and Reproducibility
Combine Cloud, HPC, and Reproducibility
Combine Cloud, HPC, and Reproducibility
Ç√
Assembling resources
Institutional resources(Your own institutional core)
Public Research Infrastructure(CyVerse, JetStream, EOSC, BPA)
Commercial Cloud(AWS, GCP, Azure)
Financial Costs
Time Costs Computational Power
Data Concerns
None-Low Low-Med Low-Med Low-Med
None Low-Med Med-High Low
Med-Inf Varies High-Inf Med-High
Working with Big Data
Data Commons Repository (DCR), NCBI-SRA
Publication
Community Data folders, Data Commons, quick share links
Sharing
Discovery Environment, Atmosphere, Agave API,
BisQue, DNA Subway
AnalysisAdd, delete, copy;
metadata templates; bulk
metadata
Metadata
Discovery Environment, iCommands, Cyberduck
UploadData Commons Repository
(DCR), Elasticsearch
Discovery
Tip 3: Take advantage of orchestration and workflows
Container Orchestration: Kubernetes(κυβερνήτης)
Workflow managers
Workflow managers
Tip 4: Don’t go it alone
Get Trained“Does your institution meet this need?”
(‘no’ responses)
***
*
CyVerse Example in Practice
Needs for biologists
• Power (computational capacity)
• Usability (simple to get started)
• Usefulness (right set of tools installed, or able to install)
• Reproducibility
Ready to usePlatforms
FoundationalCapabilities
Established CI Components
Extensible Services
Ease
of U
se Flexibility
CyVerse Product Stack
CyVerse Atmosphere
CyVerse Atmosphere
CyVerse Atmosphere
CyVerse Atmosphere
Emphasis on serverless and SaaS/PaaS layer
Visual Interactive Computing EnvironmentCyVerse VICE
RNA-Seq example: GEA Tools in Docker
1. Write a Dockerfile (1-8 hours) 2. Push container to CyVerse (10 min)
3. Create an interface and get button (10 min) 4. Use your tools (10 min)
Some take home messages:
• Technology problems can be solved by lots of folks – but you and your group know the science
• There isn't a “cloud” but a constellation of technologies and approaches – think about taking the pieces you need
• There are lots of free safe places to learn (public infrastructure)
Parker AntinNirav MerchantEric Lyons
Matthew Vaughn David Micklos