building a european wide, bioinformatics jobs execution...bundle of tools a vm image, named virtual...

24
Building a European wide, Bioinformatics jobs execution network Gianmauro Cuccuru - Galaxy Europe Team University of Freiburg

Upload: others

Post on 30-Dec-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Building a European wide, Bioinformatics jobs execution network

Gianmauro Cuccuru - Galaxy Europe TeamUniversity of Freiburg

Page 2: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

A Galaxy for all Scientists

• Galaxy is a gateway for transparent & reproducible data analysis• Easy accessing (no installation) & sharing of data, tools, analysis, workflows• Multiple interfaces

• intuitive web portals for biologists• unified API for bioinformaticians

• International: 120+ public instances, 8000+ citations

• UseGalaxy.eu, UseGalaxy.org.au, UseGalaxy.org ...

• ELIXIR community: • WG established in 2015

Page 3: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

European Galaxy server

With the UseGalaxy.eu server we provide access to:● Free compute and storage resources ● More than 2000 different, well-documented and constantly maintained

bioinformatics tools ● Reference genomes

(incl. most common plant genomes)● 250 GB per user● Free registration ● Login with ELIXIR AAI● Member sites:

Freiburg, Erasmus MC, VIB Belgium, Institut Pasteur

https://usegalaxy.eu

Page 4: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Galaxy Interface

New - Events - Statistics

Analyze Data - Workflows - Visualize - Shared Data - Help

Tools with documentation, example input and results, reference

History as digital lab book

Page 5: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Galaxy Workflows for multi-step AnalysesHow?

● Extracted from history● Built manually● Import a shared workflow

Why?

● Automatize your analysis● Run pipelines● Re-run same analysis on different

inputs or reproduce results● Change parameter● Sub workflows● Share them

Page 6: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Infrastructure to connect interactive environments

● Jupyter● RStudio● Shiny● Neo4J● Phinch● …..

Page 7: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Training Infrastructure as a Service

● Queue where only your training’s jobs will run● Free, register using a google form● No Galaxy Maintenance● No Galaxy Administration● Official Galaxy Training Materials are guaranteed

to work and regularly tested● See how your students are progressing with our

dashboard● >1500 students in the past year

Page 8: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Subdomains - Fostering Communities

Subdomains with own welcome page and tool box

annotation.usegalaxy.eucheminformatics.usegalaxy.euclimate.usegalaxy.euclipseq.usegalaxy.euecology.usegalaxy.eugraphclust.usegalaxy.euproteomics.usegalaxy.eurna.usegalaxy.euimaging.usegalaxy.eu

metabolomics.usegalaxy.eumetagenomics.usegalaxy.eunanopore.usegalaxy.eusinglecell.usegalaxy.eustreetscience.usegalaxy.euhicexplorer.usegalaxy.euhumancellatlas.usegalaxy.eu

Page 9: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

European Galaxy community at glance

UseGalaxy.eu since 2018

● 8,000 users (17th September 2019)● 6 Mio jobs● 11,000,000 datasets● 13,800 workflows● Training material: 137 contributors● Annual Galaxy community conference

Page 10: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Galaxy Computational Power

● 2000 CPU cores● 20 TB RAM● 1,5 PB storage● 50 TB data/all users/month● 130,000 jobs/all users/month● cloud infrastructure of de.NBI (German network for bioinformatics

infrastructure) to perform analyses of large datasets

Page 11: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Central manager Interactive environments (~10 physical nodes)

Main cluster (~100 cloud nodes divided into 8 different classes)

Training cluster (1-9 cloud nodes)

Page 12: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Bundle of tools

● A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs.

● Terraform scripts that take care of the infrastructure deployment into the Cloud resources

Page 13: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

● Continuous testing● Continuous Deployment

Open Infrastructure

Page 14: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Open Infrastructure

Page 15: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Join Forces

The most innovative computing centers across Europe are currently interested to share their remote computation power to support the UseGalaxy.eu load:

● DE, de.NBI cloud● IT, Recas● BE, Vlaam Supercomputer Centrum (VSC)● PT, Tecnico ULisboa● ES, Barcelona Supercomp. Center (INB-BSC )● NO, University of Bergen● CZ, CESNET

Page 16: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

A Pulsar network across Europe

To create this network of shared computational resources, we leverage Pulsar, a Task Execution Service (TES)-like service. Pulsar allows a Galaxy server to automatically interact with those remote systems, ensuring job and provenance information are correctly exchanged.

Page 17: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Local + remote clusters

DE01, DE02, IT01, UK01,...

Page 18: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Remote resources examples

FQDN: uk01.pulsar.galaxyproject.eu

Computation details:

● 5-30 nodes: 60 cores, 320 GB ram each

FQDN: de03.pulsar.galaxyproject.eu

Computation details:

● 8 NVIDIA Tesla T4

Page 19: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Singularity

Regenerated 32.000 Singularity containers and made sure that all best-practise Galaxy tools and workflows are available as Singularity containers.

Currently moving those containers (7TB) to a CVMFS and those will be part of every standard pulsar endpoint as soon as the CVMFS snapshot is ready.

Page 20: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Next

● Python bindings● Late materialization● 8.6 -> 8.8● Annex● HTCondor-CE

Page 21: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Thanks!

https://galaxyproject.eu/freiburg/

Page 22: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs
Page 23: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Docker-galaxy

Page 24: Building a European wide, Bioinformatics jobs execution...Bundle of tools A VM image, named Virtual Galaxy Compute Nodes (VGCN), that provides everything you need to run Galaxy jobs

Minimum requirementsFor a prototype setup, the minimum requirements are:

● Central manager and NFS server each with 4 cores, 8 GB

● Computational workerseach with 4-8 cores, 16 GB

● >200 GB volume

but the more the better

NFS

Central manager(HTCondor + Pulsar)

Computational workers