Why computing for genomics research sucks.
BaltiBio 2014-05-27
Example Genomics Tasks
Repetitiveness “Disk” !Input/Output Memory Duration
per task
Build 10,000 trees 10,000x low low short
Trim FASTQ files 40-400x high low short
One de novo genome assembly 1 high high long
Many de novo genome assemblies 20-1000x high high long
Determine which of 10 new tools that
promise X can actually do X (once). !“genome hacking”
1 depends depends depends
Traditional High Performance Computing (HPC)
• Physics? Astronomy? Maths? Chemistry?
• Traditional HPC infrastructures are great at small tasks:
Repetitiveness “Disk” !Input/Output Memory Duration
per task
Build 10,000 trees 10,000x low low short
• And/or have mechanisms/tools that transform their challenges into many small tasks.
“We have 9999 cores!” - central IT admin
but they are inadequate
Big Ass Servers• e.g.: 1.5 TB ram; 48 cores -
SSH into it and do whatever you want.
Repetitiveness “Disk” !Input/Output Memory Duration
per task
Build 10,000 trees 10,000x low low short
Trim FASTQ files 40-400x high low short
One de novo genome assembly 1 high high long
Many de novo genome assemblies 20-1000x high high long
Determine which of 10 new tools that promise
X can actually do X (once). !
1 depends depends depends
Jeremy Leipzig
Additional challenges for biologists• Datasets continue growing fast!
• Generally:
• We lack computational training.
• Bioinformatics tools suck (badly written, badly tested, hard to install).
So what do we need? • access to machines of all shapes and sizes
• big and small machines
• direct access via ssh (for hacking & doing things few times)
• indirect access via queue (for doing things many times)
• fast I/O - cheap archival.
• single login: all files “feel” like they’re in one place
Swiss Institute of Bioinformatics: Vital-IT
So what do we need? • access to machines of all shapes and sizes
• big and small machines
• direct access via ssh (for hacking & doing things few times)
• indirect access via queue (for doing things many times)
• fast I/O - cheap archival.
• single login; all files “feel” like they’re in one place
• easily changeable software & OS versions
Easily changeable OS & software versions
https://www.docker.io
>docker-switch bio-linux7# do stuff >docker-switch pacbio-assembly-vm# do other stuff>docker-switch antlab-ubuntu# do more stuff
@bmpvieira
Easily changeable OS & software versions
https://www.docker.io
>docker-switch bio-linux7# do stuff >docker-switch pacbio-assembly-vm# do other stuff>docker-switch antlab-ubuntu# do more stuff FAK
E@bmpvieira
What if Apple/Google made an idiot-proof cloud computing
system for genomics?
What if Apple/Google made an idiot-proof cloud computing
system for genomics?• Always on - single place to connect to:
ssh mylab.awskiller.co.uk
• Dropbox-like shared directories & file checksumming.
• Easily switchable OS version / “VM”.
• Automagically & transparently migrates:• from small to huge machines (and back) as CPU and RAM
demands change.
What if Apple/Google made an idiot-proof cloud computing
system for genomics?• Always on - single place to connect to:
ssh mylab.awskiller.co.uk
• Dropbox-like shared directories & file checksumming.
• Easily switchable OS version / “VM”.
• Automagically & transparently migrates:• from small to huge machines (and back) as CPU and RAM
demands change. • from one physical site (huge dataset) to another
Summary• Broad range of needs:!
• some similar to traditional HPC.!• some very different!!
• Users are naive.!• Tools are experimental.!• Datasets are experimental.!• IT people have difficulty understanding this.
• Do not trust them when they say things will just work! !
• A lot of potential to make things not suck.
Evolutionary Genetics group & Queen Mary U London
Bruno Vieira - @bmpvieira
Steve Moss - @gawbul
Anurag Priyam - @yeban
Richard Christie & ITS Research Support team @ Queen Mary U London
Ioannis Xenarios & Vital-IT team @ Swiss Institute of Bioinformatics
http://[email protected]