ntino cloud biolinux barcelona spain 2012
TRANSCRIPT
Cloud BioLinux: Pre-configured Bioinformatics Computing for the Genomics Community
Ntino KrampisAsst. Professor - Informatics
J. Craig Venter Institute
[email protected]://www.jcvi.org/cms/about/bios/kkrampis/
Tuesday, November 6, 12
J. Craig Venter Institute ( JCVI )
• Human Microbiome Project (Nelson et al. Science 2010; 328: 994–99)
• NIH funded, launched in 2008, $115 million
• metagenomic sequencing of microbial genomes from the human body
• sequence everything in sample, use informatics to separate genomes
Tuesday, November 6, 12
J. Craig Venter Institute
• Global Ocean Survey (first publication, Venter et al. Science 2004; 304: 66-74)
• metagenomic sequencing of microbes from oceans around the world
• Darwin’s route ?
• Numbers: HMP > 2 mil. new proteins, GOS > 1.2
Tuesday, November 6, 12
Big Data and sequencing
• JCVI sequencing facility: 454, Solexa, HiSeq, and IonTorrent on the way
• Processed data: size information content
• But... look at SOLiD 3
Source: http://www.politigenomics.com/next-generation-
sequencing-informatics
Tuesday, November 6, 12
JCVI: sequencing and computing infrastructure
• “big” sequencing needs large-scale informatics
• ~1000 node Grid Engine cluster
• research with Hadoop / MapRecuce, and a small private cloud
• 50+ bioinformaticians and software developers
Tuesday, November 6, 12
A new paradigm:Low-cost, bench-top sequencers
• GS Junior - 454, MiSeq -Illumina
• complete sequencing of bacterial, viral, fungal genomes
• RNAseq (gene expression), ChiPseq (protein interactions), gene variant discovery
• sequencing as a standard technique in basic genetics research - like PCR ?
Tuesday, November 6, 12
Will smaller academic labs become the long tail of sequencing ?
“sequencing factories” :JCVI, Broad Inst. Washington Univ.
Inst. of Genome Sciences
small academic labs withbench-top sequencers
Amountof
sequencing
Number of labs
Tuesday, November 6, 12
Sequencers shipped without clusters
• Problem A : sequence analysis requires computational capacity
• genome assembly, BLAST, gene finders - annotation
• Problem B: bioinformatics tools need software engineering expertise
• unix/linux operating systems, maintaining software libraries, compiling source code
???
Tuesday, November 6, 12
Each lab builds a cluster ?
• need additional funds to buy the hardware
• funds for personnel to maintain the cluster and software
• duplication of effort across labs
• sub-optimal utilization of the hardware
Tuesday, November 6, 12
Centralized bioinformatics services
• Bioinformatic Resource Centers ex. GSCID
• bioinformatic services usually coupled with sequencing of a genome
• provide mostly data access to external PIs
• cannot support to every lab with a sequencer
Tuesday, November 6, 12
Problem A : sequence analysis requires computational capacity
• Amazon Elastic Compute Cloud (EC2), pay-by-the-hour computing
• cloud servers cost $0.085 - $2 per hour
• max capacity 64GB RAM / 8 CPU (can boot hundreds of servers)
750 hours free for new users: aws.amazon.com/free/
free compute for teaching: aws.amazon.com/grants/
World-wide data centers
Tuesday, November 6, 12
Cloud Computing and Virtualization
• OS, software and data, pre-installed in Virtual Machine (VM)
• cloud provider: hardware and virtualization layer
• VM is a full-featured server in a single file
• VM transfer on private cloud
Credit: VMware Inc.
Tuesday, November 6, 12
Problem B: bioinformatics tools need software engineering expertise
• VM with pre-installed software on the cloud
• avoid compiling source code, or other software dependencies
• rent computational capacity, on a pay as you go basis
• run the VM on the closest Amazon data center
Tuesday, November 6, 12
Solving Problems A & B : Cloud BioLinux
• Cloud BioLinux: publicly accessible VM on EC2
• 100+ pre-installed bioinformatics tools
• remote desktop for non-command line experts
• you can create a cluster with Cloud BioLinux - CloudMan Krampis K, Booth T, Chapman B, Tiwari B, Bicak M,
Field D, Nelson K
Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community.
BMC Bioinformatics. 2012 Mar 19; 13: 42.
Tuesday, November 6, 12
Accessing Cloud BioLinux
http://aws.amazon.com/console
Tuesday, November 6, 12
Launch through the EC2 cloud console
Tuesday, November 6, 12
Amazon EC2 VM launch wizard
cloudbiolinux.org
Tuesday, November 6, 12
Tuesday, November 6, 12
Cloud BioLinux desktop remote connection
tinyurl.com/bootcloud1 tinyurl.com/bootcloud2
Tuesday, November 6, 12
Cloud BioLinux desktop
Tuesday, November 6, 12
Cloud BioLinux desktop
Tuesday, November 6, 12
Data exchange on the cloudVM snapshots
Tuesday, November 6, 12
Cloud computing research at JCVI
• open-source cloud platforms, fully compatible with Amazon EC2
• active funding, NIAID viral genomics pipeline on cloud
• end-to-end, sequence to assembly, annotation, visualization via Galaxy
• run on Amazon, private cloud, or desktop
Tuesday, November 6, 12
Scriptable Cloud Infrastructures
• Cloud BioLinux VM configuration in plain text
• high-level configuration, software groups
• each group individual bioinformatics tools
Fabricframework
Tuesday, November 6, 12
• Python Fabric leverages Linux packages (APTitude repositories)
• mix and match software from repositories
• share VM configuration as source code
• clone across clouds
Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, Nelson KCloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community.
BMC Bioinformatics. 2012 Mar 19; 13: 42.
Scriptable Cloud Infrastructures
Tuesday, November 6, 12
Scalable Data Analysis
• Cloud BioLinux + Cloudman
• dual role : Master / Worker
• Cloud BioLinux VM, has Cloudman scripts that start more copies of itself
• Grid Engine (SGE) cluster
• http://usecloudman.org/Afgan, E., Chapman, B. et al. (2012). Using Cloud Computing Infrastructure with CloudBioLinux, CloudMan, and Galaxy.Current Protocols in Bioinformatics, 11-9.
Tuesday, November 6, 12
Goodies with Cloud BioLinux
Tuesday, November 6, 12
Goodies with Cloud BioLinux
Tuesday, November 6, 12
From sequencer to the cloud
credit:basespace.illumina.com
Tuesday, November 6, 12
Acknowledgments
• Cloud BioLinux community: Brad Chapman, Enis Afgan,Tim Booth, Mesude Bicak, Dawn Field
• JCVI collaborators: Alex Richter, Ravi Sanka, Andrey Tovichgrechko, Johannes Goll, Karen Nelson, Bill Nierman, JCVI IT support.
• NIAID and for funding: Maria Giovani, Punam Mathur
cloudbiolinux.org
groups.google.com/group/cloudbiolinux
tinyurl.com/cloudboot1
tinyurl.com/cloudboot2
slideshare.com/agbiotec
Thank you !Tuesday, November 6, 12