imicrobe and ivirus: extending the iplant cyberinfrastructure from plants to microbes
DESCRIPTION
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes. Overview of work underway to add applications and computational analysis pipelines to iPlant for metagenomics and microbial ecology.TRANSCRIPT
Bonnie Hurwitz, PhD Arizona Health Sciences Center
Extending the iPlant Cyberinfrastructure: From Plants to Microbes
The iPlant Collabora,ve Community Cyberinfrastructure for Life Science
hEp://www.iplantcollaboraIve.org
iVirus and iMicrobe
Joaquin Ruiz, PhD Dean, College of Science Darren Boss Devesh Chourasiya
Funding Staff
Ma= Sullivan, PhD
Shane Burgess, PhD Dean, CALS
The iPlant Collaborative
Vision
Enable life science researchers and educators to use and extend cyberinfrastructure to understand and ultimately predict the complexity of biological systems
How iPlant CI Enables Discovery Challenge: Create an easy-‐to-‐use plaNorm powerful enough
to handle data-‐intensive biology
Many bioinformaIcs tools “off limits” to those without specialized computaIonal backgrounds.
iPlant is a collaborative virtual organization
The iPlant Collaborative Who makes up iPlant?
The iPlant Collaborative How is iPlant funded?
iPlant Renewed by NSF
September 2013 begins next 5 year period Scientific Advisory Board Focus on Genotype-Phenotype science NSF Recommended expansion of scope beyond plants
iPlant collaborates to enable access to the solutions that work the
best for the community…
The iPlant Collaborative Who does iPlant collaborate with?
How iPlant CI Enables Discovery Overview of resources
End Users
Compu
ta0o
nal U
sers Teragrid
XSEDE
ü Storage ü Computa0on ü Hos0ng ü Web Services ü Scalability
Building a plaNorm that can support diverse and constantly evolving needs.
iPlant Data Store
ü Initial 100 GB allocation – TB allocations available
ü Automatic data backup
ü Easy upload /download and sharing
The resources you need to share and manage data with your lab, colleagues and community
Discovery Environment Hundreds of bioinformatics Apps in an easy-to-use interface ü A platform that can run almost any bioinformatics application
ü Seamlessly integrated with data and high performance computing
ü User extensible – add your own applications
Agave API Fully customize iPlant resources ü Science-as-a-service platform
ü Define your own compute, and storage resources (local and iPlant)
ü Build your own app store of scientific code and workflows
Atmosphere Cloud computing for the life sciences ü Simple: One-click access to more than 100 virtual machine
images
ü Flexible: Fully customize your software setup
ü Powerful: Integrated with iPlant computing and data resources
DNA Subway Educational workflows for Genomes, DNA Barcoding, RNA-Seq ü Commonly used bioinformatics tools in streamlined workflows
ü Teach important concepts in biology and bioinformatics
ü Inquiry-based experiments for novel discovery and publication of data
Bisque Image analysis, management, and metadata
ü Secure image storage, analysis, and data management
ü Integrate existing applications or create new ones
ü Custom visualization and image handling routines and APIs
Typical End Users
Computa0onal Users Teragrid
XSEDE
iMicrobe and iVirus Leverage the iPlant Cyberinfrastructure
ü Storage ü Computa0on ü Analysis ü App dev. ü Pipeline dev. ü Code distrib. ü Data
Discoverability
Using iPlant for:
What’s Under the Hood? Stampede -‐ High Level Overview
• Base Cluster (Dell/Intel/Mellanox): – Intel Sandy Bridge processors – Dell dual-‐socket nodes w/32GB RAM (2GB/core) – 6,400 nodes – 56 Gb/s Mellanox FDR InfiniBand interconnect – More than 100,000 cores, 2.2 PF peak performance
• Co-‐Processors: – Intel Xeon Phi “MIC” Many Integrated Core processors – Special release of “Knight’s Corner” (61 cores) – All MIC cards are on site at TACC
more than 6000 installed final installa0on ongoing for formal
summer acceptance – 7+ PF peak performance
• Max Total Concurrency: – exceeds 500,000 cores – 1.8M threads
• Entered produc,on opera,ons on January 7, 2013
iMicrobe/ iVirus: New App Development
June 2013 – May 2014: 13: New Apps 1: High-throughput analysis pipeline
Forging Ahead with iPlant
• Build a metegenomics toolkit
• Streamline metagenomics workflows
• Enable high-‐throughput compuIng
• Provide key datasets for computaIon
iPlant Data Store
The resources you need to share and manage data with your lab, colleagues and community
Overview of the iPlant Data Store Some Complica0ons of Big Data
• Difficult/slow transfers • Expense for storage/backup • Difficult to share and publish • Metadata • Analysis
iPlant Supports the Life Cycle of Data
Store
Markup Search
Transfer
Analyze Visualize
Collaborate Share
Data Results A Results B Algo1 Algo2
Pre-‐ PublicaIon
Post-‐ PublicaIon
Teragrid XSEDE
Overview of the iPlant Data Store Scalable, Reliable, Redundant, High-‐performance
• Access your data from mul0ple iPlant services
• Automa0c data backup (redundant between University of Arizona and University of Texas) • Mul0ple ways to share data with collaborators
• Mul0-‐threaded high speed transfers
• Default 100GB alloca0on. >1TB alloca0ons available with jus0fica0on
Overview of the iPlant Data Store Some important items we won’t see
Source DesInaIon Copy Method Time (seconds)
CD My Computer cp 320
Berkeley Server My Computer scp 150
External Drive My Computer cp 36
USB2.0 Flash My Computer cp 30
iDS MyComputer iget 18
My Computer My Computer cp 15
Close to op0mum condi0ons; transfer between Univ. of Arizona and UC Berkeley
100GB: 29m15s 1 GB / 17.5 seconds
Discovery Environment
Hundreds of bioinformatics Apps in an easy-to-use interface
Overview of the iPlant Discovery Environment
Through the Discovery Environment you have:
• High-‐powered compu0ng
• iPlant data store
• Easy to use interface
• Virtually limitless apps
• Analysis history (provenance)
What you can do in the iPlant DE?
Scalable plajorm for powerful compu0ng, data, and applica0on resources
• Navigate the components of the DE
• Access and manipulate data
• Start and complete an analysis
• Track your analysis and see your results
Why is iPlant DE Scalable?
Democra0ze your code
• Rich plajorm for bioinforma0cs ~400 apps (and coun0ng) • Data co-‐localized with analysis • Easy to use interface, with access to support • Easy to integrate and customize your own tools
Goal: Create a metagenomic assembly. Task 1: Upload metagenomic fasta file to your personal data store Task 2: Run quality control on your raw sequence reads Task 3: Find and select an assembly tool (e.g. Metavelvet) Task 4: Specify parameters and your input files. Run the assembly App. Task 5: Monitor the progress of your analysis and save parameters. Task 6: View your results.
Discovery Environment Example
Sequence Quality Control in the iPlant DE
Genome, Metagenome, and Transcriptome
Assembly
Genome and Metagenome Assembly
ALLPATHS-LG
Newbler
SOAPdenovo
Velvet
MetaVelvet
ABySS
SPA
Digital Norm.
IDBA-UD
Transciptome Assembly
TrinityDe novo:
Reference-guided:
SOAPdenovo-Trans
Velvet/Oasis
Trans-ABySS
Tophat
Cufflinks
In the DEKey:
Where is the sample data?
Where is the Assembly App?
Specify Data and Assembly Parameters
Specify Run Settings
Track Analyses and Results
What about Annotations?
• Annotations are descriptions of features on contigs in a genome / metagenome – Ab initio gene predictions – Protein homology (Genbank nr, SIMAP) – Curated protein resources (COG, Kegg, …)
• Secondary annotations – InterPro Scan (Pfam, PIR, Prosite, …) – GO and other ontologies – Pathway Mapping (Kegg, Metacyc, Ecocyc)
Genome and Metagenome Assembly
ALLPATHS-LG
Newbler
SOAPdenovo
Velvet
MetaVelvet
ABySS
SPA
Digital Norm.
IDBA-UD
Ab initio Gene Prediction
Glimmer
Prodigal
FragGeneScan
Metagene
MetaGenmark
Transciptome Assembly
TrinityDe novo:
Reference-guided:
SOAPdenovo-Trans
Velvet/Oasis
Trans-ABySS
Tophat
Cufflinks
Meta-Genome
input
Evidenceinput
Conversion Tools
Annotation
Primary:
Secondary:
BLAST
tophat2gff
cufflinks2gff
Visualization
k-mer based
InterProScan
InterPro2GO
JBrowse
Web-Apollo
Data Commons:Genomes and MetagenomesProteins / GenesReference AnnotationsMetadata (in irods)
At TACCIn the DE Under DevelopmentKey:
Assembly & Annotation at
iPlant
ü Storage ü Computa0on ü Analysis ü Data Access ü Code Distr. ü Query by
metadata
The Louis Pasteur Method: We can’t “see” all bacteria using culture-‐based approaches
Razumov (1932) “The Great Plate Anomaly.”
Community
Genomics
Isolate
Metagenomics
The Post-‐Genomic Era: from Pasteur to CSI
Environmental Sample
Extract DNA High throughput sequencing
Assemble reads Gene Prediction
library creation
Making Sense of Metagenomes
Function
Taxonomy Compare to known proteins
Viromes are dominated by the Unknown
PhoIc AphoIc
Hurwitz BL & Sullivan MB. The Pacific Ocean Virome (POV). PLoS One. 8: e57355.
Bacteria 5% Eukaryota
1%
Archaea 0%
Viruses 3%
Viruses 7% Bacteria
4% Eukaryota 1%
Archaea 0%
Unknown 88%
Unknown 91%
We need new tools!
Phage FuncIon based on Environment
PcPipe: a VigneEe in Viral Metagenomics
Assemble Find GenesProteinClusters
Input reads
Input reads
Cluster Genes
BIN
Organizing the Unknown
Yooseph S, et al. (2007) The Sorcerer II Global Ocean Sampling expedi0on: expanding the universe of protein families. PLoS Biol 5(3):e16.
27K High-‐Confidence Viral Protein Clusters
GOS 50%
POV + GOS 22%
POV 28%
Isolate Phage 1%
2X environmental viral protein clusters
70% of data now included
Hurwitz BL & Sullivan MB. (2013) The Pacific Ocean Virome (POV). PLoS One. 8: e57355.
Ocean Microbial CommuniIes Vary by Environmental Factors
Pacific Ocean Virome: Geographic Region LocaIon on a Transect Season Depth Hurwitz BL & Sullivan MB. (2013) The Pacific Ocean Virome (POV). PLoS One. 8: e57355.
GDSGFSM5ODM4OSM2MSLF26SLA26SLJ26SLJ12SLJ4SM1CSSTCSSFCSSFSSSFDSM3MDLJ12DLJ26DLJ4OLJ12ALJ4DLJ4AM6O1KM7O4KLF26DLF26OLJ12OLF26ALA26ALA26OLJ26OLA26D
LJ4O
LJ12A
LJ4D
LJ4A
M6O1K
M7O4K
LF26D
LJ12O
LF26O
LF26A
LJ26O
LA26A
LA26O
LA26D
LJ26D
LJ12D
M3MD
GDS
GFS
M4OS
M5OD
LJ4S
LJ12S
LJ26S
LA26S
LF26S
M2MS
M1CS
SFSS
SFDS
SFCS
STCS
Aphotic Photic
Aphotic
Photic
Hurwitz BL, Brum J. and Sullivan MB. Depth Stra0fied Func0onal and Taxonomic Niche Specializa0on in the ‘Core’ and ‘Flexible’ Pacific Ocean Virome . In Review.
Photic vs Photic
Aphotic vs Photic
Aphotic vs Aphotic
Photic vs Aphotic
Protein Clusters group by phoIc zone
Many PCs shared Some PCs shared Few PCs shared
Host Genes that Promote Viral ReplicaIon
Fe-‐S cluster biogenesis and funcIon DNA/Protein biosynthesis and repair Host “wake-‐up” Energy producIon in photosynthesis
Niche Defining PhoIc Core:
Hurwitz BL, Hallam S., Sullivan MB. (2013) Metabolic Reprogramming by Viruses in the Sunlit and Dark Ocean. Genome Biology, 14, R123. Hurwitz BL, Brum J. and Sullivan MB. Depth Stra0fied Func0onal and Taxonomic Niche Specializa0on in the ‘Core’ and ‘Flexible’ Pacific Ocean Virome . In Review.
AdapIve for High Pressure Environments
DNA replicaIon iniIaIon
DNA repair
MoIlity
Energy producIon in the TCA cycle
Niche Defining AphoIc Core:
Hurwitz BL, Hallam S., Sullivan MB. (2013) Metabolic Reprogramming by Viruses in the Sunlit and Dark Ocean. Genome Biology, 14, R123. Hurwitz BL, Brum J. and Sullivan MB. Depth Stra0fied Func0onal and Taxonomic Niche Specializa0on in the ‘Core’ and ‘Flexible’ Pacific Ocean Virome. In Review.
QC sequences • FASTQ_ shrinker
Assembly part 1
• Velveth
pcpipe part 1 • Cd-‐hit-‐2d
Input to Analyses
• Blastx to nr • QIIME • RarefacMon
New.fastq
Find Genes • Meta-‐
Gene-‐Mark
POV PCs
pcpipe part 2 • Cd-‐hit
Assembly part 2
• Velvetg
New.a.faa
iPlant Discovery Environment: Automated Workflows
POV + Novel PCs
PCpipe: creaIng protein clusters for viral ecology
1. Select the Apps 2. Order the Apps 3. Map Outputs to Inputs 4. Run the analysis
Crea0ng Workflows Easy as 1-‐2-‐3-‐4
Create a New Workflow
Provide Workflow Informa0on
Select the Apps
Add the Apps
Remove an App
Order the Apps
New.a.faa POV PCs
Map Outputs to Inputs
A New Workflow
User’s ORFs
POV PCs
Run the Workflow
Automated workflows cannot use Apps that run
on the HPC
QC sequences • FASTQ_ shrinker
Assembly part 1
• Velveth
pcpipe part 1 • Cd-‐hit-‐2d
AnnotaIon
• Protein annotaMon
• Secondary annotaMon
New.fastq
Find Genes • Meta-‐
Gene-‐Mark
POV PCs
pcpipe part 2 • Cd-‐hit
pcpipe workflow
Assembly part 2
• Velvetg
New.a.faa
Gotchas in the PCpipe Workflow
FoundaIon API Runs on XSEDE (HPC) cannot be used in a workflow
POV + Novel PCs
FoundaIon API Runs on XSEDE
iPlant App iMicrobe adapter
iMicrobe condornode
BLAST vs SIMAP
cd-hit-2d cd-hit extract proteins in novel PCs
SIMAP Annotation
Pipeline Management
Foundation Code
HPC Job distribution
on condor on condor on condor on TACC on condor
Step 1 Step 2 Step 3 Step 4 Step 5
UserORFs
ExistingProteinClusters
Input 1 Input 2
ORFs inexistingclusters
ORFs innew
clusters
Annotationfor newclusters
Output 1 Output 2 Output 3
An Integrated PCPipe
Exis0ng PCs (POV)
Directory of User defined
ORFS
PCPipe App
Collaborating with iPlant
• Solve computa0onal boulenecks • Make tools easier to use • Share Data • Provide community input
Collaboration
QuesIons or Comments?
Bonnie Hurwitz, PhD
QC sequences • FASTQ_ shrinker
Assembly • Velvet
pcpipe part 1 • Cd-‐hit-‐2d
Gene
AnnotaIon • SIMAP • GO • PFAM…
New.fastq
PCs
pcpipe part 2 • Cd-‐hit
Find Genes • Prodigal
ORFs
PCpipe: Protein Cluster Pipeline
Steps in iPlant DE
PCs + Novel PCs
(HPC or Cloud)