imicrobe and ivirus: extending the iplant cyberinfrastructure from plants to microbes

Bonnie Hurwitz, PhD Arizona Health Sciences Center

Extending the iPlant Cyberinfrastructure: From Plants to Microbes

The iPlant Collabora,ve Community Cyberinfrastructure for Life Science

hEp://www.iplantcollaboraIve.org

iVirus and iMicrobe

Joaquin Ruiz, PhD Dean, College of Science Darren Boss Devesh Chourasiya

Funding Staff

Ma= Sullivan, PhD

Shane Burgess, PhD Dean, CALS

The iPlant Collaborative

Vision

Enable life science researchers and educators to use and extend cyberinfrastructure to understand and ultimately predict the complexity of biological systems

How iPlant CI Enables Discovery Challenge: Create an easy-‐to-‐use plaNorm powerful enough

to handle data-‐intensive biology

Many bioinformaIcs tools “off limits” to those without specialized computaIonal backgrounds.

iPlant is a collaborative virtual organization

The iPlant Collaborative Who makes up iPlant?

The iPlant Collaborative How is iPlant funded?

iPlant Renewed by NSF

September 2013 begins next 5 year period Scientific Advisory Board Focus on Genotype-Phenotype science NSF Recommended expansion of scope beyond plants

iPlant collaborates to enable access to the solutions that work the

best for the community…

The iPlant Collaborative Who does iPlant collaborate with?

How iPlant CI Enables Discovery Overview of resources

End Users

Compu

ta0o

nal U

sers Teragrid

XSEDE

ü  Storage ü  Computa0on ü  Hos0ng ü  Web Services ü  Scalability

Building a plaNorm that can support diverse and constantly evolving needs.

iPlant Data Store

ü  Initial 100 GB allocation – TB allocations available

ü  Automatic data backup

ü  Easy upload /download and sharing

The resources you need to share and manage data with your lab, colleagues and community

Discovery Environment Hundreds of bioinformatics Apps in an easy-to-use interface ü  A platform that can run almost any bioinformatics application

ü  Seamlessly integrated with data and high performance computing

ü  User extensible – add your own applications

Agave API Fully customize iPlant resources ü  Science-as-a-service platform

ü  Define your own compute, and storage resources (local and iPlant)

ü  Build your own app store of scientific code and workflows

Atmosphere Cloud computing for the life sciences ü  Simple: One-click access to more than 100 virtual machine

images

ü  Flexible: Fully customize your software setup

ü  Powerful: Integrated with iPlant computing and data resources

DNA Subway Educational workflows for Genomes, DNA Barcoding, RNA-Seq ü  Commonly used bioinformatics tools in streamlined workflows

ü  Teach important concepts in biology and bioinformatics

ü  Inquiry-based experiments for novel discovery and publication of data

Bisque Image analysis, management, and metadata

ü  Secure image storage, analysis, and data management

ü  Integrate existing applications or create new ones

ü  Custom visualization and image handling routines and APIs

Typical End Users

Computa0onal Users Teragrid

XSEDE

iMicrobe and iVirus Leverage the iPlant Cyberinfrastructure

ü  Storage ü  Computa0on ü  Analysis ü  App dev. ü  Pipeline dev. ü  Code distrib. ü  Data

Discoverability

Using iPlant for:

What’s Under the Hood? Stampede -‐ High Level Overview

•  Base Cluster (Dell/Intel/Mellanox): –  Intel Sandy Bridge processors –  Dell dual-‐socket nodes w/32GB RAM (2GB/core) –  6,400 nodes –  56 Gb/s Mellanox FDR InfiniBand interconnect –  More than 100,000 cores, 2.2 PF peak performance

•  Co-‐Processors: –  Intel Xeon Phi “MIC” Many Integrated Core processors –  Special release of “Knight’s Corner” (61 cores) –  All MIC cards are on site at TACC

more than 6000 installed final installa0on ongoing for formal

summer acceptance –  7+ PF peak performance

•  Max Total Concurrency: –  exceeds 500,000 cores –  1.8M threads

•  Entered produc,on opera,ons on January 7, 2013

iMicrobe/ iVirus: New App Development

June 2013 – May 2014: 13: New Apps 1: High-throughput analysis pipeline

Forging Ahead with iPlant

•  Build a metegenomics toolkit

•  Streamline metagenomics workflows

•  Enable high-‐throughput compuIng

•  Provide key datasets for computaIon

iPlant Data Store

The resources you need to share and manage data with your lab, colleagues and community

Overview of the iPlant Data Store Some Complica0ons of Big Data

•  Difficult/slow transfers •  Expense for storage/backup •  Difficult to share and publish •  Metadata •  Analysis

iPlant Supports the Life Cycle of Data

Store

Markup Search

Transfer

Analyze Visualize

Collaborate Share

Data Results A Results B Algo1 Algo2

Pre-‐ PublicaIon

Post-‐ PublicaIon

Teragrid XSEDE

Overview of the iPlant Data Store Scalable, Reliable, Redundant, High-‐performance

•  Access your data from mul0ple iPlant services

•  Automa0c data backup (redundant between University of Arizona and University of Texas) •  Mul0ple ways to share data with collaborators

•  Mul0-‐threaded high speed transfers

•  Default 100GB alloca0on. >1TB alloca0ons available with jus0fica0on

Overview of the iPlant Data Store Some important items we won’t see

Source DesInaIon Copy Method Time (seconds)

CD My Computer cp 320

Berkeley Server My Computer scp 150

External Drive My Computer cp 36

USB2.0 Flash My Computer cp 30

iDS MyComputer iget 18

My Computer My Computer cp 15

Close to op0mum condi0ons; transfer between Univ. of Arizona and UC Berkeley

100GB: 29m15s 1 GB / 17.5 seconds

Discovery Environment

Hundreds of bioinformatics Apps in an easy-to-use interface

Overview of the iPlant Discovery Environment

Through the Discovery Environment you have:

•  High-‐powered compu0ng

•  iPlant data store

•  Easy to use interface

•  Virtually limitless apps

•  Analysis history (provenance)

What you can do in the iPlant DE?

Scalable plajorm for powerful compu0ng, data, and applica0on resources

•  Navigate the components of the DE

•  Access and manipulate data

•  Start and complete an analysis

•  Track your analysis and see your results

Why is iPlant DE Scalable?

Democra0ze your code

•  Rich plajorm for bioinforma0cs ~400 apps (and coun0ng) •  Data co-‐localized with analysis •  Easy to use interface, with access to support •  Easy to integrate and customize your own tools

Goal: Create a metagenomic assembly. Task 1: Upload metagenomic fasta file to your personal data store Task 2: Run quality control on your raw sequence reads Task 3: Find and select an assembly tool (e.g. Metavelvet) Task 4: Specify parameters and your input files. Run the assembly App. Task 5: Monitor the progress of your analysis and save parameters. Task 6: View your results.

Discovery Environment Example

Sequence Quality Control in the iPlant DE

Genome, Metagenome, and Transcriptome

Assembly

Genome and Metagenome Assembly

ALLPATHS-LG

Newbler

SOAPdenovo

Velvet

MetaVelvet

ABySS

SPA

Digital Norm.

IDBA-UD

Transciptome Assembly

TrinityDe novo:

Reference-guided:

SOAPdenovo-Trans

Velvet/Oasis

Trans-ABySS

Tophat

Cufflinks

In the DEKey:

Where is the sample data?

Where is the Assembly App?

Specify Data and Assembly Parameters

Specify Run Settings

Track Analyses and Results

What about Annotations?

•  Annotations are descriptions of features on contigs in a genome / metagenome –  Ab initio gene predictions –  Protein homology (Genbank nr, SIMAP) –  Curated protein resources (COG, Kegg, …)

•  Secondary annotations –  InterPro Scan (Pfam, PIR, Prosite, …) –  GO and other ontologies –  Pathway Mapping (Kegg, Metacyc, Ecocyc)

Genome and Metagenome Assembly

ALLPATHS-LG

Newbler

SOAPdenovo

Velvet

MetaVelvet

ABySS

SPA

Digital Norm.

IDBA-UD

Ab initio Gene Prediction

Glimmer

Prodigal

FragGeneScan

Metagene

MetaGenmark

Transciptome Assembly

TrinityDe novo:

Reference-guided:

SOAPdenovo-Trans

Velvet/Oasis

Trans-ABySS

Tophat

Cufflinks

Meta-Genome

input

Evidenceinput

Conversion Tools

Annotation

Primary:

Secondary:

BLAST

tophat2gff

cufflinks2gff

Visualization

k-mer based

InterProScan

InterPro2GO

JBrowse

Web-Apollo

Data Commons:Genomes and MetagenomesProteins / GenesReference AnnotationsMetadata (in irods)

At TACCIn the DE Under DevelopmentKey:

Assembly & Annotation at

iPlant

ü  Storage ü  Computa0on ü  Analysis ü  Data Access ü  Code Distr. ü  Query by

metadata

The Louis Pasteur Method: We can’t “see” all bacteria using culture-‐based approaches

Razumov (1932) “The Great Plate Anomaly.”

Community

Genomics

Isolate

Metagenomics

The Post-‐Genomic Era: from Pasteur to CSI

Environmental Sample

Extract DNA High throughput sequencing

Assemble reads Gene Prediction

library creation

Making Sense of Metagenomes

Function

Taxonomy Compare to known proteins

Viromes are dominated by the Unknown

PhoIc AphoIc

Hurwitz BL & Sullivan MB. The Pacific Ocean Virome (POV). PLoS One. 8: e57355.

Bacteria 5% Eukaryota

1%

Archaea 0%

Viruses 3%

Viruses 7% Bacteria

4% Eukaryota 1%

Archaea 0%

Unknown 88%

Unknown 91%

We need new tools!

Phage FuncIon based on Environment

PcPipe: a VigneEe in Viral Metagenomics

Assemble Find GenesProteinClusters

Input reads

Input reads

Cluster Genes

BIN

Organizing the Unknown

Yooseph S, et al. (2007) The Sorcerer II Global Ocean Sampling expedi0on: expanding the universe of protein families. PLoS Biol 5(3):e16.

27K High-‐Confidence Viral Protein Clusters

GOS 50%

POV + GOS 22%

POV 28%

Isolate Phage 1%

2X environmental viral protein clusters

70% of data now included

Hurwitz BL & Sullivan MB. (2013) The Pacific Ocean Virome (POV). PLoS One. 8: e57355.

Ocean Microbial CommuniIes Vary by Environmental Factors

Pacific Ocean Virome: Geographic Region LocaIon on a Transect Season Depth Hurwitz BL & Sullivan MB. (2013) The Pacific Ocean Virome (POV). PLoS One. 8: e57355.

GDSGFSM5ODM4OSM2MSLF26SLA26SLJ26SLJ12SLJ4SM1CSSTCSSFCSSFSSSFDSM3MDLJ12DLJ26DLJ4OLJ12ALJ4DLJ4AM6O1KM7O4KLF26DLF26OLJ12OLF26ALA26ALA26OLJ26OLA26D

LJ4O

LJ12A

LJ4D

LJ4A

M6O1K

M7O4K

LF26D

LJ12O

LF26O

LF26A

LJ26O

LA26A

LA26O

LA26D

LJ26D

LJ12D

M3MD

GDS

GFS

M4OS

M5OD

LJ4S

LJ12S

LJ26S

LA26S

LF26S

M2MS

M1CS

SFSS

SFDS

SFCS

STCS

Aphotic Photic

Aphotic

Photic

Hurwitz BL, Brum J. and Sullivan MB. Depth Stra0fied Func0onal and Taxonomic Niche Specializa0on in the ‘Core’ and ‘Flexible’ Pacific Ocean Virome . In Review.

Photic vs Photic

Aphotic vs Photic

Aphotic vs Aphotic

Photic vs Aphotic

Protein Clusters group by phoIc zone

Many PCs shared Some PCs shared Few PCs shared

Host Genes that Promote Viral ReplicaIon

Fe-‐S cluster biogenesis and funcIon DNA/Protein biosynthesis and repair Host “wake-‐up” Energy producIon in photosynthesis

Niche Defining PhoIc Core:

Hurwitz BL, Hallam S., Sullivan MB. (2013) Metabolic Reprogramming by Viruses in the Sunlit and Dark Ocean. Genome Biology, 14, R123. Hurwitz BL, Brum J. and Sullivan MB. Depth Stra0fied Func0onal and Taxonomic Niche Specializa0on in the ‘Core’ and ‘Flexible’ Pacific Ocean Virome . In Review.

AdapIve for High Pressure Environments

DNA replicaIon iniIaIon

DNA repair

MoIlity

Energy producIon in the TCA cycle

Niche Defining AphoIc Core:

Hurwitz BL, Hallam S., Sullivan MB. (2013) Metabolic Reprogramming by Viruses in the Sunlit and Dark Ocean. Genome Biology, 14, R123. Hurwitz BL, Brum J. and Sullivan MB. Depth Stra0fied Func0onal and Taxonomic Niche Specializa0on in the ‘Core’ and ‘Flexible’ Pacific Ocean Virome. In Review.

QC sequences •  FASTQ_ shrinker

Assembly part 1

•  Velveth

pcpipe part 1 •  Cd-‐hit-‐2d

Input to Analyses

•  Blastx to nr •  QIIME •  RarefacMon

New.fastq

Find Genes •  Meta-‐

Gene-‐Mark

POV PCs

pcpipe part 2 •  Cd-‐hit

Assembly part 2

•  Velvetg

New.a.faa

iPlant Discovery Environment: Automated Workflows

POV + Novel PCs

PCpipe: creaIng protein clusters for viral ecology

1.   Select the Apps 2.   Order the Apps 3.   Map Outputs to Inputs 4.   Run the analysis

Crea0ng Workflows Easy as 1-‐2-‐3-‐4

Create a New Workflow

Provide Workflow Informa0on

Select the Apps

Add the Apps

Remove an App

Order the Apps

New.a.faa POV PCs

Map Outputs to Inputs

A New Workflow

User’s ORFs

POV PCs

Run the Workflow

Automated workflows cannot use Apps that run

on the HPC


Assembly part 1

•  Velveth


AnnotaIon

•  Protein annotaMon

•  Secondary annotaMon

New.fastq

Find Genes •  Meta-‐

Gene-‐Mark

POV PCs


pcpipe workflow

Assembly part 2

•  Velvetg

New.a.faa

Gotchas in the PCpipe Workflow

FoundaIon API Runs on XSEDE (HPC) cannot be used in a workflow

POV + Novel PCs

FoundaIon API Runs on XSEDE

iPlant App iMicrobe adapter

iMicrobe condornode

BLAST vs SIMAP

cd-hit-2d cd-hit extract proteins in novel PCs

SIMAP Annotation

Pipeline Management

Foundation Code

HPC Job distribution

on condor on condor on condor on TACC on condor

Step 1 Step 2 Step 3 Step 4 Step 5

UserORFs

ExistingProteinClusters

Input 1 Input 2

ORFs inexistingclusters

ORFs innew

clusters

Annotationfor newclusters

Output 1 Output 2 Output 3

An Integrated PCPipe

Exis0ng PCs (POV)

Directory of User defined

ORFS

PCPipe App

Collaborating with iPlant

•  Solve computa0onal boulenecks •  Make tools easier to use •  Share Data •  Provide community input

Collaboration

QuesIons or Comments?

Bonnie Hurwitz, PhD


Assembly •  Velvet


Gene

AnnotaIon •  SIMAP •  GO •  PFAM…

New.fastq

PCs


Find Genes •  Prodigal

ORFs

PCpipe: Protein Cluster Pipeline

Steps in iPlant DE

PCs + Novel PCs

(HPC or Cloud)

imicrobe and ivirus: extending the iplant cyberinfrastructure from plants to microbes

Science

iplant computing

iplant resources science

iplant data store initial

data resources

iplant collaborative

mycomputer mycomputer

data management

publication of data