biokepler:!a!comprehensive!bioinformacs! …...a toolbox with many tools need expertise to identify...
TRANSCRIPT
![Page 1: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/1.jpg)
bioKepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large-‐Scale Biological Data
WorDS.sdsc.edu
Ilkay Al/ntas1, Jianwu Wang2, Daniel Crawl1, Shweta Purawat1
1 San Diego Supercomputer Center, UC San Diego 2 UMBC
![Page 2: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/2.jpg)
A Toolbox with Many Tools
Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!
• Data • Search, database access, IO opera2ons, streaming data in real-‐2me…
• Compute • Data-‐parallel paOerns, external execu2on, …
• Network opera2ons • Provenance and fault tolerance
![Page 3: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/3.jpg)
• From analysis to searchable results • Standardization • Auto generation of methods and materials
• Sequencers • Sensor networks • Medical imaging
Workflows are Used in These Diverse Scenarios in Biological Sciences
Acquisi2on Genera2on
Data Analysis
Data
Data Publica2on Archival
Many forms • Data-intensive • HPC • Local Exploratory
Workflows foster collaborations!
• Flexibility and synergy • Optimization of resources • Increasing reuse • Standards compliance
• Often for data reduction • In real-time or offline
![Page 4: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/4.jpg)
CAMERA Example:
Using Scientific Workflows and Related Provenance for Collaborative Metagenomics
ResearchCommunity Cyberinfrastructure for Advanced
Microbial Ecology Research and Analysis(CAMERA)
http://camera.calit2.net
![Page 5: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/5.jpg)
CAMERA is a Collaborative Environment
Data Cart Multiple Available Mixed collections of CAMERA Data (e.g. projects, samples)
User Workspace Single workspace with access to all data and results (private and shared)
Group Workspace Share specified User Workspace data with collaborators
Data Discovery GIS and Advanced query options
Data Analysis Workflow based analysis
![Page 6: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/6.jpg)
Workflows are a Central Part of CAMERA • CAMERA-supported
– 28 existing workflows• Workflows under
development– Fragment Recruitment
Viewer – Next Generation Sequencing– VIROME Pipeline– Standalone bioinformatics
tools – National Center for Genome
Research– Joint Genome Institute
• User built– Currently running in a
sandbox– Will be ported to a virtual
cloud environment
All can be reached through the CAMERA portal at:hOp://portal.camera.calit2.net
• Inputs: from local or CAMERA file systems; user-supplied parameters
• Outputs: sharable with a group of users and links to the semantic database
QC
filter
Taxonomy Binning
BLAST
Assembly
Comparison, Statistical analysis, and more
workflows
Metagenomic
Annotation
and
Clustering
Duplicate filtering
More than 1500 workflow submissions monthly!
![Page 7: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/7.jpg)
CAMERA Portal - Workflows
![Page 8: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/8.jpg)
CAMERA Workflows
RAMMCAP
![Page 9: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/9.jpg)
CAMERA W
orkflows
![Page 10: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/10.jpg)
CAMERA W
orkflows
![Page 11: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/11.jpg)
CAMERA Job Status
![Page 12: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/12.jpg)
CAMERA Workflow Results
![Page 13: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/13.jpg)
Pushing the boundaries of existing infrastructure and workflow system
capabilities
![Page 14: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/14.jpg)
New Requirements from the User Community• Increase reuse
– best development practices by the scientific community– other bio packages
• Increase programmability by end users– users with various skill levels – to formulate actual domain specific workflows
• Increase resource utilization– optimize execution across available computing resources – in an efficient, transparent and intuitive manner
• Make analysis a part of the end-to-end scientific model from data generation to publication
![Page 15: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/15.jpg)
RAMMCAP – Rapid Clustering and Functional Annotation for Metagenomic Sequences
Annota2on features: • tRNA predic2on (tRNAscan) • rRNA predic2on (meta_RNA, BLAST) • ORF call (ORF_finder, Metagene) • RPS-‐BLAST against COG etc • HMMER against Pfam / Tigrfam
} Clustering of reads } Mul2-‐step clustering of ORFs } GO assignment } EC number assignment
![Page 16: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/16.jpg)
Tool Descrip/on BLAST Scalable parallel database search with blastn, blastp, tblastn, blastx, tblastx MegaBLAST Fast database search with MegaBLAST Diversity Diversity analysis for viral metagenome QC Quality control for 454 raw reads CD-‐HIT-‐454 Iden2fy ar2ficial duplicates from 454 reads RAMMCAP Metagenome annota2on
-‐ rRNA, tRNA, ORF predic2on -‐ reads and ORF clustering -‐ reads and ORF informa2on -‐ family and func2on annota2on (Pfam, TIGRfam, COG) -‐ Gene Ontology and Enzyme Classifica2on annota2on -‐ Combined annota2on summary
FRV Fragment Recruitment Viewer Assembly Consensus-‐based meta-‐assembler for 454 reads KEGG Pathway annota2on by search KEGG database with blastp RDP binning Taxonomy binning of rRNA sequences using RDP classifier BLAST binning Taxonomy binning by querying ref. rRNA DB using blastn tRNA Iden2fica2on of tRNAs from fragments using tRNA-‐scan Meta-‐RNA Iden2fica2on of rRNAs from fragments using HMM BLAST-‐RNA Iden2fica2on of rRNAs by querying ref. rRNA DB using blastn ORF_finder ORF call by six reading frame transla2on Metagene ORF call by Metagene FragGeneScan ORF call with FragGeneScan from 454 reads Pfam Protein family annota2on against Pfam using HMMER TIGRfam Protein family annota2on against TIGRfam using HMMER COG Protein family annota2on against NCBI COG using rps-‐blast KOG Protein family annota2on against NCBI KOG using rps-‐blast PRK Protein family annota2on against NCBI PRK using rps-‐blast CD-‐HIT-‐EST Clustering of reads CD-‐HIT Clustering of ORFs H-‐CD-‐HIT Mul2ple level clustering of ORFs into ORF family
A number of bioinforma2cs tools are used in RAMMCAP
![Page 17: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/17.jpg)
Original implementa2on of the annota2on workflow in Kepler
A green box is called a ‘actor’ , which performs a task.
This special actor represents an annota2on component, such as BLAST search.
Workflow parameters, which can be specified by users in portal, are passed to workflow components.
Data flow is divided.
![Page 18: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/18.jpg)
Each actor was a wrapper to a web service!
Customized web services!
![Page 19: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/19.jpg)
RAMMCAP
Data size
CPU 2me
Memory
Parallel
KB MB GB TB
Second Hour Day Month Year
GB 10GB 100GB
No need No Mul2 threading MPI Map Reduce
QC
tRNA
cd-‐hit
hmmer
metagene
blast
QC tRNA cd-‐hit hmmer metagene blast
QC tRNA cd-‐hit hmmer metagene blast
QC tRNA cd-‐hit hmmer metagene blast hmmer blast
![Page 20: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/20.jpg)
RAMMCAP – Rapid Clustering and Functional Annotation for Metagenomic Sequences
Data size
CPU 2me
Memory
Parallel
KB MB GB TB
Minute Hour Day Month Year
GB 10GB 100GB
No need No Mul2 threading MPI Map Reduce
QC tRNA cd-‐hit hmmer metagene blast
QC tRNA cd-‐hit hmmer metagene blast
QC tRNA cd-‐hit hmmer metagene blast hmmer blast
Data size
CPU 2me
Memory
Parallel
KB MB GB TB
Minute Hour Day Month Year
GB 10GB 100GB
No need No Mul2 threading MPI Map Reduce
NGS
QC tRNA cd-‐hit hmmer metagene blast
QC tRNA cd-‐hit hmmer metagene blast
QC tRNA cd-‐hit hmmer metagene blast hmmer blast
![Page 21: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/21.jpg)
Another cases – RNA-‐seq / genomic / metagenomic
Raw reads
HQ reads
Assemble
Velvet, SOAPdenovo,
Abyss Oases Trinity
Alignments
Reads QC
Con2gs
mapping BWA Bow9e BLAST
Further analysis
Data size
CPU 2me
Memory
Parallel
KB MB GB TB
Minute Hour Day Month
GB 10GB 100GB
No need No Mul2 threading MPI Map Reduce
NGS assembly
QC mapping
QC mapping mapping assembly
QC
assembly
mapping
assembly mapping QC
![Page 22: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/22.jpg)
bioKepler implementa2on: Using bioActors instead of wrapper actors
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
Wrapper Actors • Need implementa2on of underlying
computa2onal tools
bioActors • Reusable • Mul2ple execu2on modes • Build-‐in parallel execu2on
capabili2es
![Page 23: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/23.jpg)
Gateways and other user environments
bioKepler Kepler and Provenance Framework
BioLinux Galaxy Clovr Hadoop
…
CLOUD and OTHER COMPUTING RESOURCES e.g., SGE, Amazon, FutureGrid, XSEDE
www.bioKepler.org
May 22nd, 2014 Scalable Bioinforma2cs Boot Camp
A coordinated ecosystem of biological and technological packages for bioinformatics!
![Page 24: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/24.jpg)
The bioKepler Approach
• Parallel Computation Framework– Use Distributed Data-Parallel (DDP) frameworks, e.g.,
MapReduce, and other parallelization methods to execute subworkflows
• bioActors– Configurable and reusable higher-order components
for bioinformatics and computational biology• Transparent support for different execution
engines and computational environments• Deployment on diverse environments
![Page 25: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/25.jpg)
Reuse, Programmability, Execution
• Funded by NSF ABI & CI Reuse programs - Altintas (PI) and Li (Co-PI)• Development of a comprehensive bioinformatics scientific workflow
module for distributed analysis of large-scale biological data
Big improvement on usability and programmability by end users!
www.bioKepler.org
Galaxy
bioKepler
Kepler • CORE • Distributed Data
Parallel • Provenance • Repor2ng • Run Manager • … Bio-Linux
CloudBioLinux
…
Kepler supports • Workflows • Other third party
programming tools, e.g., R, Matlab, KNIME
• Extensible task and data paralleliza2on
• Service orienta2on • Execu2on on mul2ple
engines, e.g., SDF, SGE, Hadoop
![Page 26: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/26.jpg)
bioKepler’s Conceptual FrameworkKepler
bioKepler
ComputeAmazon
EC2
FutureGridSun Grid Engine
Adhoc Network
Data
CAMERA
Ensembl
Genbank
Deploy & Execute
Bioinformatics Tools
Clustering
MappingAssembly
Transfer
Customize & Integrate
Data-Parallel Execution PatternsMap-Reduce Master-Slave All-Pairs
Triton Resource
Provenance
Execution HistoryData Lineage
Reporting
PDF GenerationReport Designer
Fault-Tolerance
Error HandlingAlternatives
Run Manager
TagSearch
Director
Executable Workflow Plan
Scheduler
Execution EngineBioinformatician
Workflow
bioActorsBLASTHMMERCD-HIT
![Page 27: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/27.jpg)
bioActors
• Set of steps to execute a bioinformatics tool locally or in an external environment– Locally executable– Parallelized external execution
• Customizable by the user based on external packages– Tools imported from CloudBioLinux
• Tools are evaluated on their computational requirements
![Page 28: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/28.jpg)
Transparent Execution includes Parallelization Solutions in Distributed Environments
• Tradi9onal parallel programming interfaces – Examples: MPI and OpenMP – Hard to implement – Original sequen2al tools cannot be reused
• Parallel job execu9on – Examples: SGE and Condor – Original sequen2al tools can be reused – Create small jobs by splikng data or tasks – Hard to achieve data locality for each job
• Data parallel job execu9on – Examples: Hadoop and Stratosphere – Original sequen2al tools can be reused – Support customized and automa2c data par22on and distribu2on – Support data locality for each job through special distributed file system, HDFS
![Page 29: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/29.jpg)
Distributed Data-Parallel bioActors
• Set of steps to execute a bioinformatics tool in DDP environment
• Customized from the ExecutionChoice actor
• Includes:– Data-parallel patterns, e.g., Map, Reduce,
Cross, All-Pairs, etc., to specify data grouping– I/O to interface with storage– Data format specifying how to split and join
![Page 30: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/30.jpg)
DDP bioActor Usage Model
A1 A2 An
DDP BlastDDP Generic
1. Search
2a. ChooseSpecific
2b. Choose Generic
2b. Create Sub-Workflow
3. Add to Workflow
Results
4a. Execute
4b. Add to Larger
Workflow
4c. Save in Library
WorkflowDDP Director
User: Workflow Developer
bioActor Library
![Page 31: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/31.jpg)
Status of bioActors 500+ bioActors are listed under current bioKepler release, ~40 of them are
parallelized.
![Page 32: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/32.jpg)
Example bioActors• Alignment: BLAST, BLAT• Profile-Sequence Alignment: PSI-BLAST• Hidden Markov Model: HMMER• Mapping: Bowtie, BWA, Samtools• Multiple Alignment: ClustalW, Muscle• Clustering: CD-HIT, Blastclust• Gene Prediction: Glimmer, Genescan,
Fraggenescan• tRNA prediction: tRNA-scan, Meta-RNA• Phylogeny: FastTree, RAxML
![Page 33: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/33.jpg)
Example Workflows
![Page 34: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/34.jpg)
Current Release
• A bioKepler VM executable on Amazon EC2, FutureGrid and SDSC Cloud– Builds upon CloudBioLinux including Bio-
Linux and Galaxy• A bioActor template that can be
customized for different execution choices– e.g., local vs. Map/Reduce on a specific
environment• Example usecases
Downloadable as a package at: http://www.biokepler.org/releases
![Page 35: bioKepler:!A!Comprehensive!Bioinformacs! …...A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize](https://reader033.vdocuments.net/reader033/viewer/2022060304/5f092c7e7e708231d42593d6/html5/thumbnails/35.jpg)
Demo and Que
s2on
s
WorDS
Dire
ctor: Ilkay Al2ntas, Ph.D.
Email: al2n
tas@
sdsc.edu