ssi: distributed workflow management research and software ... · title: ssi: distributed workflow...

1
Pegasus is a system for mapping and execu4ng abstract applica4on workflows over a range of execu4on environments. The output is an executable workflow that can be executed over a variety of resources ( Clouds, XSEDE, OSG, Campus Grids, Clusters, Worksta4on) Pegasus can run workflows comprising of millions of tasks. Pegasus WMS consists of three main components: the Pegasus mapper, Condor DAGMan, and the Condor schedd. The mapping of tasks to the execu4on resources is done by the mapper based on informa4on derived from sta4c and/or dynamic sources. Pegasus adds and manages data transfer between the tasks as required. DAGMan takes this executable workflow and manages the dependencies between the tasks and releases them to the Condor schedd for execu4on. SSI: Distributed Workflow Management Research and Software in Support of Science Ewa Deelman, USC Information Sciences Institute, [email protected] Miron Livny, University of Wisconsin, Madison, [email protected] Pegasus WMS: h-p://pegasus.isi.edu Workflows are expressed in DAX (Directed Acyclic graph in XML) DAXes can be generated using Java, Perl or Python API’s Support for higher level workflow composi4on tools such as Wings and Triana Integrated with HUBZero Scien4sts can use applica4on specific portals such as CGSMD hXp://portal.nimhgene4cs.org Nodes in a workflow can be tasks or another workflow ( DAX ). Scales upto order of millions of tasks Each sub workflow is mapped when it is ready for execu4on. Clustering of small tasks into large clusters for performance. Op4mized data transfers and ability to use different protocols. Data reuse in case intermediate data products are available workflowlevel checkpoin4ng Automa4c data cleanup reduces data footprint Support for Workflow and Task level no4fica4ons Workflow Progress can be tracked through a database. Stores provenance of data used, produced and which so\ware was used with what parameters Retries computa4ons in case of failures. Monitoring and Debugging tools to debug large scale workflows. Integrates with Resource Provisioners like GlideinWMS. Support for Shell Code Generator Download Op4ons YUM repository with RPM packages APT repository with DEB packages Binary packages for various linux and Mac pla_orms. Training Materials Quickstart Guide Virtual Machine based Tutorial. Prepackaged Applica4on VM’s RNASeq Pegasus VM available at hXp://genomics.isi.edu/rnaseq Scripting Tools CGSMD Portal Pegasus GUI Other Workflow Composition Tools: Xbaya, Wings, Triana Pegasus WMS Mapper Engine Scheduler COMMERCIAL AND SCIENCE CLOUDS Users Local Clusters Distributed Resources (TeraGrid, Open Science Grid) GRAM PBS LSF SGE CONDOR GRAM PBS LSF SGE CONDOR STORAGE STORAGE COMPUTE MIDDLEWARE COMPUTE MIDDLEWARE Pegasus Features So7ware Availability Large Scale Hierarchal Workflows Composing Workflows Pegasus WMS is funded by the NaHonal Science FoundaHon OCI SDCI and SI2 program grants 0722019 and 1148515. Astronomy and Physics Galac4c Plane for genera4ng mosiacs from the Spitzer Telescope LIGO workflows for detec4ng gravita4onal waves. Periodogram Workflows for detec4ng extra solar planets . Earthquake Sciences Cybershake workflows for seismic hazard analysis for LA Basin. Broadband worklfows for accurate predic4ons of ground mo4ons. Bioinforma4cs Brain span workflows to find where in the brain a gene’s expressed Workflows to compute RNA Seq for genera4ng Cancer Genome Atlas SIPHT workflows to predict sRNA encoding genes in bacteria. Proteomics workflows for mass spectrometry based proteomics. Others hXp://pegasus.isi.edu/applica4ons ApplicaHons Using Pegasus Datand SubW FULL SubW CAT1 SubW CAT N SubW Plot SubW Datand SubW INJ-1 SubW CAT1 SubW CAT N SubW Plot SubW Datand SubW INJ-N SubW CAT1 SubW CAT N SubW Plot SubW Pipedown SubW Datand SubW FULL SubW Datand Job Template Bank Job Inspiral Job Trigbank Job Thinca Job CAT1 SubW Coire Job INJ-N SubW Inspinj Job Sub Workow Compute Job LEGEND LIGO IHOPE HIERARCHAL WORKFLOW

Upload: others

Post on 30-Jan-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SSI: Distributed Workflow Management Research and Software ... · Title: SSI: Distributed Workflow Management Research and Software in Support of Science Author: Ewa Deelman, Miron

•  Pegasus  is  a  system  for  mapping  and  execu4ng  abstract  applica4on  workflows  over  a  range  of  execu4on  

environments.  

•  The  output  is  an  executable  workflow  that  can  be  executed  over  a  variety  of  resources  (  Clouds,  XSEDE,  OSG,  Campus  Grids,  Clusters,  Worksta4on)  

•  Pegasus  can  run  workflows  comprising  of  millions  of  tasks.  

•  Pegasus  WMS  consists  of  three  main  components:  the  

Pegasus  mapper,  Condor  DAGMan,  and  the  Condor  schedd.    •  The  mapping  of  tasks  to  the  execu4on  resources  is  done  by  

the  mapper  based  on  informa4on  derived  from  sta4c  and/or  

dynamic  sources.  Pegasus  adds  and  manages  data  transfer  

between  the  tasks  as  required.  •  DAGMan  takes  this  executable  workflow  and  manages  the  

dependencies  between  the  tasks  and  releases  them  to  the  

Condor  schedd  for  execu4on.  

SSI: Distributed Workflow Management Research and Software in Support of Science

Ewa Deelman, USC Information Sciences Institute, [email protected] Miron Livny, University of Wisconsin, Madison, [email protected]

Pegasus  WMS:  h-p://pegasus.isi.edu  

•  Workflows  are  expressed  in  DAX  (Directed  Acyclic  graph  in  XML)  

•  DAXes  can  be  generated  using  Java,  Perl  or  Python  API’s  

•  Support  for  higher  level  workflow  composi4on  tools  such  as  Wings  

and  Triana    

•  Integrated  with  HUBZero  •  Scien4sts  can  use  applica4on-­‐specific  portals  such  as  CGSMD  

•  hXp://portal.nimhgene4cs.org  

 

•  Nodes  in  a  workflow  can  be  tasks  or  another  workflow  (  DAX  ).  •  Scales  up-­‐to  order  of  millions  of  tasks  

•  Each  sub  workflow  is  mapped  when  it  is  ready  for  execu4on.    

 •  Clustering  of  small  tasks  into  large  clusters  for  performance.    

•  Op4mized  data  transfers  and  ability  to  use  different  protocols.  

•  Data  reuse  in  case  intermediate  data  products  are  available  

•  workflow-­‐level  checkpoin4ng  •  Automa4c  data  cleanup    

•  reduces  data  footprint  •  Support  for  Workflow  and  Task  level  no4fica4ons  

•  Workflow  Progress  can  be  tracked  through  a  database.  •  Stores  provenance  of  data  used,  produced  and  which  so\ware  was  used  with  what  parameters  

•  Retries  computa4ons  in  case  of  failures.  

•  Monitoring  and  Debugging  tools  to  debug  large  scale  workflows.  

•  Integrates  with  Resource  Provisioners  like  GlideinWMS.  

•  Support  for  Shell  Code  Generator  

 Download  Op4ons  

•  YUM  repository  with  RPM  packages  

•  APT  repository  with  DEB  packages  •  Binary  packages  for  various  linux  and  Mac  pla_orms.  

Training  Materials  

•  Quickstart  Guide  •  Virtual  Machine  based  Tutorial.  Prepackaged  Applica4on  VM’s  

•  RNASeq  Pegasus  VM  available  at  

hXp://genomics.isi.edu/rnaseq  

Scripting Tools

CGSMD Portal

Pegasus GUI

Other Workflow Composition Tools:

Xbaya, Wings, Triana

Pegasus WMS

Mapper

Engine

Scheduler

COMMERCIAL AND SCIENCE CLOUDS

Users

Local Clusters

Distributed Resources (TeraGrid, Open Science Grid)

GRAM

PBS

LSF

SGE

CO

ND

OR

GRAM

PBS

LSF

SGE

CO

ND

OR

STORAGE

STORAGE

COMPUTE

MIDDLEWARE

COMPUTEMIDDLEWARE

Pegasus  Features  

So7ware  Availability  

Large  Scale  Hierarchal  Workflows  

Composing  Workflows  

Pegasus  WMS  is  funded  by  the  NaHonal  Science  FoundaHon  OCI  SDCI  and  SI2  program  grants  0722019  and  1148515.    

 Astronomy  and  Physics  

•  Galac4c  Plane  for  genera4ng  mosiacs  from  the  Spitzer  Telescope  

•  LIGO  workflows  for  detec4ng  gravita4onal  waves.  •  Periodogram  Workflows  for  detec4ng  extra  solar  planets  .  Earthquake  Sciences  

•  Cybershake  workflows  for  seismic  hazard  analysis  for  LA  Basin.  

•  Broadband  worklfows  for  accurate  predic4ons  of  ground  mo4ons.  

Bioinforma4cs  

•  Brain  span  workflows  to  find  where  in  the  brain  a  gene’s  expressed  

•  Workflows  to  compute  RNA  Seq  for  genera4ng  Cancer  Genome  Atlas  

•  SIPHT  workflows  to  predict  sRNA  encoding  genes  in  bacteria.  •  Proteomics  workflows  for  mass  spectrometry  based  proteomics.  

Others  •   hXp://pegasus.isi.edu/applica4ons  

ApplicaHons  Using  Pegasus  

DatafindSubW

FULLSubW

CAT1SubW

CAT NSubW

PlotSubW

DatafindSubW

INJ-1SubW

CAT1SubW

CAT NSubW

PlotSubW

DatafindSubW

INJ-NSubW

CAT1SubW

CAT NSubW

PlotSubW

PipedownSubW

DatafindSubW

FULLSubW

Datafind JobTemplate Bank

Job

Inspiral JobTrigbank JobThinca Job CAT1

SubW

Coire Job

INJ-NSubW

Inspinj Job

Sub Workflow

Compute Job

LEGEND

LIGO IHOPE HIERARCHAL WORKFLOW