scientific workflow-overview-2012-01-rev-2
DESCRIPTION
Summary of the work I did as part of the SciDAC SDM center, which wrapped up in 2012.TRANSCRIPT
Scientific Workflows:
Experience, Advances, and Where We Go From Here
Terence Critchlow
January 2012
PNNL-SA-85033
This talk will provide answers to 4 questions:
Why did I get involved with scientific workflows?
How do scientific workflows help scientists?
What problems did you find when you first started working with scientific workflows?
Can scientific workflows be effectively integrated into the broader scientific process?
I became involved with scientific workflows through the SciDAC SDM Center
The Scientific Discovery through Advanced Computing (SciDAC) program was funded by DOE starting in 2001 with the goal of advancing scientific computing by having CS and domain science teams work together to address science questions using new HPC platformsApplication initiatives were funded in areas such as combustion, fusion, astrophysics, and groundwater CS and math centers were funded in areas critical to the development of new, scalable capabilities including solvers, AMR, visualization, performance, and data management Focus was on science not CS research
The Scientific Data Management (SDM) Center was the focal point for DOE data management activities
Large, multi-institutional collaborationsLed by Arie Shoshani (LBL)5 Labs and 5 Universities Funded for 10 years Project concluded in 2011
The center had 3 research thrusts:Storage and efficient access (Rob Ross – ANL)Data Mining and Analysis (Nagiza Samatova - NCSU) Software Process Automation (Terence Critchlow – PNNL)
The goal of the SPA team was to develop and deploy technology that would allow scientists to spend more time on science by reducing the data management overhead
Workflows had filled that niche in business but, in 2001, there was little usage in science applications
As lead for the SPA team, I had both management and research responsibilities
Team of 10-15 spread across NCSU, Univ. of Utah, UC Davis, SDSC, ORNL, and PNNL
Identify relevant technologyWork with science teams to design and deploy solutionsIdentify areas requiring additional researchPerform research to improve the existing capabilities for our target customers
Workflow technology was selected because time consuming, repetitive tasks dominate day-to-day computational science activity
By automating mundane tasks, we allow scientists to focus on science not data management Needed a general purpose workflow engine that we could apply to an HPC-centric environment
Act as the orchestrator, coordinating the workflow executionAllow processing of larger data setsSupport scientific reproducibilityReduce waste of resources by allowing timely corrective action to be taken
The SDM Center was one of the founding organizations of the Kepler Consortium
In 2001 there were no widely used scientific workflow enginesKepler is an open source workflow environment
Based on the Ptolemy II system developed at UC Berkeley Started with several projects coming together based on a need for a flexible workflow environmentKepler-project.org
Kepler has become one of the best known and widely used scientific workflow engines
This talk focuses on work that I was directly involved in
There was a lot of work performed by the SDM Center team that I managed but I don’t focus on
Provenance tracking Dashboard Templates Patterns Deployed workflows
ITERCPES Combustion
https://sdm.lbl.gov/sdmcenter/
My research focused on raising the level of abstraction within scientific workflows
Our first deployed workflow was managing a bioinformatics analysis pipeline (2002)
In collaboration with Matt Coleman (LLNL)
The TSI workflow was the first of our “standard” simulation workflows (2005)
Submit batch request at NERSC
Identify new complete files
Check job status
Transfer files to HPSS Transfer completed
correctly
Transfer files to SBTransfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update web page
If R
un
nin
g
In collaboration with Doug Swesty (Stony
Brook)
The workflow can be broken into several general steps
Submit batch request at NERSC
Identify new complete files
Check job status
Transfer files to HPSS Transfer completed
correctly
Transfer files to SBTransfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update web page
If R
un
nin
g
Job Submission
Submit batch request at NERSC
Identify new complete files
Check job status
Transfer files to HPSS Transfer completed
correctly
Transfer files to SBTransfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update web page
If R
un
nin
g
Job Monitoring
The workflow can be broken into several general steps
Submit batch request at NERSC
Identify new complete files
Check job status
Transfer files to HPSS Transfer completed
correctly
Transfer files to SBTransfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update web page
If R
un
nin
g
Moving files
The workflow can be broken into several general steps
Submit batch request at NERSC
Identify new complete files
Check job status
Transfer files to HPSS Transfer completed
correctly
Transfer files to SBTransfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update web page
If R
un
nin
g
Data Analysis
The workflow can be broken into several general steps
This translates into a complicated Kepler workflow
Extensive use of nested workflows to compartmentalize steps
160 instances of 18 distinct actors Over a dozen parameters to control workflow execution
We ended up building several similar simulation science workflows
Fusion scienceCombustionSubsurface science
These all have the same general steps
But there are significant differences in the details
Unfortunately, workflows are not typically portable across machines
User authentication mechanisms depend on machine-specific policiesJob launch and monitoring features depend on scheduler File transfer mechanisms depend on available infrastructure
We developed generic actors as the first step in raising the level of abstraction for workflow design
Generic actors embody general functionality into actors that work across platforms / workflows
Improve workflow portability Simplify creation of new workflowsForm the basis for sharing subworkflowsReduce the number of actor choices
We identified several capabilities required across simulation workflows
User authenticationJob submission
Submit job scheduling request to batch scheduler
Job monitoringTrack status of job from submitted, to running, to completed
File transfer Move files, potentially between machines at different sites
Developed and deployed actors capable of performing the desired functionality using available infrastructure
Generalized to manage multiple implementationsParameters and contextual information determine which options to utilize
Use of generic actors improved workflow effectiveness
Same workflow could be used on all of the DOE leadership class machines Significantly less maintenance required
Fewer workflows needed per science team Each workflow is simpler
Still requires parameters to manage details of execution
Workflow context can be used to reduce number of explicit parameters
Workflows run in a context that provide certain preferences
Systems User accounts Configuration files
Information requested / computed / bound at run time instead of design timeInitial results are promising but more work is required to determine how effective run-time binding is for workflows
Scientific workflows still have major adoption challenges to overcome
The correlation between the scientific process and the executable workflow is loose at best
Executable workflows are extremely complex and usually require a dedicated workflow designer to create
The translation from idea to napkin drawing to executable workflow is challenging and lossy
The scientific process is collaborative, fluid, and time sensitive
Important decisions are made in meetings and conversations. Records are distributed and not easily associated with specific tasks Decisions can be revisited and changedScience is inherently iterative
Executable workflows document the results of these decisions
Lack broader context
Electronic lab notebooks provide some contextual information
Lack details and external information / links
Provenance provides some associations
Need a way to allow scientists to collect and share information about their experiments
A single location capable of collecting all relevant information about an experimentInformation needs to be related in a meaningful way Temporal information must be preserved
Working with collaborators at UTEP, we developed a prototype of what this could look like
Our prototype is built on annotating abstract workflows
Design principlesWorkflow construction needs to be a byproduct of information collectionInformation should not need to be entered more than onceAnnotations should relate to specific steps in the process
The research hierarchy contains the steps in the abstract workflow
Steps are conceptualAt the top level, these outline the major steps in the experiment being performed
Get data Create conceptual modelGenerate model input Run simulation
Each step can have sub-steps within it to refine the concept further (nested structure)
Free-form text is associated with each step in the hierarchy
Allows scientists to easily describe step’s purpose
Top level describes entire experiment
Decisions are captured under research specs tab
The process view shows the steps as a workflow
Ports are used to identify inputs and outputsLines between steps indicate information flow between steps
Steps should, eventually, connect
Steps are connected by linking input and output parameters (ports)
Inputs and outputs are linked Comment field holds assumptions and constraints from the “other side” of the lineFree-form text makes it easy to input information, but impossible to perform automatic verification
Zooming in on a specific sub-step provides additional information about that step
A new tab provides (sub-)step-specific information The process view is updated to reflect the sub-steps contained within this step Note that the inputs and outputs to the workflow come from the higher-level workflow
Eventually, some steps correspond to executable (Kepler) workflows
Prototype expands Kepler infrastructureExecutable workflows are (still) typically created by a dedicated workflow designer This places the executable workflow in the broader context of the experiment it is supporting Provenance can be linked into overall experiment
Annotations are stored in RDF to support export / import
Semantically Interlinked Online Communities (SIOC) format chosenSupports other tools using these annotations
Report generation Search / query Experiment level provenance information
This prototype represents a starting point for answering many interesting questions
How do you effectively link other sources of information into steps in an abstract workflow?
How do you select only the relevant information? How do you manage provenance and attribution in a distributed environment?
What is the best way to organize this information for people filling a variety of roles?
PIs need a different view than workflow designers or bench scientistsHow do you effectively share (subsets of) this information?How do you implement access controls effectively?
This prototype represents a starting point for answering many interesting questions
Are workflows the right abstraction for representing the scientific process?
Representing evolution over time is challenging in workflowsDoes everything have to correspond to a step?
Is there a way to generate parts of an executable workflow given an abstract definition?
Can we match steps to specific actors? Could you develop a generic set of wizards or templates?
Conclusions
The SDM Center has been at the forefront of scientific workflow R&DWorkflows have been successfully deployed across a wide variety of scientific domains Significant advances have been made in making workflow engines more reliable and usefulThere remains significant work required to
Fit workflows within the context of the overall scientific process Allow scientists to design and implement their own workflows
This work involved many, many people
My team George ChinChandrika SivaramakrishnanXiaowen Xin (LLNL)Anand KulkarniAnne Ngu (TX State)Paulo Pinheiro da Silva (UTEP)Aida Gandara (UTEP)
Other SPA team membersIlkay Altintas Bertram Ludaescher Mladen Vouk Claudio SilvaScott Klasky Norbert PodhorszkiDan Crawl Ayla Khan Arie Shoshani
Plus other students and researchers who were involved for shorter times