workflow automation for processing plasma fusion simulation data
DESCRIPTION
Workflow automation for processing plasma fusion simulation data. Norbert Podhorszki Bertram Ludäscher. University of California, Davis. Scott A. Klasky. Scientific Computing Group Oak Ridge National Laboratory. GPSC. C enter for P lasma E dge S imulation. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/1.jpg)
Workflow automation for processing plasma fusion simulation data
Norbert PodhorszkiBertram Ludäscher
Scientific Computing GroupOak Ridge National Laboratory
University of California, Davis
Scott A. Klasky
![Page 2: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/2.jpg)
6/25/07
Works’07 Monterey, CA
Center for Plasma Edge Simulation
• Focus on the edge of the plasma in the tokamak
• Multi-scale, multi-physics simulation
Edge turbulence in NSTX (@ 100,000 frames/s) Diverted magnetic field
![Page 3: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/3.jpg)
6/25/07
Works’07 Monterey, CA
Images plasma physicists adore
Electric potential Parallel flow and particle positions
![Page 4: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/4.jpg)
6/25/07
Works’07 Monterey, CA
Monitoring the simulation means…
![Page 5: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/5.jpg)
6/25/07
Works’07 Monterey, CA
Multi-physics → many codes
![Page 6: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/6.jpg)
6/25/07
Works’07 Monterey, CA
XGC simulation output
• Desired size of simulation (to be run on the petascale machine)– 100K time steps– 100 billion particles– 10 attributes (double precision) per particles
= 8 TB data per time step– Save (and process) 1K-10K time steps
– about 5 days run on the petascale
![Page 7: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/7.jpg)
6/25/07
Works’07 Monterey, CA
XGC simulation output
• Proprietary binary files (BP) – 3D variables, separate file per each timestep
• NetCDF files containing– 2D variables, all timesteps in one file
• M3D coupling data– to compute new equilibrium with external code
(loose coupling)– to check linear stability of XGC externally
![Page 8: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/8.jpg)
6/25/07
Works’07 Monterey, CA
What to do with those output?• Proprietary binary files (BP)
– Transfer to end-to-end system using bbcp– Convert to HDF5 format (with a C program)– Generate images using AVS/Express (running as service)– Archive HDF5 files in large chunks to HPSS
• NetCDF files containing– Transfer to end-to-end system (updating as new timesteps are
written into the files)– Generate images using grace library– Archive NetCDF files at the end of simulation
• M3D coupling data– Transfer to end-to-end system– Execute M3D: compute new equilibrium– Transfer back the new equilibrium to XGC– Execute ELITE: compute growth rate, test linear stability – Execute M3D-MPP: to study unstable states (ELM crash)
![Page 9: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/9.jpg)
6/25/07
Works’07 Monterey, CA
Schematic view of components
Cray XT4
Opteron cluster
Command & control site
40 GB/s
HPSS
ORNL
![Page 10: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/10.jpg)
6/25/07
Works’07 Monterey, CA
ORNL
Schematic view of components
Cray XT4
Opteron cluster
Command & control site
40 GB/s
HPSS
![Page 11: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/11.jpg)
6/25/07
Works’07 Monterey, CA
ORNL
Schematic view of components
Cray XT4
Opteron cluster
Command & control site
40 GB/s
HPSS
Seaborg @ NERSCPull data
![Page 12: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/12.jpg)
6/25/07
Works’07 Monterey, CA
• Kepler workflow– to accomplish all
these tasks– 1239 (java) actors– 4 levels of hierarchy
– many instances of ProcessFile and FileWatcher composite actors“workflow templates”
43 actors, 3 levels
196 actors, 4 levels30 actors
206 actors, 4 levels
137 actors33 actors
150123 actors
66 actors12 actors
243 actors, 4 levels
![Page 13: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/13.jpg)
6/25/07
Works’07 Monterey, CA
Workflow – java - remote script - remote prg
ls -l bp2h5
bbcp
![Page 14: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/14.jpg)
Kepler actors for CPES
• Permanent SSH connection to perform tasks on a remote machine
• Generalized actors (sub-workflows) for specified tasks:– Watch a remote directory for simulation timesteps– Execute an external command on a remote machine– Tar and archive data in large junks to HPSS– Transfer a remote image file and display on screen– Control a running SCIRun server remotely– Job submission and control to various resource managers
• Above actors do logging/checkpointing– the final workflow can be stopped / restarted
![Page 15: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/15.jpg)
6/25/07
Works’07 Monterey, CA
What Kepler features are used in CPES?
• Different computational models– PN for parallelism and pipeline processing– DDF for sequential workflow with if-then-else and
while loop structures– SDF for efficient (static schedule) sequential
execution of simple sub-workflows• Stateful actors in stream processing of files• SSH for remote operations
– keeps the connection alive• Command-line execution of the workflow
– from a script (at deployment) (no GUI)– reading workflow parameters from a file
![Page 16: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/16.jpg)
6/25/07
Works’07 Monterey, CA
● SSH Directory Listing Java actor gives new files in a directory (once)
● This is a do-while loop where the termination condition is whether the list contains a specific element (which indicates end of simulation)
FileWatcher: a data-dependent loop
![Page 17: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/17.jpg)
6/25/07
Works’07 Monterey, CA
Modeling problem: stopping and finishing
• You create working pipelines finally. Fine.– How do you stop them?– How do you let intermediate actors know that they will not
receive more tokens?– How do you perform something “after” the processing?
• We use a special token flowing through the pipelines– Always the last item in the pipeline.– Actors are implemented (extra work) to skip this token.
• Stop file created by the simulation – to stop the “task generator” actors in the workflow (FileWatchers)– to notify (stateful) actors in the pipeline that they should finalize
(Archiver, Stop_AVS/Express)– to synchronize on two independent pipelines (NetCDF+HDF5 →
archive images at the end)
![Page 18: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/18.jpg)
6/25/07
Works’07 Monterey, CA
Role of stop file
Stop
![Page 19: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/19.jpg)
6/25/07
Works’07 Monterey, CA
Role of stop file
Stop
Finalize
Wait for stop on both pipelines
Extra work after the end
![Page 20: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/20.jpg)
6/25/07
Works’07 Monterey, CA
Problem: how to restart this workflow?
• Kepler has no system-level checkpoint/restart mechanism (yet?)– seems to be difficult for large Java
applications– not to mention the status of external (and
remote) things.
• Pipeline execution– each actor is processing a different step
simultaneously
![Page 21: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/21.jpg)
6/25/07
Works’07 Monterey, CA
Our solution: user-level logging/restart
• We record– the successful operations at each (“heavy”) actor
• Those actors– are implemented to check before doing something
whether that has been done already• When the workflow is restarted
– it starts from the very beginning, but the actors simply skip operations (files, tokens) that have already been done.
• We do not worry about repeating small (control related) actions within the workflow– external operations are that matter here
![Page 22: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/22.jpg)
6/25/07
Works’07 Monterey, CA
ProcessFile core: check-perform-record
![Page 23: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/23.jpg)
6/25/07
Works’07 Monterey, CA
Problem: failed operations
• What if an operation fails, e.g. one timestep cannot be transferred? Options:
a) trust that they “fail” silently on missing data
b) notify everybody downstream in the pipeline (to skip)– mark token as “failed”
c) avoid giving tasks to them for the erroneous step
• Retrying later and processing that step is important but …
• … keeping up with the simulation on the next steps is even more important
![Page 24: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/24.jpg)
6/25/07
Works’07 Monterey, CA
Our approach for failed operations
• ProcessFile and thus the workflow handles failures by discarding tokens related to failed operations from the stream
• Advantage: – actors need not care about failures
• an incoming token is a task to be done
• Disadvantage– rate of token production varies
• this can upset Kepler’s model of computation
![Page 25: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/25.jpg)
6/25/07
Works’07 Monterey, CA
Discarding tokens on failure
33 22 11
transfer 1
failed 2
convert 1 arch 1
transfer 3 convert 3 arch 3
![Page 26: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/26.jpg)
6/25/07
Works’07 Monterey, CA
After a restart…
33 22 11
skip 1
transfer 2
skip 1
convert 2
skip 1
arch 2
skip 3 skip 3 skip 3
![Page 27: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/27.jpg)
6/25/07
Works’07 Monterey, CA
Future Plans
• Provenance management– one main reason to use scientific workflow system
e.g. in bioinformatics workflows– needed for debugging runs, interpreting results,
repeat experiment, generate documentation, compare runs etc.
– CPES workflow is selected as one use case for the ongoing Kepler provenance work
• New actors in CPES for controlling asynchronous I/O from the petascale computer towards the processing cluster
![Page 28: Workflow automation for processing plasma fusion simulation data](https://reader034.vdocuments.net/reader034/viewer/2022051114/56812b25550346895d8f26ad/html5/thumbnails/28.jpg)
Thank You
Questions?