grid workflow midwest grid workshop module 6. goals enhance scientific productivity through:...

Grid Workflow Midwest Grid Workshop Module 6

Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale Enabling use of a worldwide data grid as a scientific workstation

Goals of using grids through scripting Provide an easy on-ramp to the grid Utilize massive resources with simple scripts Leverage multiple grids like a workstation Empower script-writers to empower end users Track and leverage provenance in the science process

Classes of Workflow Systems Earlier generation business workflow systems Document management, forms processing, etc Scientific laboratory management systems LIMS, wet lab workflow Application-oriented workflow Kepler, DAGman, P-Star, VisTrails, Karajan VDS: First-generation Virtual Data System Pegasus, Virtual Data Language Service-oriented workflow systems BPEL, BPDL, Taverna/SCUFL, Triana Pegasus/Wings Pegasus with OWL/RDF workflow specification Swift workflow system Karajan with typed and mapped VDL - SwiftScript

VDS The Virtual Data System Introduced Virtual Data Language - VDL A location-independent parallel language Several Planners Pegasus: main production planner Euryale: experimental just in time planner GADU/GNARE user application planner (D. Sulahke, Argonne) Provenance Kickstart app launcher and tracker VDC virtual data catalog

Virtual Data and Workflows Challenge is managing and organizing the vast computing and storage capabilities provided by Grids Workflow expresses computations in a form that can be readily mapped to Grids Virtual data keeps accurate track of data derivation methods and provenance Grid tools virtualize location and caching of data, and recovery from failures

Virtual Data Origins: The Grid Physics Network Enhance scientific productivity through Discovery, application and management of data and processes at all scales Using a worldwide data grid as a scientific workstation The key to this approach is Virtual Data creating and managing datasets through workflow recipes and provenance recording.

Virtual Data workflow abstracts Grid details

mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Example Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

The core essence: Basic data analysis programs CMS.ECal.2006.0405 107: 24B707CC AF 01 37 01 00 01 00 01 24655A35 235011.603 061206 V 03 0 +0269 108: 24B707CD 01 23 01 3F 00 01 00 01 24655A35 235011.603 061206 V 03 0 +0269 109: 06194161 80 01 38 01 00 01 00 01 03E9DCA9 235142.597 061206 V 03 0 -0723 110: 06194163 00 01 01 28 32 01 00 01 Raw Data bins =60 xmin = 40.5 ymin =.003 Data Analysis Program bins xmin ymin infile

Expressing Workflow in VDL TR grep (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep (a1=@{in:file1}, a2=@{out:file2}); DV sort (a1=@{in:file2}, a2=@{out:file3}); file1 file2 file3 grep sort Define a function wrapper for an application Provide actual argument values for the invocation Define formal arguments for the application Define a call to invoke application Connect applications via output-to-input dependencies

ACTIVAL Workflow Main Workflow Program // Declare datasets fullBrainData brainFile ; fullBrainSpecs specFile ; brainDatasets randBrain ; brainClusters randCluster; brainDatasets dsetReturn; brainClusterTable clusterThresholdsTable ; brainDataset brainResult ; brainDataset origBrain ; // Main program executes the entire workflow (randCluster, dsetReturn) = brain_cluster(brainFile, specFile); clusterThresholdsTable = bricCentralize (randCluster.c); brainResult = makebrain(origBrain,clusterThresholdsTable,brainFile,specFile);

Performance example: fMRI workflow 4-stage workflow (subset of AIRSN) 476 jobs,

grid workflow midwest grid workshop module 6. goals enhance scientific productivity through:...

Documents