applications and requirements for scientific workflow introduction may 1 2006 nsf geoffrey fox...
TRANSCRIPT
Applications and Requirementsfor Scientific Workflow
Introduction
May 1 2006NSF
Geoffrey FoxIndiana University
Major Themes• What is different now and why
– Scientific workflow is in realm of possibility now• What are the application requirements rather than
CS requirements– Prioritize, identify new issues, what old requirements
have been satisfied• Ground these in scenarios or in application
descriptions that lead to these requirements
• Phrase as transformative research – does term “scientific workflow” conjure up the innovative future or perhaps a bureaucratic past?
Applications• Extreme weather (LEAD)• Bioinformatics (myGrid, BIRN); high throughput screening• Virtual Observatory in Astronomy• Particle Physics• Generic Data Analysis• Earthquake Science• Ocean Data Assimilation
• Note most of following topics come from Computer Science and one needs to identify the driving higher level application requirement– Preserve mapping of application requirements to computer
science topic
Topics – Application/Component Specific• [Evangelinos] Support Ocean Data assimilation
– Matlab, Fortran, Parallel simulations– Dataflow standards for “large I/O”– Metascheduling– Customization of execution parameters
(provenance)• [AGray] Need workflow components supporting
powerful data analysis across fields• [Gil] Support workflows needed in “open access” data
accompanying scientific publication• [Hendler] Support information management as well as
computation
Topics - Overarching• [Ellisman] What do we mean by workflow; the word means different things
to different people; should we use different terms; need a better word (distributed scientific method)
• [JMyers, Barga] Categorize workflows and study use; evaluate and compare; identify common patterns
• [Discussion] What has changed? – data deluge is one critical change; is data a curse or a blessing
• [Ellisman] What is the “scientific method” (versus “Google method”) and its implication for workflow
• [Barga] What’s wrong with commercial solutions• [Laszewski] Support common Grid patterns• [Fox] Build benchmark set analogous to NAS in parallel computing• [Fahringer] Include all costs (e.g. Web Service security, SOAP) in
performance models• [Deelman1] Support restructuring and planning for performance
optimization • [JMyers] Manage workflows like content
• [Ackerman, KMyers, Scacchi, Deelman2] Support full people (scientific process) workflow including social and organizational issues
Topics – Desired Qualities• [Goble] Support users who are often under-resourced• [Discussion] Multiple classes of users: “power” “common case”
“education”; do users know what they want or not?– Note industry workflow captures WELL understood business processes
• [Several] Workflows will be re-used and shared• [Ellisman] Enable reproducible science• [Livny] Support high quality software• [Laszewski] Balance between features, performance, and
completeness. • [Goble] Easily assemble workflows, find services and adapt
previous workflows• [Goble] The workflow has to reflect the science not the services
invocation interface. • [Goble] Automated workflow design is unlikely, unpopular, and
undesirable as scientists know which services they want• [Goble] Support all services that users want – whether they
have a WSDL interface or not
Topics – Desired Features• [Several] Workflows should be scalable, fault-tolerant, restartable, adaptive
and repeatable; support multi-administration heterogeneous resources• [Discussion] What do application scientists mean by above qualities?
– [Livny] Why is size important? Complexity counts• [Altintas] Support end (instruments) to end (interactive data analysis)
science• [Szalay] Interactive analysis as well as batch• [Gannon] Workflows triggered by events without user interaction• [Knoblock] Techniques for rapidly constructing models of new sources or
services so that they can be rapidly and correctly integrated. • [Knoblock] Support for dynamically integrating data across multiple data
sources (i.e, databases or web services) that were not designed to work together.
• [Curbera] Support reasoning about correctness and composability• [Livny] What is meaning of correctness and reproducibility (e.g. random
numbers)• [Gil] Support collections of workflows addressing common scientific
questions• [Discussion] Need to support workflows of heterogeneous workflows of
different types; note industry worries about linking intra-enterprise systems across enterprises
Topics – Detailed Technology• [Laszewski] Extend the workflow language through a set of
core libraries such as fault tolerance and check pointing.• [Goble] Need a higher level language than BPEL• [Goble] There will be no one workflow language or
workflow system, as there is no one word processor, programming language or operating system.
• [Ellisman, Livny] Role of portals (science gateways) as “common case” user interface versus distributed programming for “power user”
• [Altintas] User interface customizable for different domains • [Deelman2] Virtual data to capture efficiently past and
future actions• [Curbera] Integrate internet-scale execution (REST) and
enterprise service bus ESB; • [Discussion] Web 2.0 like Google maps; Industry
distinction between interoperability and implementation
Topics -- Provenance
• [Freire] Support computational (workflow) steering and provenance generation
• [Goble] Workflows must allow effective management of resultant data and provenance
• [Barga, Moreau] Define generally provenance of execution even though multiple paradigms
• [Altintas] Track provenance of workflow design, execution, and intermediate and final results
• [Gannon] Initialization of workflow components are dependent on each other
• [Seth] Design provenance supporting customization of adaptable workflows