scientific workflow management in the vl-e framework sub-program 2.5 department of computer science...
Post on 23-Jan-2016
216 views
TRANSCRIPT
Scientific workflow management in the VL-e framework
Sub-program 2.5 Department of Computer Science
Universiteit van Amsterdam
Outline
• Background– Scientific experiments, Workflow and e-Science
framework
• Workflow management in the VL-e framework– The approach followed review the related work– Application use cases and workflow support
• Future work
Scientific experiments & e-Science
Step1: designing an experiment
Step2:performing the
experiment Step3:analyzing the
experiment resultssuccess
Complex experiments: have complex processes require interdisciplinary expertise require large scale resources
Grid & high level support
Scientific workflows
Scientific Workflow Management Systems in an e-Science environment
• Functionalities:– Automating experiment
routines;– Rapid prototyping of
experimental computing systems;
– Hiding integration details between resources;
– Managing experiment lifecycle;
• Cross different layers of middleware for managing:
– Data; – Computing;– Information;– Knowledge.
Generic Grid middleware
Data management
Computing tasks
Information
Knowledge
SWMS High level workflow services
Engine
Use
r su
pp
ort
Domain specific Applications
e-Science framework
Grid infrastructure
Workflow Management system
In the VL-e project the targeted e-science framework is …
VL-e workflow wish list
• Classified in 4 categories:– Functionality and Capability– User interface characteristics– Run time capabilities– Software engineering aspects
VL-e SIG Workflow meeting Jan 11th, 2005, 10:00–11:30, H220 (NIKHEF building)• Present: Belleman, Belloum, Bouwhuis, Breanndán, Kaletas, Konijnenburg,
Marshall, Rauwerda, Sterk, Sluiter, Terpstra, Vasunin, wibisono, Yakali.
• A list of 36 points was established to characterise the ideal workflow for the VL-e
Prioritize the workflow requirements based on the VL-e Applications
• Classified in 4 categories:– Application domains Model;– Engineering;– Underlying middleware;
– Workflow management system:• Composition/ Engine (runtime issues)/User support
• A list of 12 points was established to characterise the practical workflow for VL-e
VL-e sub-program 2.5 in collaboration with SP1.X developers• SP1.X contributors: Belleman, Klous, Konijnenburg, Marshall, Rauwerda, Sluiter,
Terpstra,
Application use cases and workflow requirements
Application use cases– Different rounds: a series of meetings – Distinguish workflow requirement
Summary– From the resource perspective:
• To support legacy tools;• standard middleware, e.g., web/grid services;• To be able to invoke resources from different systems;• Provides a rich library of workflow components;
– From the application process perspective:• To efficiently manage parallel processes/tasks in an experiment (Job farming);• To efficiently explore large parameter space (Parameter sweep);• To support knowledge based information processing (semantic level data integration).
– From the perspective of using a SWMS:• To provide a friendly user interface (preferably a GUI);• To support the development of new workflow components ( java, scripts, C++,
documentation and support);• To be able to execute tasks on distributed resources (clusters or Grid);• To be stable at runtime;• To be able to interoperate with different workflow management systems.
Workflow management in VL-e
• First prototype– VLAM-G– Shortcoming (GUI, control flow, monitoring etc. + software engineering)
• Approach– Collect and analyze application use cases – Review the state of art of workflow systems– Propose workflow systems for the PoC environment– Be active in use case projects – Learn lessons from use cases– Propose a new design
Based on the list of 36 items was established to characterize the ideal workflow for the VL-e, the VLAM-G scored: 13 Yes, 5 but need to be reimplementation, 09 No, 02 Partially supported, 6 In progress or Planned
Survey of existing workflow systems
http://staff.science.uva.nl/~gvlam//doc/P2/WorkflowSurvey
Participants: Belloum, De Boer, Guevara-Masis, Korkhov, Mirzadeh, Terpstra, van Hooft, Vasunin, wibisono, Yakali, Zhao.
Survey results
• Based on the survey and the practical tests on the nine workflow systems, we learn:
– All of the systems are still in beta-versions (even in alpha), and have the tendency to crash when we do relatively complex tests.
– None of the systems have support for collaboration, data sharing, and information management.
– None of the systems enforce best practice or provide support for knowledge capture.
– Most of systems are not geared to use Grid based systems, they have been built to work on a single system with some features to submit jobs on a remote host (user still exposed to some Grid related issues like writing RSLs).
– We have had some problems when testing some features described in the documentation.http://staff.science.uva.nl/~gvlam//doc/P2/SWMSRecommendationReport.pdf
Participants: Belloum, De Boer, Korkhov, Terpstra, van Hooft, Vasunin, wibisono, Zhao.
Recommendation for PoC R1(Part of the short term solution)
http://staff.science.uva.nl/~gvlam//doc/P2/SWMSRecommendationReport.pdf
Participants: Belloum, De Boer, Korkhov, Terpstra, van Hooft, Vasunin, wibisono, Zhao.
Use cases and small project teams
• Use case project teams– Participants from SPs from P1, P2, P3 and P4.– Contributions from workflow team: distinguish reusable
components and provide integration solution. – We are also active in project management, such as
decomposing the implementation into concrete tasks, and track the progress.
• Inside SP2.5, we divide the group members – SP1.2 Belloum & Korkhov– SP1.3 Belloum & De Boer– SP1.4 Zhao & Vasunin – SP1.5 Zhao & Wibisono– SP1.6 Belloum & Paul & De Boer
Collaboration with VL-e Applications
• SP1.2 – AID-Food informatics-IvI– WCFS case: searching in “Research Management
System” (Selected by the VLeIT) (ongoing …)
• SP1.3 – AMC-IvI– High-volume data management in the PoC SRB
(Selected by the VLeIT) (ongoing …)
• SP1.4 - IBED-IvI– Run KansK toolbox in Workflow environment
(Master thesis project) (ongoing …)
Collaboration with VL-e Applications
• SP1.5 IBU-IvI– Histone code - semantic data integration (Selected
by VLeIT) (ongoing …)
– Running R scripts on multiple nodes using web service (Finished)
– Running R scripts in workflows (ongoing …)
– Ridge-O-grammer (ongoing …)
• SP1.6 AMOLF-IvI– SRB Meta data update from file header (Selected
by VLeIT) (ongoing …)
SP1.2: WCFS case: searching in Research Management System”
indexindexerdocuments
config
searcher
queryformulation
question
list
ontologyrepositories
interface
AID tools
Lab. ExpInSample
OutData
AnalysisInData
OutData
Situation Problem
Researchquestion
Answer / conclusion
LiteratureLit Report
• Much data in scientific research
• But:– No reuse: data not available across projects– No context: meaning of data not known– Not reproducible experiments– Only successful experiments traceable
• Wish:– Research Management System: manage
experimental data for WCFS researchers
SP1.3: High-volume data management in the PoC SRB
• The goal of the use case is to:– Facilitate the data management and analysis
for the functional MRI studies bu using PoC resources for computation and resources
• Matrix cluster • SRB
• FMRI pilot is going to be developed as a first step.
SP1.4: Run KansK toolbox in Workflow environment
• To be integrated in workflow – VLAM
• The toolbox main processes are dealing with the data preparation, evaluate, prediction, and display
• The workflow is about the prediction of the location of the birds
SP1.5: Histone code - semantic data integration
Model Alignment / Model Extension
Data Acquisition
e.g. Dbconnection, API, screen scraper
Map
e.g. Table -> RDF + model
Flat map to RDF RDF to structured RDF
Assign LSID’s
• Scaling problems– Sesame– Jena
Data Import
Data ImportUCSC tables RDF repository
Data ExplorationExtract overlapping genome locations
Knowledge & Data Discovery
Read data
Normalization
F test
Gene data generator
R web services
Model
Raw data
Normalized data
FILE
V plot
MatrixFDR
Gene data
Model
Local Grid
Activity
Data
SP1.5: Running R scripts in workflows
SP1.5 side
(Frans and Han)
SP2.5 side
(Wibi, Zhiming)Define concrete description
Provide UML based analysis diagrams
Have a meeting: decompose the task
Implement the functionality in the modules (Kepler Actor or VLAM module)
Work together and give necessary support.
Integrating modules into a workflow (a integration meeting)
Refine the modules Refine workflow
Final demonstration
SP1.5: Ridge-O-grammer
Input: Tamscriptome map
Slide Window Median (SWM)
Slide Window Median Probability (SWMP)
Histogram of frequencies
(HF)
Histogram of probabilities
(HP)
False Discovery Rate (FDR)
Output: List of Ridges
The outcome of this work is going to be presented at “Netherlands Bioinformatics Conference” - 24 April 2006
identify ridges(regions of increased gene expression)
On going development Activities on the rapid prototyping environment
• Simple file management tools for SRB, and GridFTP
• R scripts in workflow system
• Parameters sharing of workflow components.
• Service discovery using P2P approach
• Parameter Sweep and Job farming
Future work
• By far the most active and rapidly progressing WMS is Kepler
• Beta-version March 2006.
• Kepler/Ptolomy has two ways of extending the Systems:
• Actors• Directors
Summary
• Survey results showed that the e-science WMS targeted in VL-e – Does not exist yet– Collaboration with other Workflow project will
likely speed up the development process
• Project teams working on application use case is the only way to progress
• VLAM is still quite useful for rapid prototyping
ReferencesPeople:
Adam Belloum (SP2.5 leader), Zhiming Zhao, Paul van Hooft (post doc), Adianto Wibisono, Dmitry Vasyunin , Vladimir Korkhov , Frank Terpstra (Ph.D students), Piter de Boer (Programmer)
VL-e Reports:1. PoC recommendation report;
Publications:1. Z. Zhao; A. Belloum; H. Yakali; P.M.A. Sloot and L.O. Hertzberger: Dynamic Workflow in a Grid
Enabled Problem Solving Environment, in Proceedings of the 5th International Conference on Computer and Information Technology , pp. 339-345 . IEEE Computer Society Press, Shanghai, China, September 2005.
2. Z. Zhao; A. Belloum; A. Wibisono; F. Terpstra; P.T. de Boer; P.M.A. Sloot and L.O. Hertzberger: Scientific workflow management: between generality and applicability, in Proceedings of the International Workshop on Grid and Peer-to-Peer based Workflows, pp. 357-364. IEEE Computer Society Press, Melbourne, Australia , September 19th-21st 2005.
3. Z. Zhao; A. Belloum; P.M.A. Sloot and L.O. Hertzberger: Agent technology and scientific workflow management in an e-Science environment, in Proceedings of the 17th IEEE International conference on Tools with Artificial Intelligence, pp. 19-23. IEEE Computer Society Press, Hongkong, China, November 14th-16th 2005.
Activity:1. Int’l workshop on Workflow systems in e-Science, organized by Zhiming Zhao and Adam Belloum, in
the context of ICCS06, Reading University, May 28, 2006.
2. Workshop on Workflow systems in e-Science, to be held during the next e-Science conference in Amsterdam December 2006.