A Framework for Scientific Workflow
Reproducibility in the Cloud
Rawaa Qasha, Jacek Cała, Paul Watson Newcastle University, Newcastle upon Tyne, UK
Email: {r.qasha, jacek.cala, paul.watson}@newcastle.ac.uk
In this paper
• A new framework for repeatability and reproducibility of
scientific workflow
• Integrating logical and physical preservation
approaches
• Offering Workflow/tasks repositories with version
control
• Supporting automatic deployment and image capture of
workflows and tasks
2
• Background
• Challenges for workflow reproducibility
• Our solution for logical and physical preservations
• Overview of reproducibility framework
• Experiments and results
• Conclusions
Outline
3
Workflows & Reproducibility
4
92
1443
18 (~20%)
341 (~24%)
0
200
400
600
800
1000
1200
1400
1600
study1* study2**
Num
be
r o
f w
ork
flo
ws
total no. of workflows
Workflows can be re-excuted
*Zhao et al, “Why workflows break Understanding and combating decay in Taverna workflows,” 2012
**Mayer et al, “A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows”, 2015
• Insufficiently detailed workflow description
• Insufficient description of the execution environment
• Unavailable execution environments
• Absence of & changes in the external dependencies
• Missing input data
5
Challenges
for workflow reproducibility
6
Common reproducibility approaches
T1
T2
T4
T3
Logical preservation
Physical preservation
Using TOSCA as a logical preservation
7
Node
Type
T1
T2
T4
T3
Relationship
Type Node
Template
(T4)
Node
Template
(T1)
Node
Template
(T3)
Node
Template
(T2)
Service Template
Workflow and execution environment description
8
Image
creation
Container
With Depend.
base
image Task
image
Container
creation
Data
Task
artifact
Tools &
Libs.
(a) Initial task deployment & execution
Task
image
Container
creation
Data
(b) Task deployment & execution with task image
Using Docker for physical preservation
Preserving execution environment and dependencies, tracking changes
9
Task/WF
Repository
(GitHub)
Images
Repository
(Docker Hub) LifeCycle
Scripts Basic Types
Workflow Deployment & Enactment Engine
(TOSCA Runtime Environment: Cloudify)
Automated
Image
Creation
Target Execution Environment
(Docker over local VM, AWS, Azure, GCE, …)
Core Repository (GitHub)
Reproducibility Framework
10
Multi-container deployment
11
Single container deployment
12
Time line of workflow devOps
13
Workflow repository
Preserving description, input data, tracking changes and deployment instructions
14
Experiments and Results
15
1- Repeatability of a workflow on different
clouds
16
2- Automatic image capture for improved
performance
17
3- Reproducibility in the face of development
changes
Conclusions
18
• Full workflow reproducibility is a long-standing issue
• TOSCA description is used for logical preservation
• Docker images for tasks/workflows support physical preservation
• Changes tracking and automatic deployment also contribute to a comprehensive solution of the problem
• Integration of these techniques addresses majority of the issues related to workflow decay
THANK YOU