a framework for scientific workflow reproducibility in the...

19
A Framework for Scientific Workflow Reproducibility in the Cloud Rawaa Qasha, Jacek Cała, Paul Watson Newcastle University, Newcastle upon Tyne, UK Email: {r.qasha, jacek.cala, paul.watson}@newcastle.ac.uk

Upload: others

Post on 21-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

A Framework for Scientific Workflow

Reproducibility in the Cloud

Rawaa Qasha, Jacek Cała, Paul Watson Newcastle University, Newcastle upon Tyne, UK

Email: {r.qasha, jacek.cala, paul.watson}@newcastle.ac.uk

Page 2: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

In this paper

• A new framework for repeatability and reproducibility of

scientific workflow

• Integrating logical and physical preservation

approaches

• Offering Workflow/tasks repositories with version

control

• Supporting automatic deployment and image capture of

workflows and tasks

2

Page 3: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

• Background

• Challenges for workflow reproducibility

• Our solution for logical and physical preservations

• Overview of reproducibility framework

• Experiments and results

• Conclusions

Outline

3

Page 4: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

Workflows & Reproducibility

4

92

1443

18 (~20%)

341 (~24%)

0

200

400

600

800

1000

1200

1400

1600

study1* study2**

Num

be

r o

f w

ork

flo

ws

total no. of workflows

Workflows can be re-excuted

*Zhao et al, “Why workflows break Understanding and combating decay in Taverna workflows,” 2012

**Mayer et al, “A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows”, 2015

Page 5: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

• Insufficiently detailed workflow description

• Insufficient description of the execution environment

• Unavailable execution environments

• Absence of & changes in the external dependencies

• Missing input data

5

Challenges

for workflow reproducibility

Page 6: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

6

Common reproducibility approaches

T1

T2

T4

T3

Logical preservation

Physical preservation

Page 7: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

Using TOSCA as a logical preservation

7

Node

Type

T1

T2

T4

T3

Relationship

Type Node

Template

(T4)

Node

Template

(T1)

Node

Template

(T3)

Node

Template

(T2)

Service Template

Workflow and execution environment description

Page 8: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

8

Image

creation

Container

With Depend.

base

image Task

image

Container

creation

Data

Task

artifact

Tools &

Libs.

(a) Initial task deployment & execution

Task

image

Container

creation

Data

(b) Task deployment & execution with task image

Using Docker for physical preservation

Preserving execution environment and dependencies, tracking changes

Page 9: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

9

Task/WF

Repository

(GitHub)

Images

Repository

(Docker Hub) LifeCycle

Scripts Basic Types

Workflow Deployment & Enactment Engine

(TOSCA Runtime Environment: Cloudify)

Automated

Image

Creation

Target Execution Environment

(Docker over local VM, AWS, Azure, GCE, …)

Core Repository (GitHub)

Reproducibility Framework

Page 10: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

10

Multi-container deployment

Page 11: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

11

Single container deployment

Page 12: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

12

Time line of workflow devOps

Page 13: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

13

Workflow repository

Preserving description, input data, tracking changes and deployment instructions

Page 14: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

14

Experiments and Results

Page 15: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

15

1- Repeatability of a workflow on different

clouds

Page 16: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

16

2- Automatic image capture for improved

performance

Page 17: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

17

3- Reproducibility in the face of development

changes

Page 18: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

Conclusions

18

• Full workflow reproducibility is a long-standing issue

• TOSCA description is used for logical preservation

• Docker images for tasks/workflows support physical preservation

• Changes tracking and automatic deployment also contribute to a comprehensive solution of the problem

• Integration of these techniques addresses majority of the issues related to workflow decay

Page 19: A Framework for Scientific Workflow Reproducibility in the Cloudescience-2016.idies.jhu.edu/.../11/Qasha-Rawaa-slides.pdf · 2016. 11. 1. · Rawaa Qasha, Jacek Cała, Paul Watson

THANK YOU