self-healing of operational workflow incidents on distributed computing infrastructures

Self-healing of operational workflow incidents on

distributed computing infrastructures

1

The 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing – May 13-16 2012

Rafael Ferreira da Silva – [email protected]

Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon

Lyon, France

Context   Virtual Imaging Platform (VIP)

  Medical imaging science-gateway

  Grid of 129 sites (EGI – http://www.egi.eu)

  Significant usage   Registered users: 192 from 24 countries

  Applications: 18

  Consumed 32 CPU years in 2011

2 Rafael Ferreira da Silva – [email protected]

VIP usage in 2011: CPU consumption of VIP and related platforms on EGI.

Applications

File transfer

VIP – http://vip.creatis.insa-lyon.fr

Problem and objective   Problem: costly manual operations

  Rescheduling tasks, restarting services, killing misbehaving experiments or replicating data files

  Objective: automated platform administration   Autonomous detection of operational incidents

  Perform appropriate set of actions

  Assumptions: online and non-clairvoyant   Only partial information available

  Decisions must be fast

  Production conditions, no user activity and workloads prediction


  Several entities   Multiple level issues

  Highlighted: our target

Platform entities


•  File unavailable •  Executable corrupted …

•  Blocked •  Low efficiency •  File unavailable •  Site misconfigured …

•  Unavailable •  Does not exist … •  Misconfigured

•  Unavailable …

Entities diagram of a scientific gateway

mult

iple

lev

el iss

ues

•  Failed to setup •  Execution failed •  Unable to access storage resource …

Illustration on a workflow execution


invocations

grid jobs

Workflow description + input data

failed job replica

Outline

  General healing process

  Application to blocked activities and other incidents

  Experiments and results on a production grid

  Conclusions


0.15

0.31 0.54

General MAPE-K loop


Incident 1 degree d = 0.2



level 1

level2

level3

Roulette wheel selection

Incident 2

Selected

Rule Confidence (c) c x d

1 2 0.1 0.09

2 2 1.0 0.52

3 2 0.9 0.39

Association rules for incident 2

0.09

0.39 0.52 Incident 2

Selected

Roulette wheel selection based on association rules

Set of Actions

I2

level 1

level2

level3

level 1

level2

level3

€

=did jj=1

n∑

event (job completion and failures)

or timeout

Monitoring Analysis

Execution Knowledge

Planning

Historical info

!bEstimation by Median

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

0015

000

Modeling

  Fuzzy Finite State Machine   State degrees set from metrics

  Association rules identify relations among incident levels

  Considered Incidents   Activity blocked

  Input data error

  Output data error

  Application error

  Site misconfigured


Fuzzy States

Cri

sp S

tate

s Activity Fuzzy Finite State Machine

Activity blocked   Definition

  An invocation is late compared to the others   Jobs of an activity are considered bag-of-tasks

  Possible causes   Longer waiting times   Lost tasks (e.g. killed by site due to quota violation)   Resources with poor performance


Invocations completion rate for a FIELD-II/pasa simulation Job flow for a FIELD-II/pasa simulation

0.0e+00 4.0e+06 8.0e+06 1.2e+07

020

4060

80100

FIELD-II/pasa - workflow-9SIeNv

Time (s)

Com

plet

ed J

obs

Activity blocked: degree   Degree computed from all completed jobs of the activity

  Job phases: setup inputs download execution outputs upload

  Assumption: bag-of-tasks (all jobs have equal durations)

  Median-based estimation:

  Incident degree: job performance w.r.t median


€

d =Ei

Mi + Ei

∈ [0,1]

Median duration of jobs phases

Real job duration

42s

300s

20s

?

42s

300s

400s*

15s

Estimated job duration

50s

250s

400s

15s

completed

current

Mi = 715s Ei = 757s

*: max(400s, 20s) = 400s

Activity blocked: levels and actions

  Levels: identified from the platform logs

  Actions   Job replication

  Cancel replicas with bad performance

  Replicate only if all active replicas are running


Replication process for one task !bEstimation by Median

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

0015

000 Level 1

(no actions) Level 2

action: replicate jobs

d

€

τ1

Input/output data errors

  Possible causes   File unavailability or non-existence

  Degree   Determined from the transfer failure rate

  Levels and actions


!iu

Input failed transfers − unavailability

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

015

0025

00

!ouOutput failed transfers

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

015

0025

00

Level 1 Level 2 Level 3 Level 2 Level 1 (no actions)

action: stop activity no actions

action: stop activity action: replicate input files

€

τ1

€

τ2

€

τ1

Application error

  Possible causes   Corrupted executable, missing dependencies, incompatibility

  Degree   Determined from the application failure rate



!aApplication errors

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

040

0080

00

Level 1 (no actions)

Level 2

action: stop activity

€

τ1

Site misconfiguration

  Possible cause   Sites have utmost failure rate due to site–independent issue

  Degree



€

d =max(φ1,φ2,...,φk ) −median(φ1,φ2,...,φk )

€

where φ denotes failure rates of jobs running on site i

!isInput failed transfers − site misconfigured

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

020

0040

00 Level 1 (no actions)

Level 2 Level 3

action: blacklist site by an exponential backoff delay

action: replicate files on sites reachable from problematic site

€

τ1

€

τ2

Experiments

  Goal: Self-Healing vs No-Healing   Experiment 1: cope with recoverable errors

  Experiment 2: detect unrecoverable errors

  Metrics   Makespan of the activity execution

  Resource waste

  For w < 0: self-healing consumed less resources

  For w > 0: self-healing wasted resources


€

w =(CPU + data) self −healing(CPU + data)no−healing

−1

Experiment Conditions

  Software   Virtual Imaging Platform

  MOTEUR workflow engine

  DIRAC pilot job system

  Infrastructure   European Grid Infrastructure (EGI): production, shared

  Self-Healing and No-Healing launched simultaneously

  Experiment parameters   Task and file replication limited to 5

  Failed task resubmission limited to 5


Applications


FIELD-II/pasa

•  Ultrasound imaging simulation

•  122 invocations •  CPU Time: 15 min •  ~210 MB •  Data-intensive

Mean-Shift/hs3

•  Image denoising •  250 invocations •  CPU Time: 1 hour •  ~182 MB •  CPU-intensive

Image courtesy of ANR project US-Tagging http://www.creatis.insa-lyon.fr/us-tagging/news

O. Bernard, M. Alessandrini

Image courtesy of Ting Li http://www.creatis.insa-lyon.fr

Historical information

  Used for determining incident levels and association rules

  Traces from Virtual Imaging Platform (VIP)

  April to August 2011   Workload uniformly distributed


Cumulative amount of running activities from April to August 2011

1,082 workflow executions 36 applications (+ versions) 26 users

1,838 activities 92,309 invocations 123,025 jobs 641,297 events

Results

  Experiment 1: tests if recoverable errors are detected


FIELD-II/pasa Mean-Shift/hs3

speeds up execution up to 4 speeds up execution up to 2.6

0

2000

4000

6000

8000

10000

12000

1 2 3 4 5Repetitions

Mak

espa

n (s

)

No−HealingSelf−Healing

0

5000

10000

15000

20000

1 2 3 4 5Repetitions

Mak

espa

n (s

)

No−HealingSelf−Healing

Self-Healing process reduced resource consumption up to 26% when compared

to the No-Healing execution

Repetition w

1 –0.10

2 –0.15

3 –0.09

4 0.05

5 –0.26

Repetition w

1 –0.02

2 –0.20

3 –0.02

4 –0.02

5 –0.01

Results (2)

  Experiment 1: invocations completion rate for one repetition of FIELD-II/pasa and Mean-Shift/hs3


0e+00 1e+06 2e+06 3e+06 4e+06

020

4060

80100

FIELD-II/pasa - workflow-j5E0vz

Time (s)

Com

plete

d Jo

bs

0e+00 2e+06 4e+06 6e+06 8e+06 1e+07

020

4060

80100

FIELD-II/pasa - workflow-hbg12c

Time (s)

Com

plet

ed J

obs

Self-Healing No-Healing

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06

050

100

150

200

Mean-Shift - workflow-zCQzYD

Time (s)

Com

plete

d Jo

bs

0.0e+00 4.0e+06 8.0e+06 1.2e+07

050

100

150

200

Mean-Shift - workflow-GhGlW7

Time (s)

Com

plete

d Jo

bs

Results (3)

  Experiment 2: tests if unrecoverable errors are quickly identified and the execution is stopped

  3 different runs

  No-Healing execution was stopped after 7 hours


Unrecoverable errors were identified earlier by the healing process

Conclusions   Context

  Autonomous handling of incidents in workflow activities   No assumptions on resource characteristics and workloads

  Summary of the proposed method   Implements a generic MAPE-K loop   Incident degrees computed online   Degrees quantified in discrete incident levels   Actions set are performed according to the incident level

  Results   Obtained in production conditions (EGI)   Speeds up execution up to a factor of 4   Reduced resource consumption up to 26%   Properly detection of unrecoverable errors

  Future Work   Addresses complete workflow execution


Self-healing of operational workflow incidents on distributed computing infrastructures

Thank you for your attention. Questions?

23

Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon

Lyon, France

ACKNOWLEDGMENTS VIP users and project members

French National Agency for Research (ANR-09-COSI-03) European Grid Initiative (EGI)

France-Grilles

Rafael Ferreira da Silva – [email protected]

self-healing of operational workflow incidents on distributed computing infrastructures

Technology

rafael ferreira da silva

incident levels

level2 level3 level

degree d

ens lyon lyon

workflow execution

incidents experiments

platform entities