self-healing of operational workflow incidents on distributed computing infrastructures
TRANSCRIPT
Self-healing of operational workflow incidents on
distributed computing infrastructures
1
The 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing – May 13-16 2012
Rafael Ferreira da Silva – [email protected]
Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon
Lyon, France
Context Virtual Imaging Platform (VIP)
Medical imaging science-gateway
Grid of 129 sites (EGI – http://www.egi.eu)
Significant usage Registered users: 192 from 24 countries
Applications: 18
Consumed 32 CPU years in 2011
2 Rafael Ferreira da Silva – [email protected]
VIP usage in 2011: CPU consumption of VIP and related platforms on EGI.
Applications
File transfer
VIP – http://vip.creatis.insa-lyon.fr
Problem and objective Problem: costly manual operations
Rescheduling tasks, restarting services, killing misbehaving experiments or replicating data files
Objective: automated platform administration Autonomous detection of operational incidents
Perform appropriate set of actions
Assumptions: online and non-clairvoyant Only partial information available
Decisions must be fast
Production conditions, no user activity and workloads prediction
3 Rafael Ferreira da Silva – [email protected]
Several entities Multiple level issues
Highlighted: our target
Platform entities
4 Rafael Ferreira da Silva – [email protected]
• File unavailable • Executable corrupted …
• Blocked • Low efficiency • File unavailable • Site misconfigured …
• Unavailable • Does not exist … • Misconfigured
• Unavailable …
Entities diagram of a scientific gateway
mult
iple
lev
el iss
ues
• Failed to setup • Execution failed • Unable to access storage resource …
Illustration on a workflow execution
5 Rafael Ferreira da Silva – [email protected]
invocations
grid jobs
Workflow description + input data
failed job replica
Outline
General healing process
Application to blocked activities and other incidents
Experiments and results on a production grid
Conclusions
6 Rafael Ferreira da Silva – [email protected]
0.15
0.31 0.54
General MAPE-K loop
7 Rafael Ferreira da Silva – [email protected]
Incident 1 degree d = 0.2
Incident 2 degree d = 0.7
Incident 3 degree d = 0.4
level 1
level2
level3
Roulette wheel selection
Incident 2
Selected
Rule Confidence (c) c x d
1 2 0.1 0.09
2 2 1.0 0.52
3 2 0.9 0.39
Association rules for incident 2
0.09
0.39 0.52 Incident 2
Selected
Roulette wheel selection based on association rules
Set of Actions
I2
level 1
level2
level3
level 1
level2
level3
€
=did jj=1
n∑
event (job completion and failures)
or timeout
Monitoring Analysis
Execution Knowledge
Planning
Historical info
!bEstimation by Median
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
050
0015
000
Modeling
Fuzzy Finite State Machine State degrees set from metrics
Association rules identify relations among incident levels
Considered Incidents Activity blocked
Input data error
Output data error
Application error
Site misconfigured
8 Rafael Ferreira da Silva – [email protected]
Fuzzy States
Cri
sp S
tate
s Activity Fuzzy Finite State Machine
Activity blocked Definition
An invocation is late compared to the others Jobs of an activity are considered bag-of-tasks
Possible causes Longer waiting times Lost tasks (e.g. killed by site due to quota violation) Resources with poor performance
9 Rafael Ferreira da Silva – [email protected]
Invocations completion rate for a FIELD-II/pasa simulation Job flow for a FIELD-II/pasa simulation
0.0e+00 4.0e+06 8.0e+06 1.2e+07
020
4060
80100
FIELD-II/pasa - workflow-9SIeNv
Time (s)
Com
plet
ed J
obs
Activity blocked: degree Degree computed from all completed jobs of the activity
Job phases: setup inputs download execution outputs upload
Assumption: bag-of-tasks (all jobs have equal durations)
Median-based estimation:
Incident degree: job performance w.r.t median
10 Rafael Ferreira da Silva – [email protected]
€
d =Ei
Mi + Ei
∈ [0,1]
Median duration of jobs phases
Real job duration
42s
300s
20s
?
42s
300s
400s*
15s
Estimated job duration
50s
250s
400s
15s
completed
current
Mi = 715s Ei = 757s
*: max(400s, 20s) = 400s
Activity blocked: levels and actions
Levels: identified from the platform logs
Actions Job replication
Cancel replicas with bad performance
Replicate only if all active replicas are running
11 Rafael Ferreira da Silva – [email protected]
Replication process for one task !bEstimation by Median
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
050
0015
000 Level 1
(no actions) Level 2
action: replicate jobs
d
€
τ1
Input/output data errors
Possible causes File unavailability or non-existence
Degree Determined from the transfer failure rate
Levels and actions
12 Rafael Ferreira da Silva – [email protected]
!iu
Input failed transfers − unavailability
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
050
015
0025
00
!ouOutput failed transfers
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
050
015
0025
00
Level 1 Level 2 Level 3 Level 2 Level 1 (no actions)
action: stop activity no actions
action: stop activity action: replicate input files
€
τ1
€
τ2
€
τ1
Application error
Possible causes Corrupted executable, missing dependencies, incompatibility
Degree Determined from the application failure rate
Levels and actions
13 Rafael Ferreira da Silva – [email protected]
!aApplication errors
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
040
0080
00
Level 1 (no actions)
Level 2
action: stop activity
€
τ1
Site misconfiguration
Possible cause Sites have utmost failure rate due to site–independent issue
Degree
Levels and actions
14 Rafael Ferreira da Silva – [email protected]
€
d =max(φ1,φ2,...,φk ) −median(φ1,φ2,...,φk )
€
where φ denotes failure rates of jobs running on site i
!isInput failed transfers − site misconfigured
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
020
0040
00 Level 1 (no actions)
Level 2 Level 3
action: blacklist site by an exponential backoff delay
action: replicate files on sites reachable from problematic site
€
τ1
€
τ2
Experiments
Goal: Self-Healing vs No-Healing Experiment 1: cope with recoverable errors
Experiment 2: detect unrecoverable errors
Metrics Makespan of the activity execution
Resource waste
For w < 0: self-healing consumed less resources
For w > 0: self-healing wasted resources
15 Rafael Ferreira da Silva – [email protected]
€
w =(CPU + data) self −healing(CPU + data)no−healing
−1
Experiment Conditions
Software Virtual Imaging Platform
MOTEUR workflow engine
DIRAC pilot job system
Infrastructure European Grid Infrastructure (EGI): production, shared
Self-Healing and No-Healing launched simultaneously
Experiment parameters Task and file replication limited to 5
Failed task resubmission limited to 5
16 Rafael Ferreira da Silva – [email protected]
Applications
17 Rafael Ferreira da Silva – [email protected]
FIELD-II/pasa
• Ultrasound imaging simulation
• 122 invocations • CPU Time: 15 min • ~210 MB • Data-intensive
Mean-Shift/hs3
• Image denoising • 250 invocations • CPU Time: 1 hour • ~182 MB • CPU-intensive
Image courtesy of ANR project US-Tagging http://www.creatis.insa-lyon.fr/us-tagging/news
O. Bernard, M. Alessandrini
Image courtesy of Ting Li http://www.creatis.insa-lyon.fr
Historical information
Used for determining incident levels and association rules
Traces from Virtual Imaging Platform (VIP)
April to August 2011 Workload uniformly distributed
18 Rafael Ferreira da Silva – [email protected]
Cumulative amount of running activities from April to August 2011
1,082 workflow executions 36 applications (+ versions) 26 users
1,838 activities 92,309 invocations 123,025 jobs 641,297 events
Results
Experiment 1: tests if recoverable errors are detected
19 Rafael Ferreira da Silva – [email protected]
FIELD-II/pasa Mean-Shift/hs3
speeds up execution up to 4 speeds up execution up to 2.6
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5Repetitions
Mak
espa
n (s
)
No−HealingSelf−Healing
0
5000
10000
15000
20000
1 2 3 4 5Repetitions
Mak
espa
n (s
)
No−HealingSelf−Healing
Self-Healing process reduced resource consumption up to 26% when compared
to the No-Healing execution
Repetition w
1 –0.10
2 –0.15
3 –0.09
4 0.05
5 –0.26
Repetition w
1 –0.02
2 –0.20
3 –0.02
4 –0.02
5 –0.01
Results (2)
Experiment 1: invocations completion rate for one repetition of FIELD-II/pasa and Mean-Shift/hs3
20 Rafael Ferreira da Silva – [email protected]
0e+00 1e+06 2e+06 3e+06 4e+06
020
4060
80100
FIELD-II/pasa - workflow-j5E0vz
Time (s)
Com
plete
d Jo
bs
0e+00 2e+06 4e+06 6e+06 8e+06 1e+07
020
4060
80100
FIELD-II/pasa - workflow-hbg12c
Time (s)
Com
plet
ed J
obs
Self-Healing No-Healing
0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06
050
100
150
200
Mean-Shift - workflow-zCQzYD
Time (s)
Com
plete
d Jo
bs
0.0e+00 4.0e+06 8.0e+06 1.2e+07
050
100
150
200
Mean-Shift - workflow-GhGlW7
Time (s)
Com
plete
d Jo
bs
Results (3)
Experiment 2: tests if unrecoverable errors are quickly identified and the execution is stopped
3 different runs
No-Healing execution was stopped after 7 hours
21 Rafael Ferreira da Silva – [email protected]
Unrecoverable errors were identified earlier by the healing process
Conclusions Context
Autonomous handling of incidents in workflow activities No assumptions on resource characteristics and workloads
Summary of the proposed method Implements a generic MAPE-K loop Incident degrees computed online Degrees quantified in discrete incident levels Actions set are performed according to the incident level
Results Obtained in production conditions (EGI) Speeds up execution up to a factor of 4 Reduced resource consumption up to 26% Properly detection of unrecoverable errors
Future Work Addresses complete workflow execution
22 Rafael Ferreira da Silva – [email protected]
Self-healing of operational workflow incidents on distributed computing infrastructures
Thank you for your attention. Questions?
23
Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon
Lyon, France
ACKNOWLEDGMENTS VIP users and project members
French National Agency for Research (ANR-09-COSI-03) European Grid Initiative (EGI)
France-Grilles
Rafael Ferreira da Silva – [email protected]