revisiting failure detection for grid...
TRANSCRIPT
Xavier DÉFAGO1) School of Information Science,
Japan Adv. Inst. of Science & Tech. (JAIST)2) PRESTO, Japan Science & Tech. Agency (JST)
RevisitingFailure Detectionfor Grid Systems
IFIP WG 10.4 – Summer 2005 meeting – July 2005. Hakone, Japan.
Acknowledgements
• Naohiro HAYASHIBARA
• now at Tokyo Denki University
• Péter URBÁN
• Rami YARED
• Takuya KATAYAMA
• ... and many people through enlightening discussions
2
Related Projects
• COE program “Trustworthy e-Society”
• PRESTO, JST “Information & Systems”
• Jinzai Yosei “Dependable Internet”
• OBIGrid
• Bioinformatics Grid; RIKEN & AIST
• StarBED Internet Emulator
• OurGrid, PlanetLab.
3
Grid Systems
• What Grid?
• Data-G, computational-G, domain-G, ..., *-Grid
• What is the/a Grid?
• Structured Internet?
• Loosely coupled global / enterprise network?
• Decentralized distributed OS?
• Key point
• Virtualizing of resources, ...
• “Glue” between resources: i.e., distributed system
4
Grid Systems & Fault-Tolerance
• Needs
• 24/7 operation,
• reliability & availability,
• self-managing, auto-configuration,...
• security, accountability,...
• Current Reality
• ... a LOOOONG way to go!
5
Failure Detection in Grid
• Failure detection
• ability to detect failed components
• prevents blocking forever
• basic mechanism for fault-tolerance
• Failure detection as service
• E.g., [Stelling et al. 1998], [van Renesse et al. 1998],...
• E.g., NTP for clock synchronization
6
Failure Detection as Service
• Current situation
• ad hoc detection rather than service
• hardcoded timeouts in programs
• hidden behind heavy abstractions
• “proprietary” mechanisms
• Open challenges (highly opiniated)
• proper abstractions, QoS negotiation
• unattended management
• reduction of overhead, scalability
7
8
Simult. Indep. Requirements
• Large-scale systems• Many distributed applications simultaneously
• Different requirements
p1
p2
r2
r1
q2
q1
q4
q3
host p
host q
host r
A!
A!
A!
A!
applications
9
Example / Motivation
• Simple case
• “Bag-of-Tasks” computations
• Dispatch tasks
• Wait for results
• Environment
• Partial failures
• Heterogeneous
• Unpredictable comm.
BOOM
BOOM
Usage Patterns
• Case 1:
• Cost varies with time:
• amount work completed
• available resources
• Case 2:
• Important task
• Most likely up machine
10
?
BOOM
BOOM
Abstractions
Accrual Failure Detectors
• Accrual failure detection [Hayashibara; PhD 2004]
• 2 roles: monitoring, interpretation
• interpretation –> QoS
• => decoupling
12
Failure
Detection
Service
Programs,
Protocols
Monitoring
Interpretation
Action
Interpretation
ActionParametric
Action
suspicion level
suspicionssuspicions
Monitoring
Interpretation
Action Action Action
suspicions
Binary FD Accrual FD
• Accrual FD abstraction [Défago et al.; DSN 2005]
• combine different QoS
• properties; relation w/FD theory
slqp(t)
t
slqp(t)
Accrual Failure Detectors
13
T1(t)
trust
DT1suspect
T2(t)
DT2
trust
Chen FD as Accrual
• Chen-based adaptation [Chen et al.; TC 2002]
• After freshness point, increase with time
• Reset when receive heartbeat
• Safety margin ! set with threshold
14
slqp(t)
t
slqp(t)
p
q
BOOM
!1 !2 !3 !4 !5 !
!
" Accrual FD
• " failure detector [Hayashibara et al.; SRDS 2004]
• Heartbeat based, estimate arrival distribution
App. 3
do action(!)network
Plater (t)last arrival !
Failure Detector
App. 1
! > !1 ! suspect
App. 2
! > !2 ! suspect
heartbeat arrivals
sampling window
estimation
Plater (t)
t
tnow
! log10 Plater (tnow ! Tlast)Tlast
15
QoS of Failure Detectors
QoS of Failure Detectors
• Metricswhen p faulty:
• Detection time17
trust
suspect
up
down
BOOM
detection time
monitored
process
FD
output
QoS of Failure Detectors
• Metrics (accuracy)when p correct:
• average mistake rate
• query accuracy prob.
• good period duration
trust
suspect
up
mistakes!
FD
output
monitored
process
18
detection time
(in
)acc
ura
cy
Requirements vs. Guarantees
• Application requirements• !{D,A} : max. detect. time, max. mistakes
• FD QoS• "{d,a} : effect. detection time, effect. mistakes
19
!{D,A}
!{d,a}
Min
imum
Netw
ork
Late
ncy
!{d,a}
In a Perfect World
• Ideal• FD limited by min. network latency
• “acceptable” network/system load
!{D,A}
detection time
(in
)acc
ura
cy
20
In a Perfect World
• Perfect FD• “realistic” detection time
• absolute accuracy (no mistakes)
• (some failure types can be detected perfectly)
Min
imum
Netw
ork
Late
ncy
PerfectFailure Detector
!{d,a} detection time
(in
)acc
ura
cy
21
In a Less Perfect World
• Unreliable FDs• “realistic” detection latency
• imperfect accuracy
Min
imum
Netw
ork
Late
ncy Unreliable
Failure Detector
!{d,a}
detection time
(in
)acc
ura
cy
22
Parametric Failure Detector
• Parametric FDs• Parameter value defines FD best QoS
• E.g., Chen FD,...
• Tradeoff: accuracy <-> detection latency
detection time
(in
)acc
ura
cy
23
!{d,a}
!{d',a'}
QoS Coverage
• Coverage of FD• FD could be tuned to support app. req.
• Measure of FD
detection time
(in
)acc
ura
cy
24
Dynamic QoS Coverage
• Approximate coverage• Instantiate several QoS sets
• Find minimal set; minimal change
detection time
(in
)acc
ura
cy
25
Experimentation
Comparative Analyses
• 3 FD implementations
• Chen FD ; [Chen et al.] (FTCS 2000; TC 2002)
• Bertier FD ; [Bertier et al.] (DSN 2002)
• PHI accrual FD ; [Hayashibara et al.] (SRDS 2004)
• Goal
• “Realistic” executions (e.g., LAN, WAN)
• Identify QoS coverage
27
Experimentation: LAN
• LAN
• single FastEther hub
• Parameters
• HB interval: 20!ms
• Duration: 5" hour
• Total HB: 1’000’000
• no loss
1e-05
0.0001
0.001
0.01
0.1
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Mis
take r
ate
[1/s
]Detection time [s]
! FD
Chen FD
Bertier FD ! FDChen FD
Bertier FD
Bertier FD
Chen FD
! accrual FD
28
Experimentation: WAN
• WAN
• JAIST (JP) – EPFL (CH)
• Parameters
• HB interval: 100!ms
• Duration: 1 week
• Total HB: ~ 6’000’000
0.001
0.01
0.1
0.4
0 0.5 1 1.5 2 2.5
Mis
take r
ate
[1/s
]Detection time [s]
Chen FD
!-FD
Bertier FD
! FDChen FD
Bertier FDBertier FD
Chen FD
! accrual FD
29
Experimentation: WAN
30
Apr 3
0:00 0:00 0:00 0:00 0:00 0:00 0:00
Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 Apr 9
time (UTC)
0
50
100
150
200
250
300
350
400
450
500
0 5 10 15 20 25
Occ
ure
nce
[#
burs
ts]
Burst length [# lost messages]
2004
W32/Netsky.T@mm
Wrapping Up
Conclusion
• Ongoing work
• Translucent abstractions
• Improved implementations
• Wider experimentation
• QoS negotiation
• Much work to do...
• Self-configuration
• Low-overhead protocols
• Notification mechanisms
32
Future Directions
33
• QoS Coverage
• stricter definition
• gradients (uncertainty)
• QoS negotiation
• dynamic (re-)negotiation
• prob./best-effort negotiation
• fail-safe enforcement
detection time
(in
)acc
ura
cy
Future Directions
• Other environments
• E.g., wireless, dial-up,...
• Characterize traffic
• metrics
• clustering
• “benchmarking” sets
latency
entropy
34
Apr 3
0:00 0:00 0:00 0:00 0:00 0:00 0:00
Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 Apr 9
time (UTC)
Wireless
WAN
LAN