revisiting failure detection for grid...

Xavier DÉFAGO1) School of Information Science,

Japan Adv. Inst. of Science & Tech. (JAIST)2) PRESTO, Japan Science & Tech. Agency (JST)

RevisitingFailure Detectionfor Grid Systems

IFIP WG 10.4 – Summer 2005 meeting – July 2005. Hakone, Japan.

Acknowledgements

• Naohiro HAYASHIBARA

• now at Tokyo Denki University

• Péter URBÁN

• Rami YARED

• Takuya KATAYAMA

• ... and many people through enlightening discussions

2

Related Projects

• COE program “Trustworthy e-Society”

• PRESTO, JST “Information & Systems”

• Jinzai Yosei “Dependable Internet”

• OBIGrid

• Bioinformatics Grid; RIKEN & AIST

• StarBED Internet Emulator

• OurGrid, PlanetLab.

3

Grid Systems

• What Grid?

• Data-G, computational-G, domain-G, ..., *-Grid

• What is the/a Grid?

• Structured Internet?

• Loosely coupled global / enterprise network?

• Decentralized distributed OS?

• Key point

• Virtualizing of resources, ...

• “Glue” between resources: i.e., distributed system

4

Grid Systems & Fault-Tolerance

• Needs

• 24/7 operation,

• reliability & availability,

• self-managing, auto-configuration,...

• security, accountability,...

• Current Reality

• ... a LOOOONG way to go!

5

Failure Detection in Grid

• Failure detection

• ability to detect failed components

• prevents blocking forever

• basic mechanism for fault-tolerance

• Failure detection as service

• E.g., [Stelling et al. 1998], [van Renesse et al. 1998],...

• E.g., NTP for clock synchronization

6

Failure Detection as Service

• Current situation

• ad hoc detection rather than service

• hardcoded timeouts in programs

• hidden behind heavy abstractions

• “proprietary” mechanisms

• Open challenges (highly opiniated)

• proper abstractions, QoS negotiation

• unattended management

• reduction of overhead, scalability

7

8

Simult. Indep. Requirements

• Large-scale systems• Many distributed applications simultaneously

• Different requirements

p1

p2

r2

r1

q2

q1

q4

q3

host p

host q

host r

A!

A!

A!

A!

applications

9

Example / Motivation

• Simple case

• “Bag-of-Tasks” computations

• Dispatch tasks

• Wait for results

• Environment

• Partial failures

• Heterogeneous

• Unpredictable comm.

BOOM

BOOM

Usage Patterns

• Case 1:

• Cost varies with time:

• amount work completed

• available resources

• Case 2:

• Important task

• Most likely up machine

10

?

BOOM

BOOM

Abstractions

Accrual Failure Detectors

• Accrual failure detection [Hayashibara; PhD 2004]

• 2 roles: monitoring, interpretation

• interpretation –> QoS

• => decoupling

12

Failure

Detection

Service

Programs,

Protocols

Monitoring

Interpretation

Action

Interpretation

ActionParametric

Action

suspicion level

suspicionssuspicions

Monitoring

Interpretation

Action Action Action

suspicions

Binary FD Accrual FD

• Accrual FD abstraction [Défago et al.; DSN 2005]

• combine different QoS

• properties; relation w/FD theory

slqp(t)

t

slqp(t)

Accrual Failure Detectors

13

T1(t)

trust

DT1suspect

T2(t)

DT2

trust

Chen FD as Accrual

• Chen-based adaptation [Chen et al.; TC 2002]

• After freshness point, increase with time

• Reset when receive heartbeat

• Safety margin ! set with threshold

14

slqp(t)

t

slqp(t)

p

q

BOOM

!1 !2 !3 !4 !5 !

!

" Accrual FD

• " failure detector [Hayashibara et al.; SRDS 2004]

• Heartbeat based, estimate arrival distribution

App. 3

do action(!)network

Plater (t)last arrival !

Failure Detector

App. 1

! > !1 ! suspect

App. 2

! > !2 ! suspect

heartbeat arrivals

sampling window

estimation

Plater (t)

t

tnow

! log10 Plater (tnow ! Tlast)Tlast

15

QoS of Failure Detectors


• Metricswhen p faulty:

• Detection time17

trust

suspect

up

down

BOOM

detection time

monitored

process

FD

output


• Metrics (accuracy)when p correct:

• average mistake rate

• query accuracy prob.

• good period duration

trust

suspect

up

mistakes!

FD

output

monitored

process

18

detection time

(in

)acc

ura

cy

Requirements vs. Guarantees

• Application requirements• !{D,A} : max. detect. time, max. mistakes

• FD QoS• "{d,a} : effect. detection time, effect. mistakes

19

!{D,A}

!{d,a}

Min

imum

Netw

ork

Late

ncy

!{d,a}

In a Perfect World

• Ideal• FD limited by min. network latency

• “acceptable” network/system load

!{D,A}

detection time

(in

)acc

ura

cy

20

In a Perfect World

• Perfect FD• “realistic” detection time

• absolute accuracy (no mistakes)

• (some failure types can be detected perfectly)

Min

imum

Netw

ork

Late

ncy

PerfectFailure Detector

!{d,a} detection time

(in

)acc

ura

cy

21

In a Less Perfect World

• Unreliable FDs• “realistic” detection latency

• imperfect accuracy

Min

imum

Netw

ork

Late

ncy Unreliable

Failure Detector

!{d,a}

detection time

(in

)acc

ura

cy

22

Parametric Failure Detector

• Parametric FDs• Parameter value defines FD best QoS

• E.g., Chen FD,...

• Tradeoff: accuracy <-> detection latency

detection time

(in

)acc

ura

cy

23

!{d,a}

!{d',a'}

QoS Coverage

• Coverage of FD• FD could be tuned to support app. req.

• Measure of FD

detection time

(in

)acc

ura

cy

24

Dynamic QoS Coverage

• Approximate coverage• Instantiate several QoS sets

• Find minimal set; minimal change

detection time

(in

)acc

ura

cy

25

Experimentation

Comparative Analyses

• 3 FD implementations

• Chen FD ; [Chen et al.] (FTCS 2000; TC 2002)

• Bertier FD ; [Bertier et al.] (DSN 2002)

• PHI accrual FD ; [Hayashibara et al.] (SRDS 2004)

• Goal

• “Realistic” executions (e.g., LAN, WAN)

• Identify QoS coverage

27

Experimentation: LAN

• LAN

• single FastEther hub

• Parameters

• HB interval: 20!ms

• Duration: 5" hour

• Total HB: 1’000’000

• no loss

1e-05

0.0001

0.001

0.01

0.1

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Mis

take r

ate

[1/s

]Detection time [s]

! FD

Chen FD

Bertier FD ! FDChen FD

Bertier FD

Bertier FD

Chen FD

! accrual FD

28

Experimentation: WAN

• WAN

• JAIST (JP) – EPFL (CH)

• Parameters

• HB interval: 100!ms

• Duration: 1 week

• Total HB: ~ 6’000’000

0.001

0.01

0.1

0.4

0 0.5 1 1.5 2 2.5

Mis

take r

ate

[1/s

]Detection time [s]

Chen FD

!-FD

Bertier FD

! FDChen FD

Bertier FDBertier FD

Chen FD

! accrual FD

29

Experimentation: WAN

30

Apr 3

0:00 0:00 0:00 0:00 0:00 0:00 0:00

Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 Apr 9

time (UTC)

0

50

100

150

200

250

300

350

400

450

500

0 5 10 15 20 25

Occ

ure

nce

[#

burs

ts]

Burst length [# lost messages]

2004

W32/Netsky.T@mm

Wrapping Up

Conclusion

• Ongoing work

• Translucent abstractions

• Improved implementations

• Wider experimentation

• QoS negotiation

• Much work to do...

• Self-configuration

• Low-overhead protocols

• Notification mechanisms

32

Future Directions

33

• QoS Coverage

• stricter definition

• gradients (uncertainty)

• QoS negotiation

• dynamic (re-)negotiation

• prob./best-effort negotiation

• fail-safe enforcement

detection time

(in

)acc

ura

cy

Future Directions

• Other environments

• E.g., wireless, dial-up,...

• Characterize traffic

• metrics

• clustering

• “benchmarking” sets

latency

entropy

34

Apr 3

0:00 0:00 0:00 0:00 0:00 0:00 0:00

Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 Apr 9

time (UTC)

Wireless

WAN

LAN

revisiting failure detection for grid...

Documents