fault tolerance and recovery

38
1 Fault Tolerance and Recovery Mostly taken from www.logos.ic.i.u-tokyo.ac.jp/~horita/study/ recovery.ppt

Upload: ivi

Post on 12-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Fault Tolerance and Recovery. Mostly taken from www.logos.ic.i.u-tokyo.ac.jp/~horita/study/ recovery . ppt. Contents. Introduction Recovery Checkpointing Difficulty of Checkpointing Synchronous checkpointing / recovery ( Asynchronous checkpointing / recovery ). Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fault Tolerance and Recovery

1

Fault Tolerance and Recovery

Mostly taken from

www.logos.ic.i.u-tokyo.ac.jp/~horita/study/recovery.ppt

Page 2: Fault Tolerance and Recovery

2

Contents

Introduction Recovery Checkpointing

Difficulty of Checkpointing Synchronous checkpointing / recovery ( Asynchronous checkpointing / recovery )

Page 3: Fault Tolerance and Recovery

3

Introduction

Long computation in distributed environments High failure rate

Host failure (a lot of hosts) Network failure

One failure may disturb entire computation

⇒ Need to start it again from the beginning High cost

Why don’t we utilize the previous computation?

Recovery

Page 4: Fault Tolerance and Recovery

4

Recovery is not easy

Suppose that a parallel computation is running in distributed resources…

for(i=0; i<MAXITER; i++){

local_compute();   // compute at each host

  global_state_exchange(); // communicate with neighbors

}

1

1

1

1

7 7

7 7

88

• need to save process states periodically

• usually other processes have to restore to previous state

• overhead

Page 5: Fault Tolerance and Recovery

5

Recovery

Page 6: Fault Tolerance and Recovery

6

Back/Forward Error Recovery

Forward-error recovery Only when it is possible to remove errors Enable processes to move forward Ex) Redundancy, vote

Backward-error recovery General Restore to a previous error-free state Ex) Checkpoint

Page 7: Fault Tolerance and Recovery

7

Backward-error recovery

operational-based approach Record all modifications of a process’ state

state-based approach Record complete state at certain point

Recovery

Forward-error Backward-error

operational-based state-based

Page 8: Fault Tolerance and Recovery

8

State-based approach

Terminology checkpointing

the process of saving state

checkpoint the recovery point at which checkpointing occurs

rolling back the process of restoring a process to a prior-state

Page 9: Fault Tolerance and Recovery

9

Checkpointing

Page 10: Fault Tolerance and Recovery

10

Problem of naïve checkpointing

Orphan Messages and the Domino Effect Orphan message :

a message that make an inconsistent state Domino Effect :

what a single rolling back induce other rolling back

Lost MessagesLivelocks

Page 11: Fault Tolerance and Recovery

11

Orphan message and Domino Effect

X

Y

Z

[

[

[

x1

y1

z1

[

[

[

[x2 x3

y2

z2

Roll back

Y has not sent yet, but X has received.

: Orphan message

: Domino Effect

Suppose Y fails after sending message m and is rolled back to y2. Now X has received message m from Y, but Y has no record of sending it. m is referred to as an orphan message.

if Z is rolled back, all three processes must roll back to their very first recovery points, namely, x1, y1, and z1.

Page 12: Fault Tolerance and Recovery

12

Lost messages

X

Y

Z

[

[

[

x1

y1

z1

[

[

[

x2

y2

z2

[x3

Roll back

X has sent, but Y cannot receive forever

: Lost message

If Y fails after receiving message m, the system is restored to state {x1, y1}, in which message is lost as process X is past the point where it sends message m.

Page 13: Fault Tolerance and Recovery

13

Livelocks

X

Y [y1

[x1

n1 m1n1

m2n2

Process Y fails before receiving message n1. When Y rolls back to y1, there is no record of sending message m1, hence X must roll back to x1. And then Y receives message n1, but X has no record of sending it, so Y must roll back again. The situation will repeat infinitely.

Page 14: Fault Tolerance and Recovery

14

Consistency of Checkpoint

Strongly consistent set of checkpoints no messages penetrating the set

Consistent set of checkpoints no messages penetrating the set backward

[

[

[

x1

y1

z1

[

[

[

y2

x2

z2

Strongly consistent consistent

No orphans and domino effect but need to deal with lost messages

Page 15: Fault Tolerance and Recovery

15

Consistent Set of Checkpoints

Informally, a cut (global state) in the time diagram is consistent if no arrow starts on the right-hand side and ends on the left-hand side of it.

C1 is consistent, but C2 is inconsistent.

p

q

r

s

C1 C2

Page 16: Fault Tolerance and Recovery

16

Useless Checkpoints

Q_2 is a useless checkpoint Q_2 records the receipt of m1, but not the sending

of m2 {P1,Q_2} cannot be consistent (otherwise m1

would become an orphan); similarly {P_2,Q_2} cannot be consistent (since otherwise m2 would become an orphan)

Page 17: Fault Tolerance and Recovery

17

Checkpoint/Recovery Algorithm

Synchronous with global synchronization at checkpointing

Asynchronous without global synchronization at checkpointing

Page 18: Fault Tolerance and Recovery

18

Preliminary (Assumption)

GoalTo make a consistent global checkpoint

Assumptions Communication channels are FIFO No partition of the network End-to-end protocols cope with message loss due to

rollback recovery and communication failure No failure during the execution of the algorithm

~ Synchronous Checkpoint ~

Page 19: Fault Tolerance and Recovery

19

Preliminary (Two types of checkpoint)

tentative checkpoint : a temporary checkpoint a candidate for permanent checkpoint

permanent checkpoint : a local checkpoint at a process a part of a consistent global checkpoint

~ Synchronous Checkpoint ~

Page 20: Fault Tolerance and Recovery

20

Checkpoint Algorithm

Algorithm1. an initiating process (a single process that invokes this algorithm) takes

a tentative checkpoint2. it requests all the processes to take tentative checkpoints3. it waits for receiving from all the processes whether taking a tentative

checkpoint has been succeeded 4. if it learns all the processes has succeeded, it decides all tentative

checkpoints should be made permanent; otherwise, should be discarded.

5. it informs all the processes of the decision6. The processes that receive the decision act accordingly

Supplement Once a process has taken a tentative checkpoint, it shouldn’t send

messages until it is informed of initiator’s decision.

~ Synchronous Checkpoint ~

Page 21: Fault Tolerance and Recovery

21

Diagram of Checkpoint Algorithm

[

[

[

|

|

Tentative checkpoint

|

request to take a tentative checkpoint

OK

decide to commit

[permanent checkpoint

[

[

consistent global checkpoint

consistent global checkpoint Unnecessary checkpoint

Initiator

~ Synchronous Checkpoint ~

Page 22: Fault Tolerance and Recovery

22

Optimized Algorithm

Each message is labeled by order of sending

Labeling Scheme⊥ : smallest labelт : largest label

last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. if not exists, is in it.⊥

first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . if not exists, is in it.⊥

ckpt_cohortX :the set of all processes that may have to take checkpoints when X decides to take a checkpoint.

~ Synchronous Checkpoint ~

[

[

X

Y

x2x3

y1 y2

y2

x2

Checkpoint request need to be sent to only the processes included in ckpt_cohort

Page 23: Fault Tolerance and Recovery

23

Optimized Algorithm

ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ }

Y takes a tentative checkpoint only if

last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥

~ Synchronous Checkpoint ~

X

Y

[

[

last_label_rcvdX[Y]

first_label_sentY[X]

Page 24: Fault Tolerance and Recovery

24

Optimized AlgorithmAlgorithm

1. an initiating process takes a tentative checkpoint2. it requests p ∈ ckpt_cohort to take tentative checkpoints ( this

message includes last_label_rcvd[reciever] of sender )3. if the processes that receive the request need to take a

checkpoint, they do the same as 1.2.; otherwise, return OK messages.

4. they wait for receiving OK from all of p ∈ ckpt_cohort5. if the initiator learns all the processes have succeeded, it

decides all tentative checkpoints should be made permanent; otherwise, should be discarded.

6. it informs p ∈ ckpt_cohort of the decision7. The processes that receive the decision act accordingly

~ Synchronous Checkpoint ~

Page 25: Fault Tolerance and Recovery

25

Diagram of Optimized Algorithm

[

[

[

[

A

C

B

D

ab1 ac1

bd1

dc1 dc2

cb1

ba1 ba2

ac2cb2

cd1

|

Tentative checkpoint

ca2

last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥

2 >= 1 > 0|

2 >= 2 > 0|

2 >= 0 > 0

OK

decide to commit

[

Permanent checkpoint

[

[

ckpt_cohortX : { Y | last_label_rcvdX[Y] > }⊥

~ Synchronous Checkpoint ~

Page 26: Fault Tolerance and Recovery

26

Correctness

A set of permanent checkpoints taken by this algorithm is consistent No process sends messages after taking a

tentative checkpoint until the receipt of the decision

New checkpoints include no message from the processes that don’t take a checkpoint

The set of tentative checkpoints is fully either made to permanent checkpoints or discarded.

~ Synchronous Checkpoint ~

Page 27: Fault Tolerance and Recovery

27

Recovery Algorithm

Labeling Scheme⊥ : smallest labelт : largest label

last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. If not exists, is in it.⊥

first_label_sentX[Y] :the first message that X sent to Y after X took its last permanent or tentative checkpoint . If not exists, is in it.⊥

roll_cohortX :the set of all processes that may have to roll back to the latest checkpoint when process X rolls back.

last_label_sentX[Y] : the last message that X sent to Y before X takes its latest permanent checkpoint. If not exist, т is in it.

~ Synchronous Recovery ~

Page 28: Fault Tolerance and Recovery

28

Recovery Algorithm

roll_cohortX = { Y | X can send messages to Y }

Y will restart from the permanent checkpoint only if

last_label_rcvdY[X] > last_label_sentX[Y]

~ Synchronous Recovery ~

Page 29: Fault Tolerance and Recovery

29

Recovery AlgorithmAlgorithm

1. an initiator requests p ∈ roll_cohort to prepare to rollback ( this message includes last_label_sent[reciever] of sender )

2. if the processes that receive the request need to rollback, they do the same as 1.; otherwise, return OK message.

3. they wait for receiving OK from all of p ∈ ckpt_cohort.4. if the initiator learns p ∈ roll_cohort have succeeded, it

decides to rollback; otherwise, not to rollback.5. it informs p ∈ roll_cohort of the decision6. the processes that receive the decision act accordingly

~ Synchronous Recovery ~

Page 30: Fault Tolerance and Recovery

30

Diagram of Synchronous Recovery

[

[

[

[

A

C

B

D

ab1 ac1

bd1

dc1 dc2

cb1

ba1 ba2

ac2cb2

dc1

request to roll back

0 > 1

last_label_rcvdY[X] > last_label_sentX[Y]

2 > 1

0 >т

OK

[

[

2 > 1

0 >т

[

decide to roll back

roll_cohortX = { Y | X can send messages to Y }

Page 31: Fault Tolerance and Recovery

31

Drawbacks of Synchronous Approach

Additional messages are exchanged Synchronization delay An unnecessary extra load on the system if

failure rarely occurs

Page 32: Fault Tolerance and Recovery

32

Asynchronous Checkpoint

Characteristic Each process takes checkpoints independently No guarantee that a set of local checkpoints is

consistent A recovery algorithm has to search consistent set

of checkpoints

No additional message No synchronization delay Lighter load during normal execution

Page 33: Fault Tolerance and Recovery

33

Preliminary (Assumptions)

GoalTo find the latest consistent set of checkpoints

Assumptions Communication channels are FIFO Communication channels are reliable The underlying computation is event-driven

~ Asynchronous Checkpoint / Recovery~

Page 34: Fault Tolerance and Recovery

34

Asynchronous Checkpointing

Basic idea Save states, messages sent at each event Volatile logging Each processor notes number of messages sent

to others, and received from others Use counters to determine orphan messages

Page 35: Fault Tolerance and Recovery

35

Preliminary (Two types of log)

save an event on the memory at receipt of messages (volatile log)

volatile log periodically flushed to the disk (stable log) ⇔ checkpoint

volatile log :   quick access

lost if the corresponding processor fails

stable log :    slow access

not lost even if processors fail

~ Asynchronous Checkpoint / Recovery~

Page 36: Fault Tolerance and Recovery

36

Preliminary (Definition)

DefinitionCkPti : the checkpoint (stable log) that i rolled back to when

failure occurs

RCVDi←j (CkPti / e ) :the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPti or event e.

SENTi→j(CkPti / e ) :the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPti or event e

~ Asynchronous Checkpoint / Recovery~

Page 37: Fault Tolerance and Recovery

37

Recovery Algorithm

Algorithm1. When one process crashes, it recovers to the latest

checkpoint CkPt.2. It broadcasts the message that it had failed. Others

receive this message, and rollback to the latest event.3. Each process sends SENT(CkPt) to neighboring

processes4. Each process waits for SENT(CkPt) messages from every

neighbor5. On receiving SENTj→i(CkPtj) from j, if i notices RCVDi←j

(CkPti) > SENTj→i(CkPtj), it rolls back to the event e such that RCVDi←j (e) = SENTj→i(e),

6. repeat 3,4,and 5 N times (N is the number of processes)

~ Asynchronous Checkpoint / Recovery~

Page 38: Fault Tolerance and Recovery

38

Asynchronous Recovery

X

Y

Z

Ex0 Ex1 Ex2 Ex3

Ey0 Ey1 Ey2 Ey3

Ez0 Ez1 Ez2

[

[

[

x1

y1

z1

(Y,2)

(Y,1)

(X,2)

(X,0)

(Z,0)

(Z,1)

3 <= 2

RCVDi←j (CkPti) <= SENTj→i(CkPtj)

2 <= 2

X:Y X:Z

0 <= 0

1 <= 2

Y:X

1 <= 1

Y:Z

0 <= 0

Z:X

2 <= 1

Z:Y

1 <= 1