distributed systems cs 15-440 fault tolerance- part iii lecture 15, oct 26, 2011 majd f. sakr,...

Distributed SystemsCS 15-440

Fault Tolerance- Part III

Lecture 15, Oct 26, 2011

Majd F. Sakr, Mohammad Hammoud andVinay Kolar

1

Today…

Last session Fault Tolerance – Part II

Reliable communication

Today’s session Fault Tolerance – Part III

Reliable communication Atomicity Recovery

Announcement: Project 3 has been posted. DR is due on Monday Oct 31

2

Objectives

Discussion on Fault Tolerance

General background on fault tolerance

Process resilience, failure detection and reliable communication

Atomicity and distributed commit protocols

Recovery from failures

Reliable Communication

5

Reliable Communication

Reliable Request-Reply Communication

Reliable Group Communication


A Basic Reliable-Multicasting Scheme Scalability in Reliable Multicasting Atomic Multicast

6


A Basic Reliable-Multicasting Scheme (Recap) Scalability in Reliable Multicasting Atomic Multicast

7

Recap: Reliable Multicasting with Feedback Messages

M25

Last = 24

Receiver

Last = 24

Receiver

Last = 23

Receiver

Last = 24

Receiver

Network

Sender

HistoryBuffer

M25M25M25M25M25M25M25M25M25M25M25M25M25M25

Last = 24

Receiver

Last = 24

Receiver

Last = 23

Receiver

Last = 24

ReceiverSender

M25 M25 M25 M25

ACK25Missed 24ACK25ACK25

8



9

Scalability Issues with a Feedback-Based Scheme

If there are N receivers in a multicasting process, the sender must be prepared to accept at least N ACKs

This might cause a feedback implosion

Instead, we can let a receiver return only a NACK

Limitations:

No hard guarantees can be given that a feedback implosion will not happen

It is not clear for how long the sender should keep a message in its history buffer

10

Nonhierarchical Feedback Control

How can we control the number of NACKs sent back to the sender?

A NACK is sent to all the group members after some random delay

A group member suppresses its own feedback concerning a missing message after receiving a NACK feedback about the same message

11

Hierarchical Feedback Control

Feedback suppression is basically a nonhierarchical solution

Achieving scalability for very large groups of receivers requires that hierarchical approaches are adopted

The group of receivers is partitioned into a number of subgroups, which are organized into a tree

12

R

Receiver

Hierarchical Feedback Control

The subgroup containing the sender S forms the root of the tree

Within a subgroup, any reliable multicasting scheme can be used

Each subgroup appoints a local coordinator C responsible for handling retransmission requests in its subgroup

If C misses a message m, it asks the C of the parent subgroup to retransmit m

13

S

Root

C

CCoordinator

R



14

Atomic Multicast

P1: What is often needed in a distributed system is the guarantee that a message is delivered to either all processes or to none at all

P2: It is also generally required that all messages are delivered in the same order to all processes

Satisfying P1 and P2 results in an atomic multicast

Atomic multicast:

Ensures that non-faulty processes maintain a consistent view

Forces reconciliation when a process recovers and rejoins the group

15

Virtual Synchrony (1)

A multicast message m is uniquely associated with a list of processes to which it should be delivered

This delivery list corresponds to a group view (G)

There is only one case in which delivery of m is allowed to fail:

When a group-membership-change is the result of the sender of m crashing

In this case, m may either be delivered to all remaining processes, or ignored by each of them

16

A multicast message m is uniquely associated with a list of processes to which it should be delivered

This delivery list corresponds to a group view (G)

There is only one case in which delivery of m is allowed to fail:

When a group-membership-change is the result of the sender of m crashing

In this case, m may either be delivered to all remaining processes, or ignored by each of them

A reliable multicast with this property is said to be virtually

synchronous

Virtual Synchrony (2)

P1

P2

P3

P4

Reliable multicast by multiple point-to-point messages

P3 crashes

G = {P1, P2, P3, P4} G = {P1, P2, P4}

P3 rejoins

G = {P1, P2, P3, P4}Time

Partial multicast from P3 is discarded

The Principle of Virtual Synchronous Multicast

17

Message Ordering

Four different virtually synchronous multicast orderings are distinguished:

1.Unordered multicasts

2.FIFO-ordered multicasts

3.Causally-ordered multicasts

4.Totally-ordered multicasts

18

1. Unordered multicasts

A reliable, unordered multicast is a virtually synchronous multicast in which no guarantees are given concerning the order in which received messages are delivered by different processes

Process P1 Process P2 Process P3

Sends m1 Receives m1 Receives m2

Sends m2 Receives m2 Receives m1

Three communicating processes in the same group

19

2. FIFO-Ordered Multicasts

With FIFO-Ordered multicasts, the communication layer is forced to deliver incoming messages from the same process in the same order as they have been sent

Process P1 Process P2 Process P3 Process P4

Sends m1 Receives m1 Receives m3 Sends m3

Sends m2 Receives m3 Receives m1 Sends m4

Receives m2 Receives m2

Receives m4 Receives m4

Four processes in the same group with two different senders.

20

3-4. Causally-Ordered and Total-Ordered Multicasts

Causally-ordered multicast preserves potential causality between different messages

If message m1 causally precedes another message m2, regardless of whether they were multicast by the same sender or not, the communication layer at each receiver will always deliver m1 before m2

Total-ordered multicast requires that when messages are delivered, they are delivered in the same order to all group members (regardless of whether message delivery is unordered, FIFO-ordered, or causally-ordered)

21

Virtually Synchronous Reliable Multicasting

A virtually synchronous reliable multicasting that offers total-ordered delivery of messages is what we refer to as atomic multicasting

Multicast Basic Message Ordering Total-Ordered Delivery?

Reliable multicast None No

FIFO multicast FIFO-ordered delivery No

Causal multicast Causal-ordered delivery No

Atomic multicast None Yes

FIFO atomic multicast FIFO-ordered delivery Yes

Causal atomic multicast Causal-ordered delivery Yes

Six different versions of virtually synchronous reliable multicasting




































22

Implementing Virtual Synchrony (1)

We will consider a possible implementation of virtual synchrony appeared in Isis [Birman et al. 1991]

Isis assumes a FIFO-ordered multicast

Isis makes use of TCP, hence, each transmission is guaranteed to succeed

Using TCP does not guarantee that all messages sent to a view G are delivered to all non-faulty processes in G before any view change

23


The solution adopted by Isis is to let every process in G keeps a message m until it knows for sure that all members in G have received it

If m has been received by all members in G, m is said to be stable

Only stable messages are allowed to be delivered

24


4

21

5

6

37

0

View change

Process 4 notices that process 7 has crashed and sends a view change

4

21

5

6

37

0

Process 6 sends out all its unstablemessages, followed by a flushmessage

A flush messageAn unstable message

4

21

5

6

37

0

Process 6 installs the new view when it receives a flush message fromeveryone else

25

Distributed Commit

Atomic multicasting problem is an example of a more general problem, known as distributed commit

The distributed commit problem involves having an operation being performed by each member of a process group, or none at all

With reliable multicasting, the operation is the delivery of a message

With distributed transactions, the operation may be the commit of a transaction at a single site that takes part in the transaction

Distributed commit is often established by means of a coordinator and participants

26

One-Phase Commit Protocol

In a simple scheme, a coordinator can tell all participants whether or not to (locally) perform the operation in question

This scheme is referred to as a one-phase commit protocol

The one-phase commit protocol has a main drawback that if one of the participants cannot actually perform the operation, there is no way to tell the coordinator

In practice, more sophisticated schemes are needed. The most common utilized one is the two-phase commit protocol

27

Two-Phase Commit Protocol

Assuming that no failures occur, the two-phase commit protocol (2PC) consists of the following two phases, each consisting of two steps:

Phase I: Voting Phase

Step 1• The coordinator sends a VOTE_REQUEST message to all

participants.

Step 2

• When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message



participants.

Step 2

• When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message



participants.

Step 2

• When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator indicating that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message.

28

Phase II: Decision Phase

Step 1

Step 2

Two-Phase Commit ProtocolPhase II: Decision Phase

Step 1

• The coordinator collects all votes from the participants.

Step 2


Step 1


• If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants.

Step 2


Step 1



• However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message.

Step 2


Step 1




Step 2

• Each participant that voted for a commit waits for the final reaction by the coordinator.


Step 1




Step 2


• If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction.


Step 1




Step 2


• If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction.

• Otherwise, when receiving a GLOBAL_ABORT message, the transaction is locally aborted as well.

29

2PC Finite State Machines

INIT

WAIT

COMMITABORT

CommitVote-request

Vote-abortGlobal-abort

Vote-commitGlobal-commit

INIT

WAIT

COMMITABORT

Vote-requestVote-commit

Global-abortACK

Global-commitACK

Vote-requestVote-abort

The finite state machine for the coordinator in 2PC

The finite state machine for aparticipant in 2PC

30

2PC Algorithm

write START_2PC to local log;

multicast VOTE_REQUEST to all participants;

while not all votes have been collected{

wait for any incoming vote;

if timeout{

write GLOBAL_ABORT to local log;

multicast GLOBAL_ABORT to all participants;

exit;

}

record vote;

}

If all participants sent VOTE_COMMIT and coordinator votes COMMIT{

write GLOBAL_COMMIT to local log;

multicast GLOBAL_COMMIT to all participants;

}else{

write GLOBAL_ABORT to local log;

multicast GLOBAL_ABORT to all participants;

}

Actions by coordinator:

31


write INIT to local log;

Wait for VOTE_REQUEST from coordinator;

If timeout{

write VOTE_ABORT to local log;

exit;

}

If participant votes COMMIT{

write VOTE_COMMIT to local log;

send VOTE_COMMIT to coordinator;

wait for DECISION from coordinator;

if timeout{

multicast DECISION_RQUEST to other participants;

wait until DECISION is received; /*remain blocked*/

write DECISION to local log;

}

if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;}

else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log};

}else{

write VOTE_ABORT to local log;

send VOTE_ABORT to coordinator;

}

Actions by participants:

32


/*executed by separate thread*/

while true{

wait until any incoming DECISION_REQUEST is received; /*remain blocked*/

read most recently recorded STATE from the local log;

if STATE == GLOBAL_COMMIT

send GLOBAL_COMMIT to requesting participant;

else if STATE == INIT or STATE == GLOBAL_ABORT

send GLOBAL_ABORT to requesting participant;

else

skip; /*participant remains blocked*/

}

Actions for handling decision requests:

33

Objectives






Recovery

So far, we have mainly concentrated on algorithms that allow us to tolerate faults

However, once a failure has occurred, it is essential that the process where the failure has happened can recover to a correct state

In what follows we focus on:

What it actually means to recover to a correct state

When and how the state of a distributed system can be recorded and recovered, by means of checkpointing and message logging

35

Recovery

Error Recovery Checkpointing Message Logging

36

Recovery


37

Error Recovery

Once a failure has occurred, it is essential that the process where the failure has happened can recover to a correct state

Fundamental to fault tolerance is the recovery from an error

The idea of error recovery is to replace an erroneous state with an error-free state

There are essentially two forms of error recovery:

1. Backward recovery

2. Forward recovery

38

1. Backward Recovery (1)

In backward recovery, the main issue is to bring the system from its present erroneous state back to a previously correct state

It is necessary to record the system’s state from time to time onto a stable storage, and to restore such a recorded state when things go wrong

Stable Storage

Crash after drive 1 is updated

Bad Spot

39

1. Backward Recovery (2)

Each time (part of) the system’s present state is recorded, a checkpoint is said to be made

Problems with backward recovery:

Restoring a system or a process to a previous state is generally expensive in terms of performance

Some states can never be rolled back (e.g., typing in UNIX rm –fr *)

40

2. Forward Recovery

When the system detects that it has made an error, forward recovery reverts the system state to error time and corrects it, to be able to move forward

Forward recovery is typically faster than backward recovery but requires that it has to be known in advance which errors may occur

Some systems make use of both forward and backward recovery for different errors or different parts of one error

41

Recovery


42

Why Checkpointing?

In a fault-tolerant distributed system, backward recovery requires that the system regularly saves its state onto a stable storage

This process is referred to as checkpointing

In particular, checkpointing consists of storing a distributed snapshot of the current application state (i.e., a consistent global state), and later on, use it for restarting the execution in case of a failure

43

Recovery Line

In a distributed snapshot, if a process P has recorded the receipt of a message, then there should be also a process Q that has recorded the sending of that message

Initial state A snapshot

Message sent from Q to P

A recovery line Not a recovery line

A failure

They jointly form a distributedsnapshot

We are able to identify both, senders and receivers.

P

Q

44

Checkpointing

Checkpointing can be of two types:

1. Independent Checkpointing: each process simply records its local state from time to time in an uncoordinated fashion

2. Coordinated Checkpointing: all processes synchronize to jointly write their states to local stable storages

45

Domino Effect Independent checkpointing may make it difficult to find a recovery line,

leading potentially to a domino effect resulting from cascaded rollbacks

With coordinated checkpointing, the saved state is automatically globally consistent, hence, domino effect is inherently avoided

A failure

P

Q

Not a Recovery LineRollback

Not a Recovery LineNot a Recovery Line

46

Recovery


47

Why Message Logging?

Considering that checkpointing is an expensive operation, techniques have been sought to reduce the number of checkpoints, but still enable recovery

An important technique in distributed systems is message logging

The basic idea is that if transmission of messages can be replayed, we can still reach a globally consistent state but without having to restore that state from stable storage

In practice, the combination of having fewer checkpoints and message logging is more efficient than having to take many checkpoints

48

Message Logging

Message logging can be of two types:

1. Sender-based logging: A process can log its messages before sending them off

2. Receiver-based logging: A receiving process can first log an incoming message before delivering it to the application

When a sending or a receiving process crashes, it can restore the most recently checkpointed state, and from there on replay the logged messages (important for non-deterministic behaviors)

49

Replay of Messages and Orphan Processes

Incorrect replay of messages after recovery can lead to orphan processes. This should be avoided

P

Q

R

M1

Logged Message

Unlogged Message

M2M3

Q crashes Q recovers

M1

M1 is replayed

M2

M2 can never be replayed

M3

M3 becomes an orphan

50

Objectives






Next Class

Programming Models-Part I

Thanks You!

52

distributed systems cs 15-440 fault tolerance- part iii lecture 15, oct 26, 2011 majd f. sakr,...

Documents

reliable multicasting

reliable request

r receiver slide

multicasting process

nack feedback

feedback implosion

feedback based scheme

receiver sender m25