distributed systems 2006 overcoming failures in a distributed system * *with material adapted from...

69
Distributed Systems 2006 Overcoming Failures in a Distributed System* *With material adapted from Ken Birman

Post on 19-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006

Overcoming Failures in a Distributed System*

*With material adapted from Ken Birman

Page 2: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 2

Leslie Lamport

  “A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable”

Page 3: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 3

Plan

  Goals  Static and Dynamic Membership  Logical Time  Distributed Commit

Page 4: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 4

Thought question

  Suppose that a distributed system was built by interconnecting a set of extremely reliable components running on fault-tolerant hardware– Would such a system be expected to be reliable?– Perhaps not. The pattern of interaction, the need to match rates

of data production and consumption, and other “distributed” factors all can prevent a system from operating correctly!

Page 5: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 5

Example (1)

  The Web components are individually reliable– But the Web can fail by returning inconsistent or stale

data, can freeze up or claim that a server is not responding (even if both browser and server are operational), and it can be so slow that we consider it faulty even if it is working

  For stateful systems (the Web is stateless) this issue extends to joint behavior of sets of programs

Page 6: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 6

Example (2)

  Ariane 5– June 4, 1996, 40 seconds

after takeoff…– Self destruction after

abrupt course correction– “… caused by the complete

loss of guidance and attitude information … due to specification and design errors in the software of the inertial reference system”

– Loss of 500 million $, but no loss of life

  Where are the distribution aspects?

Page 7: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 7

Our Goal Here

  We want to replicate data and computation– For availability– For performance

  while guaranteeing consistent behavior

  Work towards ”virtual synchronous communication”– System appears to have no replicated data– System appears to only have multi-thread

concurrency

Page 8: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 8

Synchronous and Asynchronous Executions

p q r p q r

…processes share a synchronized clock

In the synchronous model messages

arrive on time

… and failures are easily detected

None of these properties holds in an asynchronous model

Page 9: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 9

Reality: Neither One

  Real distributed systems aren’t synchronous– Although some can come close

  Nor are they asynchronous– Software often treats them as asynchronous– In reality, clocks work well… so in practice we often

use time cautiously and can even put limits on message delays

  For our purposes we usually start with an asynchronous model– Subsequently enrich it with sources of time when

useful

Page 10: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 10

Steps Towards Our Goal

Tracking group membership: We’ll base 2PC and 3PC

Fault-tolerant multicast: We’ll use membership

Ordered multicast: We’ll base it on fault-tolerant multicast

Tools for solving practical replication and availability problems: we’ll base them on ordered multicast

Robust Web Services: We’ll build them with these tools

2PC and 3PC: Our first “tools” (lowest layer)

Page 11: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 11

Membership

  Which processes are available in a distributed system?

  Dynamic membership– Use group membership protocol to track members– Performant, complicated

  Static membership– Use static list of potential group members– Resolve liveness on a per-operation basis– May be slow, simpler

  (Approaches may be combined)

Page 12: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 12

Dynamic Membership

  Provides a Group Membership Service (GMS)– Processes as members– Processes may join or leave the group and monitor

other processes in the group

  (More next time)  ”80,000 updates per seconds, 5 members”

– Static membership: ”tens of updates per second, 5 members”

Page 13: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 13

Static Membership

  Example– Static set of potential members

• E.g., {p, q, r, s, t}

– Support replicated data on members• E.g., x: integer value• E.g., x: [t 0, v 0] -> [t 21, v 17] -> [t 25, v 97], ...

– Each process records version of x and value of x– p reading a value?

• Cannot just look at its own version – may have been changed at others

Page 14: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 14

Quorum Update and Read

  Simple fix– Make sure that operations reach a majority of processes in the system– Update and read only if supported by a majority of processes

• x will be sure to read latest value updated – just take one with largest version

  General fix– Two basic rules

• A quorum read should intersect prior quorum write at at least one process• Likewise, quorum writes should intersect prior quorum writes

– In a group of size N• Qr + Qw > N• Qw + Qw > N

  The example again, N = 5– Qr = 3, Qw = 3– Other possibilities?

• Note that we want Qw < N for fault tolerance, thus Qr > 1!

Page 15: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 15

Update Protocol

  1) p issues RPC-style read request to one replica after another– p collects at least Qr replies– p notes version (and value)

  2) p computes new version of data– Larger than maximum current version received

  3) p issues RPC to Qw members asking to ”prepare”– Processes reply to p

  4) p checks number of acknowledgements– >= Qw -> ”commit”– < Qw -> ”abort”

  (Actually a two-phase commit protocol (2PC) is used in 3) and 4), more later)

Page 16: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 16

Time

  We were somewhat careful to avoid time in static membership

  In distributed system we need practical ways to deal with time– E.g., we may need to agree

that update A occurred ‘before’ update B

– Or offer a “lease” on a resource that expires ‘at’ time 10:10:01.50

– Or guarantee that a time critical event will reach all interested parties ‘within’ 100ms

Page 17: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 17

But what does Time “Mean”?

  Time on a machine’s local clock– But was it set accurately?– And could it drift, e.g. run

fast or slow?– What about faults, like

stuck bits?

  Time on a global clock?– E.g. with GPS receiver– Still not accurate enough to

determine which events happens before other events

  Or could try to agree on time

Page 18: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 18

Lamport’s Approach

  Leslie Lamport suggested that we should reduce time to its basics– Cannot order events according to a global clock

• None available…

– Can use logical clock• Time basically becomes a way of labeling events so that we

may ask if event A happened before event B• Answer should be consistent with what could have happened

with respect to a global clock– Often this is what matters

Page 19: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 19

Drawing time-line pictures:

p

m

sndp(m)

qrcvq(m) delivq(m)

D

Page 20: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 20

Drawing time-line pictures:

  A, B, C and D are “events”. – Could be anything meaningful to the application

• microcode, program code, file write, message handling, …

– So are snd(m) and rcv(m) and deliv(m)

  What ordering claims are meaningful?

p

m

A

C

B

sndp(m)

qrcvq(m) delivq(m)

D

Page 21: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 21

Drawing time-line pictures:

  A happens before B, and C before D– “Local ordering” at a single process– Write and

p

q

m

A

C

B

rcvq(m) delivq(m)

sndp(m)

BAp

DCq

D

Page 22: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 22

Drawing time-line pictures:

  sndp(m) also happens before rcvq(m)– “Distributed ordering” introduced by a message– Write

p

q

m

A

C

B

rcvq(m) delivq(m)

sndp(m)

)m(rcv)m(snd q

M

p

D

Page 23: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 23

Drawing time-line pictures:

  A happens before D– Transitivity: A happens before sndp(m), which happens before

rcvq(m), which happens before D

p

q

m

D

A

C

B

rcvq(m) delivq(m)

sndp(m)

Page 24: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 24

Drawing time-line pictures:

  B and D are concurrent– Looks like B happens first, but D has no way to know.

No information flowed…

p

q

m

D

A

C

B

rcvq(m) delivq(m)

sndp(m)

Page 25: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 25

The Happens-Before Relation

  We’ll say that “A happens-before B”, written AB, if

• 1) APB according to the local ordering, or• 2) A is a snd and B is a rcv and AMB, or• A and B are related under the transitive closure of rules 1.

and 2.

  Thus, AD  So far, this is just a mathematical notation, not a

“systems tool”• A new event seen by a process happens logically after other

events seen by that process• A message receive happens logically after a message has

been sent

Page 26: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 26

”Simultaneous” Actions

  There are many situations in which we want to talk about some form of simultaneous event– Think about updating replicated data

• Perhaps we have multiple conflicting updates• The need is to ensure that they will happen in the same order

at all copies• This “looks” like a kind of simultaneous action

  Want to know the states of a distributed systems that might have occurred at an instant of real-time

Page 27: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 27

Temporal distortions

  Things can be complicated because we can’t predict– Message delays (they vary constantly)– Execution speeds (often a process shares a machine

with many other tasks)– Timing of external events

  Lamport looked at this question too

Page 28: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 28

Temporal distortions

  What does “now” mean?

p0 a

f

e

p3

b

p2

p1 c

d

Page 29: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 29

Temporal distortions

  What does “now” mean?

p0 a

f

e

p3

b

p2

p1 c

d

Page 30: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 30

Temporal distortions

  Timelines can “stretch”…

  … caused by scheduling effects, message delays, message loss…

p0 a

f

e

p3

b

p2

p1 c

d

Page 31: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 31

Temporal distortions

  Timelines can “shrink”

  E.g. something lets a machine speed up

p0 a

f

e

p3

b

p2

p1 c

d

Page 32: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 32

Temporal distortions

  Cuts represent instants of time– Viz., subsets of events, one per process

• E.g., {a, c}, {a, rcv(d), f, rcv(e)}

  But not every “cut” makes sense– Black cuts could occur but not gray ones.

p0 a

f

e

p3

b

p2

p1 c

d

Page 33: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 33

Temporal distortions

  Red messages cross gray cuts “backwards”– Need to avoid capturing states in which a message is received but

nobody is shown as having sent it

  Consistent cuts– If rcv(m) is in cut, snd(m) (or earlier) is in cut– snd(m) may be in cut without rcv(m) is in cut

• m is ”in message channel”

p0 a

f

e

p3

b

p2

p1 c

d

Page 34: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 34

Who Cares?

  Suppose– p has lock– m = release lock– p sends m to q– snd(m) -> rcv(q)

  Inconsistent cut– {rcv(q)}– Sees that both p and q have lock

Page 35: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 35

Logical clocks

  A simple tool that can capture parts of the happens before relation

  First version: uses just a single integer– Designed for big (64-bit or more) counters

– Each process p maintains LTp, a local counter

– A message m will carry LTm

Page 36: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 36

Rules for managing logical clocks

  When an event happens at a process p it increments LTp

– Any event that matters to p– Normally, also snd and rcv events (since we want receive to

occur “after” the matching send)

  When p sends m, set– LTm = LTp

  When q receives m, set– LTq = max(LTq, LTm)+1

Page 37: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 37

Time-line with LT annotations

  LT(A) = 1, LT(sndp(m)) = 2, LT(m) = 2  LT(rcvq(m))=max(1,2)+1=3, etc…

p

q

m

D

A

C

B

rcvq(m) delivq(m)

sndp(m)

LTq 0 0 0 1 1 1 1 3 3 3 4 5 5

LTp 0 1 1 2 2 2 2 2 2 3 3 3 3

Page 38: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 38

Logical clocks

  If A happens before B, AB, then LT(A)<LT(B)– AB : A = E0 … En = B, where each pair is

ordered either by p or m

• LT associated with these only increase

  But converse might not be true:– If LT(A)<LT(B) can’t be sure that AB – This is because processes that don’t communicate

still assign timestamps and hence events will “seem” to have an order

Page 39: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 39

Can we do better?

  One option is to use vector clocks  Here we treat timestamps as a list

– One counter for each process

  Rules for managing vector times differ from what did with logical clocks

Page 40: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 40

Vector clocks

  Clock is a vector: e.g. VT(A)=[1, 0]– We’ll just assign p index 0 and q index 1– Vector clocks require either agreement on the numbering/static

membership, or that the actual process id’s be included with the vector

  Rules for managing vector clock– When event happens at p, increment VTp[indexp]

• Normally, also increment for snd and rcv events

– When sending a message, set VT(m)=VTp

– When receiving, set VTq=max(VTq, VT(m))• Where “max” is max on components of vector

Page 41: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 41

Time-line with VT annotations

p

q

m

D

A

C

B

rcvq(m) delivq(m)

sndp(m)

VTq 0 0

0 0

0 0

0 1

0 1

0 1

0 1

2 2

2 2

2 2

23

2 3

2 4

VTp 0 0

1 0

1 0

2 0

2 0

2 0

2 0

2 0

2 0

3 0

30

3 0

3 0

VT(m)=[2,0]

Could also be [1,0] if we decide not to increment the clock on a snd event. Decision depends on how the timestamps

will be used.

Page 42: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 42

Rules for comparison of VTs

  We’ll say that VTA ≤ VTB if i, VTA[i] ≤ VTB[i]

  And we’ll say that VTA < VTB if– VTA ≤ VTB but VTA ≠ VTB

– That is, for some i, VTA[i] < VTB[i]

  Examples?– [2,4] ≤ [2,4]– [1,3] < [7,3]– [1,3] is “incomparable” to [3,1]

Page 43: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 43

Time-line with VT annotations

  VT(A)=[1,0]. VT(D)=[2,4]. So VT(A)<VT(D)  VT(B)=[3,0]. So VT(B) and VT(D) are incomparable

p

q

m

D

A

C

B

rcvq(m) delivq(m)

sndp(m)

VTq 0 0

0 0

0 0

0 1

0 1

0 1

0 1

2 2

2 2

2 2

23

2 3

2 4

VTp 0 0

1 0

1 0

2 0

2 0

2 0

2 0

2 0

2 0

3 0

30

3 0

3 0

VT(m)=[2,0]

Page 44: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 44

Vector time and happens before

  If AB, then VT(A)<VT(B)– Write a chain of events from A to B– Step by step the vector clocks get larger

  But also VT(A)<VT(B) then AB– Two cases

• If A and B both happen at same process p – all events seen by p increments vector clocks

• If A happens at p and B at q, can trace the path back by which q “learned” VT(A)[p] since q only updates VT(A)[p] based on message receipt from, say, q’

– If q’ <> p trace further back

  (Otherwise A and B happened concurrently)

Page 45: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 45

Introducing “wall clock time”

  There are several options– “Extend” a logical clock or vector clock with the clock

time and use it to break ties• Makes meaningful statements like “B and D were concurrent,

although B occurred first”• But unless clocks are closely synchronized such statements

could be erroneous!

– We use a clock synchronization algorithm to reconcile differences between clocks on various computers in the network

Page 46: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 46

Synchronizing clocks

  Without help, clocks will often differ by many milliseconds– Problem is that when a machine downloads time from

a network clock it can’t be sure what the delay was– This is because the “uplink” and “downlink” delays are

often very different in a network

  Outright failures of clocks are rare…

Page 47: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 47

Synchronizing clocks

  Suppose p synchronizes with time.windows.com and notes that 123 ms elapsed while the protocol was running… what time is it now?

p

time.windows.com

What time is it?

09:23.02921

Delay: 123ms

Page 48: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 48

Synchronizing clocks

  Options?– P could guess that the delay was evenly split, but this is rarely

the case in WAN settings (downlink speeds are higher)– P could ignore the delay– P could factor in only “certain” delay, e.g. if we know that the link

takes at least 5ms in each direction. Works best with GPS time sources!

  In general can’t do better than uncertainty in the link delay from the time source down to p

Page 49: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 49

Consequences?

  In a network of processes, we must assume that clocks are– Not perfectly synchronized. Even GPS has

uncertainty, although small• We say that clocks are “inaccurate” (with respect to real

time)

– And clocks can drift during periods between synchronizations

• Relative drift between clocks is their “precision” (with respect to each other)

Page 50: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 50

Thought question

– We are building an anti-missile system– Radar tells the interceptor where it should be and

what time to get there– Do we want the radar and interceptor to be as

accurate as possible, or as precise as possible?

Page 51: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 51

Thought question

  We want them to agree on the time but it isn’t important whether they are accurate with respect to “true” time– “Precision” matters more than “accuracy”– Although for this, a GPS time source would be the

way to go• Might achieve higher precision than we can with an “internal”

synchronization protocol!

Page 52: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 52

Transactions in distributed systems

  A client and database might not run on same computer– Both may not fail at same time– Also, either could timeout waiting for the other in

normal situations

  When this happens, we normally abort the transaction– Exception is a timeout that occurs while commit is

being processed – If server fails, one effect of crash is to break locks

even for read-only access

Page 53: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 53

Transactions in distributed systems

  What if data is on multiple servers?– In a networked system, transactions run against a single

database system• Indeed, many systems structured to use just a single operation – a

“one shot” transaction!– In true distributed systems may want one application to talk to

multiple databases  Main issue that arises is that now we can have multiple

database servers that are touched by one transaction  Reasons?

– Data spread around: each owns subset– Could have replicated some data object on multiple servers, e.g.

to load-balance read access for large client set– Might do this for high availability

  Solve using 2-phase commit (2PC) protocol!

Page 54: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 54

Two-phase commit in transactions

  Phase 1– Transaction wishes to commit. Data managers force

updates and lock records to the disk (e.g. to the log) and then say prepared to commit

  Phase 2– Transaction manager makes sure all are prepared,

then says commit (or abort, if some are not)– Data managers then make updates permanent or

rollback to old values, and release locks

Page 55: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 55

As a time-line picture

2PC initiator

pqrst

Vote?

All vote “commit”

Commit!

Page 56: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 56

As a time-line picture

2PC initiator

pqrst

Vote?

All vote “commit”

Commit!

Phase 1 Phase 2

Page 57: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 57

Missing Stuff

  Eventually will need to do some form of garbage collection– Issue is that participants need memory of the

protocol, at least for a while– But can delay garbage collection and run it later on

behalf of many protocol instances

  Part of any real implementation but not thought of as part of the protocol

Page 58: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 58

Fault tolerance

  We can separate this into three cases– Group member fails; initiator remains healthy– Initiator fails; group members remain healthy– Both initiator and group member fail

  Further separation– Handling recovery of a failed member– Recovery after “total” failure of the whole group

Page 59: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 59

Fault tolerance

  Some cases are pretty easy– E.g. if a member fails before voting we just treat it as

an abort– If a member fails after voting commit, we assume that

when it recovers it will finish up the commit and perform whatever action we requested

  Hard cases involve crash of initiator

Page 60: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 60

Initiator fails, members healthy

  When did it fail?– Could fail before starting the 2PC protocol

• In this case if the members were expecting the protocol to run, e.g., to terminate a pending transaction on a database, they do “unilateral abort”

– Could fail after some are prepared to commit• Those members need to learn the outcome before they can

“finish” the protocol

– Could fail after some have learned the outcome• Others may still be in a prepared state

Page 61: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 61

How to handle initiator failures?

  Wait for initiator to come up again…– May hold resources on members

  Rather– Initiator should record the decision in a logging server for use

after crashes• If decision is logged, a process may learn outcome by examining log

if initiator fails (timeout needed here)

– Also, members can help one-another terminate the protocol• This is needed if a failure happens before the initiator has a chance

to log its decision

• A process member may repeat phase 1

Page 62: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 62

Problems?

  2PC has a “bad state”– Suppose that the initiator and a member, p, both fail

and we are not using a log• May not always want to use log because of extra overhead

and reliability concerns

– Other members cannot determine if commit should abort or not

• p may have transferred $10M to a bank account, want to be consistent with that…

– There is a case in which we can’t terminate the protocol!

Page 63: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 63

As a time-line picture

2PC initiator

pqrst

Vote?

All vote “commit”

Phase 1 Phase 2Commit!

Page 64: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 64

Can we do Better?

  3 phase commit (3PC)– Assumes detectable failures

• We happen to know that real systems can’t detect failures, unless they can unplug the power for a faulty node

– Idea is to add an extra “prepared to commit” stage

Page 65: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 65

3PC

3PC initiator

pqrst

Vote?

All vote “commit”

Phase 1

Prepare to commit

All say “ok”

Phase 2

They commit

Commit!

Phase 3

Page 66: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 66

Why 3PC?

  A “new leader” in the group can deduce the outcomes when this protocol is used

  Main insight?– In 2PC the decision to commit can be known by only initiator and

one other process• In 3PC nobody can enter the commit state unless all are first in the

prepared state

– Makes it possible to determine the state, then push the protocol forward (or back)

  But does require accurate failure detections– Only commit if all operational in prepared to commit state or

abort if all operational in ok to commit state• Failed processes may learn outcome when they become operational

Page 67: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 67

Value of 3PC?

  Even with inaccurate failure detections, it greatly reduces the window of vulnerability– The bad case for 2PC is not so uncommon

• Especially if a group member is the initiator• In that case one badly timed failure freezes the whole group

– With 3PC in real systems, the troublesome case becomes very unlikely

  But the problems remain– E.g., in network partition where half may be prepared

to commit and half may be ok to commit

Page 68: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 68

Initial

OK

Prepare

Commit

Abort

Inquire

Prepare OK

Commit Abort

Abort

Coord failed

OK?

Prepare

Commit

State diagram for non-faulty member

Protocol starts in the initial state. Initiator sends the “OK to commit”

inquiryWe collect responses. If any is an

abort, we enter the abort stageOtherwise send prepare-to-commit messages out

Coordinator failure sends us into an inquiry mode in which someone (anyone) tries to figure out the

situation

This state corresponds to the coordinator sending out the commit messages. We

enter the state when all members receive them

Here, we “finish off” the prepare state if a crash interrupted it, by resending the

prepare message (needed in case only some processes saw the coordinator’s

message before it crashed)We get here if some processes were

still in the initial “OK to commit?” stageIn this case it is safe to abort, and we do so

Page 69: Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 2006 69

Summary

  We looked at goals and prerequisites for consistent replication– (Static and) and Dynamic Membership– Logical Time– Distributed Commit