complete 1 distributed systems
TRANSCRIPT
-
8/7/2019 Complete 1 Distributed Systems
1/118
CS60002Distributed Systems
-
8/7/2019 Complete 1 Distributed Systems
2/118
Text Book: Advanced Concepts in OperatingSystems by Mukesh Singhal and
Niranjan G. Shivaratri
will cover about half the course,supplemented by copies of papers
Xerox, notes, copies of papers
etc. will cover the rest.
-
8/7/2019 Complete 1 Distributed Systems
3/118
What is a distributed system?
A very broad definition:
A set of autonomous processes
communicating amongthemselves to perform a task
Autonomous: able to act
independently
Communication: shared memory or
message passing
Concurrent system : a better term
probably
-
8/7/2019 Complete 1 Distributed Systems
4/118
A more restricted definition:
A network of autonomous
computers that communicate bymessage passing to performsome task
A practical distributed system willprobably have both
Computers that communicate bymessages
Processes/threads on acomputer that communicate bymessages or shared memory
-
8/7/2019 Complete 1 Distributed Systems
5/118
Advantages
Resource Sharing
Higher Performance
Fault Tolerance Scalability
-
8/7/2019 Complete 1 Distributed Systems
6/118
Why is it hard to design them?
The usual problem of concurrent
systems:
Arbitrary interleaving of actionsmakes the system hard to verify
Plus
No globally shared memory
(therefore hard to collect globalstate)
No global clock
Unpredictable communicationdelays
-
8/7/2019 Complete 1 Distributed Systems
7/118
Models for Distributed
Algorithms
Topology : completelyconnected, ring, tree etc.
Communication : sharedmemory/message passing(reliable? Delay? FIFO/Causal?Broadcast/multicast?)
Synchronous/asynchronous
Failure models (fail stop, crash,omission, Byzantine)
An algorithm need to specify themodel on which it is supposed towork
-
8/7/2019 Complete 1 Distributed Systems
8/118
Complexity Measures
Message complexity : no. of
messages
Communication complexity/Bit
Complexity : no. of bits
Time complexity : For synchronous
systems, no. of rounds. For
asynchronous systems, different
definitions are there.
-
8/7/2019 Complete 1 Distributed Systems
9/118
Some Fundamental Problems
Ordering events in the absence of a
global clock
Capturing the global state Mutual exclusion
Leader election
Clock synchronization
Termination detection
Constructing spanning trees
Agreement protocols
-
8/7/2019 Complete 1 Distributed Systems
10/118
Ordering of Events and
Logical Clocks
-
8/7/2019 Complete 1 Distributed Systems
11/118
-
8/7/2019 Complete 1 Distributed Systems
12/118
a b implies a is apotentialcause
of b
Causal ordering :potentialdependencies
Happened Before relationship
causally orders events
If a b, then a causally affectsb
If a b and b a, then a and b
are concurrent ( a || b)
-
8/7/2019 Complete 1 Distributed Systems
13/118
-
8/7/2019 Complete 1 Distributed Systems
14/118
Points to note:
if a b, then C(a) < C(b)
is an irreflexive partial order
Total ordering possible by
arbitrarily ordering concurrent
events by process numbers
-
8/7/2019 Complete 1 Distributed Systems
15/118
Limitation of Lamports Clock
a b implies C(a) < C(b)
BUT
C(a) < C(b) doesnt imply a b !!
So not a true clock !!
-
8/7/2019 Complete 1 Distributed Systems
16/118
Solution: Vector Clocks
Ci is a vector of size n (no. of
processes)
C(a) is similarly a vector of size nUpdate rules:
Ci[i]++ for every event at process
i if a is send of message m from i
to j with vector timestamp tm, on
receive of m:
Cj[k] = max(Cj[k], tm[k]) for all k
-
8/7/2019 Complete 1 Distributed Systems
17/118
For events a and b with vector
timestamps ta and tb,
ta = tb iff for all i, ta[i] = tb[i]
ta tb iff for some i, ta[i] tb[i]
ta tb iff for all i, ta[i] tb[i]
ta < tb iff (ta tb and ta tb)
ta || tb iff (ta < tb and tb < ta)
-
8/7/2019 Complete 1 Distributed Systems
18/118
a b iff ta < tb
Events a and b are causally relatediff ta < tb or tb < ta, else they are
concurrent
Note that this is still not a total
order
-
8/7/2019 Complete 1 Distributed Systems
19/118
-
8/7/2019 Complete 1 Distributed Systems
20/118
Birman-Schiper-Stephenson
Protocol
To broadcast m from process i,
increment Ci(i), and timestamp m
with VTm = Ci[i] When j i receives m, j delays
delivery of m until
Cj[i] = VTm[i] 1 and
Cj[k] VTm[k] for all k i
Delayed messaged are queued in j
sorted by vector time. Concurrent
messages are sorted by receive time.
When m is delivered at j, Cj is
updated according to vector clock
rule.
-
8/7/2019 Complete 1 Distributed Systems
21/118
Problem of Vector Clock
message size increases since
each message needs to betagged with the vector
size can be reduced in somecases by only sending values
that have changed
-
8/7/2019 Complete 1 Distributed Systems
22/118
-
8/7/2019 Complete 1 Distributed Systems
23/118
-
8/7/2019 Complete 1 Distributed Systems
24/118
Some notations:
LSi : local state of process i
send(mij) : send event ofmessage mij from process i to
process j
rec(mij) : similar, receive instead
of send
time(x) : time at which state x
was recorded
time (send(m)) : time at whichsend(m) occured
-
8/7/2019 Complete 1 Distributed Systems
25/118
send(mij) LSi iff
time(send(mij)) < time(LSi)
rec(mij) LSj iff
time(rec(mij)) < time(LSj)
transit(LSi,LSj) = { mij | send(mij) LSi
and rec(mij) LSj}
inconsistent(LSi, LSj) = {mij |
send(mij) LSi and rec(mij) LSj}
-
8/7/2019 Complete 1 Distributed Systems
26/118
-
8/7/2019 Complete 1 Distributed Systems
27/118
Chandy-Lamports Algorithm
Uses special marker messages.
One process acts as initiator, startsthe state collection by following themarker sending rule below.
Marker sending rule for process P: P records its state; then for each
outgoing channel C from P onwhich a marker has not beensent already, P sends a markeralong C before any furthermessage is sent on C
-
8/7/2019 Complete 1 Distributed Systems
28/118
When Q receives a marker along achannel C:
If Q has not recorded its statethen Q records the state of C asempty; Q then follows themarker sending rule
If Q has already recorded itsstate, it records the state of C asthe sequence of messages
received along C after Qs statewas recorded and before Qreceived the marker along C
-
8/7/2019 Complete 1 Distributed Systems
29/118
-
8/7/2019 Complete 1 Distributed Systems
30/118
Lai and Youngs Algorithm
Similar to Chandy-Lamports, but
does not require FIFO
Boolean value X at each node,False indicates state is not
recorded yet, True indicates
recorded
Value of X piggybacked with everyapplication message
Value of X distinguishes pre-
snapshot and post-snapshot
messages, similar to the Marker
-
8/7/2019 Complete 1 Distributed Systems
31/118
Mutual Exclusion
-
8/7/2019 Complete 1 Distributed Systems
32/118
Mutual Exclusion
very well-understood in sharedmemory systems
Requirements:
at most one process in criticalsection (safety)
if more than one requestingprocess, someone enters(liveness)
a requesting process enterswithin a finite time (no starvation)
requests are granted in order(fairness)
-
8/7/2019 Complete 1 Distributed Systems
33/118
-
8/7/2019 Complete 1 Distributed Systems
34/118
Some Complexity Measures
No. of messages/critical section
entry Synchronization delay
Response time
Throughput
-
8/7/2019 Complete 1 Distributed Systems
35/118
-
8/7/2019 Complete 1 Distributed Systems
36/118
-
8/7/2019 Complete 1 Distributed Systems
37/118
Some points to note:
Purpose of REPLY messages from
node i to j is to ensure that j knowsof all requests of i prior to sendingthe REPLY (and therefore, possiblyany request of i with timestamplower than js request)
Requires FIFO channels.
3(n 1 ) messages per criticalsection invocation
Synchronization delay = max.message transmission time
requests are granted in order ofincreasing timestamps
-
8/7/2019 Complete 1 Distributed Systems
38/118
Ricart-Agarwala Algorithm
Improvement over Lamports
Main Idea:
node j need not send a REPLYto node i if j has a request withtimestamp lower than therequest of i (since i cannot enterbefore j anyway in this case)
Does not require FIFO
2(n 1) messages per criticalsection invocation
Synchronization delay = max.message transmission time
requests granted in order ofincreasing timestamps
-
8/7/2019 Complete 1 Distributed Systems
39/118
-
8/7/2019 Complete 1 Distributed Systems
40/118
-
8/7/2019 Complete 1 Distributed Systems
41/118
Maekawas Algorithm
Permission obtained from only a
subset of other processes, called
the Request Set (or Quorum) Separate Request Set Ri for each
process i
Requirements:
for all i, j: Ri Rj
for all i: i Ri
for all i: |Ri| = K, for some K
any node i is contained in exactly
D Request Sets, for some D
K = D = sqrt(N) for Maekawas
-
8/7/2019 Complete 1 Distributed Systems
42/118
A simple version
To request critical section: i sends REQUEST message to
all process in Ri
On receiving a REQUEST message:
send a REPLY message if noREPLY message has been sentsince the last RELEASE
message is received. Updatestatus to indicate that a REPLYhas been sent. Otherwise, queueup the REQUEST
To enter critical section:
i enters critical section afterreceiving REPLY from all nodes
in Ri
-
8/7/2019 Complete 1 Distributed Systems
43/118
-
8/7/2019 Complete 1 Distributed Systems
44/118
Message Complexity: 3*sqrt(N)
Synchronization delay =
2 *(max message transmissiontime)
Major problem: DEADLOCKpossible
Need three more types ofmessages (FAILED, INQUIRE,YIELD) to handle deadlock.Message complexity can be5*sqrt(N)
Building the request sets?
-
8/7/2019 Complete 1 Distributed Systems
45/118
Token based Algorithms
Single token circulates, enter CS
when token is present
No FIFO required Mutual exclusion obvious
Algorithms differ in how to find and
get the token
Uses sequence numbers rather
than timestamps to differentiate
between old and current requests
-
8/7/2019 Complete 1 Distributed Systems
46/118
Suzuki Kasami Algorithm
Broadcast a request for the token
Process with the token sends it to
the requestor if it does not need it
Issues:
Current vs. outdated requests
determining sites with pending
requests
deciding which site to give thetoken to
-
8/7/2019 Complete 1 Distributed Systems
47/118
The token:
Queue (FIFO) Q of requestingprocesses
LN[1..n] : sequence number ofrequest that j executed mostrecently
The request message:
REQUEST(i, k): requestmessage from node i for its kthcritical section execution
Other data structures
RNi[1..n] for each node i, whereRNi[j] is the largest sequencenumber received so far by i in aREQUEST message from j.
-
8/7/2019 Complete 1 Distributed Systems
48/118
-
8/7/2019 Complete 1 Distributed Systems
49/118
To enter critical section:
enter CS if token is present
To release critical section:
set LN[i] = RNi[i]
For every node j which is not inQ (in token), add node j to Q if
RNi[ j ] = LN[ j ] + 1
If Q is non empty after the
above, delete first node from Qand send the token to that node
-
8/7/2019 Complete 1 Distributed Systems
50/118
Points to note:
No. of messages: 0 if node holdsthe token already, n otherwise
Synchronization delay: 0 (node
has the token) or max. message
delay (token is elsewhere)
No starvation
-
8/7/2019 Complete 1 Distributed Systems
51/118
Raymonds Algorithm
Forms a directed tree (logical) with
the token-holder as root
Each node has variable Holder
that points to its parent on the path
to the root. Roots Holder variable
points to itself
Each node i has a FIFO request
queue Qi
-
8/7/2019 Complete 1 Distributed Systems
52/118
To request critical section:
Send REQUEST to parent on the
tree, provided i does not hold thetoken currently and Qi is empty.
Then place request in Qi
When a non-root node j receives a
request from i
place request in Qj
send REQUEST to parent if noprevious REQUEST sent
-
8/7/2019 Complete 1 Distributed Systems
53/118
When the root receives aREQUEST:
send the token to the requestingnode
set Holder variable to point tothat node
When a node receives the token:
delete first entry from the queue
send token to that node set Holder variable to point to
that node
if queue is non-empty, send a
REQUEST message to theparent (node pointed at byHolder variable)
-
8/7/2019 Complete 1 Distributed Systems
54/118
To execute critical section:
enter if token is received andown entry is at the top of thequeue; delete the entry from thequeue
To release critical section
if queue is non-empty, delete firstentry from the queue, send tokento that node and make Holdervariable point to that node
If queue is still non-empty, senda REQUEST message to the
parent (node pointed at byHolder variable)
-
8/7/2019 Complete 1 Distributed Systems
55/118
Points to note:
Avg. message complexity
O(log n)
Sync. delay (T log n)/2, whereT = max. message delay
-
8/7/2019 Complete 1 Distributed Systems
56/118
Leader Election
-
8/7/2019 Complete 1 Distributed Systems
57/118
Leader Election in Rings
Models Synchronous or Asynchronous
Anonymous (no unique id) orNon-anonymous (unique ids)
Uniform (no knowledge of n,the number of processes) or
non-uniform (knows n) Known Impossibility Result:
There is no Synchronous, non-uniform leader election
protocol for anonymous rings Implications ??
-
8/7/2019 Complete 1 Distributed Systems
58/118
Election in Asynchronous
Rings
Lelann-Chang-Roberts
Algorithm
send own id to node on left if an id received from right,
forward id to left node only if
received id greater than own id,
else ignore
if own id received, declares itself
leader
works on unidirectional rings message complexity = (n^2)
-
8/7/2019 Complete 1 Distributed Systems
59/118
Hirschberg-Sinclair Algorithm operates in phases, requires
bidirectional ring
In kth phase, send own idto 2^kprocesses on both sides of yourself
(directly send only to nextprocesses with idand kin it)
ifidreceived, forward if received idgreater than own id, else ignore
last process in the chain sends a
reply to originator if its idless thanreceived id
replies are always forwarded
A process goes to (k+1)th phaseonly if it receives a reply from both
sides in kth phase process receiving its own id
declare itself leader
-
8/7/2019 Complete 1 Distributed Systems
60/118
-
8/7/2019 Complete 1 Distributed Systems
61/118
Leader Election in Arbitrary
Networks
FloodMax synchronous, round-based
at each round, each process sendsthe max. id seen so far (notnecessarily its own) to all itsneighbors
after diameter no. of rounds, if max.id seen = own id, declares itselfleader
Complexity = O(d.m), where d =diameter of the network, m = no. ofedges
does not extend to asynchronous
model trivially Variations of building different types of
spanning trees with no pre-specifiedroots. Chosen root at the end is theleader (Ex., the DFS spanning tree
algorithm we covered earlier)
-
8/7/2019 Complete 1 Distributed Systems
62/118
Clock Synchronization
-
8/7/2019 Complete 1 Distributed Systems
63/118
Clock Synchronization
Multiple machines with physicalclocks. How can we keep themmore or less synchronized?
Internal vs. Externalsynchronization
Perfect synchronization notpossible because of communicationdelays
Even synchronization within abound can not be guaranteed withcertainty because ofunpredictability of communication
delays. But still useful !! Ex. Kerberos,
GPS
-
8/7/2019 Complete 1 Distributed Systems
64/118
-
8/7/2019 Complete 1 Distributed Systems
65/118
-
8/7/2019 Complete 1 Distributed Systems
66/118
Resynchronization
Periodic resynchronization neededto offset skew
If two clocks are drifting in oppositedirections, max. skew after time t is2 t
If application requires that clockskew < , then resynchronizationperiod
r < /(2 )
Usually and are known
-
8/7/2019 Complete 1 Distributed Systems
67/118
-
8/7/2019 Complete 1 Distributed Systems
68/118
Handling message delay: try toestimate the time the message withthe timer servers time took to eachthe sender
measure round trip time andhalve it
make multiple measurements ofround trip time, discard too highvalues, take average of rest
make multiple measurementsand take minimum
use knowledge of processingtime at server if known
Handling fast clocks
do not set clock backwards; slowit down over a period of time tobring in tune with servers clock
-
8/7/2019 Complete 1 Distributed Systems
69/118
Berkeley Algorithm
Centralized as in Cristians, but the
time server is active
time server asks for time of otherm/cs at periodic intervals
time server averages the times and
sends the new time to m/cs
M/cs sets their time (advancesimmediately or slows down slowly)
to the new time
Estimation of transmission delay as
before
-
8/7/2019 Complete 1 Distributed Systems
70/118
-
8/7/2019 Complete 1 Distributed Systems
71/118
Measurement of time
Astronomical
traditionally used
based on earths rotation aroundits axis and around the sun
solar day : interval between twoconsecutive transits of the sun
solar second : 1/86,400 of asolar day
period of earths rotation varies,so solar second is not stable
mean solar second : averagelength of large no of solar days,then divide by 86,400
-
8/7/2019 Complete 1 Distributed Systems
72/118
-
8/7/2019 Complete 1 Distributed Systems
73/118
UTC time is broadcast fromdifferent sources around the world,ex.
National Institute of Standards &Technology (NIST) runs radiostations, most famous beingWWV, anyone with a proper
receiver can tune in United States Naval Observatory
(USNO) supplies time to alldefense sources, among others
National Physical Laboratory inUK
GPS satellites
Many others
-
8/7/2019 Complete 1 Distributed Systems
74/118
-
8/7/2019 Complete 1 Distributed Systems
75/118
Reliability ensured by redundantservers
Communication by multicast(usually within LAN servers),
symmetric (usually within multiplegeographically close servers), orclient server (to higher stratumservers)
Complex algorithms to combineand filter times
Sync. possible to within tens ofmilliseconds for most machines
But, just a best-effort service, noguarantees
RFC 1305 andwww.eecis.udel.edu/~ntp/ for moredetails
http://www.eecis.udel.edu/~ntp/http://www.eecis.udel.edu/~ntp/ -
8/7/2019 Complete 1 Distributed Systems
76/118
-
8/7/2019 Complete 1 Distributed Systems
77/118
-
8/7/2019 Complete 1 Distributed Systems
78/118
Huangs Algorithm
One controlling agent, has weight 1initially
All other processes are idle initiallyand has weight 0
Computation starts whencontrolling agent sends acomputation message to a process
An idle process becomes active on
receiving a computation message B(DW) computation message
with weight DW. Can be sent onlyby the controlling agent or an activeprocess
C(DW) control message withweight DW, sent by activeprocesses to controlling agentwhen they are about to become idle
-
8/7/2019 Complete 1 Distributed Systems
79/118
-
8/7/2019 Complete 1 Distributed Systems
80/118
Building Spanning Trees
-
8/7/2019 Complete 1 Distributed Systems
81/118
Building Spanning Trees
Applications:
Broadcast
Convergecast Leader election
Two variations: from a given root r
root is not given a-priori
-
8/7/2019 Complete 1 Distributed Systems
82/118
-
8/7/2019 Complete 1 Distributed Systems
83/118
Constructing a DFS tree with
given root
plain parallelization of thesequential algorithm by introducing
synchronization each node i has a set unexplored,
initially contains all neighbors of i
A node i (initiated by the root)
considers nodes in unexploredoneby one, sending a neighbor j amessage M and then waiting for aresponse (parentorreject) beforeconsidering the next node inunexplored
if j has already received M fromsome other node, j sends a rejectto i
-
8/7/2019 Complete 1 Distributed Systems
84/118
else, j sets i as its parent, andconsiders nodes in its unexploredset one by one
j will send aparentmessage to ionly when it has considered allnodes in its unexplored set
i then considers the next node in itsunexploredset
Algorithm terminates when root hasreceivedparentorrejectmessagefrom all its neighbours
Worst case no. of messages = 4m Time complexity O(m)
-
8/7/2019 Complete 1 Distributed Systems
85/118
-
8/7/2019 Complete 1 Distributed Systems
86/118
-
8/7/2019 Complete 1 Distributed Systems
87/118
-
8/7/2019 Complete 1 Distributed Systems
88/118
Issues:
1. How does a node find its min.
wt. outgoing edge?
2. How does a fragment finds its
min. wt. outgoing edge?3. When does two fragments
merge?
4. How does two fragments
merge?
-
8/7/2019 Complete 1 Distributed Systems
89/118
-
8/7/2019 Complete 1 Distributed Systems
90/118
Merging rule for fragments
Suppose F is a fragment with id X,level L, and min. wt. outgoing edgee. Let fragment at other end of e beF1, with id X1 and level L1. Then
if L < L1, F merges into F1, newfragment has id X1, level L1
if L=L1, and e is also the min. wt.outgoing edge for F1, then F andF1 merges; new fragment has idX2 = weight of e, and level L + 1;
e is called the core edge
otherwise, F waits until one ofthe above becomes true
-
8/7/2019 Complete 1 Distributed Systems
91/118
How to find min. wt. outgoing edge
of a fragment
nodes on core edge broadcasts initiatemessage to all fragment nodes alongfragment edges; contains level and id
on receiving initiate, a node find its min.
wt. outgoing edge (in Find state) how?
nodes send Reportmessage with min.wt. edge up towards the core edge alongfragment edges (and enters Found state)
leafs send their min. wt. outgoing edge,intermediate nodes send the min. of theirmin. wt. outgoing edge and min. edge
sent by children in fragment; path info tobest edge kept
when Reportreaches the nodes on thecore edge, min. wt. outgoing edge of the
fragment is known.
-
8/7/2019 Complete 1 Distributed Systems
92/118
-
8/7/2019 Complete 1 Distributed Systems
93/118
-
8/7/2019 Complete 1 Distributed Systems
94/118
Fault Tolerance
andRecovery
-
8/7/2019 Complete 1 Distributed Systems
95/118
-
8/7/2019 Complete 1 Distributed Systems
96/118
Types of tolerance:
Masking system alwaysbehaves as per specifications
even in presence of faults Non-masking system may
violate specifications in presenceof faults. Should at least behave
in a well-defined manner
Fault tolerant system should specify:
Class of faults tolerated
what tolerance is given fromeach class
-
8/7/2019 Complete 1 Distributed Systems
97/118
-
8/7/2019 Complete 1 Distributed Systems
98/118
-
8/7/2019 Complete 1 Distributed Systems
99/118
Different problem variations
Byzantine agreement (or ByzantineGenerals problem)
one process x broadcasts avalue v
all nonfaulty processes mustagree on a common value(Agreement condition).
The agreed upon value must
be v if x is nonfaulty (Validitycondition)
Consensus
Each process broadcasts its
initial value satisfy agreement condition
If initial value of all nonfaultyprocesses is v, then the
agreed upon value must be v
-
8/7/2019 Complete 1 Distributed Systems
100/118
-
8/7/2019 Complete 1 Distributed Systems
101/118
Byzantine Agreement Problem
no solution possible if
asynchronous system, or
n < (3m + 1) needs at least (m+1) rounds of
message exchange (lower bound
result)
Oral messages messages can
be forged/changed in any manner,
but the receiver always knows the
sender
-
8/7/2019 Complete 1 Distributed Systems
102/118
Lamport-Shostak-Pease
Algorithm
Recursively defined;
OM(m), m > 0
Source x broadcasts value to allprocesses
Let vi = value received by process
i from source (0 if no valuereceived). Process i acts as a
new source and initiates OM(m-1), sending vi to remaining (n - 2)
processes
For each i, j, i j, let vj = value
received by process i fromprocess j in step 2 using O(m-1).Process i uses the valuemajority(v1, v2, , vn -1)
-
8/7/2019 Complete 1 Distributed Systems
103/118
OM(0)
2. Source x broadcasts value to all
processes
3. Each process uses the value; if
no value received, 0 is used
Time complexity = m+1 rounds
Message Complexity = O(nm)
You can reduce message complexity
to polynomial by increasing time
-
8/7/2019 Complete 1 Distributed Systems
104/118
Atomic Actions and Commit
Protocols
An action may have multiple
subactions executed by different
processes at different nodes of adistributed system
Atomic action : either all subactions
are done or none are done (all-or-nothing property/ global atomicity
property) as far as system state is
concerned
Commit protocols protocols for
enforcing global atomicity property
-
8/7/2019 Complete 1 Distributed Systems
105/118
Two-Phase Commit
Assumes the presence of write-
ahead log at each process torecover from local crashes
One process acts as coordinator
Phase 1:
coordinator sendsCOMMIT_REQUEST to allprocesses
waits for replies from all processes
on receiving aCOMMIT_REQUEST, a process, ifthe local transaction is successful,
writes Undo/redo logs in stablestorage, and sends an AGREEDmessage to the coordinator.Otherwise, sends an ABORT
-
8/7/2019 Complete 1 Distributed Systems
106/118
Phase 2:
If all processes reply AGREED,
coordinator writes COMMIT recordinto the log, then sends COMMIT toall processes. If at least oneprocess has replied ABORT,
coordinator sends ABORT to all.Coordinator then waits for ACKfrom all processes. If ACK is notreceived within timeout period,resend. If all ACKs are received,
coordinator writes COMPLETE tolog
On receiving a COMMIT, a processreleases all resources/locks, and
sends an ACK to coordinator On receiving an ABORT, a process
undoes the transaction using Undolog, releases all resources/locks,
and sends an ACK
-
8/7/2019 Complete 1 Distributed Systems
107/118
Ensures global atomicity; either all
processes commit or all of them
aborts Resilient to crash failures (see text
for different scenarios of failure)
Blocking protocol crash of
coordinator can block all processes Non-blocking protocols possible;
ex., Three-Phase Commit protocol;
we will not discuss in this class
-
8/7/2019 Complete 1 Distributed Systems
108/118
Checkpointing & Rollback
Recovery
Error recovery: Forward error recovery assess
damage due to faults exactly andrepair the erroneous part of the
system state less overhead but hard to assess
effect of faults exactly in general
Backward error recovery on afault, restore system state to aprevious error-free state and restartfrom there
costlier, but more general,application-independenttechnique
-
8/7/2019 Complete 1 Distributed Systems
109/118
Checkpoint and Rollback Recovery a form of backward error recovery
Checkpoint :
local checkpoint local state of aprocess saved in stable storagefor possible rollback on a fault
global checkpoint collection oflocal checkpoints, one from eachprocess
Consistent and Strongly ConsistentGlobal Checkpoint similar toconsistent and strongly consistent
global state respectively (Alsocalled recovery line)
-
8/7/2019 Complete 1 Distributed Systems
110/118
Orphan message a messagewhose receive is recorded in some
local checkpoint of a globalcheckpoint but send is not recordedin any local checkpoint in thatglobal checkpoint ( Note : Aconsistent global checkpoint cannothave an orphan message)
Lost message a message whosesend is recorded but receive is notin a global checkpoint
Is lost messages a problem??
not if unreliable channelsassumed (since messages canbe lost anyway)
if reliable channels assumed,need to handle this properly!Cannot lose messages !
We will assume unreliable
channels for simplicity
-
8/7/2019 Complete 1 Distributed Systems
111/118
Performance measures for a
checkpointing and recovery
algorithm during fault-free operation
checkpointing time
space for storing checkpointsand messages (if needed)
in case of a fault
recovery time (time to establish
recovery line) extent of rollback (how far in the
past did we roll back? how muchcomputation is lost?)
is output commit problemhandled? (if an output was sentout before the fault, say cashdispensed at a teller m/c, itshould not be resent after
restarting after the fault)
-
8/7/2019 Complete 1 Distributed Systems
112/118
Some parameters that affect
performance
Checkpoint interval (time between
two successive checkpoints)
Number of processes Communication pattern of the
application
Fault frequency
Nature of stable storage
-
8/7/2019 Complete 1 Distributed Systems
113/118
Classification of Checkpoint &
Recovery Algorithms
Asynchronous/Uncoordinated
every process takes local checkpointindependently
to recover from a fault in one process,all processes coordinate to find aconsistent global checkpoint from theirlocal checkpoints
very low fault-free overhead, recovery
overhead is high Domino effect possible (no consistent
global checkpoint exist, so allprocesses have to restart fromscratch)
higher space requirements, as all localcheckpoints need to be kept
Good for systems where fault is rareand inter-process communication isnot too high (less chance of domino
effect)
-
8/7/2019 Complete 1 Distributed Systems
114/118
Synchronous/Coordinated
all processes coordinate to take
a consistent global checkpoint during recovery, every processjust rolls back to its last localcheckpoint independently
low recovery overhead, but highcheckpointing overhead
no domino effect possible
low space requirement, since
only last checkpoint needs to bestored at each process
-
8/7/2019 Complete 1 Distributed Systems
115/118
Communication Induced
Synchronize checkpointing withcommunication, since message
send/receive is the fundamentalcause of inconsistency in globalcheckpoint
Ex. : take local checkpoint rightafter every send! Last localcheckpoint at each process isalways consistent. But too costly
Many variations are there, moreefficient than the above, we will
not discuss them in this class.
-
8/7/2019 Complete 1 Distributed Systems
116/118
Message logging
Take coordinated oruncoordinated checkpoint, and
then log (in stable storage) allmessages received since the lastcheckpoint
On recovery, only the recoveringprocess goes back to its lastcheckpoint, and then replaysmessages from the logappropriately until it reaches thestate right before the fault
Only class that can handleoutput commit problem!
Details too complex to discuss inthis class
-
8/7/2019 Complete 1 Distributed Systems
117/118
Some Checkpointing
Algorithms
Asynchronous/Uncoordinated
See Juang-Venkatesans
algorithm in text, quite well-
explained
Synchronous/Coordinated Chandy-Lamports global state
collection algorithm can be
modified to handle recovery from
faults See Koo-Touegs algorithm in
text, quite well-explained
-
8/7/2019 Complete 1 Distributed Systems
118/118