complete 1 distributed systems

8/7/2019 Complete 1 Distributed Systems

1/118

CS60002Distributed Systems


2/118

Text Book: Advanced Concepts in OperatingSystems by Mukesh Singhal and

Niranjan G. Shivaratri

will cover about half the course,supplemented by copies of papers

Xerox, notes, copies of papers

etc. will cover the rest.


3/118

What is a distributed system?

A very broad definition:

A set of autonomous processes

communicating amongthemselves to perform a task

Autonomous: able to act

independently

Communication: shared memory or

message passing

Concurrent system : a better term

probably


4/118

A more restricted definition:

A network of autonomous

computers that communicate bymessage passing to performsome task

A practical distributed system willprobably have both

Computers that communicate bymessages

Processes/threads on acomputer that communicate bymessages or shared memory


5/118

Advantages

Resource Sharing

Higher Performance

Fault Tolerance Scalability


6/118

Why is it hard to design them?

The usual problem of concurrent

systems:

Arbitrary interleaving of actionsmakes the system hard to verify

Plus

No globally shared memory

(therefore hard to collect globalstate)

No global clock

Unpredictable communicationdelays


7/118

Models for Distributed

Algorithms

Topology : completelyconnected, ring, tree etc.

Communication : sharedmemory/message passing(reliable? Delay? FIFO/Causal?Broadcast/multicast?)

Synchronous/asynchronous

Failure models (fail stop, crash,omission, Byzantine)

An algorithm need to specify themodel on which it is supposed towork


8/118

Complexity Measures

Message complexity : no. of

messages

Communication complexity/Bit

Complexity : no. of bits

Time complexity : For synchronous

systems, no. of rounds. For

asynchronous systems, different

definitions are there.


9/118

Some Fundamental Problems

Ordering events in the absence of a

global clock

Capturing the global state Mutual exclusion

Leader election

Clock synchronization

Termination detection

Constructing spanning trees

Agreement protocols


10/118

Ordering of Events and

Logical Clocks


11/118


12/118

a b implies a is apotentialcause

of b

Causal ordering :potentialdependencies

Happened Before relationship

causally orders events

If a b, then a causally affectsb

If a b and b a, then a and b

are concurrent ( a || b)


13/118


14/118

Points to note:

if a b, then C(a) < C(b)

is an irreflexive partial order

Total ordering possible by

arbitrarily ordering concurrent

events by process numbers


15/118

Limitation of Lamports Clock

a b implies C(a) < C(b)

BUT

C(a) < C(b) doesnt imply a b !!

So not a true clock !!


16/118

Solution: Vector Clocks

Ci is a vector of size n (no. of

processes)

C(a) is similarly a vector of size nUpdate rules:

Ci[i]++ for every event at process

i if a is send of message m from i

to j with vector timestamp tm, on

receive of m:

Cj[k] = max(Cj[k], tm[k]) for all k


17/118

For events a and b with vector

timestamps ta and tb,

ta = tb iff for all i, ta[i] = tb[i]

ta tb iff for some i, ta[i] tb[i]

ta tb iff for all i, ta[i] tb[i]

ta < tb iff (ta tb and ta tb)

ta || tb iff (ta < tb and tb < ta)


18/118

a b iff ta < tb

Events a and b are causally relatediff ta < tb or tb < ta, else they are

concurrent

Note that this is still not a total

order


19/118


20/118

Birman-Schiper-Stephenson

Protocol

To broadcast m from process i,

increment Ci(i), and timestamp m

with VTm = Ci[i] When j i receives m, j delays

delivery of m until

Cj[i] = VTm[i] 1 and

Cj[k] VTm[k] for all k i

Delayed messaged are queued in j

sorted by vector time. Concurrent

messages are sorted by receive time.

When m is delivered at j, Cj is

updated according to vector clock

rule.


21/118

Problem of Vector Clock

message size increases since

each message needs to betagged with the vector

size can be reduced in somecases by only sending values

that have changed


22/118


23/118


24/118

Some notations:

LSi : local state of process i

send(mij) : send event ofmessage mij from process i to

process j

rec(mij) : similar, receive instead

of send

time(x) : time at which state x

was recorded

time (send(m)) : time at whichsend(m) occured


25/118

send(mij) LSi iff

time(send(mij)) < time(LSi)

rec(mij) LSj iff

time(rec(mij)) < time(LSj)

transit(LSi,LSj) = { mij | send(mij) LSi

and rec(mij) LSj}

inconsistent(LSi, LSj) = {mij |

send(mij) LSi and rec(mij) LSj}


26/118


27/118

Chandy-Lamports Algorithm

Uses special marker messages.

One process acts as initiator, startsthe state collection by following themarker sending rule below.

Marker sending rule for process P: P records its state; then for each

outgoing channel C from P onwhich a marker has not beensent already, P sends a markeralong C before any furthermessage is sent on C


28/118

When Q receives a marker along achannel C:

If Q has not recorded its statethen Q records the state of C asempty; Q then follows themarker sending rule

If Q has already recorded itsstate, it records the state of C asthe sequence of messages

received along C after Qs statewas recorded and before Qreceived the marker along C


29/118


30/118

Lai and Youngs Algorithm

Similar to Chandy-Lamports, but

does not require FIFO

Boolean value X at each node,False indicates state is not

recorded yet, True indicates

recorded

Value of X piggybacked with everyapplication message

Value of X distinguishes pre-

snapshot and post-snapshot

messages, similar to the Marker


31/118

Mutual Exclusion


32/118

Mutual Exclusion

very well-understood in sharedmemory systems

Requirements:

at most one process in criticalsection (safety)

if more than one requestingprocess, someone enters(liveness)

a requesting process enterswithin a finite time (no starvation)

requests are granted in order(fairness)


33/118


34/118

Some Complexity Measures

No. of messages/critical section

entry Synchronization delay

Response time

Throughput


35/118


36/118


37/118

Some points to note:

Purpose of REPLY messages from

node i to j is to ensure that j knowsof all requests of i prior to sendingthe REPLY (and therefore, possiblyany request of i with timestamplower than js request)

Requires FIFO channels.

3(n 1 ) messages per criticalsection invocation

Synchronization delay = max.message transmission time

requests are granted in order ofincreasing timestamps


38/118

Ricart-Agarwala Algorithm

Improvement over Lamports

Main Idea:

node j need not send a REPLYto node i if j has a request withtimestamp lower than therequest of i (since i cannot enterbefore j anyway in this case)

Does not require FIFO

2(n 1) messages per criticalsection invocation

Synchronization delay = max.message transmission time

requests granted in order ofincreasing timestamps


39/118


40/118


41/118

Maekawas Algorithm

Permission obtained from only a

subset of other processes, called

the Request Set (or Quorum) Separate Request Set Ri for each

process i

Requirements:

for all i, j: Ri Rj

for all i: i Ri

for all i: |Ri| = K, for some K

any node i is contained in exactly

D Request Sets, for some D

K = D = sqrt(N) for Maekawas


42/118

A simple version

To request critical section: i sends REQUEST message to

all process in Ri

On receiving a REQUEST message:

send a REPLY message if noREPLY message has been sentsince the last RELEASE

message is received. Updatestatus to indicate that a REPLYhas been sent. Otherwise, queueup the REQUEST

To enter critical section:

i enters critical section afterreceiving REPLY from all nodes

in Ri


43/118


44/118

Message Complexity: 3*sqrt(N)

Synchronization delay =

2 *(max message transmissiontime)

Major problem: DEADLOCKpossible

Need three more types ofmessages (FAILED, INQUIRE,YIELD) to handle deadlock.Message complexity can be5*sqrt(N)

Building the request sets?


45/118

Token based Algorithms

Single token circulates, enter CS

when token is present

No FIFO required Mutual exclusion obvious

Algorithms differ in how to find and

get the token

Uses sequence numbers rather

than timestamps to differentiate

between old and current requests


46/118

Suzuki Kasami Algorithm

Broadcast a request for the token

Process with the token sends it to

the requestor if it does not need it

Issues:

Current vs. outdated requests

determining sites with pending

requests

deciding which site to give thetoken to


47/118

The token:

Queue (FIFO) Q of requestingprocesses

LN[1..n] : sequence number ofrequest that j executed mostrecently

The request message:

REQUEST(i, k): requestmessage from node i for its kthcritical section execution

Other data structures

RNi[1..n] for each node i, whereRNi[j] is the largest sequencenumber received so far by i in aREQUEST message from j.


48/118


49/118

To enter critical section:

enter CS if token is present

To release critical section:

set LN[i] = RNi[i]

For every node j which is not inQ (in token), add node j to Q if

RNi[ j ] = LN[ j ] + 1

If Q is non empty after the

above, delete first node from Qand send the token to that node


50/118

Points to note:

No. of messages: 0 if node holdsthe token already, n otherwise

Synchronization delay: 0 (node

has the token) or max. message

delay (token is elsewhere)

No starvation


51/118

Raymonds Algorithm

Forms a directed tree (logical) with

the token-holder as root

Each node has variable Holder

that points to its parent on the path

to the root. Roots Holder variable

points to itself

Each node i has a FIFO request

queue Qi


52/118

To request critical section:

Send REQUEST to parent on the

tree, provided i does not hold thetoken currently and Qi is empty.

Then place request in Qi

When a non-root node j receives a

request from i

place request in Qj

send REQUEST to parent if noprevious REQUEST sent


53/118

When the root receives aREQUEST:

send the token to the requestingnode

set Holder variable to point tothat node

When a node receives the token:

delete first entry from the queue

send token to that node set Holder variable to point to

that node

if queue is non-empty, send a

REQUEST message to theparent (node pointed at byHolder variable)


54/118

To execute critical section:

enter if token is received andown entry is at the top of thequeue; delete the entry from thequeue

To release critical section

if queue is non-empty, delete firstentry from the queue, send tokento that node and make Holdervariable point to that node

If queue is still non-empty, senda REQUEST message to the

parent (node pointed at byHolder variable)


55/118

Points to note:

Avg. message complexity

O(log n)

Sync. delay (T log n)/2, whereT = max. message delay


56/118

Leader Election


57/118

Leader Election in Rings

Models Synchronous or Asynchronous

Anonymous (no unique id) orNon-anonymous (unique ids)

Uniform (no knowledge of n,the number of processes) or

non-uniform (knows n) Known Impossibility Result:

There is no Synchronous, non-uniform leader election

protocol for anonymous rings Implications ??


58/118

Election in Asynchronous

Rings

Lelann-Chang-Roberts

Algorithm

send own id to node on left if an id received from right,

forward id to left node only if

received id greater than own id,

else ignore

if own id received, declares itself

leader

works on unidirectional rings message complexity = (n^2)


59/118

Hirschberg-Sinclair Algorithm operates in phases, requires

bidirectional ring

In kth phase, send own idto 2^kprocesses on both sides of yourself

(directly send only to nextprocesses with idand kin it)

ifidreceived, forward if received idgreater than own id, else ignore

last process in the chain sends a

reply to originator if its idless thanreceived id

replies are always forwarded

A process goes to (k+1)th phaseonly if it receives a reply from both

sides in kth phase process receiving its own id

declare itself leader


60/118


61/118

Leader Election in Arbitrary

Networks

FloodMax synchronous, round-based

at each round, each process sendsthe max. id seen so far (notnecessarily its own) to all itsneighbors

after diameter no. of rounds, if max.id seen = own id, declares itselfleader

Complexity = O(d.m), where d =diameter of the network, m = no. ofedges

does not extend to asynchronous

model trivially Variations of building different types of

spanning trees with no pre-specifiedroots. Chosen root at the end is theleader (Ex., the DFS spanning tree

algorithm we covered earlier)


62/118

Clock Synchronization


63/118

Clock Synchronization

Multiple machines with physicalclocks. How can we keep themmore or less synchronized?

Internal vs. Externalsynchronization

Perfect synchronization notpossible because of communicationdelays

Even synchronization within abound can not be guaranteed withcertainty because ofunpredictability of communication

delays. But still useful !! Ex. Kerberos,

GPS


64/118


65/118


66/118

Resynchronization

Periodic resynchronization neededto offset skew

If two clocks are drifting in oppositedirections, max. skew after time t is2 t

If application requires that clockskew < , then resynchronizationperiod

r < /(2 )

Usually and are known


67/118


68/118

Handling message delay: try toestimate the time the message withthe timer servers time took to eachthe sender

measure round trip time andhalve it

make multiple measurements ofround trip time, discard too highvalues, take average of rest

make multiple measurementsand take minimum

use knowledge of processingtime at server if known

Handling fast clocks

do not set clock backwards; slowit down over a period of time tobring in tune with servers clock


69/118

Berkeley Algorithm

Centralized as in Cristians, but the

time server is active

time server asks for time of otherm/cs at periodic intervals

time server averages the times and

sends the new time to m/cs

M/cs sets their time (advancesimmediately or slows down slowly)

to the new time

Estimation of transmission delay as

before


70/118


71/118

Measurement of time

Astronomical

traditionally used

based on earths rotation aroundits axis and around the sun

solar day : interval between twoconsecutive transits of the sun

solar second : 1/86,400 of asolar day

period of earths rotation varies,so solar second is not stable

mean solar second : averagelength of large no of solar days,then divide by 86,400


72/118


73/118

UTC time is broadcast fromdifferent sources around the world,ex.

National Institute of Standards &Technology (NIST) runs radiostations, most famous beingWWV, anyone with a proper

receiver can tune in United States Naval Observatory

(USNO) supplies time to alldefense sources, among others

National Physical Laboratory inUK

GPS satellites

Many others


74/118


75/118

Reliability ensured by redundantservers

Communication by multicast(usually within LAN servers),

symmetric (usually within multiplegeographically close servers), orclient server (to higher stratumservers)

Complex algorithms to combineand filter times

Sync. possible to within tens ofmilliseconds for most machines

But, just a best-effort service, noguarantees

RFC 1305 andwww.eecis.udel.edu/~ntp/ for moredetails
http://www.eecis.udel.edu/~ntp/http://www.eecis.udel.edu/~ntp/


76/118


77/118


78/118

Huangs Algorithm

One controlling agent, has weight 1initially

All other processes are idle initiallyand has weight 0

Computation starts whencontrolling agent sends acomputation message to a process

An idle process becomes active on

receiving a computation message B(DW) computation message

with weight DW. Can be sent onlyby the controlling agent or an activeprocess

C(DW) control message withweight DW, sent by activeprocesses to controlling agentwhen they are about to become idle


79/118


80/118

Building Spanning Trees


81/118

Building Spanning Trees

Applications:

Broadcast

Convergecast Leader election

Two variations: from a given root r

root is not given a-priori


82/118


83/118

Constructing a DFS tree with

given root

plain parallelization of thesequential algorithm by introducing

synchronization each node i has a set unexplored,

initially contains all neighbors of i

A node i (initiated by the root)

considers nodes in unexploredoneby one, sending a neighbor j amessage M and then waiting for aresponse (parentorreject) beforeconsidering the next node inunexplored

if j has already received M fromsome other node, j sends a rejectto i


84/118

else, j sets i as its parent, andconsiders nodes in its unexploredset one by one

j will send aparentmessage to ionly when it has considered allnodes in its unexplored set

i then considers the next node in itsunexploredset

Algorithm terminates when root hasreceivedparentorrejectmessagefrom all its neighbours

Worst case no. of messages = 4m Time complexity O(m)


85/118


86/118


87/118


88/118

Issues:

1. How does a node find its min.

wt. outgoing edge?

2. How does a fragment finds its

min. wt. outgoing edge?3. When does two fragments

merge?

4. How does two fragments

merge?


89/118


90/118

Merging rule for fragments

Suppose F is a fragment with id X,level L, and min. wt. outgoing edgee. Let fragment at other end of e beF1, with id X1 and level L1. Then

if L < L1, F merges into F1, newfragment has id X1, level L1

if L=L1, and e is also the min. wt.outgoing edge for F1, then F andF1 merges; new fragment has idX2 = weight of e, and level L + 1;

e is called the core edge

otherwise, F waits until one ofthe above becomes true


91/118

How to find min. wt. outgoing edge

of a fragment

nodes on core edge broadcasts initiatemessage to all fragment nodes alongfragment edges; contains level and id

on receiving initiate, a node find its min.

wt. outgoing edge (in Find state) how?

nodes send Reportmessage with min.wt. edge up towards the core edge alongfragment edges (and enters Found state)

leafs send their min. wt. outgoing edge,intermediate nodes send the min. of theirmin. wt. outgoing edge and min. edge

sent by children in fragment; path info tobest edge kept

when Reportreaches the nodes on thecore edge, min. wt. outgoing edge of the

fragment is known.


92/118


93/118


94/118

Fault Tolerance

andRecovery


95/118


96/118

Types of tolerance:

Masking system alwaysbehaves as per specifications

even in presence of faults Non-masking system may

violate specifications in presenceof faults. Should at least behave

in a well-defined manner

Fault tolerant system should specify:

Class of faults tolerated

what tolerance is given fromeach class


97/118


98/118


99/118

Different problem variations

Byzantine agreement (or ByzantineGenerals problem)

one process x broadcasts avalue v

all nonfaulty processes mustagree on a common value(Agreement condition).

The agreed upon value must

be v if x is nonfaulty (Validitycondition)

Consensus

Each process broadcasts its

initial value satisfy agreement condition

If initial value of all nonfaultyprocesses is v, then the

agreed upon value must be v


100/118


101/118

Byzantine Agreement Problem

no solution possible if

asynchronous system, or

n < (3m + 1) needs at least (m+1) rounds of

message exchange (lower bound

result)

Oral messages messages can

be forged/changed in any manner,

but the receiver always knows the

sender


102/118

Lamport-Shostak-Pease

Algorithm

Recursively defined;

OM(m), m > 0

Source x broadcasts value to allprocesses

Let vi = value received by process

i from source (0 if no valuereceived). Process i acts as a

new source and initiates OM(m-1), sending vi to remaining (n - 2)

processes

For each i, j, i j, let vj = value

received by process i fromprocess j in step 2 using O(m-1).Process i uses the valuemajority(v1, v2, , vn -1)


103/118

OM(0)

2. Source x broadcasts value to all

processes

3. Each process uses the value; if

no value received, 0 is used

Time complexity = m+1 rounds

Message Complexity = O(nm)

You can reduce message complexity

to polynomial by increasing time


104/118

Atomic Actions and Commit

Protocols

An action may have multiple

subactions executed by different

processes at different nodes of adistributed system

Atomic action : either all subactions

are done or none are done (all-or-nothing property/ global atomicity

property) as far as system state is

concerned

Commit protocols protocols for

enforcing global atomicity property


105/118

Two-Phase Commit

Assumes the presence of write-

ahead log at each process torecover from local crashes

One process acts as coordinator

Phase 1:

coordinator sendsCOMMIT_REQUEST to allprocesses

waits for replies from all processes

on receiving aCOMMIT_REQUEST, a process, ifthe local transaction is successful,

writes Undo/redo logs in stablestorage, and sends an AGREEDmessage to the coordinator.Otherwise, sends an ABORT


106/118

Phase 2:

If all processes reply AGREED,

coordinator writes COMMIT recordinto the log, then sends COMMIT toall processes. If at least oneprocess has replied ABORT,

coordinator sends ABORT to all.Coordinator then waits for ACKfrom all processes. If ACK is notreceived within timeout period,resend. If all ACKs are received,

coordinator writes COMPLETE tolog

On receiving a COMMIT, a processreleases all resources/locks, and

sends an ACK to coordinator On receiving an ABORT, a process

undoes the transaction using Undolog, releases all resources/locks,

and sends an ACK


107/118

Ensures global atomicity; either all

processes commit or all of them

aborts Resilient to crash failures (see text

for different scenarios of failure)

Blocking protocol crash of

coordinator can block all processes Non-blocking protocols possible;

ex., Three-Phase Commit protocol;

we will not discuss in this class


108/118

Checkpointing & Rollback

Recovery

Error recovery: Forward error recovery assess

damage due to faults exactly andrepair the erroneous part of the

system state less overhead but hard to assess

effect of faults exactly in general

Backward error recovery on afault, restore system state to aprevious error-free state and restartfrom there

costlier, but more general,application-independenttechnique


109/118

Checkpoint and Rollback Recovery a form of backward error recovery

Checkpoint :

local checkpoint local state of aprocess saved in stable storagefor possible rollback on a fault

global checkpoint collection oflocal checkpoints, one from eachprocess

Consistent and Strongly ConsistentGlobal Checkpoint similar toconsistent and strongly consistent

global state respectively (Alsocalled recovery line)


110/118

Orphan message a messagewhose receive is recorded in some

local checkpoint of a globalcheckpoint but send is not recordedin any local checkpoint in thatglobal checkpoint ( Note : Aconsistent global checkpoint cannothave an orphan message)

Lost message a message whosesend is recorded but receive is notin a global checkpoint

Is lost messages a problem??

not if unreliable channelsassumed (since messages canbe lost anyway)

if reliable channels assumed,need to handle this properly!Cannot lose messages !

We will assume unreliable

channels for simplicity


111/118

Performance measures for a

checkpointing and recovery

algorithm during fault-free operation

checkpointing time

space for storing checkpointsand messages (if needed)

in case of a fault

recovery time (time to establish

recovery line) extent of rollback (how far in the

past did we roll back? how muchcomputation is lost?)

is output commit problemhandled? (if an output was sentout before the fault, say cashdispensed at a teller m/c, itshould not be resent after

restarting after the fault)


112/118

Some parameters that affect

performance

Checkpoint interval (time between

two successive checkpoints)

Number of processes Communication pattern of the

application

Fault frequency

Nature of stable storage


113/118

Classification of Checkpoint &

Recovery Algorithms

Asynchronous/Uncoordinated

every process takes local checkpointindependently

to recover from a fault in one process,all processes coordinate to find aconsistent global checkpoint from theirlocal checkpoints

very low fault-free overhead, recovery

overhead is high Domino effect possible (no consistent

global checkpoint exist, so allprocesses have to restart fromscratch)

higher space requirements, as all localcheckpoints need to be kept

Good for systems where fault is rareand inter-process communication isnot too high (less chance of domino

effect)


114/118

Synchronous/Coordinated

all processes coordinate to take

a consistent global checkpoint during recovery, every processjust rolls back to its last localcheckpoint independently

low recovery overhead, but highcheckpointing overhead

no domino effect possible

low space requirement, since

only last checkpoint needs to bestored at each process


115/118

Communication Induced

Synchronize checkpointing withcommunication, since message

send/receive is the fundamentalcause of inconsistency in globalcheckpoint

Ex. : take local checkpoint rightafter every send! Last localcheckpoint at each process isalways consistent. But too costly

Many variations are there, moreefficient than the above, we will

not discuss them in this class.


116/118

Message logging

Take coordinated oruncoordinated checkpoint, and

then log (in stable storage) allmessages received since the lastcheckpoint

On recovery, only the recoveringprocess goes back to its lastcheckpoint, and then replaysmessages from the logappropriately until it reaches thestate right before the fault

Only class that can handleoutput commit problem!

Details too complex to discuss inthis class


117/118

Some Checkpointing

Algorithms

Asynchronous/Uncoordinated

See Juang-Venkatesans

algorithm in text, quite well-

explained

Synchronous/Coordinated Chandy-Lamports global state

collection algorithm can be

modified to handle recovery from

faults See Koo-Touegs algorithm in

text, quite well-explained


118/118

complete 1 distributed systems

Documents