complete 1 distributed systems

Upload: bhuwan-sharma

Post on 08-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Complete 1 Distributed Systems

    1/118

    CS60002Distributed Systems

  • 8/7/2019 Complete 1 Distributed Systems

    2/118

    Text Book: Advanced Concepts in OperatingSystems by Mukesh Singhal and

    Niranjan G. Shivaratri

    will cover about half the course,supplemented by copies of papers

    Xerox, notes, copies of papers

    etc. will cover the rest.

  • 8/7/2019 Complete 1 Distributed Systems

    3/118

    What is a distributed system?

    A very broad definition:

    A set of autonomous processes

    communicating amongthemselves to perform a task

    Autonomous: able to act

    independently

    Communication: shared memory or

    message passing

    Concurrent system : a better term

    probably

  • 8/7/2019 Complete 1 Distributed Systems

    4/118

    A more restricted definition:

    A network of autonomous

    computers that communicate bymessage passing to performsome task

    A practical distributed system willprobably have both

    Computers that communicate bymessages

    Processes/threads on acomputer that communicate bymessages or shared memory

  • 8/7/2019 Complete 1 Distributed Systems

    5/118

    Advantages

    Resource Sharing

    Higher Performance

    Fault Tolerance Scalability

  • 8/7/2019 Complete 1 Distributed Systems

    6/118

    Why is it hard to design them?

    The usual problem of concurrent

    systems:

    Arbitrary interleaving of actionsmakes the system hard to verify

    Plus

    No globally shared memory

    (therefore hard to collect globalstate)

    No global clock

    Unpredictable communicationdelays

  • 8/7/2019 Complete 1 Distributed Systems

    7/118

    Models for Distributed

    Algorithms

    Topology : completelyconnected, ring, tree etc.

    Communication : sharedmemory/message passing(reliable? Delay? FIFO/Causal?Broadcast/multicast?)

    Synchronous/asynchronous

    Failure models (fail stop, crash,omission, Byzantine)

    An algorithm need to specify themodel on which it is supposed towork

  • 8/7/2019 Complete 1 Distributed Systems

    8/118

    Complexity Measures

    Message complexity : no. of

    messages

    Communication complexity/Bit

    Complexity : no. of bits

    Time complexity : For synchronous

    systems, no. of rounds. For

    asynchronous systems, different

    definitions are there.

  • 8/7/2019 Complete 1 Distributed Systems

    9/118

    Some Fundamental Problems

    Ordering events in the absence of a

    global clock

    Capturing the global state Mutual exclusion

    Leader election

    Clock synchronization

    Termination detection

    Constructing spanning trees

    Agreement protocols

  • 8/7/2019 Complete 1 Distributed Systems

    10/118

    Ordering of Events and

    Logical Clocks

  • 8/7/2019 Complete 1 Distributed Systems

    11/118

  • 8/7/2019 Complete 1 Distributed Systems

    12/118

    a b implies a is apotentialcause

    of b

    Causal ordering :potentialdependencies

    Happened Before relationship

    causally orders events

    If a b, then a causally affectsb

    If a b and b a, then a and b

    are concurrent ( a || b)

  • 8/7/2019 Complete 1 Distributed Systems

    13/118

  • 8/7/2019 Complete 1 Distributed Systems

    14/118

    Points to note:

    if a b, then C(a) < C(b)

    is an irreflexive partial order

    Total ordering possible by

    arbitrarily ordering concurrent

    events by process numbers

  • 8/7/2019 Complete 1 Distributed Systems

    15/118

    Limitation of Lamports Clock

    a b implies C(a) < C(b)

    BUT

    C(a) < C(b) doesnt imply a b !!

    So not a true clock !!

  • 8/7/2019 Complete 1 Distributed Systems

    16/118

    Solution: Vector Clocks

    Ci is a vector of size n (no. of

    processes)

    C(a) is similarly a vector of size nUpdate rules:

    Ci[i]++ for every event at process

    i if a is send of message m from i

    to j with vector timestamp tm, on

    receive of m:

    Cj[k] = max(Cj[k], tm[k]) for all k

  • 8/7/2019 Complete 1 Distributed Systems

    17/118

    For events a and b with vector

    timestamps ta and tb,

    ta = tb iff for all i, ta[i] = tb[i]

    ta tb iff for some i, ta[i] tb[i]

    ta tb iff for all i, ta[i] tb[i]

    ta < tb iff (ta tb and ta tb)

    ta || tb iff (ta < tb and tb < ta)

  • 8/7/2019 Complete 1 Distributed Systems

    18/118

    a b iff ta < tb

    Events a and b are causally relatediff ta < tb or tb < ta, else they are

    concurrent

    Note that this is still not a total

    order

  • 8/7/2019 Complete 1 Distributed Systems

    19/118

  • 8/7/2019 Complete 1 Distributed Systems

    20/118

    Birman-Schiper-Stephenson

    Protocol

    To broadcast m from process i,

    increment Ci(i), and timestamp m

    with VTm = Ci[i] When j i receives m, j delays

    delivery of m until

    Cj[i] = VTm[i] 1 and

    Cj[k] VTm[k] for all k i

    Delayed messaged are queued in j

    sorted by vector time. Concurrent

    messages are sorted by receive time.

    When m is delivered at j, Cj is

    updated according to vector clock

    rule.

  • 8/7/2019 Complete 1 Distributed Systems

    21/118

    Problem of Vector Clock

    message size increases since

    each message needs to betagged with the vector

    size can be reduced in somecases by only sending values

    that have changed

  • 8/7/2019 Complete 1 Distributed Systems

    22/118

  • 8/7/2019 Complete 1 Distributed Systems

    23/118

  • 8/7/2019 Complete 1 Distributed Systems

    24/118

    Some notations:

    LSi : local state of process i

    send(mij) : send event ofmessage mij from process i to

    process j

    rec(mij) : similar, receive instead

    of send

    time(x) : time at which state x

    was recorded

    time (send(m)) : time at whichsend(m) occured

  • 8/7/2019 Complete 1 Distributed Systems

    25/118

    send(mij) LSi iff

    time(send(mij)) < time(LSi)

    rec(mij) LSj iff

    time(rec(mij)) < time(LSj)

    transit(LSi,LSj) = { mij | send(mij) LSi

    and rec(mij) LSj}

    inconsistent(LSi, LSj) = {mij |

    send(mij) LSi and rec(mij) LSj}

  • 8/7/2019 Complete 1 Distributed Systems

    26/118

  • 8/7/2019 Complete 1 Distributed Systems

    27/118

    Chandy-Lamports Algorithm

    Uses special marker messages.

    One process acts as initiator, startsthe state collection by following themarker sending rule below.

    Marker sending rule for process P: P records its state; then for each

    outgoing channel C from P onwhich a marker has not beensent already, P sends a markeralong C before any furthermessage is sent on C

  • 8/7/2019 Complete 1 Distributed Systems

    28/118

    When Q receives a marker along achannel C:

    If Q has not recorded its statethen Q records the state of C asempty; Q then follows themarker sending rule

    If Q has already recorded itsstate, it records the state of C asthe sequence of messages

    received along C after Qs statewas recorded and before Qreceived the marker along C

  • 8/7/2019 Complete 1 Distributed Systems

    29/118

  • 8/7/2019 Complete 1 Distributed Systems

    30/118

    Lai and Youngs Algorithm

    Similar to Chandy-Lamports, but

    does not require FIFO

    Boolean value X at each node,False indicates state is not

    recorded yet, True indicates

    recorded

    Value of X piggybacked with everyapplication message

    Value of X distinguishes pre-

    snapshot and post-snapshot

    messages, similar to the Marker

  • 8/7/2019 Complete 1 Distributed Systems

    31/118

    Mutual Exclusion

  • 8/7/2019 Complete 1 Distributed Systems

    32/118

    Mutual Exclusion

    very well-understood in sharedmemory systems

    Requirements:

    at most one process in criticalsection (safety)

    if more than one requestingprocess, someone enters(liveness)

    a requesting process enterswithin a finite time (no starvation)

    requests are granted in order(fairness)

  • 8/7/2019 Complete 1 Distributed Systems

    33/118

  • 8/7/2019 Complete 1 Distributed Systems

    34/118

    Some Complexity Measures

    No. of messages/critical section

    entry Synchronization delay

    Response time

    Throughput

  • 8/7/2019 Complete 1 Distributed Systems

    35/118

  • 8/7/2019 Complete 1 Distributed Systems

    36/118

  • 8/7/2019 Complete 1 Distributed Systems

    37/118

    Some points to note:

    Purpose of REPLY messages from

    node i to j is to ensure that j knowsof all requests of i prior to sendingthe REPLY (and therefore, possiblyany request of i with timestamplower than js request)

    Requires FIFO channels.

    3(n 1 ) messages per criticalsection invocation

    Synchronization delay = max.message transmission time

    requests are granted in order ofincreasing timestamps

  • 8/7/2019 Complete 1 Distributed Systems

    38/118

    Ricart-Agarwala Algorithm

    Improvement over Lamports

    Main Idea:

    node j need not send a REPLYto node i if j has a request withtimestamp lower than therequest of i (since i cannot enterbefore j anyway in this case)

    Does not require FIFO

    2(n 1) messages per criticalsection invocation

    Synchronization delay = max.message transmission time

    requests granted in order ofincreasing timestamps

  • 8/7/2019 Complete 1 Distributed Systems

    39/118

  • 8/7/2019 Complete 1 Distributed Systems

    40/118

  • 8/7/2019 Complete 1 Distributed Systems

    41/118

    Maekawas Algorithm

    Permission obtained from only a

    subset of other processes, called

    the Request Set (or Quorum) Separate Request Set Ri for each

    process i

    Requirements:

    for all i, j: Ri Rj

    for all i: i Ri

    for all i: |Ri| = K, for some K

    any node i is contained in exactly

    D Request Sets, for some D

    K = D = sqrt(N) for Maekawas

  • 8/7/2019 Complete 1 Distributed Systems

    42/118

    A simple version

    To request critical section: i sends REQUEST message to

    all process in Ri

    On receiving a REQUEST message:

    send a REPLY message if noREPLY message has been sentsince the last RELEASE

    message is received. Updatestatus to indicate that a REPLYhas been sent. Otherwise, queueup the REQUEST

    To enter critical section:

    i enters critical section afterreceiving REPLY from all nodes

    in Ri

  • 8/7/2019 Complete 1 Distributed Systems

    43/118

  • 8/7/2019 Complete 1 Distributed Systems

    44/118

    Message Complexity: 3*sqrt(N)

    Synchronization delay =

    2 *(max message transmissiontime)

    Major problem: DEADLOCKpossible

    Need three more types ofmessages (FAILED, INQUIRE,YIELD) to handle deadlock.Message complexity can be5*sqrt(N)

    Building the request sets?

  • 8/7/2019 Complete 1 Distributed Systems

    45/118

    Token based Algorithms

    Single token circulates, enter CS

    when token is present

    No FIFO required Mutual exclusion obvious

    Algorithms differ in how to find and

    get the token

    Uses sequence numbers rather

    than timestamps to differentiate

    between old and current requests

  • 8/7/2019 Complete 1 Distributed Systems

    46/118

    Suzuki Kasami Algorithm

    Broadcast a request for the token

    Process with the token sends it to

    the requestor if it does not need it

    Issues:

    Current vs. outdated requests

    determining sites with pending

    requests

    deciding which site to give thetoken to

  • 8/7/2019 Complete 1 Distributed Systems

    47/118

    The token:

    Queue (FIFO) Q of requestingprocesses

    LN[1..n] : sequence number ofrequest that j executed mostrecently

    The request message:

    REQUEST(i, k): requestmessage from node i for its kthcritical section execution

    Other data structures

    RNi[1..n] for each node i, whereRNi[j] is the largest sequencenumber received so far by i in aREQUEST message from j.

  • 8/7/2019 Complete 1 Distributed Systems

    48/118

  • 8/7/2019 Complete 1 Distributed Systems

    49/118

    To enter critical section:

    enter CS if token is present

    To release critical section:

    set LN[i] = RNi[i]

    For every node j which is not inQ (in token), add node j to Q if

    RNi[ j ] = LN[ j ] + 1

    If Q is non empty after the

    above, delete first node from Qand send the token to that node

  • 8/7/2019 Complete 1 Distributed Systems

    50/118

    Points to note:

    No. of messages: 0 if node holdsthe token already, n otherwise

    Synchronization delay: 0 (node

    has the token) or max. message

    delay (token is elsewhere)

    No starvation

  • 8/7/2019 Complete 1 Distributed Systems

    51/118

    Raymonds Algorithm

    Forms a directed tree (logical) with

    the token-holder as root

    Each node has variable Holder

    that points to its parent on the path

    to the root. Roots Holder variable

    points to itself

    Each node i has a FIFO request

    queue Qi

  • 8/7/2019 Complete 1 Distributed Systems

    52/118

    To request critical section:

    Send REQUEST to parent on the

    tree, provided i does not hold thetoken currently and Qi is empty.

    Then place request in Qi

    When a non-root node j receives a

    request from i

    place request in Qj

    send REQUEST to parent if noprevious REQUEST sent

  • 8/7/2019 Complete 1 Distributed Systems

    53/118

    When the root receives aREQUEST:

    send the token to the requestingnode

    set Holder variable to point tothat node

    When a node receives the token:

    delete first entry from the queue

    send token to that node set Holder variable to point to

    that node

    if queue is non-empty, send a

    REQUEST message to theparent (node pointed at byHolder variable)

  • 8/7/2019 Complete 1 Distributed Systems

    54/118

    To execute critical section:

    enter if token is received andown entry is at the top of thequeue; delete the entry from thequeue

    To release critical section

    if queue is non-empty, delete firstentry from the queue, send tokento that node and make Holdervariable point to that node

    If queue is still non-empty, senda REQUEST message to the

    parent (node pointed at byHolder variable)

  • 8/7/2019 Complete 1 Distributed Systems

    55/118

    Points to note:

    Avg. message complexity

    O(log n)

    Sync. delay (T log n)/2, whereT = max. message delay

  • 8/7/2019 Complete 1 Distributed Systems

    56/118

    Leader Election

  • 8/7/2019 Complete 1 Distributed Systems

    57/118

    Leader Election in Rings

    Models Synchronous or Asynchronous

    Anonymous (no unique id) orNon-anonymous (unique ids)

    Uniform (no knowledge of n,the number of processes) or

    non-uniform (knows n) Known Impossibility Result:

    There is no Synchronous, non-uniform leader election

    protocol for anonymous rings Implications ??

  • 8/7/2019 Complete 1 Distributed Systems

    58/118

    Election in Asynchronous

    Rings

    Lelann-Chang-Roberts

    Algorithm

    send own id to node on left if an id received from right,

    forward id to left node only if

    received id greater than own id,

    else ignore

    if own id received, declares itself

    leader

    works on unidirectional rings message complexity = (n^2)

  • 8/7/2019 Complete 1 Distributed Systems

    59/118

    Hirschberg-Sinclair Algorithm operates in phases, requires

    bidirectional ring

    In kth phase, send own idto 2^kprocesses on both sides of yourself

    (directly send only to nextprocesses with idand kin it)

    ifidreceived, forward if received idgreater than own id, else ignore

    last process in the chain sends a

    reply to originator if its idless thanreceived id

    replies are always forwarded

    A process goes to (k+1)th phaseonly if it receives a reply from both

    sides in kth phase process receiving its own id

    declare itself leader

  • 8/7/2019 Complete 1 Distributed Systems

    60/118

  • 8/7/2019 Complete 1 Distributed Systems

    61/118

    Leader Election in Arbitrary

    Networks

    FloodMax synchronous, round-based

    at each round, each process sendsthe max. id seen so far (notnecessarily its own) to all itsneighbors

    after diameter no. of rounds, if max.id seen = own id, declares itselfleader

    Complexity = O(d.m), where d =diameter of the network, m = no. ofedges

    does not extend to asynchronous

    model trivially Variations of building different types of

    spanning trees with no pre-specifiedroots. Chosen root at the end is theleader (Ex., the DFS spanning tree

    algorithm we covered earlier)

  • 8/7/2019 Complete 1 Distributed Systems

    62/118

    Clock Synchronization

  • 8/7/2019 Complete 1 Distributed Systems

    63/118

    Clock Synchronization

    Multiple machines with physicalclocks. How can we keep themmore or less synchronized?

    Internal vs. Externalsynchronization

    Perfect synchronization notpossible because of communicationdelays

    Even synchronization within abound can not be guaranteed withcertainty because ofunpredictability of communication

    delays. But still useful !! Ex. Kerberos,

    GPS

  • 8/7/2019 Complete 1 Distributed Systems

    64/118

  • 8/7/2019 Complete 1 Distributed Systems

    65/118

  • 8/7/2019 Complete 1 Distributed Systems

    66/118

    Resynchronization

    Periodic resynchronization neededto offset skew

    If two clocks are drifting in oppositedirections, max. skew after time t is2 t

    If application requires that clockskew < , then resynchronizationperiod

    r < /(2 )

    Usually and are known

  • 8/7/2019 Complete 1 Distributed Systems

    67/118

  • 8/7/2019 Complete 1 Distributed Systems

    68/118

    Handling message delay: try toestimate the time the message withthe timer servers time took to eachthe sender

    measure round trip time andhalve it

    make multiple measurements ofround trip time, discard too highvalues, take average of rest

    make multiple measurementsand take minimum

    use knowledge of processingtime at server if known

    Handling fast clocks

    do not set clock backwards; slowit down over a period of time tobring in tune with servers clock

  • 8/7/2019 Complete 1 Distributed Systems

    69/118

    Berkeley Algorithm

    Centralized as in Cristians, but the

    time server is active

    time server asks for time of otherm/cs at periodic intervals

    time server averages the times and

    sends the new time to m/cs

    M/cs sets their time (advancesimmediately or slows down slowly)

    to the new time

    Estimation of transmission delay as

    before

  • 8/7/2019 Complete 1 Distributed Systems

    70/118

  • 8/7/2019 Complete 1 Distributed Systems

    71/118

    Measurement of time

    Astronomical

    traditionally used

    based on earths rotation aroundits axis and around the sun

    solar day : interval between twoconsecutive transits of the sun

    solar second : 1/86,400 of asolar day

    period of earths rotation varies,so solar second is not stable

    mean solar second : averagelength of large no of solar days,then divide by 86,400

  • 8/7/2019 Complete 1 Distributed Systems

    72/118

  • 8/7/2019 Complete 1 Distributed Systems

    73/118

    UTC time is broadcast fromdifferent sources around the world,ex.

    National Institute of Standards &Technology (NIST) runs radiostations, most famous beingWWV, anyone with a proper

    receiver can tune in United States Naval Observatory

    (USNO) supplies time to alldefense sources, among others

    National Physical Laboratory inUK

    GPS satellites

    Many others

  • 8/7/2019 Complete 1 Distributed Systems

    74/118

  • 8/7/2019 Complete 1 Distributed Systems

    75/118

    Reliability ensured by redundantservers

    Communication by multicast(usually within LAN servers),

    symmetric (usually within multiplegeographically close servers), orclient server (to higher stratumservers)

    Complex algorithms to combineand filter times

    Sync. possible to within tens ofmilliseconds for most machines

    But, just a best-effort service, noguarantees

    RFC 1305 andwww.eecis.udel.edu/~ntp/ for moredetails

    http://www.eecis.udel.edu/~ntp/http://www.eecis.udel.edu/~ntp/
  • 8/7/2019 Complete 1 Distributed Systems

    76/118

  • 8/7/2019 Complete 1 Distributed Systems

    77/118

  • 8/7/2019 Complete 1 Distributed Systems

    78/118

    Huangs Algorithm

    One controlling agent, has weight 1initially

    All other processes are idle initiallyand has weight 0

    Computation starts whencontrolling agent sends acomputation message to a process

    An idle process becomes active on

    receiving a computation message B(DW) computation message

    with weight DW. Can be sent onlyby the controlling agent or an activeprocess

    C(DW) control message withweight DW, sent by activeprocesses to controlling agentwhen they are about to become idle

  • 8/7/2019 Complete 1 Distributed Systems

    79/118

  • 8/7/2019 Complete 1 Distributed Systems

    80/118

    Building Spanning Trees

  • 8/7/2019 Complete 1 Distributed Systems

    81/118

    Building Spanning Trees

    Applications:

    Broadcast

    Convergecast Leader election

    Two variations: from a given root r

    root is not given a-priori

  • 8/7/2019 Complete 1 Distributed Systems

    82/118

  • 8/7/2019 Complete 1 Distributed Systems

    83/118

    Constructing a DFS tree with

    given root

    plain parallelization of thesequential algorithm by introducing

    synchronization each node i has a set unexplored,

    initially contains all neighbors of i

    A node i (initiated by the root)

    considers nodes in unexploredoneby one, sending a neighbor j amessage M and then waiting for aresponse (parentorreject) beforeconsidering the next node inunexplored

    if j has already received M fromsome other node, j sends a rejectto i

  • 8/7/2019 Complete 1 Distributed Systems

    84/118

    else, j sets i as its parent, andconsiders nodes in its unexploredset one by one

    j will send aparentmessage to ionly when it has considered allnodes in its unexplored set

    i then considers the next node in itsunexploredset

    Algorithm terminates when root hasreceivedparentorrejectmessagefrom all its neighbours

    Worst case no. of messages = 4m Time complexity O(m)

  • 8/7/2019 Complete 1 Distributed Systems

    85/118

  • 8/7/2019 Complete 1 Distributed Systems

    86/118

  • 8/7/2019 Complete 1 Distributed Systems

    87/118

  • 8/7/2019 Complete 1 Distributed Systems

    88/118

    Issues:

    1. How does a node find its min.

    wt. outgoing edge?

    2. How does a fragment finds its

    min. wt. outgoing edge?3. When does two fragments

    merge?

    4. How does two fragments

    merge?

  • 8/7/2019 Complete 1 Distributed Systems

    89/118

  • 8/7/2019 Complete 1 Distributed Systems

    90/118

    Merging rule for fragments

    Suppose F is a fragment with id X,level L, and min. wt. outgoing edgee. Let fragment at other end of e beF1, with id X1 and level L1. Then

    if L < L1, F merges into F1, newfragment has id X1, level L1

    if L=L1, and e is also the min. wt.outgoing edge for F1, then F andF1 merges; new fragment has idX2 = weight of e, and level L + 1;

    e is called the core edge

    otherwise, F waits until one ofthe above becomes true

  • 8/7/2019 Complete 1 Distributed Systems

    91/118

    How to find min. wt. outgoing edge

    of a fragment

    nodes on core edge broadcasts initiatemessage to all fragment nodes alongfragment edges; contains level and id

    on receiving initiate, a node find its min.

    wt. outgoing edge (in Find state) how?

    nodes send Reportmessage with min.wt. edge up towards the core edge alongfragment edges (and enters Found state)

    leafs send their min. wt. outgoing edge,intermediate nodes send the min. of theirmin. wt. outgoing edge and min. edge

    sent by children in fragment; path info tobest edge kept

    when Reportreaches the nodes on thecore edge, min. wt. outgoing edge of the

    fragment is known.

  • 8/7/2019 Complete 1 Distributed Systems

    92/118

  • 8/7/2019 Complete 1 Distributed Systems

    93/118

  • 8/7/2019 Complete 1 Distributed Systems

    94/118

    Fault Tolerance

    andRecovery

  • 8/7/2019 Complete 1 Distributed Systems

    95/118

  • 8/7/2019 Complete 1 Distributed Systems

    96/118

    Types of tolerance:

    Masking system alwaysbehaves as per specifications

    even in presence of faults Non-masking system may

    violate specifications in presenceof faults. Should at least behave

    in a well-defined manner

    Fault tolerant system should specify:

    Class of faults tolerated

    what tolerance is given fromeach class

  • 8/7/2019 Complete 1 Distributed Systems

    97/118

  • 8/7/2019 Complete 1 Distributed Systems

    98/118

  • 8/7/2019 Complete 1 Distributed Systems

    99/118

    Different problem variations

    Byzantine agreement (or ByzantineGenerals problem)

    one process x broadcasts avalue v

    all nonfaulty processes mustagree on a common value(Agreement condition).

    The agreed upon value must

    be v if x is nonfaulty (Validitycondition)

    Consensus

    Each process broadcasts its

    initial value satisfy agreement condition

    If initial value of all nonfaultyprocesses is v, then the

    agreed upon value must be v

  • 8/7/2019 Complete 1 Distributed Systems

    100/118

  • 8/7/2019 Complete 1 Distributed Systems

    101/118

    Byzantine Agreement Problem

    no solution possible if

    asynchronous system, or

    n < (3m + 1) needs at least (m+1) rounds of

    message exchange (lower bound

    result)

    Oral messages messages can

    be forged/changed in any manner,

    but the receiver always knows the

    sender

  • 8/7/2019 Complete 1 Distributed Systems

    102/118

    Lamport-Shostak-Pease

    Algorithm

    Recursively defined;

    OM(m), m > 0

    Source x broadcasts value to allprocesses

    Let vi = value received by process

    i from source (0 if no valuereceived). Process i acts as a

    new source and initiates OM(m-1), sending vi to remaining (n - 2)

    processes

    For each i, j, i j, let vj = value

    received by process i fromprocess j in step 2 using O(m-1).Process i uses the valuemajority(v1, v2, , vn -1)

  • 8/7/2019 Complete 1 Distributed Systems

    103/118

    OM(0)

    2. Source x broadcasts value to all

    processes

    3. Each process uses the value; if

    no value received, 0 is used

    Time complexity = m+1 rounds

    Message Complexity = O(nm)

    You can reduce message complexity

    to polynomial by increasing time

  • 8/7/2019 Complete 1 Distributed Systems

    104/118

    Atomic Actions and Commit

    Protocols

    An action may have multiple

    subactions executed by different

    processes at different nodes of adistributed system

    Atomic action : either all subactions

    are done or none are done (all-or-nothing property/ global atomicity

    property) as far as system state is

    concerned

    Commit protocols protocols for

    enforcing global atomicity property

  • 8/7/2019 Complete 1 Distributed Systems

    105/118

    Two-Phase Commit

    Assumes the presence of write-

    ahead log at each process torecover from local crashes

    One process acts as coordinator

    Phase 1:

    coordinator sendsCOMMIT_REQUEST to allprocesses

    waits for replies from all processes

    on receiving aCOMMIT_REQUEST, a process, ifthe local transaction is successful,

    writes Undo/redo logs in stablestorage, and sends an AGREEDmessage to the coordinator.Otherwise, sends an ABORT

  • 8/7/2019 Complete 1 Distributed Systems

    106/118

    Phase 2:

    If all processes reply AGREED,

    coordinator writes COMMIT recordinto the log, then sends COMMIT toall processes. If at least oneprocess has replied ABORT,

    coordinator sends ABORT to all.Coordinator then waits for ACKfrom all processes. If ACK is notreceived within timeout period,resend. If all ACKs are received,

    coordinator writes COMPLETE tolog

    On receiving a COMMIT, a processreleases all resources/locks, and

    sends an ACK to coordinator On receiving an ABORT, a process

    undoes the transaction using Undolog, releases all resources/locks,

    and sends an ACK

  • 8/7/2019 Complete 1 Distributed Systems

    107/118

    Ensures global atomicity; either all

    processes commit or all of them

    aborts Resilient to crash failures (see text

    for different scenarios of failure)

    Blocking protocol crash of

    coordinator can block all processes Non-blocking protocols possible;

    ex., Three-Phase Commit protocol;

    we will not discuss in this class

  • 8/7/2019 Complete 1 Distributed Systems

    108/118

    Checkpointing & Rollback

    Recovery

    Error recovery: Forward error recovery assess

    damage due to faults exactly andrepair the erroneous part of the

    system state less overhead but hard to assess

    effect of faults exactly in general

    Backward error recovery on afault, restore system state to aprevious error-free state and restartfrom there

    costlier, but more general,application-independenttechnique

  • 8/7/2019 Complete 1 Distributed Systems

    109/118

    Checkpoint and Rollback Recovery a form of backward error recovery

    Checkpoint :

    local checkpoint local state of aprocess saved in stable storagefor possible rollback on a fault

    global checkpoint collection oflocal checkpoints, one from eachprocess

    Consistent and Strongly ConsistentGlobal Checkpoint similar toconsistent and strongly consistent

    global state respectively (Alsocalled recovery line)

  • 8/7/2019 Complete 1 Distributed Systems

    110/118

    Orphan message a messagewhose receive is recorded in some

    local checkpoint of a globalcheckpoint but send is not recordedin any local checkpoint in thatglobal checkpoint ( Note : Aconsistent global checkpoint cannothave an orphan message)

    Lost message a message whosesend is recorded but receive is notin a global checkpoint

    Is lost messages a problem??

    not if unreliable channelsassumed (since messages canbe lost anyway)

    if reliable channels assumed,need to handle this properly!Cannot lose messages !

    We will assume unreliable

    channels for simplicity

  • 8/7/2019 Complete 1 Distributed Systems

    111/118

    Performance measures for a

    checkpointing and recovery

    algorithm during fault-free operation

    checkpointing time

    space for storing checkpointsand messages (if needed)

    in case of a fault

    recovery time (time to establish

    recovery line) extent of rollback (how far in the

    past did we roll back? how muchcomputation is lost?)

    is output commit problemhandled? (if an output was sentout before the fault, say cashdispensed at a teller m/c, itshould not be resent after

    restarting after the fault)

  • 8/7/2019 Complete 1 Distributed Systems

    112/118

    Some parameters that affect

    performance

    Checkpoint interval (time between

    two successive checkpoints)

    Number of processes Communication pattern of the

    application

    Fault frequency

    Nature of stable storage

  • 8/7/2019 Complete 1 Distributed Systems

    113/118

    Classification of Checkpoint &

    Recovery Algorithms

    Asynchronous/Uncoordinated

    every process takes local checkpointindependently

    to recover from a fault in one process,all processes coordinate to find aconsistent global checkpoint from theirlocal checkpoints

    very low fault-free overhead, recovery

    overhead is high Domino effect possible (no consistent

    global checkpoint exist, so allprocesses have to restart fromscratch)

    higher space requirements, as all localcheckpoints need to be kept

    Good for systems where fault is rareand inter-process communication isnot too high (less chance of domino

    effect)

  • 8/7/2019 Complete 1 Distributed Systems

    114/118

    Synchronous/Coordinated

    all processes coordinate to take

    a consistent global checkpoint during recovery, every processjust rolls back to its last localcheckpoint independently

    low recovery overhead, but highcheckpointing overhead

    no domino effect possible

    low space requirement, since

    only last checkpoint needs to bestored at each process

  • 8/7/2019 Complete 1 Distributed Systems

    115/118

    Communication Induced

    Synchronize checkpointing withcommunication, since message

    send/receive is the fundamentalcause of inconsistency in globalcheckpoint

    Ex. : take local checkpoint rightafter every send! Last localcheckpoint at each process isalways consistent. But too costly

    Many variations are there, moreefficient than the above, we will

    not discuss them in this class.

  • 8/7/2019 Complete 1 Distributed Systems

    116/118

    Message logging

    Take coordinated oruncoordinated checkpoint, and

    then log (in stable storage) allmessages received since the lastcheckpoint

    On recovery, only the recoveringprocess goes back to its lastcheckpoint, and then replaysmessages from the logappropriately until it reaches thestate right before the fault

    Only class that can handleoutput commit problem!

    Details too complex to discuss inthis class

  • 8/7/2019 Complete 1 Distributed Systems

    117/118

    Some Checkpointing

    Algorithms

    Asynchronous/Uncoordinated

    See Juang-Venkatesans

    algorithm in text, quite well-

    explained

    Synchronous/Coordinated Chandy-Lamports global state

    collection algorithm can be

    modified to handle recovery from

    faults See Koo-Touegs algorithm in

    text, quite well-explained

  • 8/7/2019 Complete 1 Distributed Systems

    118/118