6.distributed operating systems

DISTRIBUTED OPERATING

SYSTEMS

Sandeep Kumar Poonia

CANONICAL PROBLEMS IN DISTRIBUTED SYSTEMS

Time ordering and clock synchronization

Leader election

Mutual exclusion

Distributed transactions

Deadlock detection

THE IMPORTANCE OF SYNCHRONIZATION

Because various components of a distributedsystem must cooperate and exchange information,synchronization is a necessity.

Various components of the system must agree onthe timing and ordering of events. Imagine abanking system that did not track the timing andordering of financial transactions. Similar chaoswould ensure if distributed systems were notsynchronized.

Constraints, both implicit and explicit, are thereforeenforced to ensure synchronization of components.

CLOCK SYNCHRONIZATION

As in non-distributed systems, the knowledge of “when events occur” is necessary.

However, clock synchronization is often more difficult in distributed systems because there is no ideal time source, and because distributed algorithms must sometimes be used.

Distributed algorithms must overcome:

Scattering of information

Local, rather than global, decision-making


Time is unambiguous in centralized systems

System clock keeps time, all entities use this for

time

Distributed systems: each node has own

system clock

Crystal-based clocks are less accurate (1 part in

million)

Problem: An event that occurred after another

may be assigned an earlier time

LACK OF GLOBAL TIME IN DS

It is impossible to guarantee that

physical clocks run at the same

frequency

Lack of global time, can cause problems

Example: UNIX make

Edit output.c at a client

output.o is at a server (compile at server)

Client machine clock can be lagging behind

the server machine clock

LACK OF GLOBAL TIME – EXAMPLE

When each machine has its own clock, an

event that occurred after another event may

nevertheless be assigned an earlier time.

LOGICAL CLOCKS

For many problems, internal consistency of

clocks is important

Absolute time is less important

Use logical clocks

Key idea:

Clock synchronization need not be absolute

If two machines do not interact, no need to

synchronize them

More importantly, processes need to agree on

the order in which events occur rather than the

time at which they occurred

EVENT ORDERING Problem: define a total ordering of all events that

occur in a system

Events in a single processor machine are totally

ordered

In a distributed system:

No global clock, local clocks may be unsynchronized

Can not order events on different machines using local

times

Key idea [Lamport ]

Processes exchange messages

Message must be sent before received

Send/receive used to order events (and synchronize

clocks)

LOGICAL CLOCKS

Often, it is not necessary for a computer to know the exact time, only relative time. This is known as “logical time”.

Logical time is not based on timing but on the ordering of events.

Logical clocks can only advance forward, not in reverse.

Non-interacting processes cannot share a logical clock.

Computers generally obtain logical time using interrupts to update a software clock. The more interrupts (the more frequently time is updated), the higher the overhead.

LAMPORT’S LOGICAL CLOCK SYNCHRONIZATION ALGORITHM

The most common logical clock synchronization algorithm

for distributed systems is Lamport‟s Algorithm. It is used in

situations where ordering is important but global time is not

required.

Based on the “happens-before” relation:

Event A “happens-before” Event B (A→B) when all

processes involved in a distributed system agree that

event A occurred first, and B subsequently occurred.

This DOES NOT mean that Event A actually occurred

before Event B in absolute clock time.

LAMPORT’S LOGICAL CLOCK SYNCHRONIZATION

ALGORITHM

A distributed system can use the “happens-before” relation when:

Events A and B are observed by the same process, or by multiple processes with the same global clock

Event A acknowledges sending a message and Event B acknowledges receiving it, since a message cannot be received before it is sent

If two events do not communicate via messages, they are considered concurrent – because order cannot be determined and it does not matter. Concurrent events can be ignored.

LAMPORT’S LOGICAL CLOCK SYNCHRONIZATION

ALGORITHM (CONT.)

In the previous examples, Clock C(a) < C(b)

If they are concurrent, C(a) = C(b)

Concurrent events can only occur on the same system,

because every message transfer between two systems

takes at least one clock tick.

In Lamport‟s Algorithm, logical clock values for events may

be changed, but always by moving the clock forward. Time

values can never be decreased.

An additional refinement in the algorithm is often used:

If Event A and Event B are concurrent. C(a) = C(b), some unique

property of the processes associated with these events can be used

to choose a winner. This establishes a total ordering of all events.

Process ID is often used as the tiebreaker.

LAMPORT’S LOGICAL CLOCK

SYNCHRONIZATION ALGORITHM (CONT.)

Lamport‟s Algorithm can thus be used in distributed

systems to ensure synchronization:

A logical clock is implemented in each node in

the system.

Each node can determine the order in which

events have occurred in that system’s own point

of view.

The logical clock of one node does not need to

have any relation to real time or to any other

node in the system.

EVENT ORDERING USING HB

Goal: define the notion of time of an event such

that

If A-> B then C(A) < C(B)

If A and B are concurrent, then C(A) <, = or > C(B)

Solution:

Each processor maintains a logical clock LCi

Whenever an event occurs locally at I, LCi = LCi+1

When i sends message to j, piggyback Lci

When j receives message from i

If LCj < LCi then LCj = LCi +1 else do nothing

Claim: this algorithm meets the above goals

PROCESS EACH WITH ITS OWN CLOCK

•At time 6 , Process 0 sends message A to Process 1

•It arrives to Process 1 at 16( It took 10 ticks to make journey

•Message B from 1 to 2 takes 16 ticks

•Message C from 2 to 1 leaves at

60 and arrives at 56 -Not Possible

•Message D from 1 to 0 leaves at

64 and arrives at 54 -Not Possible

LAMPORT’S ALGORITHM CORRECTS THE CLOCKS

Use „happened-before‟

relation

Each message carries the

sending time (as per sender‟s

clock)

When arrives, receiver fast

forwards its clock to be one

more than the sending time.

(between every two events,

the clock must tick at least

once)

PHYSICAL CLOCKS

The time difference between two computers is known as drift. Clock drift over time is known as skew. Computer clock manufacturers specify a maximum skew rate in their products.

Computer clocks are among the least accurate modern timepieces.

Inside every computer is a chip surrounding a quartz crystal oscillator to record time. These crystals cost 25 seconds to produce.

Average loss of accuracy: 0.86 seconds per day

This skew is unacceptable for distributed systems. Several methods are now in use to attempt the synchronization of physical clocks in distributed systems:

PHYSICAL CLOCKS

17th Century: time has been measured

astronomically

Solar Day: interval between two consecutive

transit of sun

Solar Second: 1/86400th of solar day

PHYSICAL CLOCKS 1948: Atomic Clocks are invented

Accurate clocks are atomic oscillators (one part in 1013)

BIH decide TAI(International Atomic Time)

TAI seconds is now about 3 msec less than solar day

BIH solves the problem by introducing leap seconds

Whenever discrepancy between TAI and solar time grow to

800 msec

This time is called Universal Coordinated Time(UTC)

When BIH announces leap second, the power companies

raise their frequency to 61 & 51 Hz for 60 & 50 sec, to

advance all the clocks in their distribution area.

PHYSICAL CLOCKS - UTC

Coordinated Universal Time

(UTC) is the international

time standard.

UTC is the current term for

what was commonly

referred to as Greenwich

Mean Time (GMT).

Zero hours UTC is midnight in

Greenwich, England, which

lies on the zero longitudinal

meridian.

UTC is based on a 24-hour

clock.

PHYSICAL CLOCKS Most clocks are less accurate (e.g., mechanical watches)

Computers use crystal-based blocks (one part in million)

Results in clock drift

How do you tell time?

Use astronomical metrics (solar day)

Coordinated universal time (UTC) – international standard

based on atomic time

Add leap seconds to be consistent with astronomical time

UTC broadcast on radio (satellite and earth)

Receivers accurate to 0.1 – 10 ms

Need to synchronize machines with a master or with one

another


Each clock has a maximum drift rate 1- dC/dt <= 1+

Two clocks may drift by 2 t in time t

To limit drift to resynchronize after every

2 seconds

CHRISTIAN’S ALGORITHM

Assuming there is one time server with UTC:

Each node in the distributed system periodically polls the time server.

Time( treply) is estimated as t + (Treply + Treq)/2

This process is repeated several times and an average is provided.

Machine Treply then attempts to adjust its time.

Disadvantages:

Must attempt to take delay between server Treply and time server into account

Single point of failure if time server fails

CRISTIAN’S ALGORITHM

Synchronize machines to a

time server with a UTC

receiver

Machine P requests time from

server every seconds

Receives time t from

server, P sets clock to

t+treply where treply is the

time to send reply to P

Use (treq+treply)/2 as an

estimate of treply

Improve accuracy by

making a series of

measurements

PROBLEM WITH CRISTIAN’S ALGORITHM

Major Problem

Time must never run

backward

If sender‟s clock is

fast, CUTC will be

smaller than the

sender‟s current value

of C

Minor Problem

It takes nonzero time

for the time server‟s

reply

This delay may be large

and vary with network

load

SOLUTION

Major Problem

Control the clock

Suppose that the timer set

to generate 100 intrpt/sec

Normally each interrupt

add 10 msec to the time

To slow down add only 9

msec

To advance add 11 msec to

the time

Minor Problem

Measure it

Make a series of

measurements for accuracy

Discard the measurements

that exceeds the threshold

value

The message that came

back fastest can be taken to

be the most accurate.

BERKELEY ALGORITHM

Used in systems without UTC receiver

Keep clocks synchronized with one another

One computer is master, other are slaves

Master periodically polls slaves for their times

Average times and return differences to slaves

Communication delays compensated as in Cristian‟s

algo

Failure of master => election of a new master

BERKELEY ALGORITHM

a) The time daemon asks all the other machines for their clock values

b) The machines answer

c) The time daemon tells everyone how to adjust their clock

30

DECENTRALIZED AVERAGING ALGORITHM

Each machine on the distributed system has a daemon

without UTC.

Periodically, at an agreed-upon fixed time, each machine

broadcasts its local time.

Each machine calculates the correct time by averaging

all results.

31

NETWORK TIME PROTOCOL (NTP)

Enables clients across the Internet to be synchronized accurately to UTC. Overcomes large and variable message delays

Employs statistical techniques for filtering, based on past quality of servers and several other measures

Can survive lengthy losses of connectivity: Redundant servers

Redundant paths to servers

Provides protection against malicious interference through authentication techniques

32

NETWORK TIME PROTOCOL (NTP) (CONT.)

Uses a hierarchy of servers located across the Internet.

Primary servers are directly connected to a UTC time

source.

33

NETWORK TIME PROTOCOL (NTP) (CONT.)

NTP has three modes: Multicast Mode:

Suitable for user workstations on a LAN

One or more servers periodically multicasts the time to other machines on the network.

Procedure Call Mode: Similar to Christian‟s Algorithm

Provides higher accuracy than Multicast Mode because delays are compensated.

Symmetric Mode: Pairs of servers exchange pairs of timing messages that contain

time stamps of recent message events.

The most accurate, but also the most expensive mode

Although NTP is quite advanced, there is still a drift of 20-35 milliseconds!!!

MORE PROBLEMS

Causality

Vector timestamps

Global state and termination detection

Election algorithms

LOGICAL CLOCKS

For many DS algorithms, associating

an event to an absolute real time is

not essential, we only need to know

an unambiguous order of events

Lamport's timestamps

Vector timestamps

LOGICAL CLOCKS (CONT.)

Synchronization based on “relative time”.

“relative time” may not relate to the “real

time”. Example: Unix make (Is output.c updated after the

generation of output.o?)

What‟s important is that the processes in

the Distributed System agree on the

ordering in which certain events occur.

Such “clocks” are referred to as Logical

Clocks.

EXAMPLE: WHY ORDER MATTERS?

Replicated accounts in Jaipur(JP) and Bikaner(BN)

Two updates occur at the same time

Current balance: $1,000

Update1: Add $100 at BN; Update2: Add interest of 1% at JP

Whoops, inconsistent states!

LAMPORT ALGORITHM

Clock synchronization does not have to be

exact

Synchronization not needed if there is no

interaction between machines

Synchronization only needed when machines

communicate

i.e. must only agree on ordering of interacting

events

LAMPORT’S “HAPPENS-BEFORE” PARTIAL

ORDER

Given two events e & e`, e < e` if:

1. Same process: e <i e`, for some

process Pi

2. Same message: e = send(m) and

e`=receive(m) for some message m

3. Transitivity: there is an event e* such

that e < e* and e* < e`

CONCURRENT EVENTS

Given two events e & e`:

If neither e < e` nor e`< e, then e || e`

P1

P2

P3

Real Time

a b

c

f

d

e

m1

m2

LAMPORT LOGICAL CLOCKS

Substitute synchronized clocks with a global

ordering of events

ei < ej LC(ei) < LC(ej)

LCi is a local clock: contains increasing values

each process i has own LCi

Increment LCi on each event occurrence

within same process i, if ej occurs before ek

LCi(ej) < LCi(ek)

If es is a send event and er receives that send,

then

LCi(es) < LCj(er)

LAMPORT ALGORITHM

Each process increments local clock

between any two successive events

Message contains a timestamp

Upon receiving a message, if received

timestamp is ahead, receiver fast forward

its clock to be one more than sending

time

LAMPORT ALGORITHM (CONT.)

Timestamp

Each event is given a timestamp t

If es is a send message m from pi, then

t=LCi(es)

When pj receives m, set LCj value as follows

If t < LCj, increment LCj by one

Message regarded as next event on j

If t ≥ LCj, set LCj to t+1

LAMPORT’S ALGORITHM ANALYSIS (1)

Claim: ei < ej LC(ei) < LC(ej)

Proof: by induction on the length of the

sequence of events relating to ei and ej

P1

P2

P3

Real Timea b

c

f

d

e

m1

m2

1 2

3

5

4

1

g

5

LAMPORT’S ALGORITHM ANALYSIS (2) LC(ei) < LC(ej) ei < ej ?

Claim: if LC(ei) < LC(ej), then it is notnecessarily true that ei < ej

P1

P2

P3

Real Timea b

c

f

d

e

m1

m2

1 2

3

5

4

1

g

2

TOTAL ORDERING OF EVENTS

Happens before is only a partial order

Make the timestamp of an event e of

process Pi be: (LC(e),i)

(a,b) < (c,d) iff a < c, or a = c and b < d

APPLICATION: TOTALLY-ORDERED MULTICASTING

Message is timestamped with sender‟s

logical time

Message is multicast (including sender itself)

When message is received

It is put into local queue

Ordered according to timestamp

Multicast acknowledgement

Message is delivered to applications only

when

It is at head of queue

It has been acknowledged by all involved

processes

APPLICATION: TOTALLY-ORDERED MULTICASTING

Update 1 is time-stamped and multicast. Added to local queues.

Update 2 is time-stamped and multicast. Added to local queues.

Acknowledgements for Update 2 sent/received. Update 2 can now be processed.

Acknowledgements for Update 1 sent/received. Update 1 can now be processed.

(Note: all queues are the same, as the timestamps have been used to ensure the “happens-before” relation holds.)

LIMITATION OF LAMPORT’S ALGORITHM

ei < ej LC(ei) < LC(ej)

However, LC(ei) < LC(ej) does not imply ei < ej

for instance, (1,1) < (1,3), but events a and e are concurrent

P1

P2

P3

Real Timea b

c

f

d

e

m1

m2

(1,1) (2,1)

(3,2)

(5,3)

(4,2)

(1,3)

g

(2,3)

VECTOR TIMESTAMPS

Pi‟s clock is a vector VTi[]

VTi[i] = number of events Pi has

stamped

VTi[j] = what Pi thinks number of

events Pj has stamped (i j)

VECTOR TIMESTAMPS (CONT.)

Initialization the vector timestamp for each process is

initialized to (0,0,…,0)

Local eventwhen an event occurs on process Pi, VTi[i]

VTi[i] + 1

e.g., on processor 3, (1,2,1,3) (1,2,2,3)

Message passingwhen Pi sends a message to Pj, the message

has timestamp t[]=VTi[]

when Pj receives the message, it sets VTj[k] to max (VTj[k],t[k]), for k = 1, 2, …, N

e.g., P2 receives a message with timestamp (3,2,4) and P2‟s timestamp is (3,4,3), then P2 adjust its timestamp to (3,4,4)

VECTOR TIMESTAMPS (CONT.)

COMPARING VECTORS

VT1 = VT2 iff VT1[i] = VT2[i] for all i

VT1 VT2 iff VT1[i] VT2[i] for all i

VT1 < VT2 iff VT1 VT2 & VT1 VT2

for instance, (1, 2, 2) < (1, 3, 2)

VECTOR TIMESTAMP ANALYSIS

Claim: e < e‟ iff e.VT < e‟.VT

P1

P2

P3

Real Timea b

c

f

d

e

m1

m2

[1,0,0]

[2,0,0]

[2,1,0]

[2,2,3]

[2,2,0]

[0,0,1]

g

[0,0,2]

APPLICATION: CAUSALLY-ORDERED MULTICASTING

For ordered delivery, we also need…

Multicast msgs (reliable but may be out-of-order)

Vi[i] is only incremented when sending

When k gets a msg from j, with timestamp ts,

the msg is buffered until:

1: ts[j] = Vk[j] + 1

(this is the next timestamp that k is expecting from j)

2: ts[i] ≤ Vk[i] for all i ≠ j

(k has seen all msgs that were seen by j when j sent the

msg)

CAUSALLY-ORDERED MULTICASTING

P2

a

P1

c

d

P3

e

g

[1,0,0]

[1,0,0][1,0,0]

[1,0,1]

[1,0,1]

Post a

r: Reply a

Message a arrives at P2 before the reply r from P3 does

b

[1,0,1]

[0,0,0][0,0,0]

[0,0,0]

CAUSALLY-ORDERED MULTICASTING (CONT.)P2

a

P1 P3

d

g

[1,0,0]

[1,0,0]

[1,0,1]

Post a

r: Reply a

Buffered

c

[1,0,0]

The message a arrives at P2 after the reply from P3; The reply is

not delivered right away.

b

[1,0,1]

[0,0,0][0,0,0]

[0,0,0]

Deliver r

ORDERED COMMUNICATION

Totally ordered multicast

Use Lamport timestamps

Causally ordered multicast

Use vector timestamps

VECTOR CLOCKS

Each process i maintains a vector Vi

Vi[i] : number of events that have occurred at i

Vi[j] : number of events I knows have occurred at process j

Update vector clocks as follows Local event: increment Vi[I]

Send a message :piggyback entire vector V

Receipt of a message: Vj[k] = max( Vj[k],Vi[k] ) Receiver is told about how many events the sender knows

occurred at another process k

Also Vj[i] = Vj[i]+1

GLOBAL STATE

Global state of a distributed system

Local state of each process

Messages sent but not received (state of the queues)

Many applications need to know the state of the

system

Failure recovery, distributed deadlock detection

Problem: how can you figure out the state of a

distributed system?

Each process is independent

No global clock or synchronization

Distributed snapshot: a consistent global state

GLOBAL STATE (1)

a) A consistent cut

b) An inconsistent cut

DISTRIBUTED SNAPSHOT ALGORITHM

Assume each process communicates with another process using unidirectional point-to-point channels (e.g, TCP connections)

Any process can initiate the algorithm

Checkpoint local state

Send marker on every outgoing channel

On receiving a marker

Checkpoint state if first marker and send marker on outgoing channels, save messages on all other channels until:

Subsequent marker on a channel: stop saving state for that channel

DISTRIBUTED SNAPSHOT

A process finishes when

It receives a marker on each incoming channel

and processes them all

State: local state plus state of all channels

Send state to initiator

Any process can initiate snapshot

Multiple snapshots may be in progress

Each is separate, and each is distinguished by tagging

the marker with the initiator ID (and sequence

number)

A

C

BM

M

SNAPSHOT ALGORITHM EXAMPLE

a) Organization of a process and channels for a distributed

snapshot

SNAPSHOT ALGORITHM EXAMPLE

b) Process Q receives a marker for the first time and records its local state

c) Q records all incoming message

d) Q receives a marker for its incoming channel and finishes recording the state of the incoming channel

TERMINATION DETECTION

Detecting the end of a distributed computation

Notation: let sender be predecessor, receiver be successor

Two types of markers: Done and Continue

After finishing its part of the snapshot, process Q sends a Done or a Continue to its predecessor

Send a Done only when All of Q‟s successors send a Done

Q has not received any message since it check-pointed its local state and received a marker on all incoming channels

Else send a Continue

Computation has terminated if the initiator receives Done messages from everyone

DISTRIBUTED SYNCHRONIZATION

Distributed system with multiple processes may

need to share data or access shared data

structures

Use critical sections with mutual exclusion

Single process with multiple threads

Semaphores, locks, monitors

How do you do this for multiple processes in a

distributed system?

Processes may be running on different machines

Solution: lock mechanism for a distributed

environment

Can be centralized or distributed

CENTRALIZED MUTUAL EXCLUSION

Assume processes are numbered

One process is elected coordinator (highest ID process)

Every process needs to check with coordinator before entering the critical section

To obtain exclusive access: send request, await reply

To release: send release message

Coordinator: Receive request: if available and queue empty, send

grant; if not, queue request

Receive release: remove next request from queue and send grant

MUTUAL EXCLUSION:

A CENTRALIZED ALGORITHM

a) Process 1 asks the coordinator for permission to enter a critical region. Permission is granted

b) Process 2 then asks permission to enter the same critical region. The coordinator does not reply.

c) When process 1 exits the critical region, it tells the coordinator, when then replies to 2

PROPERTIES

Simulates centralized lock using blocking calls

Fair: requests are granted the lock in the order they were

received

Simple: three messages per use of a critical section

(request, grant, release)

Shortcomings:

Single point of failure

How do you detect a dead coordinator?

A process can not distinguish between “lock in use” from a dead

coordinator

No response from coordinator in either case

Performance bottleneck in large distributed systems

DISTRIBUTED ALGORITHM

[Ricart and Agrawala]: needs 2(n-1) messages

Based on event ordering and time stamps

Process k enters critical section as follows

Generate new time stamp TSk = TSk+1

Send request(k,TSk) all other n-1 processes

Wait until reply(j) received from all other processes

Enter critical section

Upon receiving a request message, process j

Sends reply if no contention

If already in critical section, does not reply, queue request

If wants to enter, compare TSj with TSk and send reply if TSk<TSj,

else queue

A DISTRIBUTED ALGORITHM

a) Two processes want to enter the same critical region at the same moment.

b) Process 0 has the lowest timestamp, so it wins.

c) When process 0 is done, it sends an OK also, so 2 can now enter the critical region.

PROPERTIES

Fully decentralized

N points of failure!

All processes are involved in all decisions

Any overloaded process can become a

bottleneck

ELECTION ALGORITHMS

Many distributed algorithms need one process

to act as coordinator

Doesn‟t matter which process does the job, just

need to pick one

Election algorithms: technique to pick a unique

coordinator (aka leader election)

Examples: take over the role of a failed

process, pick a master in Berkeley clock

synchronization algorithm

Types of election algorithms: Bully and Ring

algorithms

BULLY ALGORITHM

Each process has a unique numerical ID

Processes know the Ids and address of every other process

Communication is assumed reliable

Key Idea: select process with highest ID

Process initiates election if it just recovered from failure or if coordinator failed

3 message types: election, OK, I won

Several processes can initiate an election simultaneously Need consistent result

O(n2) messages required with n processes

BULLY ALGORITHM DETAILS Any process P can initiate an election

P sends Election messages to all process with higher Ids and awaits OK messages

If no OK messages, P becomes coordinator and sends I won messages to all process with lower Ids

If it receives an OK, it drops out and waits for an I won

If a process receives an Election msg, it returns an OK and starts an election

If a process receives a I won, it treats sender an coordinator

BULLY ALGORITHM EXAMPLE

The bully election algorithm

Process 4 holds an election

Process 5 and 6 respond, telling 4 to stop

Now 5 and 6 each hold an election


d) Process 6 tells 5 to stop

e) Process 6 wins and tells everyone

LAST CLASS

Vector timestamps

Global state

Distributed Snapshot

Election algorithms

TODAY: STILL MORE CANONICAL

PROBLEMS

Election algorithms

Bully algorithm

Ring algorithm

Distributed synchronization and mutual

exclusion

Distributed transactions

ELECTION ALGORITHMS

Many distributed algorithms need one process

to act as coordinator

Doesn‟t matter which process does the job, just

need to pick one

Election algorithms: technique to pick a unique

coordinator (aka leader election)

Examples: take over the role of a failed

process, pick a master in Berkeley clock

synchronization algorithm

Types of election algorithms: Bully and Ring

algorithms

BULLY ALGORITHM

Each process has a unique numerical ID

Processes know the Ids and address of every other process

Communication is assumed reliable

Key Idea: select process with highest ID

Process initiates election if it just recovered from failure or if coordinator failed

3 message types: election, OK, I won

Several processes can initiate an election simultaneously

Need consistent result

BULLY ALGORITHM DETAILS

Any process P can initiate an election

P sends Election messages to all process with higher Ids and awaits OK messages

If no OK messages, P becomes coordinator and sends I wonmessages to all process with lower Ids

If it receives an OK, it drops out and waits for an I won

If a process receives an Election msg, it returns an OK and starts an election

If a process receives a I won, it treats sender an coordinator


The bully election algorithm

Process 4 holds an election

Process 5 and 6 respond, telling 4 to stop

Now 5 and 6 each hold an election


d) Process 6 tells 5 to stop

e) Process 6 wins and tells everyone

RING-BASED ELECTION

Processes have unique Ids and arranged in a logical ring

Each process knows its neighbors

Select process with highest ID

Begin election if just recovered or coordinator has failed

Send Election to closest downstream node that is alive

Sequentially poll each successor until a live node is found

Each process tags its ID on the message

Initiator picks node with highest ID and sends a coordinator

message

Multiple elections can be in progress

Wastes network bandwidth but does no harm

A RING ALGORITHM

Election algorithm using a ring.

COMPARISON

Assume n processes and one election in

progress

Bully algorithm

Worst case: initiator is node with lowest ID

Triggers n-2 elections at higher ranked nodes: O(n2)

msgs

Best case: immediate election: n-2 messages

Ring

2 (n-1) messages always

A TOKEN RING ALGORITHM

a) An unordered group of processes on a network.

b) A logical ring constructed in software.

• Use a token to arbitrate access to critical section

• Must wait for token before entering CS

• Pass the token to neighbor once done or if not interested

• Detecting token loss in non-trivial

COMPARISON

A comparison of three mutual exclusion

algorithms.

AlgorithmMessages per

entry/exit

Delay before entry (in

message times)Problems

Centralized 3 2 Coordinator crash

Distributed 2 ( n – 1 ) 2 ( n – 1 )Crash of any

process

Token ring 1 to 0 to n – 1Lost token, process

crash

TRANSACTIONS

Transactions provide higher

level mechanism for atomicity

of processing in distributed

systems

Have their origins in databases

Banking example: Three

accounts A:$100, B:$200,

C:$300

Client 1: transfer $4 from A to

B

Client 2: transfer $3 from C to

B

Result can be inconsistent

unless certain properties are

imposed on the accesses

Client 1 Client 2

Read A: $100

Write A: $96

Read C: $300

Write C:$297

Read B: $200

Read B: $200

Write B:$203

Write B:$204

ACID PROPERTIES

Atomic: all or nothing

Consistent: transaction takes

system from one consistent

state to another

Isolated: Immediate effects

are not visible to other

(serializable)

Durable: Changes are

permanent once transaction

completes (commits)

Client 1 Client 2

Read A: $100

Write A: $96

Read B: $200

Write B:$204

Read C: $300

Write C:$297

Read B: $204

Write B:$207

6.distributed operating systems

Education

clock synchronization

logical time

absolute clock time

time distributed systems

global clock event

logical clock values

software clock

clock tick