cs9222 advanced operating systems

CS9222 Advanced Operating System

Unit – IV

Dr.A.Kathirvel

Professor & Head/IT - VCEW

Unit - IV

Basic Concepts-Classification of Failures – Basic Approaches to Recovery; Recovery in Concurrent System; Synchronous and Asynchronous checkpointing and Recovery; Check pointing in Distributed Database Systems; Fault Tolerance; Issues - Two-phase and Nonblocking commit Protocols; Voting Protocols; Dynamic Voting Protocols;

Recovery

Recovery in computer systems refers to restoring a system to its normal operational state.

Recovery may be as simple as restarting a failed computer or restarting failed processes.

Recovery is generally a very complicated process.

For example, a process has memory allocated to it and a process may have locked shared resources, such as files and memory. Under such circumstances, if a process fails, it is imperative that the resources allocated to the failed process are undone.

4

Recovery

Computer system recovery:

Restore the system to a normal operational state

Process recovery:

Reclaim resources allocated to process,

Undo modification made to databases, and

Restart the process

Or restart process from point of failure and resume execution

Distributed process recovery (cooperating processes):

Undo effect of interactions of failed process with other cooperating processes.

Replication (hardware components, processes, data):

Main method for increasing system availability

System:

Set of hardware and software components

Designed to provide a specified service (I.e. meet a set of requirements)

5

System failure: – System does not meet requirements, i.e.does not perform its services as specified

Erroneous System State: – State which could lead to a system failure by a sequence of valid state transitions

– Error: the part of the system state which differs from its intended value

Fault: – Anomalous physical condition, e.g. design errors, manufacturing problems, damage,

external disturbances.

Recovery (cont.)

Error could lead to system failure

Error is a manifestation of a fault

6

Process failure: Behavior: process causes system state to deviate from specification (e.g. incorrect computation,

process stop execution)

Errors causing process failure: protection violation, deadlocks, timeout, wrong user input, etc…

Recovery: Abort process or

Restart process from prior state

System failure: Behavior: processor fails to execute

Caused by software errors or hardware faults (CPU/memory/bus/…/ failure)

Recovery: system stopped and restarted in correct state

Assumption: fail-stop processors, i.e. system stops execution, internal state is lost

Secondary Storage Failure: Behavior: stored data cannot be accessed

Errors causing failure: parity error, head crash, etc.

Recovery/Design strategies: Reconstruct content from archive + log of activities

Design mirrored disk system

Communication Medium Failure: Behavior: a site cannot communicate with another operational site

Errors/Faults: failure of switching nodes or communication links

Recovery/Design Strategies: reroute, error-resistant communication protocols

Classification of failures

7

Failure recovery: restore an erroneous state to an error-free state

Approaches to failure recovery:

Forward-error recovery:

Remove errors in process/system state (if errors can be completely assessed)

Continue process/system forward execution

Backward-error recovery:

Restore process/system to previous error-free state and restart from there

Comparison: Forward vs. Backward error recovery

Backward-error recovery

(+) Simple to implement

(+) Can be used as general recovery mechanism

(-) Performance penalty

(-) No guarantee that fault does not occur again

(-) Some components cannot be recovered

Forward-error Recovery

(+) Less overhead

(-) Limited use, i.e. only when impact of faults understood

(-) Cannot be used as general mechanism for error recovery

Backward and Forward Error Recovery

8

Principle: restore process/system to a known, error-free “recovery point”/ “checkpoint”.

System model:

Approaches: (1) Operation-based approach

(2) State-based approach

Backward-Error Recovery: Basic approach

CPU

Main memory

secondar

y storage

stable

storage

Storage that

maintains

information in

the event of

system failure

Bring object to MM

to be accessed

Store logs and

recovery points

Write object back

if modified

9

Principle: Record all changes made to state of process (‘audit trail’ or ‘log’) such that process

can be returned to a previous state

Example: A transaction based environment where transactions update a database It is possible to commit or undo updates on a per-transaction basis

A commit indicates that the transaction on the object was successful and changes are permanent

(1.a) Updating-in-place Principle: every update (write) operation to an object creates a log in stable storage

that can be used to ‘undo’ and ‘redo’ the operation

Log content: object name, old object state, new object state

Implementation of a recoverable update operation:

Do operation: update object and write log record

Undo operation: log(old) -> object (undoes the action performed by a do)

Redo operation: log(new) -> object (redoes the action performed by a do)

Display operation: display log record (optional)

Problem: a ‘do’ cannot be recovered if system crashes after write object but before log record write

(1.b) The write-ahead log protocol Principle: write log record before updating object

(1) The Operation-based Approach

10

Principle: establish frequent ‘recovery points’ or ‘checkpoints’ saving the entire state of process

Actions: ‘Checkpointing’ or ‘taking a checkpoint’: saving process state

‘Rolling back’ a process: restoring a process to a prior state

Note: A process should be rolled back to the most recent ‘recovery point’ to minimize the overhead and delays in the completion of the process

Shadow Pages: Special case of state-based approach Only a part of the system state is saved to minimize recovery

When an object is modified, page containing object is first copied on stable storage (shadow page)

If process successfully commits: shadow page discarded and modified page is made part of the database

If process fails: shadow page used and the modified page discarded

(2) State-based Approach

11

Recovery in concurrent systems Issue: if one of a set of cooperating processes fails and has to be rolled back to a

recovery point, all processes it communicated with since the recovery point have to be rolled back.

Conclusion: In concurrent and/or distributed systems all cooperating processes have to establish recovery points

Orphan messages and the domino effect

Case 1: failure of X after x3 : no impact on Y or Z

Case 2: failure of Y after sending msg. ‘m’ Y rolled back to y2

‘m’ ≡ orphan massage

X rolled back to x2

Case 3: failure of Z after z2 Y has to roll back to y1

X has to roll back to x1 Domino Effect

Z has to roll back to z1

X

Y

Z

y1

x1

z1 z2

x2

y2

x3

m

Time

12

Lost messages

• Assume that x1 and y1 are the only recovery points for processes X and Y, respectively

• Assume Y fails after receiving message ‘m’

• Y rolled back to y1, X rolled back to x1

• Message ‘m’ is lost

Note: there is no distinction between this case and the case where message ‘m’ is lost in communication channel and processes X and Y are in states x1 and y1, respectively

X

Y y1

x1

m

Time

Failure

13

Problem of livelock

• Livelock: case where a single failure can cause an infinite number of rollbacks

• Process Y fails before receiving message ‘n1’ sent by X

• Y rolled back to y1, no record of sending message ‘m1’, causing X to roll back to x1

• When Y restarts, sends out ‘m2’ and receives ‘n1’ (delayed)

• When X restarts from x1, sends out ‘n2’ and receives ‘m2’

• Y has to roll back again, since there is no record of ‘n1’ being sent

• This cause X to be rolled back again, since it has received ‘m2’ and there is no record of sending ‘m2’ in Y

• The above sequence can repeat indefinitely

X

Y y1

x1

m1

Time Failure

n1

(a)

X

Y y1

x1

m2

Time 2nd roll back

n2 n1

(b)

(a)

(b)

CS-550 (M.Soneru): Recovery [SaS] 14

Consistent set of checkpoints • Checkpointing in distributed systems requires that all processes (sites) that

interact with one another establish periodic checkpoints

• All the sites save their local states: local checkpoints

• All the local checkpoints, one from each site, collectively form a global checkpoint

• The domino effect is caused by orphan messages, which in turn are caused by rollbacks

1. Strongly consistent set of checkpoints

– Establish a set of local checkpoints (one for each process in the set) such that no information flow takes place (i.e., no orphan messages) during the interval spanned by the checkpoints

2. Consistent set of checkpoints

– Similar to the consistent global state

– Each message that is received in a checkpoint (state) should also be recorded as sent in another checkpoint (state)

Consistency of Checkpoint

• Strongly consistent set of checkpoints

no messages penetrating the set

• Consistent set of checkpoints

no messages penetrating the set backward

[

[

[

x1

y1

z1

[

[

[

y2

x2

z2

Strongly consistent consistent

need to deal with

lost messages

Checkpoint/Recovery Algorithm

• Synchronous

– with global synchronization at checkpointing

• Asynchronous

– without global synchronization at checkpointing

Preliminary (Assumption)

Goal

To make a consistent global checkpoint

Assumptions

– Communication channels are FIFO

– No partition of the network

– End-to-end protocols cope with message loss due to rollback recovery and communication failure

– No failure during the execution of the algorithm

～Synchronous Checkpoint～

Preliminary (Two types of checkpoint)

tentative checkpoint : – a temporary checkpoint

– a candidate for permanent checkpoint

permanent checkpoint : – a local checkpoint at a process

– a part of a consistent global checkpoint


Checkpoint Algorithm

Algorithm

1. an initiating process (a single process that invokes this algorithm) takes a tentative checkpoint

2. it requests all the processes to take tentative checkpoints

3. it waits for receiving from all the processes whether taking a tentative checkpoint has been succeeded

4. if it learns all the processes has succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded.

5. it informs all the processes of the decision

6. The processes that receive the decision act accordingly

Supplement

Once a process has taken a tentative checkpoint, it shouldn’t send messages until it is informed of initiator’s decision.


Diagram of Checkpoint Algorithm

[

[

[

|

|

Tentative

checkpoint

|

request to

take a

tentative

checkpoint

OK

decide to commit

[ permanent checkpoint

[

[

consistent global checkpoint

consistent global checkpoint Unnecessary checkpoint

Initiator


Optimized Algorithm

Each message is labeled by order of sending

Labeling Scheme ⊥ : smallest label

т : largest label

last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. if not exists, ⊥is in it.

first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . if not exists, ⊥is in it.

ckpt_cohortX : the set of all processes that may have to take checkpoints when X decides to take a checkpoint.


[

[

X

Y

x2 x3

y1 y2

y2

x2

Checkpoint request need to be sent to only the processes

included in ckpt_cohort

Optimized Algorithm

ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ }

Y takes a tentative checkpoint only if

last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥


X

Y

[

[

last_label_rcvdX[Y]

first_label_sentY[X]

Optimized Algorithm

Algorithm 1. an initiating process takes a tentative checkpoint 2. it requests p ∈ ckpt_cohort to take tentative checkpoints ( this

message includes last_label_rcvd[reciever] of sender ) 3. if the processes that receive the request need to take a checkpoint,

they do the same as 1.2.; otherwise, return OK messages. 4. they wait for receiving OK from all of p ∈ ckpt_cohort 5. if the initiator learns all the processes have succeeded, it decides all

tentative checkpoints should be made permanent; otherwise, should be discarded.

6. it informs p ∈ ckpt_cohort of the decision 7. The processes that receive the decision act accordingly


Diagram of Optimized Algorithm

[

[

[

[

A

C

B

D

ab1 ac1

bd1

dc1 dc2

cb1

ba1 ba2

ac2 cb2

cd1

|

Tentative

checkpoint

ca2

last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥

2 >= 1 > 0 |

2 >= 2 > 0 |

2 >= 0 > 0

OK

decide to commit

[

Permanent

checkpoint

[

[

ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ }


Correctness

• A set of permanent checkpoints taken by this algorithm is consistent

– No process sends messages after taking a tentative checkpoint until the receipt of the decision

– New checkpoints include no message from the processes that don’t take a checkpoint

– The set of tentative checkpoints is fully either made to permanent checkpoints or discarded.


Recovery Algorithm

Labeling Scheme ⊥ : smallest label

т : largest label

last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. If not exists, ⊥is in it.

first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . If not exists, ⊥is in it.

roll_cohortX : the set of all processes that may have to roll back to the latest checkpoint when process X rolls back.

last_label_sentX[Y] : the last message that X sent to Y before X takes its latest permanent checkpoint. If not exist, т is in it.

～Synchronous Recovery～

Recovery Algorithm

roll_cohortX = { Y | X can send messages to Y }

Y will restart from the permanent checkpoint only if

last_label_rcvdY[X] > last_label_sentX[Y]


Recovery Algorithm

Algorithm 1. an initiator requests p ∈ roll_cohort to prepare to rollback ( this

message includes last_label_sent[reciever] of sender ) 2. if the processes that receive the request need to rollback, they

do the same as 1.; otherwise, return OK message. 3. they wait for receiving OK from all of p ∈ ckpt_cohort. 4. if the initiator learns p ∈ roll_cohort have succeeded, it decides

to rollback; otherwise, not to rollback. 5. it informs p ∈ roll_cohort of the decision 6. the processes that receive the decision act accordingly


Diagram of Synchronous Recovery

[

[

[

[

A

C

B

D

ab1 ac1

bd1

dc1 dc2

cb1

ba1 ba2

ac2 cb2

dc1

request to

roll back

0 > 1

last_label_rcvdY[X] > last_label_sentX[Y]

2 > 1

0 >т

OK

[

[

2 > 1

0 >т

[

decide to

roll back

roll_cohortX = { Y | X can send messages to Y }

Drawbacks of Synchronous Approach

• Additional messages are exchanged

• Synchronization delay

• An unnecessary extra load on the system if failure rarely occurs

Asynchronous Checkpoint

Characteristic – Each process takes checkpoints independently

– No guarantee that a set of local checkpoints is consistent

– A recovery algorithm has to search consistent set of checkpoints

– No additional message

– No synchronization delay

– Lighter load during normal excution

Preliminary (Assumptions)

Goal

To find the latest consistent set of checkpoints

Assumptions

– Communication channels are FIFO

– Communication channels are reliable

– The underlying computation is event-driven

～Asynchronous Checkpoint / Recovery～

Preliminary (Two types of log)

• save an event on the memory at receipt of messages (volatile log)

• volatile log periodically flushed to the disk (stable log) ⇔ checkpoint

volatile log : quick access

lost if the corresponding processor fails

stable log : slow access

not lost even if processors fail


Preliminary (Definition)

Definition

CkPti : the checkpoint (stable log) that i rolled back to when failure occurs

RCVDi←j (CkPti / e ) : the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPti or event e.

SENTi→j(CkPti / e ) : the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPti or event e


Recovery Algorithm

Algorithm 1. When one process crashes, it recovers to the latest checkpoint

CkPt. 2. It broadcasts the message that it had failed. Others receive this

message, and rollback to the latest event. 3. Each process sends SENT(CkPt) to neighboring processes 4. Each process waits for SENT(CkPt) messages from every

neighbor 5. On receiving SENTj→i(CkPtj) from j, if i notices RCVDi←j (CkPti) >

SENTj→i(CkPtj), it rolls back to the event e such that RCVDi←j (e) = SENTj→i(e),

6. repeat 3,4,and 5 N times (N is the number of processes)


Asynchronous Recovery

X

Y

Z

Ex0 Ex1 Ex2 Ex3

Ey0 Ey1 Ey2 Ey3

Ez0 Ez1 Ez2

[

[

[

x1

y1

z1

(Y,2)

(Y,1)

(X,2)

(X,0)

(Z,0)

(Z,1)

3 <= 2

RCVDi←j (CkPti) <= SENTj→i(CkPtj)

2 <= 2

X:Y X:Z

0 <= 0

1 <= 2

Y:X

1 <= 1

Y:Z

0 <= 0

Z:X

2 <= 1

Z:Y

1 <= 1

37

System reliability: Fault-Intolerance vs. Fault-Tolerance

The fault intolerance (or fault-avoidance) approach improves system reliability by removing the source of failures (i.e., hardware and software faults) before normal operation begins

The approach of fault-tolerance expect faults to be present during system operation, but employs design techniques which insure the continued correct execution of the computing process

38

Approaches to fault-tolerance

Approaches:

(a) Mask failures

(b) Well defined failure behavior

Mask failures:

System continues to provide its specified function(s) in the presence of failures

Example: voting protocols

(b) Well defined failure behaviour:

System exhibits a well define behaviour in the presence of failures

It may or it may not perform its specified function(s), but facilitates actions suitable for fault recovery

Example: commit protocols

A transaction made to a database is made visible only if successful and it commits

If it fails, transaction is undone

Redundancy:

Method for achieving fault tolerance (multiple copies of hardware, processes, data, etc...)

39

Issues

Process Deaths: All resources allocated to a process must be recovered when a process

dies

Kernel and remaining processes can notify other cooperating processes

Client-server systems: client (server) process needs to be informed that the corresponding server (client) process died

Machine failure: All processes running on that machine will die

Client-server systems: difficult to distinguish between a process and machine failure

Issue: detection by processes of other machines

Network Failure: Network may be partitioned into subnets

Machines from different subnets cannot communicate

Difficult for a process to distinguish between a machine and a communication link failure

40

Atomic actions

System activity: sequence of primitive or atomic actions

Atomic Action: Machine Level: uninterruptible instruction

Process Level: Group of instructions that accomplish a task

Example: Two processes, P1 and P2, share a memory location ‘x’ and both modify ‘x’

Process P1 Process P2

… …

Lock(x); Lock(x);

x := x + z; x := x + y; Atomic action

Unlock(x); Unlock(x);

… …

successful exit

System level: group of cooperating process performing a task (global atomicity)

41

Committing Transaction: Sequence of actions treated as an atomic action to preserve

consistency (e.g. access to a database)

Commit a transaction: Unconditional guarantee that the transaction will complete successfully (even in the presence of failures)

Abort a transaction: Unconditional guarantee to back out of a transaction, i.e., that all the effects of the transaction have been removed (transaction was backed out)

Events that may cause aborting a transaction: deadlocks, timeouts, protection violation

Mechanisms that facilitate backing out of an aborting transaction

Write-ahead-log protocol

Shadow pages

Commit protocols:

Enforce global atomicity (involving several cooperating distributed processes)

Ensure that all the sites either commit or abort transaction unanimously, even in the presence of multiple and repetitive failures

42

The two-phase commit protocol Assumption:

One process is coordinator, the others are “cohorts” (different sites)

Stable store available at each site

Write-ahead log protocol

Coordinator

Initialization

Send start transaction message to all cohorts

Phase 1

Send commit-request message, requesting all

cohort to commit

Wait for reply from cohorts

Phase 2

If all cohorts sent agreed and coordinator

agrees

then write commit record into log

and send commit message to cohorts

else send abort message to cohorts

Wait for acknowledgment from cohorts

If acknowledgment from a cohort not received

within specified period

resent commit/abort to that cohort

If all acknowledgments received,

write complete record to log

Cohorts

If transaction at cohort is successful

then write undo and redo log on stable

storage and return agreed

message

else return abort message

If commit received,

release all resources and locks held for

transaction and

send acknowledgment

if abort received,

undo the transaction using undo log record,

release resources and locks and

send acknowledgment

NonBlocking Commit Protocols

Our Blocking Theorem from last week states that if network partitioning is possible, then any distributed commit protocol may block.

Let’s assume now that the network can not partition.

Then we can consult other processes to make progress.

However, if all processes fail, then we are, again, blocked.

Let’s further assume that total failure is not possible ie. not all processes are crashed at the same time.

Automata representation

We model the participants with finite state automata (FSA).

The participants move from one state to another as a result of receiving one or several messages or as a result of a timeout event.

Having received these messages, a participant may send some messages before executing the state transition.

Commit Protocol Automata

Final states are divided into Abort states and Commit states (finally, either Abort or Commit takes place).

Once an Abort state is reached, it is not possible to do a transition to a non-Abort state. (Abort is irreversible). Similarly for Commit states (Commit is also irreversible).

The state diagram is acyclic. We denote the initial state by q, the terminal states are a (an

abort/rollback state) and c (a commit state). Often there is a wait-state, which we denote by w.

Assume the participants are P1,…,Pn. Possible coordinator is P0, when the protocol starts.

2PC Coordinator

q

w

a c

A commit-request from application

VoteReq to P1,…,Pn

Timeout or No from one of P1,.., Pn

Abort to P1,…Pn

Yes from all P1,..,Pn

Commit to P1,…,Pn

2PC Participant

q

w a

c

VoteReq from P0

No to P0

Commit from P0

-

Abort from P0

-

VoteReq from P0

Yes to P0

Commit Protocol State Transitions

In a commit protocol, the idea is to inform other participants on local progress.

In fact, a state transition without message change is uninteresting, unless the participant moves into a terminal state.

Therefore, unless a participant moves into a terminal state, we may assume that it sends messages to other participants about its change of state.

To simplify our analysis, we may assume that the messages are sent to all other participants. This is not necessary, but creates unnecessary complication.

Concurrency set

A concurrency set of a state s is the set of possible states among all participants, if some participant is in state s.

In other words, the concurrency set of state s is the set of all states that can co-exist with state s.

2PC Concurrency Sets

q

w

a c

Commit-req.

VoteReq to All

Timeout or a No

Abort to all Yes from all

Commit to all

q

w a

c

VoteReq from P0

No to P0

Abort from P0

-

VoteReq from P0

Yes to P0

Concurrency_set(q) = {q,w,a}, Concurrency_set(a) = {q,w,a}

Concurrency_set(w) = {q,w,a,c}, Concurrency_set(c) = (w,c)

Committable states

We say that a state is committable, if the existence of a participant in this state means that everyone has voted Yes.

If a state is not committable, we say that it is non-committable.

In 2PC, w and c are committable states.

How can a site terminate when there is a timeout?

Either (1) one of the operational sites knows the fate of the transaction, or (2) the operational sites can decide the fate of the transaction.

Knowing the fate of the transaction means, in practice, that there is a participant in a terminal state.

Start by considering a single participant s. The site must infer the possible states of other participants from its own state. This can be done using concurrency sets.

When can’t a single participant unilaterally abort?

Suppose a participant is in a state, which has a commit state in its concurrency set. Then, it is possible that some other participant is in a commit state.

A participant in a state, which has a commit state in its concurrency set, should not unilaterally abort.

When can’t a single participant unilaterally commit?

Suppose a participant is in a state, which has an abort state in its concurrency set. Then, some participant may be in an abort state.

A participant in a state, which has an abort state in its concurrency set, should not unilaterally commit.

Also, a participant that is not in a committable state should not commit.

The Fundamental Non-Blocking Theorem

A protocol is non-blocking, if and only if it satisfies the following conditions: (1) There exists no local state such that its concurrency set contains both an abort and a commit state, and (2) there exists no noncommittable state, whose concurrency set contains a commit state.

Showing the Fundamental Non-Blocking Theorem

From our discussion above it follows that Conditions (1) and (2) are necessary.

We discuss their sufficiency later by showing how to terminate a commit protocol fulfilling conditions (1) and (2).

Observations on 2PC

As the participants exchange messages as they progress, they progress in a synchronised fashion.

In fact, there is always at most one step difference between the states of any two live participants.

We say that the participants keep a one-step synchronisation.

It is easy to see by Fundamental Nonblocking Theorem that 2PC is blocking.

One-step synchronisation and non-blocking property

If a commit protocol keeps one-step synchronisation, then the concurrency set of state s consists of s and the states adjacent to s.

By applying this observation and the Fundamental Non-blocking Theorem, we get a useful Lemma:

Lemma

A protocol that is synchronous within one state transition is non-blocking, if and only if (1) it contains no state adjacet to both a Commit and an Abort state, and (2) it contains non non-committable state that is adjacet to a commit state.

How to improve 2PC to get a non-blocking protocol

It is easy to see that the state w is the problematic state – and in two ways: - it has both Abort and Commit in its concurrency set, and - it is a non-committable state, but it has Commit in its concurrency set.

Solution: add an extra state between w and c (adding between w and a would not do – why?)

We are primarily interested in the centralised protocol, but similar decentralised improvement is possible.

3PC Coordinator

q

w

a

c

A commit-request from application

VoteReq to P1,…,Pn

Timeout or No from one of P1,.., Pn

Abort to P1,…Pn

Yes from all P1,..,Pn

Prepare to P1,…,Pn

p

Ack from all P1,..,Pn

Commit to P1,…,Pn

3PC Participant

q

w a

c

VoteReq from P0

No to P0

Prepare from P0

Ack to P0

Abort from P0

-

VoteReq from P0

Yes to P0

p

Commit from P0

-

3PC Concurrency sets (cs)

q

w

a

c

A commit-request

VoteReq to all

Timeout or one No

Abort to all

Yes from all

Prepare to all

p

Ack from all

Commit to all

q

w a

c

VoteReq from P0

No to P0

Abort from P0

-

VoteReq from P0

Yes to P0

p

Commit from P0

-

Prepare from P0

Ack to P0

cs(p) = {w,p,c},

cs(w) = {q,a,w,p},

etc.

3PC and failures

If there are no failures, then clearly 3PC is correct.

In the presence of failures, the operational participants should be able to terminate their execution.

In the centralised case, a need for termination protocol implies that the coordinator is no longer operational.

We discuss a general termination protocol. It makes the assumption that at least one participant remains operational and that the participants obey the Fundamental Non-Blocking Theorem.

Termination

Basic idea: Choose a backup coordinator B – vote or use some preassigned ids.

Backup Coordinator Decision Rule: If the B’s state contains commit in its concurrency set, commit the transaction. Else abort the transaction.

Reasoning behind the rule: If B’s state contains commit in the concurrency set, then it is possible that some site has performed commit – otherwise not.

Re-executing termination

It is, of course, possible the backup coordinator fails.

For this reason, the termination protocol should be executed in such a way that it can be re-executed.

In particular, the termination protocol must not break the one-step synchronisation.

Implementing termination

To keep one-step synchronisation, the termination protocol should be executed in two steps:

1. The backup coordinator B tells the others to make a transition to B’s state. Others answer Ok. (This is not necessary if B is in Commit or Abort state.) 2. B tells the others to commit or abort by the decision rule.

Fundamental Non-Blocking Theorem Proof - Sufficiency

The basic termination procedure and decision rule is valid for any protocol that fulfills the conditions given in the Fundamental Non-Blocking Theorem.

The existence of a termination protocol completes the proof.

69

Voting protocols

Principles: Data replicated at several sites to increase reliability

Each replica assigned a number of votes

To access a replica, a process must collect a majority of votes

Vote mechanism: (1) Static voting:

Each replica has number of votes (in stable storage)

A process can access a replica for a read or write operation if it can collect a certain number of votes (read or write quorum)

(2) Dynamic voting

Number of votes or the set of sites that form a quorum change with the state of system (due to site and communication failures)

(2.1) Majority based approach: Set of sites that can form a majority to allow access to replicated data of

changes with the changing state of the system

(2.2) Dynamic vote reassignment: Number of votes assigned to a site changes dynamically

70

Failure resilient processes

Resilient process: continues execution in the presence of failures with minimum disruption to the service provided (masks failures)

Approaches for implementing resilient processes: Backup processes and

Replicated execution

(1) Backup processes Each process made of a primary process and one or more backup

processes

Primary process execute, while the backup processes are inactive

If primary process fails, a backup process takes over

Primary process establishes checkpoints, such that backup process can restart

(2) Replicated execution Several processes execute same program concurrently

Majority consensus (voting) of their results

Increases both the reliability and availability of the process

71

Recovery (fault tolerant) block concept

Provide fault-tolerance within an individual sequential process in which assignments to stored variables are the only means of making recognizable progress

The recovery block is made of:

A primary block (the conventional program),

Zero or more alternates (providing the same function as the primary block, but using different algorithm), and

An acceptance test (performed on exit from a primary or alternate block to validate its actions).

72

Recovery (fault tolerant) Block concept

Recovery Block A

Acceptance test AT

Primary block AP

<Program text>

Alternate block AQ

<Program text>

Primary block

alternate block

Acceptance

test

Recovery block

73

N-version programming

Module ‘0’

Module ‘1’

Module ‘n-1’

Voter

Thank U

cs9222 advanced operating systems

Engineering