cs9222 advanced operating systems
TRANSCRIPT
CS9222 Advanced Operating System
Unit – IV
Dr.A.Kathirvel
Professor & Head/IT - VCEW
Unit - IV
Basic Concepts-Classification of Failures – Basic Approaches to Recovery; Recovery in Concurrent System; Synchronous and Asynchronous checkpointing and Recovery; Check pointing in Distributed Database Systems; Fault Tolerance; Issues - Two-phase and Nonblocking commit Protocols; Voting Protocols; Dynamic Voting Protocols;
Recovery
Recovery in computer systems refers to restoring a system to its normal operational state.
Recovery may be as simple as restarting a failed computer or restarting failed processes.
Recovery is generally a very complicated process.
For example, a process has memory allocated to it and a process may have locked shared resources, such as files and memory. Under such circumstances, if a process fails, it is imperative that the resources allocated to the failed process are undone.
4
Recovery
Computer system recovery:
Restore the system to a normal operational state
Process recovery:
Reclaim resources allocated to process,
Undo modification made to databases, and
Restart the process
Or restart process from point of failure and resume execution
Distributed process recovery (cooperating processes):
Undo effect of interactions of failed process with other cooperating processes.
Replication (hardware components, processes, data):
Main method for increasing system availability
System:
Set of hardware and software components
Designed to provide a specified service (I.e. meet a set of requirements)
5
System failure: – System does not meet requirements, i.e.does not perform its services as specified
Erroneous System State: – State which could lead to a system failure by a sequence of valid state transitions
– Error: the part of the system state which differs from its intended value
Fault: – Anomalous physical condition, e.g. design errors, manufacturing problems, damage,
external disturbances.
Recovery (cont.)
Error could lead to system failure
Error is a manifestation of a fault
6
Process failure: Behavior: process causes system state to deviate from specification (e.g. incorrect computation,
process stop execution)
Errors causing process failure: protection violation, deadlocks, timeout, wrong user input, etc…
Recovery: Abort process or
Restart process from prior state
System failure: Behavior: processor fails to execute
Caused by software errors or hardware faults (CPU/memory/bus/…/ failure)
Recovery: system stopped and restarted in correct state
Assumption: fail-stop processors, i.e. system stops execution, internal state is lost
Secondary Storage Failure: Behavior: stored data cannot be accessed
Errors causing failure: parity error, head crash, etc.
Recovery/Design strategies: Reconstruct content from archive + log of activities
Design mirrored disk system
Communication Medium Failure: Behavior: a site cannot communicate with another operational site
Errors/Faults: failure of switching nodes or communication links
Recovery/Design Strategies: reroute, error-resistant communication protocols
Classification of failures
7
Failure recovery: restore an erroneous state to an error-free state
Approaches to failure recovery:
Forward-error recovery:
Remove errors in process/system state (if errors can be completely assessed)
Continue process/system forward execution
Backward-error recovery:
Restore process/system to previous error-free state and restart from there
Comparison: Forward vs. Backward error recovery
Backward-error recovery
(+) Simple to implement
(+) Can be used as general recovery mechanism
(-) Performance penalty
(-) No guarantee that fault does not occur again
(-) Some components cannot be recovered
Forward-error Recovery
(+) Less overhead
(-) Limited use, i.e. only when impact of faults understood
(-) Cannot be used as general mechanism for error recovery
Backward and Forward Error Recovery
8
Principle: restore process/system to a known, error-free “recovery point”/ “checkpoint”.
System model:
Approaches: (1) Operation-based approach
(2) State-based approach
Backward-Error Recovery: Basic approach
CPU
Main memory
secondar
y storage
stable
storage
Storage that
maintains
information in
the event of
system failure
Bring object to MM
to be accessed
Store logs and
recovery points
Write object back
if modified
9
Principle: Record all changes made to state of process (‘audit trail’ or ‘log’) such that process
can be returned to a previous state
Example: A transaction based environment where transactions update a database It is possible to commit or undo updates on a per-transaction basis
A commit indicates that the transaction on the object was successful and changes are permanent
(1.a) Updating-in-place Principle: every update (write) operation to an object creates a log in stable storage
that can be used to ‘undo’ and ‘redo’ the operation
Log content: object name, old object state, new object state
Implementation of a recoverable update operation:
Do operation: update object and write log record
Undo operation: log(old) -> object (undoes the action performed by a do)
Redo operation: log(new) -> object (redoes the action performed by a do)
Display operation: display log record (optional)
Problem: a ‘do’ cannot be recovered if system crashes after write object but before log record write
(1.b) The write-ahead log protocol Principle: write log record before updating object
(1) The Operation-based Approach
10
Principle: establish frequent ‘recovery points’ or ‘checkpoints’ saving the entire state of process
Actions: ‘Checkpointing’ or ‘taking a checkpoint’: saving process state
‘Rolling back’ a process: restoring a process to a prior state
Note: A process should be rolled back to the most recent ‘recovery point’ to minimize the overhead and delays in the completion of the process
Shadow Pages: Special case of state-based approach Only a part of the system state is saved to minimize recovery
When an object is modified, page containing object is first copied on stable storage (shadow page)
If process successfully commits: shadow page discarded and modified page is made part of the database
If process fails: shadow page used and the modified page discarded
(2) State-based Approach
11
Recovery in concurrent systems Issue: if one of a set of cooperating processes fails and has to be rolled back to a
recovery point, all processes it communicated with since the recovery point have to be rolled back.
Conclusion: In concurrent and/or distributed systems all cooperating processes have to establish recovery points
Orphan messages and the domino effect
Case 1: failure of X after x3 : no impact on Y or Z
Case 2: failure of Y after sending msg. ‘m’ Y rolled back to y2
‘m’ ≡ orphan massage
X rolled back to x2
Case 3: failure of Z after z2 Y has to roll back to y1
X has to roll back to x1 Domino Effect
Z has to roll back to z1
X
Y
Z
y1
x1
z1 z2
x2
y2
x3
m
Time
12
Lost messages
• Assume that x1 and y1 are the only recovery points for processes X and Y, respectively
• Assume Y fails after receiving message ‘m’
• Y rolled back to y1, X rolled back to x1
• Message ‘m’ is lost
Note: there is no distinction between this case and the case where message ‘m’ is lost in communication channel and processes X and Y are in states x1 and y1, respectively
X
Y y1
x1
m
Time
Failure
13
Problem of livelock
• Livelock: case where a single failure can cause an infinite number of rollbacks
• Process Y fails before receiving message ‘n1’ sent by X
• Y rolled back to y1, no record of sending message ‘m1’, causing X to roll back to x1
• When Y restarts, sends out ‘m2’ and receives ‘n1’ (delayed)
• When X restarts from x1, sends out ‘n2’ and receives ‘m2’
• Y has to roll back again, since there is no record of ‘n1’ being sent
• This cause X to be rolled back again, since it has received ‘m2’ and there is no record of sending ‘m2’ in Y
• The above sequence can repeat indefinitely
X
Y y1
x1
m1
Time Failure
n1
(a)
X
Y y1
x1
m2
Time 2nd roll back
n2 n1
(b)
(a)
(b)
CS-550 (M.Soneru): Recovery [SaS] 14
Consistent set of checkpoints • Checkpointing in distributed systems requires that all processes (sites) that
interact with one another establish periodic checkpoints
• All the sites save their local states: local checkpoints
• All the local checkpoints, one from each site, collectively form a global checkpoint
• The domino effect is caused by orphan messages, which in turn are caused by rollbacks
1. Strongly consistent set of checkpoints
– Establish a set of local checkpoints (one for each process in the set) such that no information flow takes place (i.e., no orphan messages) during the interval spanned by the checkpoints
2. Consistent set of checkpoints
– Similar to the consistent global state
– Each message that is received in a checkpoint (state) should also be recorded as sent in another checkpoint (state)
Consistency of Checkpoint
• Strongly consistent set of checkpoints
no messages penetrating the set
• Consistent set of checkpoints
no messages penetrating the set backward
[
[
[
x1
y1
z1
[
[
[
y2
x2
z2
Strongly consistent consistent
need to deal with
lost messages
Checkpoint/Recovery Algorithm
• Synchronous
– with global synchronization at checkpointing
• Asynchronous
– without global synchronization at checkpointing
Preliminary (Assumption)
Goal
To make a consistent global checkpoint
Assumptions
– Communication channels are FIFO
– No partition of the network
– End-to-end protocols cope with message loss due to rollback recovery and communication failure
– No failure during the execution of the algorithm
~Synchronous Checkpoint~
Preliminary (Two types of checkpoint)
tentative checkpoint : – a temporary checkpoint
– a candidate for permanent checkpoint
permanent checkpoint : – a local checkpoint at a process
– a part of a consistent global checkpoint
~Synchronous Checkpoint~
Checkpoint Algorithm
Algorithm
1. an initiating process (a single process that invokes this algorithm) takes a tentative checkpoint
2. it requests all the processes to take tentative checkpoints
3. it waits for receiving from all the processes whether taking a tentative checkpoint has been succeeded
4. if it learns all the processes has succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded.
5. it informs all the processes of the decision
6. The processes that receive the decision act accordingly
Supplement
Once a process has taken a tentative checkpoint, it shouldn’t send messages until it is informed of initiator’s decision.
~Synchronous Checkpoint~
Diagram of Checkpoint Algorithm
[
[
[
|
|
Tentative
checkpoint
|
request to
take a
tentative
checkpoint
OK
decide to commit
[ permanent checkpoint
[
[
consistent global checkpoint
consistent global checkpoint Unnecessary checkpoint
Initiator
~Synchronous Checkpoint~
Optimized Algorithm
Each message is labeled by order of sending
Labeling Scheme ⊥ : smallest label
т : largest label
last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. if not exists, ⊥is in it.
first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . if not exists, ⊥is in it.
ckpt_cohortX : the set of all processes that may have to take checkpoints when X decides to take a checkpoint.
~Synchronous Checkpoint~
[
[
X
Y
x2 x3
y1 y2
y2
x2
Checkpoint request need to be sent to only the processes
included in ckpt_cohort
Optimized Algorithm
ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ }
Y takes a tentative checkpoint only if
last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥
~Synchronous Checkpoint~
X
Y
[
[
last_label_rcvdX[Y]
first_label_sentY[X]
Optimized Algorithm
Algorithm 1. an initiating process takes a tentative checkpoint 2. it requests p ∈ ckpt_cohort to take tentative checkpoints ( this
message includes last_label_rcvd[reciever] of sender ) 3. if the processes that receive the request need to take a checkpoint,
they do the same as 1.2.; otherwise, return OK messages. 4. they wait for receiving OK from all of p ∈ ckpt_cohort 5. if the initiator learns all the processes have succeeded, it decides all
tentative checkpoints should be made permanent; otherwise, should be discarded.
6. it informs p ∈ ckpt_cohort of the decision 7. The processes that receive the decision act accordingly
~Synchronous Checkpoint~
Diagram of Optimized Algorithm
[
[
[
[
A
C
B
D
ab1 ac1
bd1
dc1 dc2
cb1
ba1 ba2
ac2 cb2
cd1
|
Tentative
checkpoint
ca2
last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥
2 >= 1 > 0 |
2 >= 2 > 0 |
2 >= 0 > 0
OK
decide to commit
[
Permanent
checkpoint
[
[
ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ }
~Synchronous Checkpoint~
Correctness
• A set of permanent checkpoints taken by this algorithm is consistent
– No process sends messages after taking a tentative checkpoint until the receipt of the decision
– New checkpoints include no message from the processes that don’t take a checkpoint
– The set of tentative checkpoints is fully either made to permanent checkpoints or discarded.
~Synchronous Checkpoint~
Recovery Algorithm
Labeling Scheme ⊥ : smallest label
т : largest label
last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. If not exists, ⊥is in it.
first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . If not exists, ⊥is in it.
roll_cohortX : the set of all processes that may have to roll back to the latest checkpoint when process X rolls back.
last_label_sentX[Y] : the last message that X sent to Y before X takes its latest permanent checkpoint. If not exist, т is in it.
~Synchronous Recovery~
Recovery Algorithm
roll_cohortX = { Y | X can send messages to Y }
Y will restart from the permanent checkpoint only if
last_label_rcvdY[X] > last_label_sentX[Y]
~Synchronous Recovery~
Recovery Algorithm
Algorithm 1. an initiator requests p ∈ roll_cohort to prepare to rollback ( this
message includes last_label_sent[reciever] of sender ) 2. if the processes that receive the request need to rollback, they
do the same as 1.; otherwise, return OK message. 3. they wait for receiving OK from all of p ∈ ckpt_cohort. 4. if the initiator learns p ∈ roll_cohort have succeeded, it decides
to rollback; otherwise, not to rollback. 5. it informs p ∈ roll_cohort of the decision 6. the processes that receive the decision act accordingly
~Synchronous Recovery~
Diagram of Synchronous Recovery
[
[
[
[
A
C
B
D
ab1 ac1
bd1
dc1 dc2
cb1
ba1 ba2
ac2 cb2
dc1
request to
roll back
0 > 1
last_label_rcvdY[X] > last_label_sentX[Y]
2 > 1
0 >т
OK
[
[
2 > 1
0 >т
[
decide to
roll back
roll_cohortX = { Y | X can send messages to Y }
Drawbacks of Synchronous Approach
• Additional messages are exchanged
• Synchronization delay
• An unnecessary extra load on the system if failure rarely occurs
Asynchronous Checkpoint
Characteristic – Each process takes checkpoints independently
– No guarantee that a set of local checkpoints is consistent
– A recovery algorithm has to search consistent set of checkpoints
– No additional message
– No synchronization delay
– Lighter load during normal excution
Preliminary (Assumptions)
Goal
To find the latest consistent set of checkpoints
Assumptions
– Communication channels are FIFO
– Communication channels are reliable
– The underlying computation is event-driven
~Asynchronous Checkpoint / Recovery~
Preliminary (Two types of log)
• save an event on the memory at receipt of messages (volatile log)
• volatile log periodically flushed to the disk (stable log) ⇔ checkpoint
volatile log : quick access
lost if the corresponding processor fails
stable log : slow access
not lost even if processors fail
~Asynchronous Checkpoint / Recovery~
Preliminary (Definition)
Definition
CkPti : the checkpoint (stable log) that i rolled back to when failure occurs
RCVDi←j (CkPti / e ) : the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPti or event e.
SENTi→j(CkPti / e ) : the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPti or event e
~Asynchronous Checkpoint / Recovery~
Recovery Algorithm
Algorithm 1. When one process crashes, it recovers to the latest checkpoint
CkPt. 2. It broadcasts the message that it had failed. Others receive this
message, and rollback to the latest event. 3. Each process sends SENT(CkPt) to neighboring processes 4. Each process waits for SENT(CkPt) messages from every
neighbor 5. On receiving SENTj→i(CkPtj) from j, if i notices RCVDi←j (CkPti) >
SENTj→i(CkPtj), it rolls back to the event e such that RCVDi←j (e) = SENTj→i(e),
6. repeat 3,4,and 5 N times (N is the number of processes)
~Asynchronous Checkpoint / Recovery~
Asynchronous Recovery
X
Y
Z
Ex0 Ex1 Ex2 Ex3
Ey0 Ey1 Ey2 Ey3
Ez0 Ez1 Ez2
[
[
[
x1
y1
z1
(Y,2)
(Y,1)
(X,2)
(X,0)
(Z,0)
(Z,1)
3 <= 2
RCVDi←j (CkPti) <= SENTj→i(CkPtj)
2 <= 2
X:Y X:Z
0 <= 0
1 <= 2
Y:X
1 <= 1
Y:Z
0 <= 0
Z:X
2 <= 1
Z:Y
1 <= 1
37
System reliability: Fault-Intolerance vs. Fault-Tolerance
The fault intolerance (or fault-avoidance) approach improves system reliability by removing the source of failures (i.e., hardware and software faults) before normal operation begins
The approach of fault-tolerance expect faults to be present during system operation, but employs design techniques which insure the continued correct execution of the computing process
38
Approaches to fault-tolerance
Approaches:
(a) Mask failures
(b) Well defined failure behavior
Mask failures:
System continues to provide its specified function(s) in the presence of failures
Example: voting protocols
(b) Well defined failure behaviour:
System exhibits a well define behaviour in the presence of failures
It may or it may not perform its specified function(s), but facilitates actions suitable for fault recovery
Example: commit protocols
A transaction made to a database is made visible only if successful and it commits
If it fails, transaction is undone
Redundancy:
Method for achieving fault tolerance (multiple copies of hardware, processes, data, etc...)
39
Issues
Process Deaths: All resources allocated to a process must be recovered when a process
dies
Kernel and remaining processes can notify other cooperating processes
Client-server systems: client (server) process needs to be informed that the corresponding server (client) process died
Machine failure: All processes running on that machine will die
Client-server systems: difficult to distinguish between a process and machine failure
Issue: detection by processes of other machines
Network Failure: Network may be partitioned into subnets
Machines from different subnets cannot communicate
Difficult for a process to distinguish between a machine and a communication link failure
40
Atomic actions
System activity: sequence of primitive or atomic actions
Atomic Action: Machine Level: uninterruptible instruction
Process Level: Group of instructions that accomplish a task
Example: Two processes, P1 and P2, share a memory location ‘x’ and both modify ‘x’
Process P1 Process P2
… …
Lock(x); Lock(x);
x := x + z; x := x + y; Atomic action
Unlock(x); Unlock(x);
… …
successful exit
System level: group of cooperating process performing a task (global atomicity)
41
Committing Transaction: Sequence of actions treated as an atomic action to preserve
consistency (e.g. access to a database)
Commit a transaction: Unconditional guarantee that the transaction will complete successfully (even in the presence of failures)
Abort a transaction: Unconditional guarantee to back out of a transaction, i.e., that all the effects of the transaction have been removed (transaction was backed out)
Events that may cause aborting a transaction: deadlocks, timeouts, protection violation
Mechanisms that facilitate backing out of an aborting transaction
Write-ahead-log protocol
Shadow pages
Commit protocols:
Enforce global atomicity (involving several cooperating distributed processes)
Ensure that all the sites either commit or abort transaction unanimously, even in the presence of multiple and repetitive failures
42
The two-phase commit protocol Assumption:
One process is coordinator, the others are “cohorts” (different sites)
Stable store available at each site
Write-ahead log protocol
Coordinator
Initialization
Send start transaction message to all cohorts
Phase 1
Send commit-request message, requesting all
cohort to commit
Wait for reply from cohorts
Phase 2
If all cohorts sent agreed and coordinator
agrees
then write commit record into log
and send commit message to cohorts
else send abort message to cohorts
Wait for acknowledgment from cohorts
If acknowledgment from a cohort not received
within specified period
resent commit/abort to that cohort
If all acknowledgments received,
write complete record to log
Cohorts
If transaction at cohort is successful
then write undo and redo log on stable
storage and return agreed
message
else return abort message
If commit received,
release all resources and locks held for
transaction and
send acknowledgment
if abort received,
undo the transaction using undo log record,
release resources and locks and
send acknowledgment
NonBlocking Commit Protocols
Our Blocking Theorem from last week states that if network partitioning is possible, then any distributed commit protocol may block.
Let’s assume now that the network can not partition.
Then we can consult other processes to make progress.
However, if all processes fail, then we are, again, blocked.
Let’s further assume that total failure is not possible ie. not all processes are crashed at the same time.
Automata representation
We model the participants with finite state automata (FSA).
The participants move from one state to another as a result of receiving one or several messages or as a result of a timeout event.
Having received these messages, a participant may send some messages before executing the state transition.
Commit Protocol Automata
Final states are divided into Abort states and Commit states (finally, either Abort or Commit takes place).
Once an Abort state is reached, it is not possible to do a transition to a non-Abort state. (Abort is irreversible). Similarly for Commit states (Commit is also irreversible).
The state diagram is acyclic. We denote the initial state by q, the terminal states are a (an
abort/rollback state) and c (a commit state). Often there is a wait-state, which we denote by w.
Assume the participants are P1,…,Pn. Possible coordinator is P0, when the protocol starts.
2PC Coordinator
q
w
a c
A commit-request from application
VoteReq to P1,…,Pn
Timeout or No from one of P1,.., Pn
Abort to P1,…Pn
Yes from all P1,..,Pn
Commit to P1,…,Pn
2PC Participant
q
w a
c
VoteReq from P0
No to P0
Commit from P0
-
Abort from P0
-
VoteReq from P0
Yes to P0
Commit Protocol State Transitions
In a commit protocol, the idea is to inform other participants on local progress.
In fact, a state transition without message change is uninteresting, unless the participant moves into a terminal state.
Therefore, unless a participant moves into a terminal state, we may assume that it sends messages to other participants about its change of state.
To simplify our analysis, we may assume that the messages are sent to all other participants. This is not necessary, but creates unnecessary complication.
Concurrency set
A concurrency set of a state s is the set of possible states among all participants, if some participant is in state s.
In other words, the concurrency set of state s is the set of all states that can co-exist with state s.
2PC Concurrency Sets
q
w
a c
Commit-req.
VoteReq to All
Timeout or a No
Abort to all Yes from all
Commit to all
q
w a
c
VoteReq from P0
No to P0
Abort from P0
-
VoteReq from P0
Yes to P0
Concurrency_set(q) = {q,w,a}, Concurrency_set(a) = {q,w,a}
Concurrency_set(w) = {q,w,a,c}, Concurrency_set(c) = (w,c)
Committable states
We say that a state is committable, if the existence of a participant in this state means that everyone has voted Yes.
If a state is not committable, we say that it is non-committable.
In 2PC, w and c are committable states.
How can a site terminate when there is a timeout?
Either (1) one of the operational sites knows the fate of the transaction, or (2) the operational sites can decide the fate of the transaction.
Knowing the fate of the transaction means, in practice, that there is a participant in a terminal state.
Start by considering a single participant s. The site must infer the possible states of other participants from its own state. This can be done using concurrency sets.
When can’t a single participant unilaterally abort?
Suppose a participant is in a state, which has a commit state in its concurrency set. Then, it is possible that some other participant is in a commit state.
A participant in a state, which has a commit state in its concurrency set, should not unilaterally abort.
When can’t a single participant unilaterally commit?
Suppose a participant is in a state, which has an abort state in its concurrency set. Then, some participant may be in an abort state.
A participant in a state, which has an abort state in its concurrency set, should not unilaterally commit.
Also, a participant that is not in a committable state should not commit.
The Fundamental Non-Blocking Theorem
A protocol is non-blocking, if and only if it satisfies the following conditions: (1) There exists no local state such that its concurrency set contains both an abort and a commit state, and (2) there exists no noncommittable state, whose concurrency set contains a commit state.
Showing the Fundamental Non-Blocking Theorem
From our discussion above it follows that Conditions (1) and (2) are necessary.
We discuss their sufficiency later by showing how to terminate a commit protocol fulfilling conditions (1) and (2).
Observations on 2PC
As the participants exchange messages as they progress, they progress in a synchronised fashion.
In fact, there is always at most one step difference between the states of any two live participants.
We say that the participants keep a one-step synchronisation.
It is easy to see by Fundamental Nonblocking Theorem that 2PC is blocking.
One-step synchronisation and non-blocking property
If a commit protocol keeps one-step synchronisation, then the concurrency set of state s consists of s and the states adjacent to s.
By applying this observation and the Fundamental Non-blocking Theorem, we get a useful Lemma:
Lemma
A protocol that is synchronous within one state transition is non-blocking, if and only if (1) it contains no state adjacet to both a Commit and an Abort state, and (2) it contains non non-committable state that is adjacet to a commit state.
How to improve 2PC to get a non-blocking protocol
It is easy to see that the state w is the problematic state – and in two ways: - it has both Abort and Commit in its concurrency set, and - it is a non-committable state, but it has Commit in its concurrency set.
Solution: add an extra state between w and c (adding between w and a would not do – why?)
We are primarily interested in the centralised protocol, but similar decentralised improvement is possible.
3PC Coordinator
q
w
a
c
A commit-request from application
VoteReq to P1,…,Pn
Timeout or No from one of P1,.., Pn
Abort to P1,…Pn
Yes from all P1,..,Pn
Prepare to P1,…,Pn
p
Ack from all P1,..,Pn
Commit to P1,…,Pn
3PC Participant
q
w a
c
VoteReq from P0
No to P0
Prepare from P0
Ack to P0
Abort from P0
-
VoteReq from P0
Yes to P0
p
Commit from P0
-
3PC Concurrency sets (cs)
q
w
a
c
A commit-request
VoteReq to all
Timeout or one No
Abort to all
Yes from all
Prepare to all
p
Ack from all
Commit to all
q
w a
c
VoteReq from P0
No to P0
Abort from P0
-
VoteReq from P0
Yes to P0
p
Commit from P0
-
Prepare from P0
Ack to P0
cs(p) = {w,p,c},
cs(w) = {q,a,w,p},
etc.
3PC and failures
If there are no failures, then clearly 3PC is correct.
In the presence of failures, the operational participants should be able to terminate their execution.
In the centralised case, a need for termination protocol implies that the coordinator is no longer operational.
We discuss a general termination protocol. It makes the assumption that at least one participant remains operational and that the participants obey the Fundamental Non-Blocking Theorem.
Termination
Basic idea: Choose a backup coordinator B – vote or use some preassigned ids.
Backup Coordinator Decision Rule: If the B’s state contains commit in its concurrency set, commit the transaction. Else abort the transaction.
Reasoning behind the rule: If B’s state contains commit in the concurrency set, then it is possible that some site has performed commit – otherwise not.
Re-executing termination
It is, of course, possible the backup coordinator fails.
For this reason, the termination protocol should be executed in such a way that it can be re-executed.
In particular, the termination protocol must not break the one-step synchronisation.
Implementing termination
To keep one-step synchronisation, the termination protocol should be executed in two steps:
1. The backup coordinator B tells the others to make a transition to B’s state. Others answer Ok. (This is not necessary if B is in Commit or Abort state.) 2. B tells the others to commit or abort by the decision rule.
Fundamental Non-Blocking Theorem Proof - Sufficiency
The basic termination procedure and decision rule is valid for any protocol that fulfills the conditions given in the Fundamental Non-Blocking Theorem.
The existence of a termination protocol completes the proof.
69
Voting protocols
Principles: Data replicated at several sites to increase reliability
Each replica assigned a number of votes
To access a replica, a process must collect a majority of votes
Vote mechanism: (1) Static voting:
Each replica has number of votes (in stable storage)
A process can access a replica for a read or write operation if it can collect a certain number of votes (read or write quorum)
(2) Dynamic voting
Number of votes or the set of sites that form a quorum change with the state of system (due to site and communication failures)
(2.1) Majority based approach: Set of sites that can form a majority to allow access to replicated data of
changes with the changing state of the system
(2.2) Dynamic vote reassignment: Number of votes assigned to a site changes dynamically
70
Failure resilient processes
Resilient process: continues execution in the presence of failures with minimum disruption to the service provided (masks failures)
Approaches for implementing resilient processes: Backup processes and
Replicated execution
(1) Backup processes Each process made of a primary process and one or more backup
processes
Primary process execute, while the backup processes are inactive
If primary process fails, a backup process takes over
Primary process establishes checkpoints, such that backup process can restart
(2) Replicated execution Several processes execute same program concurrently
Majority consensus (voting) of their results
Increases both the reliability and availability of the process
71
Recovery (fault tolerant) block concept
Provide fault-tolerance within an individual sequential process in which assignments to stored variables are the only means of making recognizable progress
The recovery block is made of:
A primary block (the conventional program),
Zero or more alternates (providing the same function as the primary block, but using different algorithm), and
An acceptance test (performed on exit from a primary or alternate block to validate its actions).
72
Recovery (fault tolerant) Block concept
Recovery Block A
Acceptance test AT
Primary block AP
<Program text>
Alternate block AQ
<Program text>
Primary block
alternate block
Acceptance
test
Recovery block
73
N-version programming
Module ‘0’
Module ‘1’
Module ‘n-1’
Voter
Thank U