fault tolerance
DESCRIPTION
Fault Tolerance. Motivation : Systems need to be much more reliable than their components Use Redundancy : Extra items that can be used to make up for failures Types of Redundancy : Hardware Software Time Information. Fault-Tolerant Scheduling. - PowerPoint PPT PresentationTRANSCRIPT
Fault Tolerance
Motivation: Systems need to be much more reliable than their components
Use Redundancy: Extra items that can be used to make up for failures
Types of Redundancy:» Hardware» Software» Time» Information
Fault-Tolerant Scheduling
Fault Tolerance: The ability of a system to suffer component failures and still function adequately
Fault-Tolerant Scheduling: Save enough time in a schedule that the system can still function despite a certain number of processor failures
FT-Scheduling: Model
System Model» Multiprocessor system» Each processor has its own memory» Tasks are preloaded into assigned processors
Task Model» Tasks are independent of one another» Schedules are created ahead of time
Basic Idea
Preassign backup copies, called ghosts. Assign ghosts to the processors along with the
primary copies» A ghost and a primary copy of the same task can’t be
assigned to the same processor» For each processor, all the primaries and a particular
subset of the ghost copies assigned to it should be feasibly schedulable on that processor
Requirements
Two main variations:» Current and future iterations of the task have to be
saved if a processor fails» Only future iterations need to be saved; the current
iteration can be discarded
Forward and Backward Masking
Forward Masking: Mask the output of failed units without significant loss of time
Backward Masking: After detecting an error, try to fix it by recomputing or some other means
Failure Types
Permanent: The fault is incurable Transient: The unit is faulty for some time,
following which it starts functioning correctly again
Intermittent: Frequently cycles between a faulty and a non-faulty state
Faults and Errors
A fault is some physical defect or malfunction An error is a manifestation of a fault Latency:
» Fault Latency: Time between occurrence of a fault and its manifestation as an error
» Error Latency: Time between the generation of an error and its being caught by the system
Hardware Failure Recovery
If transient, it may be enough to wait for the fault to go away and then reinvoke the computation
If permanent, reassign the tasks to other, functional, processors
Faults: Output Characteristics
Stuck-at: A line is stuck at 0 or 1. Dead: No output (e.g., high-impedance state) Arbitrary: The output changes with time
Factors Affecting HW F-Rate
Temperature Radiation Power surges Mechanical shocks
HW failure rate often follows the “bathtub” curve
Some Terminology
Fail-safe Systems: Systems which end up in a “safe” state upon failure» Example: All traffic lights turning red in an intersection
Fail-stop Systems: Systems that stop producing output when they fail
Example of HW Redundancy
Triple-Modular Redundancy (TMR):» Three units run the same algorithm in parallel» Their outputs are voted on and the majority is picked as
the output of the TMR cluster
» Can forward-mask up to one processor failure
Mathematical Background
Basic laws of probability» Density and distribution functions» Notion of stochastic independence» Expectation, variance, etc.
Memoryless distribution» Markov chains
Steady-state & transient solutions
Bayes’s Law
Hardware FT
N-Modular Redundancy (NMR)» Basic structure
Variations
» Reliability evaluation Independent failures Correlated failures
» Voter: Bit-by-bit comparison Median Formalized majority Generalized k-plurality
Exploiting Appln Semantics
Acceptance Test: Specify a range outside which the output is tagged as faulty (or at least suspicious)
No acceptance test is perfect:» Sensitivity: Probability of catching an incorrect output» Specificity: Probabililty that an output which is flagged
as wrong is really wrong Specificity = 1 - False Positive Probability
Checkpointing
Store partial results in a safe place When failure occurs, roll back to the latest
checkpoint and restart Issues:
» Checkpoint positioning» Implementation
Kernel level Application level
» Correctness: Can be a problem in distributed systems
Terminology
Checkpointing Overhead: The part of the checkpointing activity that is not hidden from the application
Checkpointing Latency: Time between when a checkpoint starts being taken to when it is stored in non-volatile storage.
Reducing Chkptg Overhead
Buffer checkpoint writes Don’t checkpoint “dead” variables:
» Never used again by the program, or» Next operation with respect to the variable is a write» Problem is how to identify dead variables
Don’t checkpoint read-only stuff, like code
Reducing Chkptg Latency
Consider compressing the checkpoint. Usefulness of this approach depends on:» Extent of the compression possible» Work required to execute the compression algorithm
Optimization of Chkptg
Objective in general-purpose systems is usually to minimize the expected execution time
Objective in real-time systems is to maximize the probability of meeting task deadlines» Need a mathematical model to determine this» Generally, we place checkpoints approximately
equidistant from each other and just determine the optimal number of them
Distributed Checkpointing
Ordering of Events:» Easy to do if there’s just one thread» If there are multiple threads:
Events in the same thread are trivial to order Event A in thread X is said to precede Event B in thread Y if
there is some communication from the X after event A that arrives at Y before event B
Given two events A and B in separate threads,– A could precede B– B could precede A– They could be concurrent
Distributed Checkpointing
Domino Effect: An uncontrolled cascade of rollbacks can roll the entire system back to the starting state
To avoid the domino effect, we can coordinate the checkpointing» Tightly synchronize the checkpoints in all processors» Koo-Toueg algorithm
Checkptg with Clock Sync
Assume the clock skew is bounded at and minimum message delivery time is
Each processor:» Takes a local checkpoint at some specified time, » Following its checkpoint, it does not sent out any
messages until it is sure that this message will be received only after the recipient has itself checkpointed; i.e., until
Koo-Toueg Algorithm
A processor that wants to checkpoint, » Does so, locally» Tells all processors which have communicated with it
the last message (timestamp or message number) received from them
If these processors don’t have a checkpoint recording the transmission of this message, they take a checkpoint
This can result in a surge of checkpointing activity visible at the non-volatile storage
Software Fault Tolerance
It is practically impossible to produce a large piece of software that is bug-free» E.g., Even the space shuttle flew with several
potentially disastrous bugs despite extensive testing
Single-version Fault Tolerance Multi-version Fault Tolerance
Fault Models
Reasonably trustworthy hardware fault models exist
Many software fault models exist in the literature, but not one can be fully trusted to represent reality
Single-Version FT
Wrappers: Code “wrapped around” the software that checks for consistency and correctness
Software Rejuvenation: Reboot the machine reasonably frequently
Use data diversity: Sometimes an algorithm may fail on some data but not if these data are subjected to minor perturbations
Multi-version FT
Very, very expensive Two basic approaches
» N-version programming» Recovery Blocks
N-Version Programming (NVP)
Theoretically appealing, but hard to make it effective
Basic Idea:» Have N independent teams of programmers develop
applications independently» Run them in parallel and vote on them» If they are truly independent, they will be highly
reliable
Failure Diversity
Effectiveness hinges on whether faults in the versions are statistically independent of one another
Forces against truly independent failures:» Common programming “culture”» Common specifications» Common algorithms» Common software/hardware platforms
Failure Diversity
Incidental Diversity» Prohibit interaction between teams of programmers
working on different versions and hope they produce independently failing versions
Forced Diversity» Diverse specifications» Diverse programming languages» Diverse development tools and compilers» Cognitively diverse teams: Probably not realistic
Experimental Results
Experiments suggest that correlated failures do occur at a much higher rate than would be the case if failures in the versions were stochastically independent
Example: Study conducted by Brilliant, Knight, and Leveson at UVa and UCI» 27 students writing code for anti-missile application» 93 correlated failures observed: if true independence
had existed, we’d have expected about 5
Recovery Blocks
Also uses multiple versions Only one version is active at any time If the output of this version fails an acceptance
test, another version is activated
Byzantine Failures
The worst failure mode known Original Motivating Problem (~1978):
» A sensor needs to disseminate its output to a set of processors. How can we ensure that,
If the sensor is functioning correctly: All functional processors obtain the correct sensor reading
If the sensor is malfunctioning: All functional processors agree on the sensor reading
Byzantine Generals Problem
Some divisions of the Byzantine Army are besieging a city. They must all coordinate their attacks (or coordinate their retreat) to avoid disaster
The overall commander communicates to his divisional commanders by means of a confidential messenger. This messenger is trustworthy and doesn’t alter the message; it can only be read by its intended recipient
Byz Generals Problem (contd.)
If the C-in-C is loyal» He sends consistent orders to the subordinate generals » All loyal subordinates must obey his order
If the C-in-C is a traitor» All loyal subordinate generals must agree on some
default action (e.g., running away)
Impossibility with 3 Generals
Suppose there are 2 divisions, A and B. Commander-in-chief is a traitor and sends
message to Com(A) saying “Attack!” and to Com(B) saying “Retreat!”
Com(A) sends a messenger to Com(B), saying “The boss told me to attack!”
Com(B) receives:» Direct order from the C-in-C saying “Retreat”» Message from Com(A) saying “I was ordered to attack”
Byz. Generals Problem (contd.)
Com(B)’s dilemma:» Either the C-in-C or Com(A) is a traitor: it is
impossible to know which» Further communication with Com(A) won’t add any
useful information» Not possible to ensure that if Com(A) and Com(B) are
both loyal, they both agree on the same action
The problem cannot be solved if there are 3 generals who may include at least one traitor
Byz. Generals Problem (contd.)
Central Result: To reach agreement with a total of N participants with up to m traitors, we must have N > 3m
Byzantine Generals Algorithm
Byz(0) // no-failure algorithm» C-in-C sends his order to every subordinate» The subordinate uses the order he receives, or the
default if he receives no order
Byz(m) // For up to m traitors (failures)» (1) C-in-C sends order to every subordinate, G_i: let
this be received as v_i» (2) G_i acts as the C-in-C in a Byz(m-1) algorithm to
circulate this order to his colleagues» (3) For each (i,j) such that i!=j, let w_(i,j) be the order
that G_i got from G_j in step 2 or the default if no message was received. G_i calculates the majority of the orders {v_i, w_(i,j)} and uses it as the correct order to follow