fault tolerance

Fault Tolerance

Motivation: Systems need to be much more reliable than their components

Use Redundancy: Extra items that can be used to make up for failures

Types of Redundancy:» Hardware» Software» Time» Information

Fault-Tolerant Scheduling

Fault Tolerance: The ability of a system to suffer component failures and still function adequately

Fault-Tolerant Scheduling: Save enough time in a schedule that the system can still function despite a certain number of processor failures

FT-Scheduling: Model

System Model» Multiprocessor system» Each processor has its own memory» Tasks are preloaded into assigned processors

Task Model» Tasks are independent of one another» Schedules are created ahead of time

Basic Idea

Preassign backup copies, called ghosts. Assign ghosts to the processors along with the

primary copies» A ghost and a primary copy of the same task can’t be

assigned to the same processor» For each processor, all the primaries and a particular

subset of the ghost copies assigned to it should be feasibly schedulable on that processor

Requirements

Two main variations:» Current and future iterations of the task have to be

saved if a processor fails» Only future iterations need to be saved; the current

iteration can be discarded

Forward and Backward Masking

Forward Masking: Mask the output of failed units without significant loss of time

Backward Masking: After detecting an error, try to fix it by recomputing or some other means

Failure Types

Permanent: The fault is incurable Transient: The unit is faulty for some time,

following which it starts functioning correctly again

Intermittent: Frequently cycles between a faulty and a non-faulty state

Faults and Errors

A fault is some physical defect or malfunction An error is a manifestation of a fault Latency:

» Fault Latency: Time between occurrence of a fault and its manifestation as an error

» Error Latency: Time between the generation of an error and its being caught by the system

Hardware Failure Recovery

If transient, it may be enough to wait for the fault to go away and then reinvoke the computation

If permanent, reassign the tasks to other, functional, processors

Faults: Output Characteristics

Stuck-at: A line is stuck at 0 or 1. Dead: No output (e.g., high-impedance state) Arbitrary: The output changes with time

Factors Affecting HW F-Rate

Temperature Radiation Power surges Mechanical shocks

HW failure rate often follows the “bathtub” curve

Some Terminology

Fail-safe Systems: Systems which end up in a “safe” state upon failure» Example: All traffic lights turning red in an intersection

Fail-stop Systems: Systems that stop producing output when they fail

Example of HW Redundancy

Triple-Modular Redundancy (TMR):» Three units run the same algorithm in parallel» Their outputs are voted on and the majority is picked as

the output of the TMR cluster

» Can forward-mask up to one processor failure

Mathematical Background

Basic laws of probability» Density and distribution functions» Notion of stochastic independence» Expectation, variance, etc.

Memoryless distribution» Markov chains

Steady-state & transient solutions

Bayes’s Law

Hardware FT

N-Modular Redundancy (NMR)» Basic structure

Variations

» Reliability evaluation Independent failures Correlated failures

» Voter: Bit-by-bit comparison Median Formalized majority Generalized k-plurality

Exploiting Appln Semantics

Acceptance Test: Specify a range outside which the output is tagged as faulty (or at least suspicious)

No acceptance test is perfect:» Sensitivity: Probability of catching an incorrect output» Specificity: Probabililty that an output which is flagged

as wrong is really wrong Specificity = 1 - False Positive Probability

Checkpointing

Store partial results in a safe place When failure occurs, roll back to the latest

checkpoint and restart Issues:

» Checkpoint positioning» Implementation

Kernel level Application level

» Correctness: Can be a problem in distributed systems

Terminology

Checkpointing Overhead: The part of the checkpointing activity that is not hidden from the application

Checkpointing Latency: Time between when a checkpoint starts being taken to when it is stored in non-volatile storage.

Reducing Chkptg Overhead

Buffer checkpoint writes Don’t checkpoint “dead” variables:

» Never used again by the program, or» Next operation with respect to the variable is a write» Problem is how to identify dead variables

Don’t checkpoint read-only stuff, like code

Reducing Chkptg Latency

Consider compressing the checkpoint. Usefulness of this approach depends on:» Extent of the compression possible» Work required to execute the compression algorithm

Optimization of Chkptg

Objective in general-purpose systems is usually to minimize the expected execution time

Objective in real-time systems is to maximize the probability of meeting task deadlines» Need a mathematical model to determine this» Generally, we place checkpoints approximately

equidistant from each other and just determine the optimal number of them

Distributed Checkpointing

Ordering of Events:» Easy to do if there’s just one thread» If there are multiple threads:

Events in the same thread are trivial to order Event A in thread X is said to precede Event B in thread Y if

there is some communication from the X after event A that arrives at Y before event B

Given two events A and B in separate threads,– A could precede B– B could precede A– They could be concurrent

Distributed Checkpointing

Domino Effect: An uncontrolled cascade of rollbacks can roll the entire system back to the starting state

To avoid the domino effect, we can coordinate the checkpointing» Tightly synchronize the checkpoints in all processors» Koo-Toueg algorithm

Checkptg with Clock Sync

Assume the clock skew is bounded at and minimum message delivery time is

Each processor:» Takes a local checkpoint at some specified time, » Following its checkpoint, it does not sent out any

messages until it is sure that this message will be received only after the recipient has itself checkpointed; i.e., until

Koo-Toueg Algorithm

A processor that wants to checkpoint, » Does so, locally» Tells all processors which have communicated with it

the last message (timestamp or message number) received from them

If these processors don’t have a checkpoint recording the transmission of this message, they take a checkpoint

This can result in a surge of checkpointing activity visible at the non-volatile storage

Software Fault Tolerance

It is practically impossible to produce a large piece of software that is bug-free» E.g., Even the space shuttle flew with several

potentially disastrous bugs despite extensive testing

Single-version Fault Tolerance Multi-version Fault Tolerance

Fault Models

Reasonably trustworthy hardware fault models exist

Many software fault models exist in the literature, but not one can be fully trusted to represent reality

Single-Version FT

Wrappers: Code “wrapped around” the software that checks for consistency and correctness

Software Rejuvenation: Reboot the machine reasonably frequently

Use data diversity: Sometimes an algorithm may fail on some data but not if these data are subjected to minor perturbations

Multi-version FT

Very, very expensive Two basic approaches

» N-version programming» Recovery Blocks

N-Version Programming (NVP)

Theoretically appealing, but hard to make it effective

Basic Idea:» Have N independent teams of programmers develop

applications independently» Run them in parallel and vote on them» If they are truly independent, they will be highly

reliable

Failure Diversity

Effectiveness hinges on whether faults in the versions are statistically independent of one another

Forces against truly independent failures:» Common programming “culture”» Common specifications» Common algorithms» Common software/hardware platforms

Failure Diversity

Incidental Diversity» Prohibit interaction between teams of programmers

working on different versions and hope they produce independently failing versions

Forced Diversity» Diverse specifications» Diverse programming languages» Diverse development tools and compilers» Cognitively diverse teams: Probably not realistic

Experimental Results

Experiments suggest that correlated failures do occur at a much higher rate than would be the case if failures in the versions were stochastically independent

Example: Study conducted by Brilliant, Knight, and Leveson at UVa and UCI» 27 students writing code for anti-missile application» 93 correlated failures observed: if true independence

had existed, we’d have expected about 5

Recovery Blocks

Also uses multiple versions Only one version is active at any time If the output of this version fails an acceptance

test, another version is activated

Byzantine Failures

The worst failure mode known Original Motivating Problem (~1978):

» A sensor needs to disseminate its output to a set of processors. How can we ensure that,

If the sensor is functioning correctly: All functional processors obtain the correct sensor reading

If the sensor is malfunctioning: All functional processors agree on the sensor reading

Byzantine Generals Problem

Some divisions of the Byzantine Army are besieging a city. They must all coordinate their attacks (or coordinate their retreat) to avoid disaster

The overall commander communicates to his divisional commanders by means of a confidential messenger. This messenger is trustworthy and doesn’t alter the message; it can only be read by its intended recipient

Byz Generals Problem (contd.)

If the C-in-C is loyal» He sends consistent orders to the subordinate generals » All loyal subordinates must obey his order

If the C-in-C is a traitor» All loyal subordinate generals must agree on some

default action (e.g., running away)

Impossibility with 3 Generals

Suppose there are 2 divisions, A and B. Commander-in-chief is a traitor and sends

message to Com(A) saying “Attack!” and to Com(B) saying “Retreat!”

Com(A) sends a messenger to Com(B), saying “The boss told me to attack!”

Com(B) receives:» Direct order from the C-in-C saying “Retreat”» Message from Com(A) saying “I was ordered to attack”

Byz. Generals Problem (contd.)

Com(B)’s dilemma:» Either the C-in-C or Com(A) is a traitor: it is

impossible to know which» Further communication with Com(A) won’t add any

useful information» Not possible to ensure that if Com(A) and Com(B) are

both loyal, they both agree on the same action

The problem cannot be solved if there are 3 generals who may include at least one traitor

Byz. Generals Problem (contd.)

Central Result: To reach agreement with a total of N participants with up to m traitors, we must have N > 3m

Byzantine Generals Algorithm

Byz(0) // no-failure algorithm» C-in-C sends his order to every subordinate» The subordinate uses the order he receives, or the

default if he receives no order

Byz(m) // For up to m traitors (failures)» (1) C-in-C sends order to every subordinate, G_i: let

this be received as v_i» (2) G_i acts as the C-in-C in a Byz(m-1) algorithm to

circulate this order to his colleagues» (3) For each (i,j) such that i!=j, let w_(i,j) be the order

that G_i got from G_j in step 2 or the default if no message was received. G_i calculates the majority of the orders {v_i, w_(i,j)} and uses it as the correct order to follow

fault tolerance

Documents