fault tolerance

42
Fault Tolerance Motivation: Systems need to be much more reliable than their components Use Redundancy: Extra items that can be used to make up for failures Types of Redundancy: » Hardware » Software » Time » Information

Upload: abdul-burton

Post on 31-Dec-2015

35 views

Category:

Documents


1 download

DESCRIPTION

Fault Tolerance. Motivation : Systems need to be much more reliable than their components Use Redundancy : Extra items that can be used to make up for failures Types of Redundancy : Hardware Software Time Information. Fault-Tolerant Scheduling. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fault Tolerance

Fault Tolerance

Motivation: Systems need to be much more reliable than their components

Use Redundancy: Extra items that can be used to make up for failures

Types of Redundancy:» Hardware» Software» Time» Information

Page 2: Fault Tolerance

Fault-Tolerant Scheduling

Fault Tolerance: The ability of a system to suffer component failures and still function adequately

Fault-Tolerant Scheduling: Save enough time in a schedule that the system can still function despite a certain number of processor failures

Page 3: Fault Tolerance

FT-Scheduling: Model

System Model» Multiprocessor system» Each processor has its own memory» Tasks are preloaded into assigned processors

Task Model» Tasks are independent of one another» Schedules are created ahead of time

Page 4: Fault Tolerance

Basic Idea

Preassign backup copies, called ghosts. Assign ghosts to the processors along with the

primary copies» A ghost and a primary copy of the same task can’t be

assigned to the same processor» For each processor, all the primaries and a particular

subset of the ghost copies assigned to it should be feasibly schedulable on that processor

Page 5: Fault Tolerance

Requirements

Two main variations:» Current and future iterations of the task have to be

saved if a processor fails» Only future iterations need to be saved; the current

iteration can be discarded

Page 6: Fault Tolerance

Forward and Backward Masking

Forward Masking: Mask the output of failed units without significant loss of time

Backward Masking: After detecting an error, try to fix it by recomputing or some other means

Page 7: Fault Tolerance

Failure Types

Permanent: The fault is incurable Transient: The unit is faulty for some time,

following which it starts functioning correctly again

Intermittent: Frequently cycles between a faulty and a non-faulty state

Page 8: Fault Tolerance

Faults and Errors

A fault is some physical defect or malfunction An error is a manifestation of a fault Latency:

» Fault Latency: Time between occurrence of a fault and its manifestation as an error

» Error Latency: Time between the generation of an error and its being caught by the system

Page 9: Fault Tolerance

Hardware Failure Recovery

If transient, it may be enough to wait for the fault to go away and then reinvoke the computation

If permanent, reassign the tasks to other, functional, processors

Page 10: Fault Tolerance

Faults: Output Characteristics

Stuck-at: A line is stuck at 0 or 1. Dead: No output (e.g., high-impedance state) Arbitrary: The output changes with time

Page 11: Fault Tolerance

Factors Affecting HW F-Rate

Temperature Radiation Power surges Mechanical shocks

HW failure rate often follows the “bathtub” curve

Page 12: Fault Tolerance

Some Terminology

Fail-safe Systems: Systems which end up in a “safe” state upon failure» Example: All traffic lights turning red in an intersection

Fail-stop Systems: Systems that stop producing output when they fail

Page 13: Fault Tolerance

Example of HW Redundancy

Triple-Modular Redundancy (TMR):» Three units run the same algorithm in parallel» Their outputs are voted on and the majority is picked as

the output of the TMR cluster

» Can forward-mask up to one processor failure

Page 14: Fault Tolerance

Mathematical Background

Basic laws of probability» Density and distribution functions» Notion of stochastic independence» Expectation, variance, etc.

Memoryless distribution» Markov chains

Steady-state & transient solutions

Bayes’s Law

Page 15: Fault Tolerance

Hardware FT

N-Modular Redundancy (NMR)» Basic structure

Variations

» Reliability evaluation Independent failures Correlated failures

» Voter: Bit-by-bit comparison Median Formalized majority Generalized k-plurality

Page 16: Fault Tolerance

Exploiting Appln Semantics

Acceptance Test: Specify a range outside which the output is tagged as faulty (or at least suspicious)

No acceptance test is perfect:» Sensitivity: Probability of catching an incorrect output» Specificity: Probabililty that an output which is flagged

as wrong is really wrong Specificity = 1 - False Positive Probability

Page 17: Fault Tolerance

Checkpointing

Store partial results in a safe place When failure occurs, roll back to the latest

checkpoint and restart Issues:

» Checkpoint positioning» Implementation

Kernel level Application level

» Correctness: Can be a problem in distributed systems

Page 18: Fault Tolerance

Terminology

Checkpointing Overhead: The part of the checkpointing activity that is not hidden from the application

Checkpointing Latency: Time between when a checkpoint starts being taken to when it is stored in non-volatile storage.

Page 19: Fault Tolerance

Reducing Chkptg Overhead

Buffer checkpoint writes Don’t checkpoint “dead” variables:

» Never used again by the program, or» Next operation with respect to the variable is a write» Problem is how to identify dead variables

Don’t checkpoint read-only stuff, like code

Page 20: Fault Tolerance

Reducing Chkptg Latency

Consider compressing the checkpoint. Usefulness of this approach depends on:» Extent of the compression possible» Work required to execute the compression algorithm

Page 21: Fault Tolerance

Optimization of Chkptg

Objective in general-purpose systems is usually to minimize the expected execution time

Objective in real-time systems is to maximize the probability of meeting task deadlines» Need a mathematical model to determine this» Generally, we place checkpoints approximately

equidistant from each other and just determine the optimal number of them

Page 22: Fault Tolerance

Distributed Checkpointing

Ordering of Events:» Easy to do if there’s just one thread» If there are multiple threads:

Events in the same thread are trivial to order Event A in thread X is said to precede Event B in thread Y if

there is some communication from the X after event A that arrives at Y before event B

Given two events A and B in separate threads,– A could precede B– B could precede A– They could be concurrent

Page 23: Fault Tolerance

Distributed Checkpointing

Domino Effect: An uncontrolled cascade of rollbacks can roll the entire system back to the starting state

To avoid the domino effect, we can coordinate the checkpointing» Tightly synchronize the checkpoints in all processors» Koo-Toueg algorithm

Page 24: Fault Tolerance

Checkptg with Clock Sync

Assume the clock skew is bounded at and minimum message delivery time is

Each processor:» Takes a local checkpoint at some specified time, » Following its checkpoint, it does not sent out any

messages until it is sure that this message will be received only after the recipient has itself checkpointed; i.e., until

Page 25: Fault Tolerance

Koo-Toueg Algorithm

A processor that wants to checkpoint, » Does so, locally» Tells all processors which have communicated with it

the last message (timestamp or message number) received from them

If these processors don’t have a checkpoint recording the transmission of this message, they take a checkpoint

This can result in a surge of checkpointing activity visible at the non-volatile storage

Page 26: Fault Tolerance

Software Fault Tolerance

It is practically impossible to produce a large piece of software that is bug-free» E.g., Even the space shuttle flew with several

potentially disastrous bugs despite extensive testing

Single-version Fault Tolerance Multi-version Fault Tolerance

Page 27: Fault Tolerance

Fault Models

Reasonably trustworthy hardware fault models exist

Many software fault models exist in the literature, but not one can be fully trusted to represent reality

Page 28: Fault Tolerance

Single-Version FT

Wrappers: Code “wrapped around” the software that checks for consistency and correctness

Software Rejuvenation: Reboot the machine reasonably frequently

Use data diversity: Sometimes an algorithm may fail on some data but not if these data are subjected to minor perturbations

Page 29: Fault Tolerance

Multi-version FT

Very, very expensive Two basic approaches

» N-version programming» Recovery Blocks

Page 30: Fault Tolerance

N-Version Programming (NVP)

Theoretically appealing, but hard to make it effective

Basic Idea:» Have N independent teams of programmers develop

applications independently» Run them in parallel and vote on them» If they are truly independent, they will be highly

reliable

Page 31: Fault Tolerance

Failure Diversity

Effectiveness hinges on whether faults in the versions are statistically independent of one another

Forces against truly independent failures:» Common programming “culture”» Common specifications» Common algorithms» Common software/hardware platforms

Page 32: Fault Tolerance

Failure Diversity

Incidental Diversity» Prohibit interaction between teams of programmers

working on different versions and hope they produce independently failing versions

Forced Diversity» Diverse specifications» Diverse programming languages» Diverse development tools and compilers» Cognitively diverse teams: Probably not realistic

Page 33: Fault Tolerance

Experimental Results

Experiments suggest that correlated failures do occur at a much higher rate than would be the case if failures in the versions were stochastically independent

Example: Study conducted by Brilliant, Knight, and Leveson at UVa and UCI» 27 students writing code for anti-missile application» 93 correlated failures observed: if true independence

had existed, we’d have expected about 5

Page 34: Fault Tolerance

Recovery Blocks

Also uses multiple versions Only one version is active at any time If the output of this version fails an acceptance

test, another version is activated

Page 35: Fault Tolerance

Byzantine Failures

The worst failure mode known Original Motivating Problem (~1978):

» A sensor needs to disseminate its output to a set of processors. How can we ensure that,

If the sensor is functioning correctly: All functional processors obtain the correct sensor reading

If the sensor is malfunctioning: All functional processors agree on the sensor reading

Page 36: Fault Tolerance

Byzantine Generals Problem

Some divisions of the Byzantine Army are besieging a city. They must all coordinate their attacks (or coordinate their retreat) to avoid disaster

The overall commander communicates to his divisional commanders by means of a confidential messenger. This messenger is trustworthy and doesn’t alter the message; it can only be read by its intended recipient

Page 37: Fault Tolerance

Byz Generals Problem (contd.)

If the C-in-C is loyal» He sends consistent orders to the subordinate generals » All loyal subordinates must obey his order

If the C-in-C is a traitor» All loyal subordinate generals must agree on some

default action (e.g., running away)

Page 38: Fault Tolerance

Impossibility with 3 Generals

Suppose there are 2 divisions, A and B. Commander-in-chief is a traitor and sends

message to Com(A) saying “Attack!” and to Com(B) saying “Retreat!”

Com(A) sends a messenger to Com(B), saying “The boss told me to attack!”

Com(B) receives:» Direct order from the C-in-C saying “Retreat”» Message from Com(A) saying “I was ordered to attack”

Page 39: Fault Tolerance

Byz. Generals Problem (contd.)

Com(B)’s dilemma:» Either the C-in-C or Com(A) is a traitor: it is

impossible to know which» Further communication with Com(A) won’t add any

useful information» Not possible to ensure that if Com(A) and Com(B) are

both loyal, they both agree on the same action

The problem cannot be solved if there are 3 generals who may include at least one traitor

Page 40: Fault Tolerance

Byz. Generals Problem (contd.)

Central Result: To reach agreement with a total of N participants with up to m traitors, we must have N > 3m

Page 41: Fault Tolerance

Byzantine Generals Algorithm

Byz(0) // no-failure algorithm» C-in-C sends his order to every subordinate» The subordinate uses the order he receives, or the

default if he receives no order

Page 42: Fault Tolerance

Byz(m) // For up to m traitors (failures)» (1) C-in-C sends order to every subordinate, G_i: let

this be received as v_i» (2) G_i acts as the C-in-C in a Byz(m-1) algorithm to

circulate this order to his colleagues» (3) For each (i,j) such that i!=j, let w_(i,j) be the order

that G_i got from G_j in step 2 or the default if no message was received. G_i calculates the majority of the orders {v_i, w_(i,j)} and uses it as the correct order to follow