university of massachusetts dept. of electrical & … › ~krishna › 655 › fall06 ›...

Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .1

C. M. KrishnaFall 2006

UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer Engineering

Fault Tolerant ComputingECE 655

Part 3Complex Structures


Non Series/Parallel Systems

♦Each path represents a configuration allowing the system to operate successfully, e.g., ADF

♦The reliability can be calculated by expanding about a single module i :

♦Rsystem=Ri Prob{System works | i is fault-free} +(1-Ri) Prob{System works | i is faulty}

♦Draw two new diagrams: in (a) module i is operational; in (b) module i is faulty

♦Module i is selected so that the two new diagrams are closer to simple series/parallel structures


Expanding about C

♦The process of expanding can be repeated until the resulting diagrams are of the series/parallel type

♦Figure (a) needs further expansion about E♦Figure (a) should not be viewed as a parallel

connection of A and B, connected serially to D and Ein parallel. Such a diagram will have the path BCDFwhich is not a valid path

(a) (b)


Expanding about C and E

♦Rsystem=RC Prob {System works | C is operational} +(1-RC) RF [1-(1-RA RD)(1-RB RE)]

♦Expanding about E yields♦Prob {System works | C is operational}=

RE RF [1-(1-RA)(1-RB)] +(1-RE)RA RD RF♦Substituting results in♦Rsystem=RC [RE RF(RA+RB-RA RB)+(1-RE) RA RD RF]

+(1-RC) [RF(RA RD+RB RE-RA RD RB RE)]♦Example: RA=RB=RC=RD=RE=RF=R

Rsystem=R (R -3R +R+2)233

(a) (b)


Upper Bound on Reliability♦If structure is too complicated - derive upper and

lower bounds on Rsystem♦An upper bound - Rsystem ≤ 1 - ∏ (1-Rpath_i)

∗ Rpath_i - reliability of modules in series along path i∗ Assuming all paths are in parallel

♦Example - the paths are ADF, BEF and ACEF♦Rsystem ≤ 1 -(1-RA RD RF)(1-RB RE RF)(1-RA RC RE RF)♦If RA=RB=RC=RD=RE=RF=R then ♦Rsystem ≤ R (R -2R -R +R+2)♦Upper bound can be used to derive the exact

expression: perform multiplication and replace every occurrence of Ri by Ri∗ On each path every module is used only once and its reliability

should be raised only to its first power

3 4 37

j


Lower Bound on Reliability♦A lower bound is calculated based on minimal cut sets

of the system diagram ♦A minimal cut set: a minimal list of modules such

that the removal (due to a fault) of all modules will cause a working system to fail

♦Minimal cut sets: F, AB, AE, DE and BCD

♦The lower bound is♦Rsystem ≥ ∏ (1-Qcut_i)

∗ Qcut_i - probability that the minimal cut i is faulty (i.e., all its modules are faulty)

♦Example - RA=RB=RC=RD=RE=RF=R

♦Rsystem ≥ R (24-60R+62R -33R +9R -R )325 54


Variations on NMR♦Unit-level Modular Redundancy

♦Voters are no longer as critical as in NMR; a single faulty voter will be no worse than a single faulty unit

♦The level at which the replication and voting are applied can be further lowered at the expense of additional voters increasing the size and delay of the system


Triplicated Processor/Memory System

♦All communications (in either direction) between the triplicated processors and triplicatedmemories go through majority voting

♦This organization has a higher reliability than a single majority voting of triplicatedprocessor/memory structure


Active/Dynamic Redundancy♦Previous variations of N-modular redundancy -

considerable hardware to instantaneously mask errors

♦Temporary erroneous results may be acceptable if system can detect such errors and reconfigure itself ∗Replacing the faulty module by a fault-free spare

♦Example - an active (or dynamic) redundancy scheme


Reliability - Active Spares♦If all spare modules are active (powered) they

have the same failure rate - similar to a basic parallel system

♦The system reliability is thus♦Rdynamic(t) = Rdet(t) [1 - (1-R(t)) ] ♦R(t) - reliability of module♦Rdet(t) - reliability of Detection and

Reconfiguration unit

N


Reliability - Standby Spares

♦If spare modules are not expected to fail (e.g., are not powered in order to conserve energy), the reliability of a system with one active module and one standby spare is

♦ Rdynamic(t) = R(t)+C R(t)(1-R(t))♦where C is the coverage factor: probability that

the faulty active module will be correctly diagnosed and disconnected and the good spare will be successfully connected

♦Generalizing to the case of N spares -♦Rdynamic(t) = R(t) Σ C (1-R(t))

k=0

N k k


Hybrid Redundancy♦An NMR system masks permanent and intermittent

failures but its reliability drops below that of a single module for very long mission times

♦Hybrid redundancy overcomes this by adding spare modules to replace active modules once they become faulty

♦A hybrid system consists of a core of N processors (NMR), and M spares


Hybrid Redundancy - Reliability♦The reliability of a hybrid system with a TMR

core and M spares is ♦Rhybrid(t) = Rvoter(t) Rreconf(t) ( 1-m R(t)[1-R(t)] -

[1-R(t)] )∗ m=M+3 is the total number of modules ∗ Rvoter(t) and Rreconf(t) are the reliability of voter and

comparison and reconfiguration circuitry ∗ Assuming that any fault in voter or comparison and

reconfiguration circuit will cause a system fault♦In practice, not all faults in these circuits will be

fatal: the reliability will be higher♦More accurate Rhybrid(t): detailed analysis of voter

and comparison & reconfiguration circuits and the ways they can fail

m

m-1


Sift-Out Modular Redundancy

♦Like NMR all N modules are active - Voter of outputs of all still operational modules

♦Besides the voter, a comparison and switching circuit - compares output of each module to outputs of other still operational modules

♦A module whose output disagrees with other outputs is switched out

♦Simpler than hybrid redundancy♦Should not to be too aggressive in the purging

(sifting-out) process - vast majority of failures are transient and will go away ∗purging a module only if it produces incorrect outputs over a sustained period of time


Triplex-Duplex Architecture♦This approach ties together processors to form

duplexes ♦A triplex is then formed out of these duplexes ♦When the processors in a duplex disagree, both

of them are switched out of the system♦The triplex-duplex arrangement allows a simpler

identification of faulty processors♦Further, the triplex can continue to function even

if only one duplex is left functional, since the duplex arrangement allows us to detect faults

university of massachusetts dept. of electrical & … › ~krishna › 655 › fall06 ›...

Documents