university of massachusetts dept. of electrical & … › ~krishna › 655 › fall06 ›...
TRANSCRIPT
-
Page 1
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .1
C. M. KrishnaFall 2006
UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer Engineering
Fault Tolerant ComputingECE 655
Part 3Complex Structures
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .2
Non Series/Parallel Systems
♦Each path represents a configuration allowing the system to operate successfully, e.g., ADF
♦The reliability can be calculated by expanding about a single module i :
♦Rsystem=Ri Prob{System works | i is fault-free} +(1-Ri) Prob{System works | i is faulty}
♦Draw two new diagrams: in (a) module i is operational; in (b) module i is faulty
♦Module i is selected so that the two new diagrams are closer to simple series/parallel structures
-
Page 2
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .3
Expanding about C
♦The process of expanding can be repeated until the resulting diagrams are of the series/parallel type
♦Figure (a) needs further expansion about E♦Figure (a) should not be viewed as a parallel
connection of A and B, connected serially to D and Ein parallel. Such a diagram will have the path BCDFwhich is not a valid path
(a) (b)
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .4
Expanding about C and E
♦Rsystem=RC Prob {System works | C is operational} +(1-RC) RF [1-(1-RA RD)(1-RB RE)]
♦Expanding about E yields♦Prob {System works | C is operational}=
RE RF [1-(1-RA)(1-RB)] +(1-RE)RA RD RF♦Substituting results in♦Rsystem=RC [RE RF(RA+RB-RA RB)+(1-RE) RA RD RF]
+(1-RC) [RF(RA RD+RB RE-RA RD RB RE)]♦Example: RA=RB=RC=RD=RE=RF=R
Rsystem=R (R -3R +R+2)233
(a) (b)
-
Page 3
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .5
Upper Bound on Reliability♦If structure is too complicated - derive upper and
lower bounds on Rsystem♦An upper bound - Rsystem ≤ 1 - ∏ (1-Rpath_i)
∗ Rpath_i - reliability of modules in series along path i∗ Assuming all paths are in parallel
♦Example - the paths are ADF, BEF and ACEF♦Rsystem ≤ 1 -(1-RA RD RF)(1-RB RE RF)(1-RA RC RE RF)♦If RA=RB=RC=RD=RE=RF=R then ♦Rsystem ≤ R (R -2R -R +R+2)♦Upper bound can be used to derive the exact
expression: perform multiplication and replace every occurrence of Ri by Ri∗ On each path every module is used only once and its reliability
should be raised only to its first power
3 4 37
j
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .6
Lower Bound on Reliability♦A lower bound is calculated based on minimal cut sets
of the system diagram ♦A minimal cut set: a minimal list of modules such
that the removal (due to a fault) of all modules will cause a working system to fail
♦Minimal cut sets: F, AB, AE, DE and BCD
♦The lower bound is♦Rsystem ≥ ∏ (1-Qcut_i)
∗ Qcut_i - probability that the minimal cut i is faulty (i.e., all its modules are faulty)
♦Example - RA=RB=RC=RD=RE=RF=R
♦Rsystem ≥ R (24-60R+62R -33R +9R -R )325 54
-
Page 4
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .7
Variations on NMR♦Unit-level Modular Redundancy
♦Voters are no longer as critical as in NMR; a single faulty voter will be no worse than a single faulty unit
♦The level at which the replication and voting are applied can be further lowered at the expense of additional voters increasing the size and delay of the system
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .8
Triplicated Processor/Memory System
♦All communications (in either direction) between the triplicated processors and triplicatedmemories go through majority voting
♦This organization has a higher reliability than a single majority voting of triplicatedprocessor/memory structure
-
Page 5
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .9
Active/Dynamic Redundancy♦Previous variations of N-modular redundancy -
considerable hardware to instantaneously mask errors
♦Temporary erroneous results may be acceptable if system can detect such errors and reconfigure itself ∗Replacing the faulty module by a fault-free spare
♦Example - an active (or dynamic) redundancy scheme
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .10
Reliability - Active Spares♦If all spare modules are active (powered) they
have the same failure rate - similar to a basic parallel system
♦The system reliability is thus♦Rdynamic(t) = Rdet(t) [1 - (1-R(t)) ] ♦R(t) - reliability of module♦Rdet(t) - reliability of Detection and
Reconfiguration unit
N
-
Page 6
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .11
Reliability - Standby Spares
♦If spare modules are not expected to fail (e.g., are not powered in order to conserve energy), the reliability of a system with one active module and one standby spare is
♦ Rdynamic(t) = R(t)+C R(t)(1-R(t))♦where C is the coverage factor: probability that
the faulty active module will be correctly diagnosed and disconnected and the good spare will be successfully connected
♦Generalizing to the case of N spares -♦Rdynamic(t) = R(t) Σ C (1-R(t))
k=0
N k k
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .12
Hybrid Redundancy♦An NMR system masks permanent and intermittent
failures but its reliability drops below that of a single module for very long mission times
♦Hybrid redundancy overcomes this by adding spare modules to replace active modules once they become faulty
♦A hybrid system consists of a core of N processors (NMR), and M spares
-
Page 7
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .13
Hybrid Redundancy - Reliability♦The reliability of a hybrid system with a TMR
core and M spares is ♦Rhybrid(t) = Rvoter(t) Rreconf(t) ( 1-m R(t)[1-R(t)] -
[1-R(t)] )∗ m=M+3 is the total number of modules ∗ Rvoter(t) and Rreconf(t) are the reliability of voter and
comparison and reconfiguration circuitry ∗ Assuming that any fault in voter or comparison and
reconfiguration circuit will cause a system fault♦In practice, not all faults in these circuits will be
fatal: the reliability will be higher♦More accurate Rhybrid(t): detailed analysis of voter
and comparison & reconfiguration circuits and the ways they can fail
m
m-1
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .14
Sift-Out Modular Redundancy
♦Like NMR all N modules are active - Voter of outputs of all still operational modules
♦Besides the voter, a comparison and switching circuit - compares output of each module to outputs of other still operational modules
♦A module whose output disagrees with other outputs is switched out
♦Simpler than hybrid redundancy♦Should not to be too aggressive in the purging
(sifting-out) process - vast majority of failures are transient and will go away ∗purging a module only if it produces incorrect outputs over a sustained period of time
-
Page 8
Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .15
Triplex-Duplex Architecture♦This approach ties together processors to form
duplexes ♦A triplex is then formed out of these duplexes ♦When the processors in a duplex disagree, both
of them are switched out of the system♦The triplex-duplex arrangement allows a simpler
identification of faulty processors♦Further, the triplex can continue to function even
if only one duplex is left functional, since the duplex arrangement allows us to detect faults