university of massachusetts dept. of electrical & … › ~krishna › 655 › fall06 ›...

8
Page 1 Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .1 C. M. Krishna Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655 Part 3 Complex Structures Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .2 Non Series/Parallel Systems Each path represents a configuration allowing the system to operate successfully, e.g., ADF The reliability can be calculated by expanding about a single module i : Rsystem=Ri Prob{System works | i is fault-free} +(1-Ri) Prob{System works | i is faulty} Draw two new diagrams: in (a) module i is operational; in (b) module i is faulty Module i is selected so that the two new diagrams are closer to simple series/parallel structures

Upload: others

Post on 29-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Page 1

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .1

    C. M. KrishnaFall 2006

    UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer Engineering

    Fault Tolerant ComputingECE 655

    Part 3Complex Structures

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .2

    Non Series/Parallel Systems

    ♦Each path represents a configuration allowing the system to operate successfully, e.g., ADF

    ♦The reliability can be calculated by expanding about a single module i :

    ♦Rsystem=Ri Prob{System works | i is fault-free} +(1-Ri) Prob{System works | i is faulty}

    ♦Draw two new diagrams: in (a) module i is operational; in (b) module i is faulty

    ♦Module i is selected so that the two new diagrams are closer to simple series/parallel structures

  • Page 2

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .3

    Expanding about C

    ♦The process of expanding can be repeated until the resulting diagrams are of the series/parallel type

    ♦Figure (a) needs further expansion about E♦Figure (a) should not be viewed as a parallel

    connection of A and B, connected serially to D and Ein parallel. Such a diagram will have the path BCDFwhich is not a valid path

    (a) (b)

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .4

    Expanding about C and E

    ♦Rsystem=RC Prob {System works | C is operational} +(1-RC) RF [1-(1-RA RD)(1-RB RE)]

    ♦Expanding about E yields♦Prob {System works | C is operational}=

    RE RF [1-(1-RA)(1-RB)] +(1-RE)RA RD RF♦Substituting results in♦Rsystem=RC [RE RF(RA+RB-RA RB)+(1-RE) RA RD RF]

    +(1-RC) [RF(RA RD+RB RE-RA RD RB RE)]♦Example: RA=RB=RC=RD=RE=RF=R

    Rsystem=R (R -3R +R+2)233

    (a) (b)

  • Page 3

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .5

    Upper Bound on Reliability♦If structure is too complicated - derive upper and

    lower bounds on Rsystem♦An upper bound - Rsystem ≤ 1 - ∏ (1-Rpath_i)

    ∗ Rpath_i - reliability of modules in series along path i∗ Assuming all paths are in parallel

    ♦Example - the paths are ADF, BEF and ACEF♦Rsystem ≤ 1 -(1-RA RD RF)(1-RB RE RF)(1-RA RC RE RF)♦If RA=RB=RC=RD=RE=RF=R then ♦Rsystem ≤ R (R -2R -R +R+2)♦Upper bound can be used to derive the exact

    expression: perform multiplication and replace every occurrence of Ri by Ri∗ On each path every module is used only once and its reliability

    should be raised only to its first power

    3 4 37

    j

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .6

    Lower Bound on Reliability♦A lower bound is calculated based on minimal cut sets

    of the system diagram ♦A minimal cut set: a minimal list of modules such

    that the removal (due to a fault) of all modules will cause a working system to fail

    ♦Minimal cut sets: F, AB, AE, DE and BCD

    ♦The lower bound is♦Rsystem ≥ ∏ (1-Qcut_i)

    ∗ Qcut_i - probability that the minimal cut i is faulty (i.e., all its modules are faulty)

    ♦Example - RA=RB=RC=RD=RE=RF=R

    ♦Rsystem ≥ R (24-60R+62R -33R +9R -R )325 54

  • Page 4

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .7

    Variations on NMR♦Unit-level Modular Redundancy

    ♦Voters are no longer as critical as in NMR; a single faulty voter will be no worse than a single faulty unit

    ♦The level at which the replication and voting are applied can be further lowered at the expense of additional voters increasing the size and delay of the system

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .8

    Triplicated Processor/Memory System

    ♦All communications (in either direction) between the triplicated processors and triplicatedmemories go through majority voting

    ♦This organization has a higher reliability than a single majority voting of triplicatedprocessor/memory structure

  • Page 5

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .9

    Active/Dynamic Redundancy♦Previous variations of N-modular redundancy -

    considerable hardware to instantaneously mask errors

    ♦Temporary erroneous results may be acceptable if system can detect such errors and reconfigure itself ∗Replacing the faulty module by a fault-free spare

    ♦Example - an active (or dynamic) redundancy scheme

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .10

    Reliability - Active Spares♦If all spare modules are active (powered) they

    have the same failure rate - similar to a basic parallel system

    ♦The system reliability is thus♦Rdynamic(t) = Rdet(t) [1 - (1-R(t)) ] ♦R(t) - reliability of module♦Rdet(t) - reliability of Detection and

    Reconfiguration unit

    N

  • Page 6

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .11

    Reliability - Standby Spares

    ♦If spare modules are not expected to fail (e.g., are not powered in order to conserve energy), the reliability of a system with one active module and one standby spare is

    ♦ Rdynamic(t) = R(t)+C R(t)(1-R(t))♦where C is the coverage factor: probability that

    the faulty active module will be correctly diagnosed and disconnected and the good spare will be successfully connected

    ♦Generalizing to the case of N spares -♦Rdynamic(t) = R(t) Σ C (1-R(t))

    k=0

    N k k

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .12

    Hybrid Redundancy♦An NMR system masks permanent and intermittent

    failures but its reliability drops below that of a single module for very long mission times

    ♦Hybrid redundancy overcomes this by adding spare modules to replace active modules once they become faulty

    ♦A hybrid system consists of a core of N processors (NMR), and M spares

  • Page 7

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .13

    Hybrid Redundancy - Reliability♦The reliability of a hybrid system with a TMR

    core and M spares is ♦Rhybrid(t) = Rvoter(t) Rreconf(t) ( 1-m R(t)[1-R(t)] -

    [1-R(t)] )∗ m=M+3 is the total number of modules ∗ Rvoter(t) and Rreconf(t) are the reliability of voter and

    comparison and reconfiguration circuitry ∗ Assuming that any fault in voter or comparison and

    reconfiguration circuit will cause a system fault♦In practice, not all faults in these circuits will be

    fatal: the reliability will be higher♦More accurate Rhybrid(t): detailed analysis of voter

    and comparison & reconfiguration circuits and the ways they can fail

    m

    m-1

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .14

    Sift-Out Modular Redundancy

    ♦Like NMR all N modules are active - Voter of outputs of all still operational modules

    ♦Besides the voter, a comparison and switching circuit - compares output of each module to outputs of other still operational modules

    ♦A module whose output disagrees with other outputs is switched out

    ♦Simpler than hybrid redundancy♦Should not to be too aggressive in the purging

    (sifting-out) process - vast majority of failures are transient and will go away ∗purging a module only if it produces incorrect outputs over a sustained period of time

  • Page 8

    Copyright 2004 Koren & Krishna ECE655/Krishna Part.3 .15

    Triplex-Duplex Architecture♦This approach ties together processors to form

    duplexes ♦A triplex is then formed out of these duplexes ♦When the processors in a duplex disagree, both

    of them are switched out of the system♦The triplex-duplex arrangement allows a simpler

    identification of faulty processors♦Further, the triplex can continue to function even

    if only one duplex is left functional, since the duplex arrangement allows us to detect faults