ee141 system-on-chip test architectures ch. 3 - fault-tolerant design - p. 1 1 chapter 3...

118
EE141 1 em-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 Chapter 3 Chapter 3 Fault-Tolerant Design Fault-Tolerant Design

Post on 21-Dec-2015

268 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE1411

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1

Chapter 3Chapter 3

Fault-Tolerant DesignFault-Tolerant Design

Page 2: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE1412

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 2

What is this chapter about?What is this chapter about?

Gives Overview of Fault-Tolerant Design

Focus on Basic Concepts in Fault-Tolerant Design Metrics Used to Specify and Evaluate Dependability Review of Coding Theory Fault-Tolerant Design Schemes

– Hardware Redundancy– Information Redundancy– Time Redundancy

Examples of Fault-Tolerant Applications in Industry

Page 3: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE1413

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 3

Fault-Tolerant DesignFault-Tolerant Design

Introduction Fundamentals of Fault Tolerance Fundamentals of Coding Theory Fault Tolerant Schemes Industry Practices Concluding Remarks

Page 4: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE1414

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 4

IntroductionIntroduction Fault Tolerance

Ability of system to continue error-free operation in presence of unexpected fault

Important in mission-critical applications E.g., medical, aviation, banking, etc. Errors very costly

Becoming important in mainstream applications Technology scaling causing circuit behavior to

become less predictable and more prone to failures Needing fault tolerance to keep failure rate within

acceptable levels

Page 5: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE1415

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 5

FaultsFaults Permanent Faults

Due to manufacturing defects, early life failures, wearout failures

Wearout failures due to various mechanisms– e.g., electromigration, hot carrier degradation, dielectric

breakdown, etc.

Temporary Faults Only present for short period of time Caused by external disturbance or marginal design

parameters

Page 6: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE1416

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 6

Temporary FaultsTemporary Faults

Transient Errors (Non-recurring errors) Cause by external disturbance

– e.g., radiation, noise, power disturbance, etc.

Intermittent Errors (Recurring errors) Cause by marginal design parameters Timing problems

– e.g., races, hazards, skew

Signal integrity problems– e.g., crosstalk, ground bounce, etc.

Page 7: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE1417

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 7

RedundancyRedundancy

Fault Tolerance requires some form of redundancy Time Redundancy Hardware Redundancy Information Redundancy

Page 8: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE1418

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 8

Time RedundancyTime RedundancyPerform Same Operation Twice

See if get same result both times If not, then fault occurred Can detect temporary faults Cannot detect permanent faults

– Would affect both computations

Advantage Little to no hardware overhead

Disadvantage Impacts system or circuit performance

Page 9: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE1419

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 9

Hardware RedundancyHardware Redundancy

Replicate hardware and compare outputs From two or more modules Detects both permanent and temporary faults

Advantage Little or no performance impact

Disadvantage Area and power for redundant hardware

Page 10: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14110

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 10

Information RedundancyInformation Redundancy

Encode outputs with error detecting or correcting code Code selected to minimize redundancy for

class of faultsAdvantage

Less hardware to generate redundant information than replicating module

Drawback Added complexity in design

Page 11: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14111

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 11

Failure RateFailure Rate (t) = Component failure rate

Measured in FITS (failures per 109 hours)

Page 12: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14112

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 12

System Failure RateSystem Failure Rate

System constructed from componentsNo Fault Tolerance

Any component fails, whole system fails

k

iicsys

1,

Page 13: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14113

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 13

ReliabilityReliability

If component working at time 0 R(t) = Probability still working at time t

Exponential Failure Law If failure rate assumed constant

– Good approximation if past infant mortality period

tetR )(

Page 14: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14114

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 14

Reliability for Series SystemReliability for Series System

Series System All components need to work for system to

work

A B C

CBAsys RRRR

Page 15: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14115

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 15

System Reliability with RedundancySystem Reliability with Redundancy

System reliability with component B in Parallel Can tolerate one component B failing

AB

CB

CBBACBAsys RRRRRRRR )2()1(1 22

Page 16: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14116

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 16

Mean-Time-to-Failure (MTTF)Mean-Time-to-Failure (MTTF)

Average time before system fails Equal to area under reliability curve

For Exponential Failure Law

dttRMTTF

0

)(

1

0

dteMTTF t

Page 17: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14117

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 17

MaintainabilityMaintainability If system failed at time 0

M(t) = Probability repaired and operational at time t

System repair time divided into Passive repair time

– Time for service engineer to travel to site

Active repair time– Time to locate failing component,

repair/replace, and verify system operational– Can be improved through designing system so

easy to locate failed component and verify

Page 18: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14118

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 18

Repair Rate and MTTRRepair Rate and MTTR

= rate at which system repaired Analogous to failure rate

Maintainability often modeled as

Mean-Time-to-Repair (MTTR) = 1/

tetM 1)(

Page 19: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14119

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 19

AvailabilityAvailability

System Availability Fraction of time system is operational

t0 t1 t2 t3 t4 t

S1

0

failures

Normal system operation

MTTRMTTF

MTTFilabilitysystem ava

Page 20: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14120

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 20

AvailabilityAvailability

Telephone Systems Required to have system availability of

0.9999 (“four nines”)High-Reliability Systems

May require 7 or more ninesFault-Tolerant Design

Needed to achieve such high availability from less reliable components

Page 21: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14121

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 21

Coding TheoryCoding TheoryCoding

Using more bits than necessary to represent data

Provides way to detect errors– Errors occur when bits get flipped

Error Detecting Codes Many types Detect different classes of errors Use different amounts of redundancy Ease of encoding and decoding data varies

Page 22: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14122

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 22

Block CodeBlock Code

Message = Data Being EncodedBlock code

Encodes m messages with n-bit codeword

If no redundancy m messages encoded with log2(m) bits

minimum possible

n

mredundancy 2log

1

Page 23: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14123

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 23

Block CodeBlock Code

To detect errors, some redundancy needed Space of distinct 2n blocks partitioned into

codewords and non-codewordsCan detect errors that cause codeword

to become non-codewordCannot detect errors that cause

codeword to become another codeword

Page 24: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14124

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 24

Separable Block CodeSeparable Block Code

Separable n-bit blocks partitioned into

– k information bits directly representing message– (n-k) check bits

Denoted (n,k) Block CodeAdvantage

k-bit message directly extracted without decoding

Rate of Separable Block Code = k/n

Page 25: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14125

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 25

Example of Separable Block CodeExample of Separable Block Code

(4,3) Parity Code Check bit is XOR of 3 message bits message 101 codeword 1010

Single Bit Parity

nn

kn

n

k

nn

mredundancy

k 11

)2(log1

log1 22

n

n

n

krate

1

Page 26: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14126

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 26

Example of Non-Separable Block CodeExample of Non-Separable Block Code

One-Hot Code Each Codeword has single 1 Example of 8-bit one-hot

– 10000000, 01000000, 00100000, 00010000 00001000, 00000100, 00000010, 00000001

Redundancy = 1 - log2(8)/8 = 5/8

n

n

n

mredundancy

)(log1

log1 22

Page 27: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14127

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 27

Linear Block CodesLinear Block Codes

Special class Modulo-2 sum of any 2 codewords also

codeword Null space of (n-k)xn Boolean matrix

– Called Parity Check Matrix, H

For any n-bit codeword c cHT = 0 All 0 codeword exists in any linear code

Page 28: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14128

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 28

Linear Block CodesLinear Block Codes

Generator Matrix, G kxn Matrix

Codeword c for message m c = mG

GHT = 0

Page 29: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14129

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 29

Systematic Block CodeSystematic Block Code

First k-bits correspond to message Last n-k bits correspond to check bits

For Systematic Code G = [Ikxk : Pkx(n-k)]

H = [I(n-k)x(n-k) : PT(n-k)xk]

Example

1111H

1

1

1

100

010

001

G

Page 30: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14130

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 30

Distance of CodeDistance of CodeDistance between two codewords

Number of bits in which they differDistance of Code

Minimum distance between any two codewords in code

If n=k (no redundancy), distance = 1 Single-bit parity, distance = 2

Code with distance d Detect d-1 errors Correct up to (d-1)/2 errors

Page 31: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14131

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 31

Error Correcting CodesError Correcting Codes

Code with distance 3 Called single error correcting (SEC) code

Code with distance 4 Called single error correcting and double

error detecting (SEC-DED) codeProcedure for constructing SEC code

Described in [Hamming 1950] Any H-matrix with all columns distinct and

no all-0 column is SEC

Page 32: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14132

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 32

Hamming CodeHamming CodeFor any value of n

SEC code constructed by– setting each column in H equal to binary

representation of column number (starting from 1)

Number of rows in H equal to log2(n+1) Example of SEC Hamming Code for n=7

1

1

1

010

100

111

101

110

000

H

Page 33: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14133

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 33

Error Correction in Hamming Error Correction in Hamming CodeCode

Syndrome, s s = HvT for received vector v If v is codeword

– Syndrome = 0

If v non-codeword and single-bit error– Syndrome will match one of columns of H– Will contain binary value of bit position in error

Page 34: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14134

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 34

Example of Error CorrectionExample of Error CorrectionFor (7,3) Hamming Code

Suppose codeword 0110011 has one-bit error changing it to 1110011

]001[

111

011

101

001

110

010

100

]1110011[

TvHs

Page 35: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14135

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 35

SEC-DED CodeSEC-DED Code

Make SEC Hamming Code SEC-DED By adding parity check over all bits Extra parity bit

– 1 for single-bit error– 0 for double-bit error

Makes possible to detect double bit error– Avoid assuming single-bit error and

miscorrecting it

Page 36: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14136

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 36

Example of Error CorrectionExample of Error CorrectionFor (7,4) SEC-DED Hamming Code

Suppose codeword 0110011 has two-bit error changing it to 1010011

– Doesn’t match any column in H

]0010[

1

1

1

111

011

101

1001

1110

1010

1100

]1010011[

TvHs

Page 37: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14137

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 37

Hsiao CodeHsiao CodeWeight of column

Number of 1’s in columnConstructing n-bit SEC-DED Hsiao Code

First use all possible weight-1 columns– Then all possible weight-3 columns– Then weight-5 columns, etc.

Until n columns formed Number check bits is log2(n+1) Minimizes number of 1’s in H-matrix

– Less hardware and delay for computing syndrome– Disadvantage: Correction logic more complex

Page 38: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14138

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 38

Example of Hsiao CodeExample of Hsiao Code

(7,3) Hsiao Code Uses weight-1 and weight-3 columns

1

0

1

1

1

1

0

1

1

1

1

0

0001

0010

0100

1000

H

Page 39: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14139

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 39

Unidirectional ErrorsUnidirectional ErrorsErrors in block of data which only cause

01 or 10, but not both Any number of bits in error in one direction

Example Correct codeword 111000 Unidirectional errors could cause

– 001000, 000000, 101000 (only 10 errors)

Non-unidirectional errors– 101001, 011001, 011011 (both10 and 01)

Page 40: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14140

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 40

Unidirectional Error Detecting CodesUnidirectional Error Detecting Codes

All unidirectional error detecting (AUED) Codes Detect all unidirectional errors in codeword Single-bit parity is not AUED

– Cannot detect even number of errors

No linear code is AUED– All linear codes must contain all-0 vector, so

cannot detect all 10 errors

Page 41: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14141

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 41

Two-Rail CodeTwo-Rail CodeTwo-Rail Code

One check bit for each information bit– Equal to complement of information bit

Two-Rail Code is AEUD 50% Redundancy

Example of (6,3) Two-Rail Code Message 101 has Codeword 101010 Set of all codewords

– 000111, 001110, 010101, 011100, 100110, 101010, 110001, 111000

Page 42: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14142

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 42

Berger CodesBerger Codes

Lowest redundancy of separable AUED codes For k information bits, log2(k+1) check bits

Check bits equal to binary representation of number of 0’s in information bits

Example Information bits 1000101

– log2(7+1)=3 check bits

– Check bits equal to 100 (4 zero’s)

Page 43: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14143

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 43

Berger CodesBerger Codes

Codewords for (5,3) Berger Code 00011, 00110, 01010, 01101, 10010,

10101, 11001, 11100 If unidirectional errors

Contain 10 errors– increase 0’s in information bits– can only decrease binary number in check bits

Contain 01 errors– decrease 0’s in information bits– can only increase binary number in check bits

Page 44: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14144

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 44

Berger CodesBerger Codes

If 8 information bits Berger code requires log28+1=4 check bits

(16,8) Two-Rail Code Requires 50% redundancy

Redundancy advantage of Berger Code Increases as k increased

%25

4

1

12

81

)2(log1

log1 22

nn

mredundancy

k

Page 45: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14145

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 45

Constant Weight CodesConstant Weight Codes

Constant Weight Codes Non-separable, but lower redundancy than

Berger Each codeword has same number of 1’s

Example 2-out-of-3 constant weight code 110, 011, 101

AEUD code Unidirectional errors always change number

of 1’s

Page 46: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14146

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 46

Constant Weight CodesConstant Weight Codes

Number codewords in m-out-of-n code

Codewords maximized when m close to n/2 as possible n/2-out-of-n when n even (n/2-0.5 or n/2+0.5)-out-of-n when n odd Minimizes redundancy of code

nmC

Page 47: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14147

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 47

ExampleExample

6-out-of-12 constant weight code

12-bit Berger Code Only 28 = 256 codewords

codewordsC 924126

%9.17

12

)924(log1

log1 22

n

mredundancy

%3.33

12

)2(log1

log1

822

n

mredundancy

Page 48: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14148

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 48

Constant Weight CodesConstant Weight Codes

Advantage Less redundancy than Berger codes

Disadvantage Non-separable Need decoding logic

– to convert codeword back to binary message

Page 49: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14149

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 49

Burst ErrorBurst ErrorBurst Error

Common, multi-bit errors tend to be clustered– Noise source affects contiguous set of bus lines

Length of burst error– number of bits between first and last error

Wrap around from last to first bit of codewordExample: Original codeword 00000000

00111100 is burst error length 4 00110100 is burst error length 4

– Any number of errors between first and last error

Page 50: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14150

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 50

Cyclic CodesCyclic Codes

Special class of linear code Any codeword shifted cyclically is another

codeword Used to detect burst errors Less redundancy required to detect burst

error than general multi-bit errors– Some distance 2 codes can detect all burst

errors of length 4– detecting all possible 4-bit errors requires

distance 5 code

Page 51: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14151

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 51

Cyclic Redundancy Check (CRC) CodeCyclic Redundancy Check (CRC) Code

Most widely used cyclic code Uses binary alphabet based on GF(2)

CRC code is (n,k) block code Formed using generator polynomial, g(x)

– called code generator– degree n-k polynomial (same degree as

number of check bits)

012

2...)( gxgxgxgxg knkn

)()()( xgxmxc

Page 52: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14152

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 52

Message m(x) g(x) c(x) Codeword

0000 0 x2 + 1 0 000000

0001 1 x2 + 1 x2 + 1 000101

0010 x x2 + 1 x3 + x 001010

0011 x + 1 x2 + 1 x3 + x2 + x + 1 001111

0100 x2 x2 + 1 x4 + x2 010100

0101 x2 + 1 x2 + 1 x4 + 1 010001

0110 x2 + x x2 + 1 x4 + x3 + x2 + x 011110

0111 x2 + x + 1 x2 + 1 x4 + x3 + x + 1 011011

1000 x3 x2 + 1 x5 + x3 101000

1001 x3 + 1 x2 + 1 x5 + x3 + x2 + 1 101101

1010 x3 + x x2 + 1 x5 + x 100010

1011 x3 + x + 1 x2 + 1 x5 + x2 + x + 1 100111

1100 x3 + x2 x2 + 1 x5 + x4 + x3 + x2 111100

1101 x3 + x2 + 1 x2 + 1 x5 + x4 + x3 + 1 111001

1110 x3 + x2 + x x2 + 1 x5 + x4 + x2 + x 110110

1111 x3 + x2 + x + 1 x2 + 1 x5 + x4 + x + 1 110011

Page 53: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14153

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 53

CRC CodeCRC Code

Linear block code Has G-matrix and H-matrix G-matrix shifted version of generator

polynomial

01

01

01

.

0

0

.

0

0

...

.

0

...00

....

...0

...

gg

g

g

gg

ggg

G

kn

kn

kn

Page 54: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14154

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 54

CRC Code ExampleCRC Code Example

(6,4) CRC code generated by g(x)=x2+1

101000

010100

001010

000101

G

Page 55: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14155

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 55

Systematic CRC CodesSystematic CRC Codes

To obtain systematic CRC code codewords formed using Galois division

– nice because LFSR can be used for performing division

)(

)()(

)()()(

xg

xxmofremainderxr

xrxxmxckn

kn

Page 56: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14156

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 56

Galois Division ExampleGalois Division ExampleEncode m(x)=x2+x with g(x)=x2+1

Requires dividing m(x)xn-k =x4+x3 by g(x)

Remainder r(x)=x+1– c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1

111101 11000

10111010111010111 remainder

Page 57: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14157

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 57

Message m(x) g(x) r(x) c(x) Codeword

0000 0 x2 + 1 0 0 000000

0001 1 x2 + 1 1 x2 + 1 000101

0010 x x2 + 1 x x3 + x 001010

0011 x + 1 x2 + 1 x + 1 x3 + x2 + x + 1 001111

0100 x2 x2 + 1 1 x4 + 1 010001

0101 x2 + 1 x2 + 1 0 x4 + x2 010100

0110 x2 + x x2 + 1 x + 1 x4 + x3 + x + 1 011011

0111 x2 + x + 1 x2 + 1 x x4 + x3 + x + 1 011110

1000 x3 x2 + 1 x x4 + x3 + x + 1 100010

1001 x3 + 1 x2 + 1 x + 1 x4 + x3 + x + 1 100111

1010 x3 + x x2 + 1 0 x4 + x3 + x + 1 101000

1011 x3 + x + 1 x2 + 1 1 x4 + x3 + x + 1 101101

1100 x3 + x2 x2 + 1 x + 1 x4 + x3 + x + 1 110011

1101 x3 + x2 + 1 x2 + 1 x x4 + x3 + x + 1 110110

1110 x3 + x2 + x x2 + 1 1 x4 + x3 + x + 1 111001

1111 x3 + x2 + x + 1 x2 + 1 0 x4 + x3 + x2 + x 111100

Page 58: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14158

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 58

Generating Check Bits for CRC CodeGenerating Check Bits for CRC CodeUse LFSR

With characteristic polynomial equal to g(x) Append n-k 0’s to end of message

Example: m(x)=x2+x+1 and g(x)=x3+x+1

0 0 0 111000Appended 0’s

Message

0 1 0

Final state after shifting equals remainder

Page 59: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14159

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 59

Checking CRC CodewordChecking CRC Codeword

Checking Received Codeword for Errors Shift codeword into LFSR

– with same characteristic polynomial as used to generate it

If final state of LFSR non-zero, then error

0 0 0 111010codeword to check

Page 60: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14160

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 60

Selecting Generator PolynomialSelecting Generator Polynomial

Key issue for CRC Codes If first and last bit of polynomial are 1

– Will detect burst errors of length n-k or less

If generator polynomial is mutliple of (x+1)– Will detect any odd number of errors

If g(x) = (x+1)p(x) where p(x) primitive of degree n-k-1 and n < 2n-k-1

– Will detect single, double, triple, and odd errors

Page 61: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14161

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 61

Commonly Used CRC GeneratorsCommonly Used CRC Generators

CRC code Generator Polynomial

CRC-5 (USB token packets) x5+x2+1

CRC-12 (Telecom systems) x12+x11+x3+x2+x+1

CRC-16-CCITT (X25, Bluetooth) x16+x12+x5+1

CRC-32 (Ethernet) x32+x26+x23+x22+x16+x12+x11+x10+x8

+x7+x5+x4+x+1

CRC-64 (ISO) x64+x4+x3+x+1

Page 62: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14162

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 62

Fault Tolerance SchemesFault Tolerance Schemes

Adding Fault Tolerance to Design Improves dependability of system Requires redundancy

– Hardware– Time– Information

Page 63: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14163

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 63

Hardware RedundancyHardware Redundancy Involves replicating hardware units

At any level of design– gate-level, module-level, chip-level, board-level

Three Basic Forms Static (also called Passive)

– Masks faults rather than detects them

Dynamic (also called Active)– Detects faults and reconfigures to spare hardware

Hybrid– Combines active and passive approaches

Page 64: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14164

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 64

Static RedundancyStatic Redundancy

Masks faults so no erroneous outputs Provides uninterrupted operation Important for real-time systems

– No time to reconfigure or retry operation

Simple self-contained– No need to update or rollback system state

Page 65: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14165

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 65

Triple Module Redundancy (TMR)Triple Module Redundancy (TMR)Well-known static redundancy scheme

Three copies of module Use majority voter to determine final output Error in one module out-voted by other two

Module3

Module2

Module1

MajorityVoter

Page 66: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14166

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 66

TMR Reliability and MTTFTMR Reliability and MTTF

TMR works if any 2 modules work Rm = reliability of each module Rv = reliability of voter

MTTF for TMR

)23()]1([ 32232

3mmvmmmvTMR RRRRRCRRR

vmvm

tttmmvTMRTMR dteeedtRRRdtRMTTF mmv

3

2

2

3

)23()23(0

32

0

32

0

Page 67: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14167

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 67

Comparison with SimplexComparison with Simplex

Neglecting fault rate of voter

TMR has lower MTTF, but Can tolerate temporary faults Higher reliability for short mission times

simplexmmm

TMR MTTFMTTF6

51

6

5

3

2

2

3

Page 68: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14168

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 68

Comparison with SimplexComparison with Simplex

Crossover point

RTMR > Rsimplex when Mission time shorter than 70% of MTTF

simplexm

ttt

simplexTMR

MTTFtSolve

eee

RR

mmm

7.02ln

23 32

Page 69: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14169

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 69

N-Modular Redundancy (NMR)N-Modular Redundancy (NMR)

NMR N modules along with majority voter

– TMR special case

Number of failed modules masked = (N-1)/2 As N increases, MTTF decreases

– But, reliability for short missions increases

If goal only to tolerate temporary faults TMR sufficient

Page 70: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14170

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 70

Interwoven LogicInterwoven Logic

Replace each gate with 4 gates using inconnection pattern

that automatically corrects errorsTraditionally not as attractive as TMR

Requires lots of area overhead Renewed interest by researchers

investigating emerging nanoelectronic technologies

Page 71: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14171

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 71

Interwoven Logic with 4 NOR GatesInterwoven Logic with 4 NOR Gates

++

+

++

X

Y

1

2

3

4

1b

+1c

+1d

+1a

+2b

+2c

+2d

+2a

+3b

+3c

+3d

+3a

+4b

+4c

+4d

+4a

X

Y

Page 72: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14172

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 72

Example of Error on Third Y InputExample of Error on Third Y Input

+1b

+1c

+1d

+1a

+2b

+2c

+2d

+2a

+3b

+3c

+3d

+3a

+4b

+4c

+4d

+4a

X

Y

0

0

0

0

0010

1

1

0

0

0

0

0

0

1

1

1

1

0000

++

++

X

Y

1

2

3

4

Page 73: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14173

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 73

Dynamic RedundancyDynamic Redundancy

Involves Detecting fault Locating faulty hardware unit Reconfiguring system to use spare fault-free

hardware unit

Page 74: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14174

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 74

Unpowered (Cold) SparesUnpowered (Cold) Spares

Advantage Extends lifetime of spares

Equations Assume spare not failing until powered Perfect reconfiguration capability

2

)1(

_/

_/

sparecoldw

tsparecoldw

MTTF

etR

Page 75: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14175

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 75

Unpowered (Cold) SparesUnpowered (Cold) SparesOne cold spare doubles MTTF

Assuming faults always detected and reconfiguration circuitry never fails

Drawback of cold spare Extra time to power and initialize Cannot be used to help in detecting faults Fault detection requires either

– periodic offline testing– online testing using time or information

redundancy

Page 76: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14176

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 76

Powered (Hot) SparesPowered (Hot) SparesCan use spares for online fault detectionOne approach is duplicate-and-compare

If outputs mismatch then fault occurred– Run diagnostic procedure to determine which

module is faulty and replace with spare

Any number of spares can be used

ModuleB

SpareModule

ModuleA

Compare

Output

Agree/Disagree

Page 77: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14177

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 77

Pair-and-a-SparePair-and-a-SpareAvoids halting system to run diagnostic

procedure when fault occurs

ModuleB

ModuleA

Compare

Output

Agree/Disagree

ModuleD

ModuleC

Compare

Output

Agree/Disagree

Switch

Page 78: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14178

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 78

TMR/SimplexTMR/Simplex

When one module in TMR fails Disconnect one of remaining modules Improves MTTF while retaining advantages

of TMR when 3 good modulesTMR/Simplex

Reliability always better than either TMR or Simplex alone

Page 79: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14179

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 79

Comparison of Reliability vs TimeComparison of Reliability vs Time

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

NORMALIZED MISSION TIME (T/MTTF)

RE

LIA

BIL

ITY

SIMPLEX

TMR

TMR/SIMPLEX

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

NORMALIZED MISSION TIME (T/MTTF)

RE

LIA

BIL

ITY

SIMPLEX

TMR

TMR/SIMPLEX

Page 80: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14180

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 80

Hybrid RedundancyHybrid Redundancy

Combines both static and dynamic redundancy Masks faults like static Detects and reconfigures like dynamic

Page 81: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14181

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 81

TMR with SparesTMR with Spares

If TMR module fails Replace with spare

– can be either hot or cold spare

While system has three working modules– TMR will provide fault masking for

uninterrupted operation

Page 82: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14182

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 82

Self-Purging RedundancySelf-Purging Redundancy

Uses threshold voter instead of majority voter Threshold voter outputs 1 if number of

input that are 1 greater than threshold– Otherwise outputs 0

Requires hot spares

Page 83: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14183

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 83

Self-Purging RedundancySelf-Purging Redundancy

Module3

Module2

Module1

ThresholdVoter2

Elem.Switch

Elem.Switch

Elem.Switch

Module4

Elem.Switch

Module5

Elem.Switch Voter

Module

FlipFlop

&

RS

Initialization

Elementary Switch

Page 84: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14184

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 84

Self-Purging RedundancySelf-Purging Redundancy

Compared with 5MR Self-purging with 5 modules

– Tolerate up to 3 failing modules (5MR cannot)– Cannot tolerate two modules simultaneously

failing (5MR can)

Compared with TMR with 2 spares Self-purging with 5 modules

– simpler reconfiguration circuitry– requires hot spares (3MR w/spares can use

either hot or cold spares)

Page 85: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14185

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 85

Time RedundancyTime Redundancy

Advantage Less hardware

Drawback Cannot detect permanent faults

If error detected System needs to rollback to known good

state before resuming operation

Page 86: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14186

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 86

Repeated ExecutionRepeated Execution

Repeat operation twice Simplest time redundancy approach Detects temporary faults occurring during

one execution (but not both)– Causes mismatch in results

Can reuse same hardware for both executions

– Only one copy of functional hardware needed

Page 87: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14187

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 87

Repeated ExecutionRepeated Execution

Requires mechanism for storing and comparing results of both executions In processor, can store in memory or on

disk and use software to compareMain cost

Additional time for redundant execution and comparison

Page 88: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14188

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 88

Multi-threaded Redundant ExecutionMulti-threaded Redundant Execution

Can use in processor-based system that can run multiple threads Two copies of thread executed concurrently Results compared when both complete Take advantage of processor’s built-in

capability to exploit processing resources– Reduce execution time– Can significantly reduce performance penalty

Page 89: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14189

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 89

Multiple Sampling of OuputsMultiple Sampling of Ouputs

Done at circuit-level Sample once at end of normal clock cycle Same again after delay of t Two samples compared to detect mismatch

– Indicates error occurred

Detect fault whose duration is less than t Performance overhead depends on

– Size of t relative to normal clock period

Page 90: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14190

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 90

Multiple Sampling of OutputsMultiple Sampling of Outputs

Simple approach using two latches

Clk

MainLatch

Clk+t

ShadowLatch

ErrorSignal

Page 91: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14191

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 91

Multiple Sampling of OutputsMultiple Sampling of OutputsApproach using stability checker at output

NormalClock Period t

NormalClock Period t

StabilityChecking

Period

StabilityChecking

Period

&

&

+

+& Error

CheckingPeriod

Signal

Page 92: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14192

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 92

Diverse RecomputationDiverse Recomputation

Use same hardware, but perform computation differently second time Can detect permanent faults that affects

only one computationFor arithmetic or logical operations

Shift operands when performing second computation [Patel 1982]

Detects permanent fault affecting only one bit-slice

Page 93: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14193

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 93

Information RedundancyInformation Redundancy

Based on Error Detecting and Correcting Codes

Advantage Detects both permanent and temporary

faults Implemented with less hardware overhead

than using multiple copies of moduleDisadvantage

More complex design

Page 94: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14194

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 94

Error DetectionError Detection

Error detecting codes used to detect errors If error detected

– Rollback to previous known error-free state– Retry operation

Page 95: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14195

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 95

RollbackRollback

Requires adding storage to save previous state Amount of rollback depends on latency of

error detection mechanism Zero-latency error detection

– rollback implemented by preventing system state from updating

If errors detected after n cycles– need rollback restoring system to state at least

n clock cycles earlier

Page 96: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14196

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 96

CheckpointCheckpoint

Execution divided into set of operations Before each operation executed

– checkpoint created where system state saved

If any error detected during operation– rollback to last checkpoint and retry operation

If multiple retries fail– operation halts and system flags that

permanent fault has occurred

Page 97: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14197

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 97

Error DetectionError Detection

Encode outputs of circuit with error detecting code Non-codeword output indicates error

m

m

k

c

Inputs

Checker

FunctionalLogic

Check BitGenerator

k

Outputs

ErrorIndication

Page 98: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14198

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 98

Self-Checking CheckerSelf-Checking Checker

Has two outputs Normal error-free case (1,0) or (0,1) If equal to each other, then error (0,0) or (1,1) Cannot have single error indicator output

– Stuck-at 0 fault on output could never be detected

Page 99: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE14199

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 99

Totally Self-Checking CheckerTotally Self-Checking Checker

Requires three properties Code Disjoint

– all codeword inputs mapped to codeword outputs

Fault Secure– for all codeword inputs, checker in presence of

fault will either procedure correct codeword output or non-codeword output (not incorrect codeword)

Self-Testing– For each fault, at least one codeword input gives

error indication

Page 100: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141100

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 100

Duplicate-and-CompareDuplicate-and-CompareEquality checker indicates error

Undetected error can occur only if common-mode fault affecting both copies

Only faults after stems detected Over 100% overhead (including checker)

FunctionalLogic

FunctionalLogic

Stems

EqualityChecker

ErrorIndication

PrimaryInputs

Page 101: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141101

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 101

Single-Bit Parity CodeSingle-Bit Parity Code

Totally self-checking checker formed by removing final gate from XOR tree

EI0

FunctionalLogic

ParityPrediction

EI1

Page 102: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141102

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 102

Single-Bit Parity CodeSingle-Bit Parity Code

Cannot detect even bit errors Can ensure no even bit errors by

generating each output with independent cone of logic

– Only single bit errors can occur due to single point fault

– Typically requires a lot of overhead

Page 103: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141103

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 103

Parity-Check CodesParity-Check Codes

Each check bit is parity for some set of output bits

Example: 6 outputs and 3 check bits

Z1 Z2 Z3 Z4 Z5 Z6 c1 c2 c3

Parity Group 1 1 0 0 1 1 0 1 0 0

Parity Group 2 0 1 1 0 0 0 0 1 0

Parity Group 3 0 0 0 0 0 1 0 0 1

Page 104: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141104

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 104

Parity-Check CodesParity-Check Codes

For c check bits and k functional outputs 2ck possible parity check codes Can choose code based on structure of

circuit to minimize undetected error combinations

Fanouts in circuit determine possible error combinations due to single-point fault

Page 105: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141105

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 105

Checker for Parity-Check CodesChecker for Parity-Check Codes

Constructed from single-bit parity checkers and two-rail checkers

ParityChecker

Two-RailChecker

Z1Z4

Z5c1

ParityChecker

Z2

Z3

c2

ParityChecker

Z6

c3

Two-RailChecker

E0

E1

Page 106: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141106

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 106

Two-Rail CheckersTwo-Rail Checkers

Totally self-checking two-rail checker

C0+

&

&

+

&

&C1

A0

B0

A1

B1

Page 107: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141107

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 107

Berger CodesBerger Codes

Inverter-free circuit Inverters only at primary inputs Can be synthesized using only algebraic

factoring [Jha 1993] Only unidirectional errors possible for

single point faults– Can use unidirectional code– Berger code gives 100% coverage

Page 108: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141108

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 108

Constant Weight CodesConstant Weight Codes

Non-separable with lower redundancy Drawback: need decoding logic to convert

codeword back to its original binary value Can use for encoding states of FSM

– No need for decoding logic

Page 109: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141109

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 109

Error CorrectionError Correction

Information redundancy can also be used to mask errors Not as attractive as TMR because logic for

predicting check bits very complex However, very good for memories

– Check bits stored with data– Error do not propagate in memories as in logic

circuits, so SEC-DED usually sufficient

Page 110: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141110

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 110

Error CorrectionError Correction

Memories very dense and prone to errors Especially due to single-event upsets (SEUs)

from radiationSEC-DED check bits stored in memory

32-bit word, SEC-DED requires 7 check bits– Increases size of memory by 7/32=21.9%

64-bit word, SEC-DED requires 8 check bits– Increases size of memory by 8/64=12.5%

Page 111: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141111

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 111

Memory ECC ArchitectureMemory ECC Architecture

GenerateCheckBits Memory

GenerateSyndromeCorrect

Data

CalculatedCheck Bits

WriteCheck Bits

Read Data Word

Write Data Word

Data WordIn

ReadCheck Bits

Data WordOut

Page 112: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141112

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 112

Hamming Code for ECC RAMHamming Code for ECC RAM

RAM Core

N words Z+c+1

bits/word

Z

c

Input Data

Parity Bit Generator

Z

c

Hamming Check Bit Generator

Parity Check

Hamming Check c

Bit Error Correction Circuit

Output Data

Generate Detect/Correct

Hamming Check Bit Generator

Parity Bit Generator

Z

Error Type Condition No bit error Hamming check bits match, no parity error

Single-bit correctable error Hamming check bits mismatch, parity error Double-bit error detection Hamming check bits mismatch, no parity error

Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 c1 c2 c3 c4 Parity Group 1 1 1 0 1 1 0 1 0 1 0 0 0 Parity Group 2 1 0 1 1 0 1 1 0 0 1 0 0 Parity Group 3 0 1 1 1 0 0 0 1 0 0 1 0 Parity Group 4 0 0 0 0 1 1 1 1 0 0 0 1

Page 113: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141113

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 113

Memory ECCMemory ECC

SEC-DED generally very effective Memory bit-flips tend to be independent

and uniformly distributed If bit-flip occurs, gets corrected next time

memory location accessed Main risk is if memory word not access for

long time – Multiple bit-flips could accumulate

Page 114: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141114

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 114

Memory ScrubbingMemory Scrubbing

Every location in memory read on periodic basis Reduces chance of multiple errors

accumulating in a memory word Can be implemented by having memory

controller cycle through memory during idle periods

Page 115: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141115

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 115

Multiple-Bit Upsets (MBU)Multiple-Bit Upsets (MBU)Can occur due to single SEU

Typically occur in adjacent memory cellsMemory interleaving used

To prevent MBUs from resulting in multiple bit errors in same word

Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4Bit1 Bit1 Bit1 Bit1 Bit2 Bit2 Bit2 Bit2 Bit3 Bit3 Bit3 Bit3

Memory

Page 116: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141116

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 116

Type Issues Goal Examples Techniques

Long-LifeSystems

Difficult orExpensive to Repair

MaximizeMTTF

SatellitesSpacecraft

Implanted Biomedical

DynamicRedundancy

ReliableReal-TimeSystems

Error or Delay Catastrophic

Fault Masking Capability

AircraftNuclear Power PlantAir Bag Electronics

Radar

TMR

High Availability

Systems

DowntimeVery Costly

HighAvailability

Reservation SystemStock Exchange

Telephone Systems

No Single Point of Failure;

Self-Checking Pairs; Fault Isolation

High Integrity Systems

Data CorruptionVery Costly

HighData Integrity

BankingTransaction ProcessingDatabase

Checkpointing,Time Redundancy; ECC; Redundant

Disks

Mainstream Low-Cost Systems

Reasonable Level of Failures Acceptable

Meet Failure Rate Expectationsat Low Cost

Consumer Electronics Personal Computers

Often None; Memory ECC; Bus

Parity; Changing as Technology Scales

Page 117: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141117

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 117

Concluding RemarksConcluding Remarks

Many different fault-tolerant schemesChoosing scheme depends on

Types of faults to be tolerated– Temporary or permanent– Single or multiple point failures– etc.

Design constraints– Area, performance, power, etc.

Page 118: EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141118

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 118

Concluding RemarksConcluding Remarks

As technology scales Circuits increasingly prone to failure Achieving sufficient fault tolerance will be

major design issue