shantanu dutt (student involved: hasan arslan) ece dept. university of illinois -chicago

Evaluation of Computer Faults Due to to EM Interference Concepts, Simulation Environment and Some Results

Shantanu Dutt(Student Involved: Hasan

Arslan)ECE Dept.

University of Illinois -Chicago

Outline

Past Work-- General Fault Detection and Tolerance, EMI Faults

Our Goals Our Methodology for Fault Detection and

Classification Experimental Results Conclusions and Future Work

Past work – General Fault Detection and tolerance

Off-line testing (mainly for hard faults) Concurrent-online testing (operational faults):

Adding external hardware, monitoring data, address and control lines

Memory:error-detecting & correcting codes Computer systems

Watchdog processor – detecting control flow errors in program execution [Mahmood & McCluskey, TC’88]

Algorithm-based fault tolerance: use of some property of computation for self-checking [Huang & Abraham, TC’84, Dutt & Assad, TC’96]

Past Work On EM/Radiation-Induced Faults

Detection of high level computer failure due to different types of EM signals [Mojert et al., EMC’01]

Failure in real-time communication & control systems from communication line errors due to EM signals [Kohlberg & Carter, EMC’01]

Also: Radiation Hardened Processors: Leon and ERC32 processors (http sites). But primarily only ECC for memory and register file---simple fault tolerance but probably targeting the most likely source of “permanent” faults.

Assumptions/Scenarios of Past Work

Past Work on general fault detection: Random single (sometimes double) faults Deterministic faults Types of faults: permanent, transient,

intermittent; intermittent type not generally tackled

Past Work on EM-induced faults: No how/why/what analysis and classification

of computer failure due to EM interference

Goals of Our Work

Will determine and classify the following type of computer system behavioral error (i.e., program errors) due to different patterns, extent, duration and location of faults:

Control flow errors -- incorrect sequence of instruction execution. Causes: address gen. error, memory faults, bus faults

Data errors. Causes: computation errors, memory & bus faults Termination Errors (hung processor & crashes). Causes: C.U.

transition to dead-end states, invalid instruction, out-of-bound address, divide-by-zero, spurious interrupts (?)

Note: Error types are NOT mutually exclusive Provide broad-based recipes for FT and reliable operation To the best of our knowledge, more comprehensive

analysis of fault effects on a computer system than that attempted previously

Comprehensive analysis needed due to the nature of EM effects--all pervasive, periodic, clustered

Our System of Fault Analysis in a Computer System

Computer Sys =Processor+ Memory+ Et. Buses

Fault Injection

Ob

serv

atio

n &

Erro

r Cla

ssifi

catio

n

Use VHDL model of amodern micro-proc---DLX& SuperScalar DLX

In each comp; controlof fault duration, freq, #,pattern (rand, clust)

Dete

ctio

n &

Cla

ssifi

catio

nu

sin

g C

FC

&A

BFT/D

ata

En

cod

ing

Characteristics of Fault Injection Methods -- Previous

Work

Hardaware Software

With contact Without contact

Compilation Runtime

Cost High High Low Low

Damage High Low None None

Trigger Yes No Yes Yes

Repeatability

High Low High High

Controllability

High Low High High

Acc. FIP Chip pin. Chip int. Reg. Mem. Soft.

Reg. Mem. I/O cont./port

Our Fault Injection Approach

•Fault Types (stuck_at 0, stuck-at 1, single random, clustered, multiple random, etc)

•Inject Faults in a “Software” Model (VHDL) of a Computer-- adv of both the h/w and s/w approaches w/o the disadvantages

•Variable Duration of Faults & Frequency

Memory DLX CPUAddress Bus

Data Bus

Fault Generator

Counter_2Counter_1

Var-widthVar-periodPulse gen.

data1 0Fault Generator

MUX

Methodologies for Control Flow Errors [Mahmood & McCluskey, TC’88]

A node is a block of instructions with a branch at the end

A derived signature of a node is a function (e.g., or, LFSR) of all its instructions

A program graph is one in which there is an arc from node u to v if the branch at u can lead to node v

v1

v2 v3

v5

v6

v4

NOP Sign(v4)ADD r1 r3LD r2 address

NOP sign(v5)NOP sign(v6)BLT r4 r8 off

MA

IN

Sign(v4)BRT v6 v5

WD

Processor

Memory Hierarchy

Watchdog

Memory Bus

Signal frombranch circuit

Methodologies for Control Flow Errors: CFCChecking Using a Watchdog

WD compares the information gathered concurrently to the information previously provided

Complexity,lies between the current circuit-level and system-level tech.

90% error coverage for single errors [Mahmood et al, ieee tc’88]

v1

v2 v3

v5

v6

v4

NOP Sign(v4)ADD r1 r3LD r2 address

NOP sign(v5)NOP sign(v6) BLT r4 r8 off

MA

IN

Sign(v4)BRT v6 v5

WD

Compute Block Sign.

Block

header

detect

ed

Check Block Sig.

End

of b

lock

Wait for new block

Middle of block

Sig

. ok

Corrupted

Sig.

New Block

Detected

Block Header

Checking START

Errorflag

CheckBranch

Sig.

errorSi

g.ok

Data Error Methodologies: Algorithm-Based Fault Tolerance Difficult to detect, occurs inside the microproc, not

necessarily observable to an external WD processor Use properties of the computation to check

correctness of computed data E.g. linearly property: f(v1+v2)=f(v1)+f(2) of

computation f( ) can be used to check it Pre-compute v’ = v1 + v1 + …+ vk (input checksum) Computer f(v1), …..f(vk) Compute u = f(v) + f(v2) + …. + f(vk) (output checksum) Check if f(v’) = u; inequality indicates computation error(s)

Can be used for linear computations such as matrix multiplication, matrix addition, Gaussian elimination [Huang & Abraham, TC’84],[Dutt & Assad, TC’96]

Data Error Methodologies: Data Encoding

Data that is numerically processed can be encoded and checked if the output of arithmetic operations is still encoded (e.g., Berger, AN codes)

A simple coding scheme is AN coding: # N is transformed to A.N where A is odd, say, 3

Works for addition: 3.N1 + 3.N2 = 3(N1+N2) -- check if result is still a multiple of 3; if not then error

100% det of single faults -- single fault will change result by +/- (2^i) and so no longer multiple of 3.

Methodologies for Termination Errors

Valid address range registers R_low, R_high in processor -- check generated address to see if in range

Can detect crashes due to invalid addresses Timeout Mechanisms -- Store upper bound

exec time for each block in the watchdog; if time is exceeded during run time flag error

Can detect infinite loops or hung processor due to control unit faults

Current Implementation Fault Injection w/ various controls (duration,

frequency, extent, pattern) for a non-pipelined DLX processor in VHDL

Fault injection on memory data/address buses Description of a watchdog processor in VHDL for

control flow checking + infinite-loop termination errors

Valid address range registers in processor ECC (1-error correction and 2-error detection) of

memory (commercial feature) and buses (non-standard)

Some error analysis results for a simple Fibonacci computation: f(i) = f(i-1) + f(i-2), i=2 to 99, f(0)=f(1)=0

Current Implementation -- ECC Capabilities on Memory and

Buses

deco

der

Memory

En

Dec

Add r30,r0,r14

000ef020

CPU

en

cod

er

En

Dec

rfe

410ee0304181ee8008 4380ee8818

Fault Injector

chec k

PCH. Adr.

Reg.

L. Adr. Reg.

32+7ECC=39 bits

address

Some Error Observations

Adress 00000040 00000044 . .000000A4 000000A8 . .000000D4 000000D8 000000DC 000000E0 000000E4 000000E8 000000EC . .0000010C 00000110 00000114

Orig. InstructionADDI R7, R0, 3 SW 0(R3), R7 . .LW R3, -12(R30) LHI R4, 0 . .LW R6, -12(R30) SLLI R6, R6, 2 ADD R5, R5, R6 LW R4, 0(R4) LW R5, 0(R5) ADD R4, R4, R5 SW 0(R3), R4 . .LW R3, -12(R30) ADDI R3, R3, 1 SW -12(R30), R3

Corrupted Inst.ADDI R7, R0, 3SW 0(R2), R23 . .LW R3, -12(R30) LHI R4, 0 . .LW R6, -12(R30)SLLI R14, R4, 2ADD R5, R5, R6LW R4, 0(R4) LW R5, 0(R5) ADD R4, R4, R5SW 0(R3), R4 . .LW R3, 1040(R14) ADDI R3, R3, 1SW -12(R30), R3

wrong initializ.

Ind. Addr. Err

Inc. unknown value as index value

Invalid Addr Err

Some Error Observations(contd.)

TRAP Trap_id

0 5 6 31

•For All ALU instructions, first 6 bits are always 0. When 2nd and 5th bits are set, they become trap inst. Hence their distance is 2.

Trap_0 code=44000000

50c60004

Slli r5,r5,#4 Add r4,r4,r5

00a62820

•DLX interprets the last 5 bits (27-31) as trap_id (bit 6-26 are ignored). Non-trap instructions interpret bit 6-10 as src./dst. register.

•Check for trap/non-trap inst. extended to bit 6-10, to inc. min. dist. from 2 to 3.•Premature stops due to trap_0 thus reduced.•More refined schemes to increase min. distance -- on-going work

•DLX uses TRAP_0 to stop exec. Processor checks first 6 bits (0-5) for Trapinstruction, and last bit (31) for trap_id. No other bit checked.

44XXXXX4

Trap_02 bit faults

•For trap instructions if last bit is 0, then execution stops (Trap_0). Unfortunately, for most ALU inst.(add,and,xor,rfe…etc), the last bit is also 0.

Experimental Setup: Fault Injection Parameters

Repeat Period: 10 ns - 800 ns

(f=100 Mz - 1.25 MHz)Clock cycle:

22 ns

Duration Range:5 ns - 400 ns

Low_Low

Repeat Period Range: Duration Range:

Med_Med

High_High

305 - 425 5 - 25160 - 440 180 - 220

305 - 425 300 - 400

R=20

D=5

• 4 random errors simulated on the data bus w/ foll.characteristics

70

Execu

ted in

st.

74

116

Avgr. Exec. Inst. For each sim.(1134 inst. no fault)

169

No. Addr. Corr. Trap_FixedNo. Addr. Corr. Trap_0

Addr. Corr. Trap_0Addr. Corr. Trap_Fixed

314

.1

441

Exec.

in

st.

Avrg. Exec. Inst. For each run stopped by Trap.

154

276

When we fixed trap few runs is terminated because of trap.But Invalid. Addr. Termination (IAT) error increases35.42 inst exec. When Sim stopped because of IAT.38.37 inst exec. For second type 88.46 inst exec. (IAT) third type 124.54 inst exec. (IAT) 4th type

Experimental Results

Experimental Results (Cont)Simulation Times, Data Computation

13,2

58 n

s

Sim

ula

tion

tim

e (

ns)

14,0

15 n

s27,0

09 n

s

Avgr. Exec. time. For each sim.(265,620 ns for non faulty)

38,8

26 n

s

No. Addr. Corr. Trap_FixedNo. Addr. Corr. Trap_0

Addr. Corr. Trap_0Addr. Corr. Trap_Fixed

3.8

6

6.3

5

13.2

3

Arr

ay E

lts

Up

date

dAvrg. Array. Elts Updates

15.1

8

When simulation runs more it calculates more data elements

Experimental Results

0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

T_0 T_F A_0 A_F

data. Err.Cont.flowTerm.Err

52 simulation for Low_Low

T_0: No Addr. Corr. Trap_0: 410 err. Error_density:7/100

43:Term. Err(%10) 14 trap 29 Inv.Addr66 CF (%16)301 Dat_Err. (%74)

T_F: No Addr. Corr. Trap_Fixed: 424 err. Error_density:7/100

41:Term. Err(%9.6) 9 trap 32 Inv.Addr76 CF (%18)307 Dat_Err. (%73)

A_0: Addr. Corr. Trap_0: 444 err. Error_density:2.2/100

38:Term. Err(%6.8) 11 trap 27 Inv.Addr54 CF (%13.6) 315 Dat_Err. (%79.5)

A_F: Addr. Corr. Trap_Fixed: 446 err. Error_density:1.7/100

27:Term. Err(%6) 6 trap 21 Inv.Addr24 CF (%5.4) 395 Dat_Err. (%88.6)

The more program runs the more it gives Data Err.

When trap is fixed, more simulation is completed. But it increase the Inv. Addr. Term.

When Addr. corrected Inv. Addr. Err. is reduced.Simulation executes more instructionIt increase the Data Err.

Experimental Results (cont)

0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

T_0 T_F A_0 A_F


52 simulation for Med_MEd

T_0: No Addr. Corr. Trap_0: 68 err. Error_density:15/100


T_F: No Addr. Corr. Trap_Fixed: 82 err. Error_density: 11/100

52:Term. Err(%63.5) 0 trap 52 Inv.Addr7 CF (%8.5) 23 Dat_Err. (%28)

A_0: Addr. Corr. Trap_0: 150 err. Error_Density:10/100

52:Term. Err(%34) 7 trap 43 Inv.Addr54 CF (%36) 44 Dat_Err. (%30)

A_F: Addr. Corr. Trap_Fixed: 175 err. Error_density:8/100


Increasing fault inject period, reduces the # of executed Inst. So error density increases terribly

Experimental Results (cont)

0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

T_0 T_F A_0 A_F


52 simulation for High_High

T_0: No Addr. Corr. Trap_0: 61 err. Error_density: 35/100

52:Term. Err(%85) 1 trap 51 Inv.Addr9 CF (%15) 0 Dat_Err. (%0)

T_F: No Addr. Corr. Trap_Fixed: 90 err. Error_density: 48/100

52:Term. Err(%57) 0 trap 52 Inv.Addr38 CF (%43) 0 Dat_Err. (% 0)

A_0: Addr. Corr. Trap_0: 93 err. Error_density: 22/100

52:Term. Err(%55.3) 9 trap 43 Inv.Addr41 CF (%43.6) 1 Dat_Err. (% 1.1)

A_F: Addr. Corr. Trap_Fixed: 52 err. Error_Density: 26/100

52:Term. Err(%100) 4 trap 48 Inv.Addr0 CF (% 0) 0 Dat_Err. (% 0)

Process never get able to calculate Fib.valbecause of high fault injection.None of the simulation is completed.

Error CoverageFor error coverage, we run our simulation 122 times for: repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns

T_0: No. Addr. Corr. Trap_0

Total: 434 erroneous inst. executed

9080706050403020

100

ECC cover.

95

434 erroneous inst. 411 err. inst. covered by ECC (95%)

Cont.Flow. Cov.

20

90 err. Inst. covered by WD (20%). We are injection 4 bit faults, If process jumps the middle of a block, WD spends time to get beginning of block.

Data cov.

Error Coverage(Cont.)For error coverage, we run our simulation 122 times for repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns

T_F: No. Addr. Corr. Trap_Fixed


9080706050403020

100

ECC cover.

95


Cont.Flow. Cov.

23


Data cov

Error Coverage (Cont.)For error coverage, we run our simulation 122 times for

repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns

A_0: Addr. Corr. Trap_0


9080706050403020

100

ECC cover.

83


Cont.Flow. Cov.

20


Data cov

13There were 89 data error. 12 (13%) of them covered by 3N coding

Error Coverage (Cont.)For error coverage, we run our simulation 122 times for

repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns

A_F: Addr. Corr. Trap_Fixed


9080706050403020

100

ECC cover.

82


Cont.Flow. Cov.

18

762 err. Inst. covered by WD (18%). We are injection 4 bit faults, If process jumps the middle of a block, WD spends time to get beginning of block..

Data cov

19There were 106 data error. 20 (19%) of them covered by 3N coding

Error Coverage (cont)For error cover., 20 runs selected that resulted in complete simulations

w/ combinations of period: 305 - 460 ns and dura. range : 5 - 60 ns

Addr. Corr. Trap_0


9080706050403020

100

ECC cover.

80


Cont.Flow. Cov.

39 170 err. Inst. covered by WD (39%). We are injecting 4 bit faults. If process jumps the middle of a block, WD spends time to get beginning of block.

Data cov

23

There were 217 data error. 51 (23%) of them covered by 3N coding

Conclusions Have completed a significant but preliminary fault

simulation of the DLX processor in VHDL Obtain % of termination, control and data errors for

different fault duration and frequencies Encoding the TRAP instruction to have a min. distance from

other instructions helps in reducing incorrect termination Need to have ECC for register fields of instrs to reduce

incorrect address generation and data errors It seems to be possible to catch most errors by the

combination of mechanisms we have suggested so at least a fail safe mode can be guaranteed with high confidence; though room for improvement for control & data error detection

Future Work Other fault patterns (e.g., clusters); correlation with EM

induced fault work by others in our group Other block signature techniques (e.g., LFSR) with

better fault coverage Aliasing analysis (math., empirical) for signatures Perform error analysis for more substantial “real-life”

programs (scientific computations, non-numeric, system or O.S.)

Fault injection and analysis for SuperScalar DLX Fault injection and analysis of on-chip processor

components (integer and FP ALU, register files, control unit, internal buses, power/ground lines)

Looking Further Ahead

Q: Are there patterns of errors that lead to computer crashes w/ high probability?

Q:If so, can the detection of such patterns be used to shut down the computer in a fail-safe manner (save state & data for later resumption)

Q:Are there patterns of errors that are characteristic of EM-induced faults versus random single/double faults?

Q:If so, can these be used as “early detection & warning” of EM interference?

Future: Based on the correlation of system errors to EM faults, determine fault tolerance/ error minimization techniques for EM-induced faults.

shantanu dutt (student involved: hasan arslan) ece dept. university of illinois -chicago

Documents

general fault detection

system of fault analysis

program errors

memory faults

computation errors

eminduced faults

simple fault tolerance

evaluation of computer