design and simulation of an em-fault-tolerant processor with micro-rollback, control- flow checking...
DESCRIPTION
Assumptions/Scenarios of Past FD/FT Work Past Work on general fault detection: Past Work on general fault detection: Random single (sometimes double) faultsRandom single (sometimes double) faults Deterministic faultsDeterministic faults Types of faults: permanent, transient, intermittent; intermittent type not generally tackledTypes of faults: permanent, transient, intermittent; intermittent type not generally tackled Past Work on EM-induced faults: Past Work on EM-induced faults: No how/why/what analysis and classification of computer failure due to EM interferenceNo how/why/what analysis and classification of computer failure due to EM interferenceTRANSCRIPT
Design and Simulation of an EM-Fault-Tolerant Processor with Micro-Rollback, Control-
Flow Checking and ECC
Franco Trovo, Shantanu Dutt Franco Trovo, Shantanu Dutt & Hasan Arslan& Hasan Arslan
Univ. of Illinois at ChicagoUniv. of Illinois at Chicago
OutlineOutline GoalsGoals Solution AdoptedSolution Adopted
• Control Flow CheckingControl Flow Checking• Hamming encoding on the busesHamming encoding on the buses• Instruction Micro rollbackInstruction Micro rollback
Motorola 68040 and VHDL descriptionMotorola 68040 and VHDL description Simulation resultsSimulation results ConclusionConclusion
Assumptions/Scenarios of Past FD/FT Assumptions/Scenarios of Past FD/FT WorkWork
Past Work on general fault detection:Past Work on general fault detection:• Random single (sometimes double) faultsRandom single (sometimes double) faults• Deterministic faultsDeterministic faults• Types of faults: permanent, transient, Types of faults: permanent, transient,
intermittent; intermittent type not generally intermittent; intermittent type not generally tackledtackled
Past Work on EM-induced faults:Past Work on EM-induced faults:• No how/why/what analysis and classification of No how/why/what analysis and classification of
computer failure due to EM interferencecomputer failure due to EM interference
Broad Goals of Our WorkBroad Goals of Our Work Will determine and classify the following type of Will determine and classify the following type of
computer system behavioral error (i.e., program computer system behavioral error (i.e., program errors) due to different patterns, extent, duration and errors) due to different patterns, extent, duration and location of faults under EM-type faults:location of faults under EM-type faults: Control flow errors -- incorrect sequence of instruction Control flow errors -- incorrect sequence of instruction
execution. execution. Causes: address gen. error, memory faults, Causes: address gen. error, memory faults, bus faultsbus faults
Data errors. Data errors. Causes: computation errors, memory & bus Causes: computation errors, memory & bus faultsfaults
Termination Errors (hung processor & crashes). Termination Errors (hung processor & crashes). Causes:Causes: C.U. transition to dead-end states, invalid instruction, C.U. transition to dead-end states, invalid instruction, out-of-bound address, divide-by-zero, spurious interruptsout-of-bound address, divide-by-zero, spurious interrupts
Note: Note: Error types are NOT mutually exclusive Error types are NOT mutually exclusive Provide recipes for FT and reliable operationProvide recipes for FT and reliable operation
In This WorkIn This Work Will detectWill detect
Control flow errors -- incorrect sequence of instruction Control flow errors -- incorrect sequence of instruction execution. execution. Causes: address gen. error, memory faults, Causes: address gen. error, memory faults, bus faultsbus faults
Raw bus errors using ECCRaw bus errors using ECC
Provide a FT mechanism using these detections for Provide a FT mechanism using these detections for reliable operationreliable operation
OutlineOutline GoalsGoals Solution AdoptedSolution Adopted
• Control Flow CheckingControl Flow Checking• Hamming encoding on the busesHamming encoding on the buses• Instruction Micro rollbackInstruction Micro rollback
Motorola 68040 and VHDL descriptionMotorola 68040 and VHDL description Simulation resultsSimulation results ConclusionConclusion
FD/FT SolutionsFD/FT Solutions
Fault Detection:Fault Detection:• Control flow checking (CFC) by a concurrent error Control flow checking (CFC) by a concurrent error
detection using watchdog (WD) processordetection using watchdog (WD) processor• Hamming ECC (2-error detecting) on data & Hamming ECC (2-error detecting) on data &
address busesaddress buses
Fault Tolerance:Fault Tolerance:• Instruction micro rollback triggered byInstruction micro rollback triggered by
Hamming ECCHamming ECC WD-monitored CFCWD-monitored CFC
General Structure of a System with General Structure of a System with a Watchdoga Watchdog
MAIN PROCESSOR
MAIN MEMORY
DATA BUS
ADD. BUS
WATCHDOG PROCESSOR
Performs various checks (CFC, address, etc.)
General Structure of a WD-General Structure of a WD-Monitored System with On-Chip Monitored System with On-Chip
CacheCache
ADD. BUS
DATA BUS
CPU
MM
WD
Cache
Control Flow Checking Control Flow Checking [Mahmood, et al., IEEE TC’88][Mahmood, et al., IEEE TC’88]
Hybrid solution for detecting wrong block Hybrid solution for detecting wrong block sequence executionsequence execution
Starting from a program it extracts a Control Flow Starting from a program it extracts a Control Flow Graph Graph
Each node is Each node is associated to a associated to a block of branch block of branch free instructions free instructions + branch at end+ branch at end
Each edge is Each edge is associated w/ a associated w/ a possible branch possible branch between two between two blocksblocks
Block AIf cond1 then Block B if cond2 then Block D else Block EElse Block CEnd ifBlock F
A
B C
D E
F
Control Flow CheckingControl Flow Checking Block: branch free set of instructionsBlock: branch free set of instructions Signature: information added to the block in order Signature: information added to the block in order
to distinguish a block from anotherto distinguish a block from another
Block augmentation & sign. insertion
A
B C
D E
FJump free set of
instructions
Jump free set of
instructions
JUMP
JUMP
JUMP sign 1
JUMP
JUMP sign 2
Branch free set of
instructions
Branch free set of
instructions
Branch
Branch
BLOCK sign
Sign of 1st bra
Branch
Sign of 2nd bra
Branch
Block
CFC Implemented State CFC Implemented State DiagramDiagram
ResetBegin Block
ErrorWrong Bra
ErrorWrong Jump or
Faulted Signature
ErrorWrong Computed Signature
Header
Middle Block
Signature 1
Signature 2
Branch
ErrorSignatureExpected
Computed Sign. Eq.Header Sign?
GET2S
GET1S
Header Sign Eg.Bra Signatures?
N
N
N
N
Y
Y
Y
Y
A
B C
D E
F
Jump free set of
instructions
JUMP
JUMP sign 1
JUMP
JUMP sign 2
Branch free set of
instructions
BLOCK sign
Sign of 1st bra
Branch
Sign of 2nd bra
Branch
No Branch signs
Micro Rollback [Tamir, et al., IEEE TC‘90]Micro Rollback [Tamir, et al., IEEE TC‘90]
Individual State Registers(RAM based)
Register File, Caches, Main Mem(DWB based)
to\from processor
Priority
v
v
v
v
v
v
Backup Registers
Current Register
CAMvv v v v v
PRIORITY CIRCUITDECODER
Register Addresses
Register FileDWB
FIFOBus 1
Bus 2
Write
Write
Support for Micro Rollback for Support for Micro Rollback for Register File - exampleRegister File - example
MOVE 0000, D0MOVE 0000, D0 ADD 000F, D0ADD 000F, D0 MOVE 0001, A3 (f)MOVE 0001, A3 (f) SUBSUB 0002, D0 0002, D0 ……
CAM
PRIORITY CIRCUITDECODER
Register Addresses
Register FileDWB
FIFOBus 1
Bus 2
Write
Write
Support for Micro Rollback for Support for Micro Rollback for Register File - exampleRegister File - example
MOVE 0000, D0MOVE 0000, D0 ADD 000F, D0ADD 000F, D0 MOVE 0001, A3 (f)MOVE 0001, A3 (f) SUBSUB 0002, D0 0002, D0
Micro rollbackMicro rollback2 levels2 levels
……
CAM
PRIORITY CIRCUITDECODER
Register Addresses
Register FileDWB
FIFOBus 1
Bus 2
Write
Write
100000
D0 XX XX XX XX XX
0000XXXX XXXX XXXX XXXX XXXX
Support for Micro Rollback for Support for Micro Rollback for Register File - exampleRegister File - example
MOVE 0000, D0MOVE 0000, D0 ADD 000F, D0ADD 000F, D0 MOVE 0001, A3 (f)MOVE 0001, A3 (f) SUBSUB 0002, D0 0002, D0
Micro rollbackMicro rollback2 levels2 levels
……
CAM
PRIORITY CIRCUITDECODER
Register Addresses
Register FileDWB
FIFOBus 1
Bus 2
Write
Write
110000
D0 XX XX XX XX D0
000FXXXX XXXX XXXX XXXX 0000
Support for Micro Rollback for Support for Micro Rollback for Register File - exampleRegister File - example
MOVE 0000, D0MOVE 0000, D0 ADD 000F, D0ADD 000F, D0 MOVE 0001, A3 (f)MOVE 0001, A3 (f) SUBSUB 0002, D0 0002, D0
Micro rollbackMicro rollback2 levels2 levels
……
CAM
PRIORITY CIRCUITDECODER
Register Addresses
Register FileDWB
FIFOBus 1
Bus 2
Write
Write
111000
A3 XX XX XX D0 D0
0101XXXX XXXX XXXX 0000 000F
Support for Micro Rollback for Support for Micro Rollback for Register File - exampleRegister File - example
MOVE 0000, D0MOVE 0000, D0 ADD 000F, D0ADD 000F, D0 MOVE 0001, A3 (f)MOVE 0001, A3 (f) SUBSUB 0002, D0 0002, D0
Micro rollbackMicro rollback2 levels2 levels
……
CAM
PRIORITY CIRCUITDECODER
Register Addresses
Register FileDWB
FIFOBus 1
Bus 2
Write
Write
00
XX XX
XXXX XXXX
1 1 1 1
D0 D0 A3 D0
0000 000D0101000F
Support for Micro Rollback for Support for Micro Rollback for Register File - exampleRegister File - example
MOVE 0000, D0MOVE 0000, D0 ADD 000F, D0ADD 000F, D0 MOVE 0001, A3 (f)MOVE 0001, A3 (f) SUBSUB 0002, D0 0002, D0
Micro rollbackMicro rollback2 levels2 levels
……CAM
PRIORITY CIRCUITDECODER
Register Addresses
Register FileDWB
FIFOBus 1
Bus 2
Write
Write
00
XX XX
XXXX XXXX
1 1 0 0
D0 D0 A3 D0
0000 000D0101000F
Support for Micro Rollback for Support for Micro Rollback for Register File - exampleRegister File - example
MOVE 0000, D0MOVE 0000, D0 ADD 000F, D0ADD 000F, D0 MOVE 0001, A3 (f)MOVE 0001, A3 (f) SUBSUB 0002, 0002,
D0…D0…
CAM
PRIORITY CIRCUITDECODER
Register Addresses
Register FileDWB
FIFOBus 1
Bus 2
Write
Write
00
XX XX
XXXX XXXX
1 1
D0 D0
0000
1 0
D0 A3
000D0001000F
CFC with Micro Rollback - CFC with Micro Rollback - Priority Priority
Two concurrent fault detection techniques can request Two concurrent fault detection techniques can request the processor a micro rollbackthe processor a micro rollback
They generally requests different number of levels of They generally requests different number of levels of rollbackrollback
Which technique should have the priority in case of Which technique should have the priority in case of simult. detection by both HC and WD?simult. detection by both HC and WD?• We assign the priority to the Hamming codeWe assign the priority to the Hamming code
Reason: shorter jump backsReason: shorter jump backs Although a rationale exists for WD priorityAlthough a rationale exists for WD priority
HC WD
MRB Unit uRB=1 uRB=3
? ?
CFC with Instruction Micro CFC with Instruction Micro Rollback – State DiagramRollback – State Diagram
ResetBegin Block
ErrorWrong Branch
ErrorWrong Computed Signature
Header
Middle Block
Signature 1
Signature 2
Branch
GET2S
GET1S
Header Sign Eg.Jump Signatures?
N
N
N
N
Y
Y
Y
Y
Computed Sign. Eq.Header Sign?
Error
Wrong Branch or Faulted SignaturesMultiple points of micro rollback
t<t1
t1<=t<t2
tt2
A
B C
D E
F
urb_d = 2
urb_d = bsize
urb_d = 1
urb_d = 2
urb_d = 3t = number of times the same error state is encountered.t < t1 : urb to BEGIN_BLOCK (1 instr) read header sign. againt1<=t<t2 : urb to “Branch” (2 instr) --re-exec prev. blk’s brancht >≥ t2 : urb to MIDDLE BLOCK (3 instr)-- re-read 2 branch signs. prev blk
Hamming Codeurb_d = 1
(re-executeprevious branch)
Jump free set of
instructions
JUMP
JUMP sign 1
JUMPJUMP sign 2
Branch free set of
instructions
BLOCK sign
Sign of 1st bra
BranchSign of 2nd bra
Branch
OutlineOutline GoalsGoals Solution AdoptedSolution Adopted
• Control Flow CheckingControl Flow Checking• Hamming encoding on the busesHamming encoding on the buses• Instruction Micro rollbackInstruction Micro rollback
Motorola 68040 and VHDL descriptionMotorola 68040 and VHDL description Simulation resultsSimulation results ConclusionConclusion
Improved VHDL Model of 68040 + Improved VHDL Model of 68040 + Watchdog connectionsWatchdog connections
CPU BC
InstrCache
DataCache
Encoder DecoderDecoder
Enc \ Dec
Encoder
Enc \ Dec
Enc \ Dec
Enc \ Dec
Encoder Decoder
Encoder Decoder
Encoder Decoder
AddressBus
Data Bus
enable
rw
readyOABUS2OABUS1
IABUS1 IABUS2
IDBUS
ODBUS
WD
Hammingcode errordetect. bits
Controllines
Data buses
OutlineOutline Goals Goals Solution AdoptedSolution Adopted
• Control Flow CheckingControl Flow Checking• Hamming encoding on the busesHamming encoding on the buses• Instruction Micro rollbackInstruction Micro rollback
Motorola 68040 and VHDL descriptionMotorola 68040 and VHDL description Simulation resultsSimulation results ConclusionConclusion
Simulation EnvironmentSimulation Environment•The Total Fault Injection Time is simply the total duration of the intermittent fault on the bus or buses considered.•The Delay Time is the time that the FG waits before starting the fault injection.•The Period Time is the period of the intermittent fault.•The Fault Time is the time of duration of the injection of a certain fault.
Start Fault Injection
FirstFaultInjected
SecondFaultInjected
Period TimeFaultTime
Total Fault Injection Time
Delay Time
Fault Enable
Fault Parameters ValuesFault Parameters Values Simulations run on the model:Simulations run on the model:
• Faults injected on all cache busesFaults injected on all cache buses• Fault typesFault types
Random Double, Triple, Quadruple FaultsRandom Double, Triple, Quadruple Faults Clustered 1 cluster 2bits, 1 cluster 4bits, 2 clusters 2bitsClustered 1 cluster 2bits, 1 cluster 4bits, 2 clusters 2bits
• Three values of repeat frequencyThree values of repeat frequency Low (100 clock cycles = 100KHz)Low (100 clock cycles = 100KHz) Medium (10 clock cycles = 1MHz)Medium (10 clock cycles = 1MHz) High (1 clock cycle = 10MHz)High (1 clock cycle = 10MHz)
• Three values of duty cycleThree values of duty cycle 25% all the simulations25% all the simulations 50% all except high freq and 4 faults50% all except high freq and 4 faults 75% all 2 faults and 3faults middle frequencies75% all 2 faults and 3faults middle frequencies
Simulation Results (contd.)Simulation Results (contd.)
Overall correctness of execution - sorted
4555
35 30
76 7265 64
1118 21 18
11 8 13 16
0102030405060708090
100
Correct without WDCorrect with WDFail safe with WDIncorrect runs with WD
Average execution time (completed runs) vs kind of fault injection
0200000400000600000800000
10000001200000
No Faults
2 Random Faults
1 Cluster 2bits
3 Random Faults
1 Cluster 4bits
2 Clusters 2bits
4 Random Faults
100KHz1MHz10MHz
Simulation Results (contd.)Simulation Results (contd.)
NOTE:
• HC has better error coverage for cluster faults
• Block sign check (part of CFC) has better err cov for rand faults
Simulation Results (contd.)Simulation Results (contd.)
Average execution time - low frequency [1 cluster 4 bits]
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
only correctrun
only notcorrect run
finished run not finishedrun
no faultinjection
25% dc50% dc75% dc
ConclusionsConclusions Micro-rollback coupled with FD for the first timeMicro-rollback coupled with FD for the first time Micro-rollable WD state diagram for the first timeMicro-rollable WD state diagram for the first time More extensive fault patterns than previous workMore extensive fault patterns than previous work Good reliability for our FD/FT solutions (correct or Good reliability for our FD/FT solutions (correct or
fail-safe execution)fail-safe execution)• 3 faults: 94% low freq, 90% mid freq & 90% high freq3 faults: 94% low freq, 90% mid freq & 90% high freq• 4 faults: 86% low freq, 80% mid freq & 80% high freq4 faults: 86% low freq, 80% mid freq & 80% high freq
Average execution time linear with duty cycle and Average execution time linear with duty cycle and almost quadratic with the fault injection almost quadratic with the fault injection frequencyfrequency• time ovhd 3 faults: 11% low, 12% med, 64% high freqtime ovhd 3 faults: 11% low, 12% med, 64% high freq• time ovhd 4 faults: 16% low, 32% med, 182% high freqtime ovhd 4 faults: 16% low, 32% med, 182% high freq
Data buses less tolerant to faults than address Data buses less tolerant to faults than address buses (latter causes more CFC errors and are so buses (latter causes more CFC errors and are so detected more easily)detected more easily)
Future WorkFuture Work Introduction of other fault detection Introduction of other fault detection
techniques as triggers for micro rollbacktechniques as triggers for micro rollback
• Lower level fault detection like the micro Lower level fault detection like the micro instruction control flow checking -- can detect instruction control flow checking -- can detect internal processor faultsinternal processor faults
• Higher level fault detection like algorithm based Higher level fault detection like algorithm based fault tolerance (ABFT) for checking data errors -- fault tolerance (ABFT) for checking data errors -- can detect external & internal faults affecting can detect external & internal faults affecting datadata