university of michigan electrical engineering and computer science 1 a microarchitectural analysis...

23
1 University of Michigan Electrical Engineering and Computer Science A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor Jason Blome, Scott Mahlke, Daryl Bradley*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd.

Post on 22-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

1 University of MichiganElectrical Engineering and Computer Science

A Microarchitectural Analysis of Soft Error Propagation in a Production-Level

Embedded Microprocessor

Jason Blome, Scott Mahlke, Daryl Bradley*, Krisztián Flautner*

Advanced Computer Architecture Lab, University of Michigan*ARM Ltd.

2 University of MichiganElectrical Engineering and Computer Science

Embedded Everywhere

Patterson and Hennessy 2005

• Not just cellphones• Safety critical applications:

► Automotive► Healthcare

3 University of MichiganElectrical Engineering and Computer Science

Embedded Domain Constraints• Power efficient performance

► Longer clock cycle times► Increased logic depth between stages► Higher area ratio of combinational logic to state elements

• Less speculative state► Potentially less masking

• Limited real estate

All of these high level constraints affect the behavior of faults and the potential of fault tolerance techniques

4 University of MichiganElectrical Engineering and Computer Science

Objectives• Understand the effects of transient faults on a

typical embedded design► Architectural contributions to soft error effects► Production-grade core

• Reference synthesis flow• Design for test methodologies

• Simulate faults in both combinational and sequential logic

5 University of MichiganElectrical Engineering and Computer Science

Soft Error Rate Contributions

Shivakumar 2002

Soft Error Rate Contributions

Mitra 2005

49%

11%

40%

StaticCombinationalLogicUnprotectedSRAMs

SequentialElements

Increasing contribution of faults in combinational logic to the overall soft error rate

6 University of MichiganElectrical Engineering and Computer Science

Processor Model

RegisterBank

RegisterBank

Data InterfaceData Interface

InstructionAddress

Logic

InstructionAddress

Logic

DataAddress

Logic

DataAddress

Logic

MultiplyMultiply ALU

ShiftShift

Instruction DecodeInstruction Decode

ARM926EJ-S

Instruction FetchInstruction Fetch

Datacache

Datacache

MMUMMU

Instructioncache

Instructioncache

MMUMMU

Bus Interface

Write Buffer/Bus Interface

MuxArray

MuxArray

• ARM926EJ-S• Cell library characterized for 130 nm• 5 ns clock cycle time

7 University of MichiganElectrical Engineering and Computer Science

Analysis Infrastructure

testbench

referencedesign

testdesign

report generationreport generation

benchmarkbenchmark

fault injection/error analysis framework

error checkingand logging

fault injectionscheduler

8 University of MichiganElectrical Engineering and Computer Science

Fault Masking

• Logical: faulted value does not affect logical operation of the circuit

0

0

• Latching-Window: the fault pulse does not reach a state element within the latching window

• Electrical: the fault pulse is electrically attenuated by subsequent gates in the circuit

• Architectural/Software: incorrect state is written before it is read

CLK

tsetup thold

9 University of MichiganElectrical Engineering and Computer Science

Observed Error Rates

Error Site Error Rate Masking Rate

Microarchitectural State

94% 6%

Architectural State 7% 93%

Top-level Ports 4% 96%

Error Site Error Rate Masking Rate

Microarchitectural State

16% 84%

Architectural State 4% 96%

Top-level Ports 3% 97%

Faults Occurring in Registers

Faults Occurring in Combinational Logic

At the software interface, error rates within 3%

94%

16%

7%

4%

10 University of MichiganElectrical Engineering and Computer Science

Observed Error Rates

Cycle Average Bit Errors

1 1.26

2 3.19

3 3.06

4 5.52

Faults Occurring in Registers

Faults Occurring in Combinational Logic

Cycle Average Bit Errors

1 41.49

2 45.33

3 47.76

4 49.54

Faults in combinational logic have a much more dramatic effect on system state

11 University of MichiganElectrical Engineering and Computer Science

Architectural Errors per Cycle

00.10.20.30.40.50.60.70.80.9

1

1 10 100 1000

Number of Architectural Errors

Rela

tive F

req

uen

cy

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10

Faults Occurring in Registers

Faults Occurring in Combinational Logic

00.10.20.30.40.50.60.70.80.9

1

1 10 100 1000

Number of Architectural Errors

Rela

tive F

req

uen

cy

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10

12 University of MichiganElectrical Engineering and Computer Science

Architectural Corruption Characteristics

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 6 11 16 21 26 31

Corrupt Bits per Architectural Register

Rela

tive F

req

uen

cy

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 6 11 16 21 26

Number of Corrupted Architectural Registers

Rela

tive F

req

uen

cy

Cycle 1

Cycle 2Cycle 3

Cycle 4Cycle 5

Cycle 6Cycle 7

Cycle 8Cycle 9

Cycle 10

Bits per Architectural Register Corrupted

Number of Architectural Registers Corrupted

13 University of MichiganElectrical Engineering and Computer Science

Results Summary• Faults occurring in logic:

► Will likely be much more frequent in embedded design► Tend to have a more dramatic effect on system state► Multi-bit/multi-register architectural errors common

• Design for test methodologies can greatly impact soft error characteristics

• Error rates at the software interface consistent with those observed in high-performance microprocessors

14 University of MichiganElectrical Engineering and Computer Science

Traditional Error Detection/Protection

• Reliable Encoding► ECC/Parity

• Limited use for faults in logic• Unclear where/how much to protect

• Redundant Computation► In space

• Area/energy overhead

► In time• Energy overhead• Requires performance slack

15 University of MichiganElectrical Engineering and Computer Science

Case Study I

RegisterBank

RegisterBank

Data InterfaceData Interface

InstructionAddress

Logic

InstructionAddress

Logic

DataAddress

Logic

DataAddress

Logic

MultiplyMultiply ALU

ShiftShift

Instruction DecodeInstruction Decode

Instruction FetchInstruction Fetch

Datacache

Datacache

MMUMMU

Instructioncache

Instructioncache

MMUMMU

Bus Interface

Write Buffer/Bus Interface

MuxArray

MuxArray

IRoute

Cycle 1: 51 Errorsinstr_reg_ID[0, 16, 22, 31]ID_decode_info[0, 16, 31]

stored_instr[29, 30]Cycle 2: 51 Errors

instr_reg_EX[0, 16, 22, 31]EX_decode_info[0, 16, 31]Cycle 3: 17 ErrorsALU_out[0, 1, 2, 3, 4, 5, 6]

Cycle 4: 18 ErrorsALU_result_wb[0,1,2,3,4,5,6]

Cycle 5: 29 ErrorsReg0_reg[0, 1, 2, 3, 4, 5, 6]

16 University of MichiganElectrical Engineering and Computer Science

Case Study II

RegisterBank

RegisterBank

Data InterfaceData Interface

InstructionAddress

Logic

InstructionAddress

Logic

DataAddress

Logic

DataAddress

Logic

MultiplyMultiply ALU

ShiftShift

Instruction DecodeInstruction Decode

Instruction FetchInstruction Fetch

Datacache

Datacache

MMUMMU

Instructioncache

Instructioncache

MMUMMU

Bus Interface

Write Buffer/Bus Interface

MuxArray

MuxArray

IPipeCycle 1: 9 Errorsinstr_reg_ID[3,12,17, 18,24,26,29,30,31]

Cycle 4: 183 Errorswriteback and forwarding state

register bank

Cycle 2: 62 Errorsinstr_reg_EX

shifter_data_opEx_regShifter_data_reg

alu_cc_reg

Cycle 3: 49 ErrorsShifter_data_EX

alu_out_reg

17 University of MichiganElectrical Engineering and Computer Science

Fault Characteristics• Case Study I: uCORE.uIRoute.U600

► First cycle error sites: 51 errors• uIRoute.INSTRHeld_reg[0]• uIRoute.INSTRHeld_reg[16]• uIRoute.INSTRHeld_reg[22]• uIRoute.INSTRHeld_reg[31]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[0]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[16]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[31]• u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg[29]• u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg[30]

• Case Study II: uCORE.u9EJ.uARM9.uCORECTL.uIPIPE.U3626► First cycle error sites: 9 errors

• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[3]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[12]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[17]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[18]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[24]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[26]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[29]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[30]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[31]

18 University of MichiganElectrical Engineering and Computer Science

Embedded Design Space Potential

• Leverage significant signal fanout• Determine that a fault has occurred during the

cycle that it occurs► Transition detection circuits

• Selectively deploy fault detection units► Intersection of high fanout fault targets► No roll-back necessary – simply flush the pipeline► Low cost/area overhead critical for embedded

designs

19 University of MichiganElectrical Engineering and Computer Science

Conclusion

• Design domain critical:► Affects fault behavior► Limits applicable tolerance techiques

• Key observations:► Faults in combinational logic much more likely in

embedded designs► Faults in combinational logic behave dramatically

different than those in state elements► Fault fanout offers potential for low overhead

detection

20 University of MichiganElectrical Engineering and Computer Science

Soft Error Terminology

transient fault soft error

transistor

21 University of MichiganElectrical Engineering and Computer Science

Dependence on Fault Duration

0

0.02

0.04

0.06

0.08

0.1

0.12

1500 2500 3500 4500

Fault Duration

Fre

qu

en

cy

of

Ex

pre

ss

ed

Err

ors

22 University of MichiganElectrical Engineering and Computer Science

Pulse Detection

D

CLK

Q

~Q

error

flip-flop

shadow latch

23 University of MichiganElectrical Engineering and Computer Science

Microarchitectural Errors per Cycle

00.10.20.30.40.50.60.70.80.9

1

1 10 100 1000 10000

Number of Microarchitectural Errors

Rela

tive F

req

uen

cy Cycle 1

Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10

Faults Occurring in Registers

Faults Occurring in Combinational Logic

Multi-bit errors common for Faults in combinational logic

00.1

0.20.3

0.40.5

0.60.70.8

0.91

1 10 100 1000 10000

Number of Microarchitectural Errors

Rela

tive F

req

uen

cy

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10