layali rashid , karthik pattabiraman and sathish...

28
T OWARDS UNDERSTANDING THE EFFECTS OF INTERMITTENT HARDWARE F AULTS ON PROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING THE UNIVERSITY OF BRITISH COLUMBIA

Upload: others

Post on 17-Jul-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

TOWARDS UNDERSTANDING THE EFFECTS OF

INTERMITTENT HARDWARE FAULTS ON PROGRAMS

Layali Rashid, Karthik Pattabiraman and Sathish GopalakrishnanDEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

THE UNIVERSITY OF BRITISH COLUMBIA

Page 2: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Motivation: Why Intermittent Faults?

� Intermittent faults are likely to be a significant concern in future processors� Do not persist forever unlike permanent faults

� Persist for longer duration than transient faults

� May impact program more than transient faults� May impact program more than transient faults

� Assumption:

� An intermittent fault affects two or more consecutive instructions in the program.

Page 3: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Contributions

� Study the impact of intermittent faults on programs.

� Model the propagation of intermittent faults in programs at the instruction-level.

� Validate the model using fault injections.� Validate the model using fault injections.

Page 4: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Motivation: Why Model Error Propagation?

� Fault injection experiments are prohibitively expensive.� Intermittent faults vary in location and duration.

� An order of magnitude slower than modeling.

� Modeling error propagation provides more insights that may help in tolerating faults.

Page 5: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Primary Research Questions

� Do all intermittent faults lead to program crash?

� How many instructions are executed before the program crashes? program crashes?

� How many variables are corrupted by the fault before the program crashes?

Page 6: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Approach

Crash ModelFault Model

Dynamic Dependency Graph

SimpleScalarsimulator

Evaluate using FI

Page 7: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Approach

Crash Model

Fault Model• Decoder•ALU Unit• Load/Store Unit

SimpleScalarsimulator

Evaluate using FI

Dynamic Dependency Graph

Page 8: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Approach

Fault Model

Crash Model•Memory address•Branch/jump address•Function call address

SimpleScalarsimulator

Evaluate using FI

Dynamic Dependency Graph

Page 9: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Approach

Crash ModelFault Model

Dynamic Dependency Graph is a directed acyclic graph that models the dynamic dependencies between instructions. [Agrawal '90]

SimpleScalarsimulator

Evaluate using FI

Page 10: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Code Fragment Node

mov R1, #5 1

mov R2, #6 2

mov R3, #7 3

ld R4, R1, Array_Addr 4AA

1

4

2

5

Array_Addr

#5 #6

3

6

#7

A

Example

ld R4, R1, Array_Addr 4

ld R5, R2, Array_Addr 5

ld R6, R3, Array_Addr 6

mult R7, R5, R4 7

4 5

7

6

R R...

Page 11: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Code Fragment Node

mov R1, #5 1

mov R2, #6 2

mov R3, #7 3

ld R4, R1, Array_Addr 4

1

4

2

5

Array_Addr

#5 #6

3

6

#7

A A A

Example

ld R4, R1, Array_Addr 4

ld R5, R2, Array_Addr 5

ld R6, R3, Array_Addr 6

mult R7, R5, R4 7

4 5

7

6

R R...

A node is a value produced by a dynamic instruction

Page 12: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Code Fragment Node

mov R1, #5 1

mov R2, #6 2

mov R3, #7 3

ld R4, R1, Array_Addr 4AA

1

4

2

5

Array_Addr

#5 #6

3

6

#7

A

Example

ld R4, R1, Array_Addr 4

ld R5, R2, Array_Addr 5

ld R6, R3, Array_Addr 6

mult R7, R5, R4 7

4 5

7

6

R R...

The edges represent the instructions’ operands:•A is an address operand• R is a regular operand.

Page 13: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

DDG Metrics

� Intermittent Propagation Set (IPS): set of program values to which an intermittent fault propagates,

� Crash Distance (CD): number of instructions � Crash Distance (CD): number of instructions that execute from the time an intermittent fault occurs until the program crashes (due to fault).

Page 14: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Example

Code Fragment Node

mov R1, #5 1

mov R2, #6 2

mov R3, #7 3

ld R4, R1, Array_Addr 4AA

1 2

5

Array_Addr

#5 #6

3

6

#7

A

Intermittent Error

4ld R4, R1, Array_Addr 4

ld R5, R2, Array_Addr 5

ld R6, R3, Array_Addr 6

mult R7, R5, R4 7

5

7

6

R R...

4

Intermittent Propagation Set (1,2) = {?}Crash Distance (1, 2) = ?

Page 15: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Example

Code Fragment Node

mov R1, #5 1

mov R2, #6 2

mov R3, #7 3

ld R4, R1, Array_Addr 4AA

1 2

5

Array_Addr

#5 #6

3

6

#7

A

4

Transient Error

Crash Nodeld R4, R1, Array_Addr 4

ld R5, R2, Array_Addr 5

ld R6, R3, Array_Addr 6

mult R7, R5, R4 7

5

7

6

R R...

Transient Propagation Set (1) = {1, 4}Transient Crash Distance (1) = 4

4Crash Node

Page 16: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Example

Code Fragment Node

mov R1, #5 1

mov R2, #6 2

mov R3, #7 3

ld R4, R1, Array_Addr 4AA

1

4

2Array_Addr

#5 #6

3

6

#7

A

5

Transient Error

ld R4, R1, Array_Addr 4

ld R5, R2, Array_Addr 5

ld R6, R3, Array_Addr 6

mult R7, R5, R4 7

4

7

6

R R...

5

Transient Propagation Set (1) = {1, 4}Transient Crash Distance (1) = 4

Transient Propagation Set (2) = {2, 5}Transient Crash Distance (2) = 4

Page 17: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Example

Code Fragment Node

mov R1, #5 1

mov R2, #6 2

mov R3, #7 3

ld R4, R1, Array_Addr 4AA

1 2

5

Array_Addr

#5 #6

3

6

#7

A

4

Intermittent Error

Crash Nodeld R4, R1, Array_Addr 4

ld R5, R2, Array_Addr 5

ld R6, R3, Array_Addr 6

mult R7, R5, R4 7

5

7

6

R R...

Intermittent Propagation Set (1,2) = {1, 2, 4}Crash Distance (1, 2) = 4

4Crash Node

Page 18: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Approach

Crash ModelFault Model

Dynamic

SimpleScalarsimulator

Evaluate using FI

Dynamic Dependency Graph

Page 19: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Experimental Setup

� Evaluating the Model’s Accuracy� Intermittent fault injections in instruction level

simulator (SimpleScalar)

� Measure the difference between the predicted and the actual CD for crashesactual CD for crashes

� Computation of Intermittent Fault Propagation� Construct the DDG of each program.

� Find the IPS and the CD for each fault

Page 20: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Benchmarks

� Preliminary results for two programs: Matrix Multiply and Insertion Sort.

� Each program has about 11,000 static MIPS instructions.

Page 21: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Results: DDG Model Vs. SimpleScalar

� 88% of the expected CD fall within 10 nodes from the actual ones and 97% fall within 100 nodes.

Page 22: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Results: CD Absolute values

� 95% of the faults cause program to crash within 10 nodes of the fault’s start.

Page 23: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Results: Effect of Fault Length

Page 24: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Conclusions and Discussion� We enhanced Dynamic Dependency Graph to model intermittent

fault propagation in programs.

� 88% of the expected faults' CDs fall within 10 nodes of the actual CDs.

� The majority of the intermittent faults cause programs to crash The majority of the intermittent faults cause programs to crash within few hundreds of dynamic instructions.

� Discussion� Detection using software-based techniques of intermittent faults

can be efficient.

� Diagnosis of intermittent faults is possibly feasible using software-based techniques.

� Recovery using check-pointing techniques on the order of thousands of instructions will be effective.

Page 25: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

THANKYOU

Page 26: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

BACKUP SLIDES

Page 27: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Insertion Sort CD

Page 28: Layali Rashid , Karthik Pattabiraman and Sathish …webhost.laas.fr/TSF/WDSN10/WDSN10_files/Slides/WDSN10...Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan Created Date

Insertion Sort IPS