new challenges fornew challenges for designers of fault

147
New challenges for New challenges for designers of fault tolerant designers of fault tolerant E b dd dS t b d Embedded Systems based ft t h l i on future technologies Carlos Arthur Lang Lisbôa Luigi Carro Instituto de Informática, Programa de Pós-Graduação em Computação Universidade Federal do Rio Grande do Sul - Porto Alegre, RS, Brazil IESS - Schloβ Langenargen, Germany – September 15 th , 2009

Upload: others

Post on 20-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New challenges forNew challenges for designers of fault

New challenges forNew challenges for designers of fault tolerantdesigners of fault tolerant

E b dd d S t b dEmbedded Systems based f t t h l ion future technologies

Carlos Arthur Lang Lisbôa Luigi Carro

Instituto de Informática, Programa de Pós-Graduação em ComputaçãoUniversidade Federal do Rio Grande do Sul - Porto Alegre, RS, Brazilg , ,

IESS - Schloβ Langenargen, Germany – September 15th, 2009

Page 2: New challenges forNew challenges for designers of fault

Outline

• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future

technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults

mitigation techniquesmitigation techniques• Recent solutions working at different abstraction

l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 2

Conclusions

Page 3: New challenges forNew challenges for designers of fault

Concepts and Definitionsp

Faults• Faults

• Errors

• Failures

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 3

Page 4: New challenges forNew challenges for designers of fault

Concepts and Definitions

• Duration of errors and faults

p

• Duration of errors and faults

o Permanent

o Transiento Transient

o Intermittent

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 4

Page 5: New challenges forNew challenges for designers of fault

Technology trends (1)Technology trends (1)

T i t i• Transistor size

Device size are decreasing

NodesNodes capacitances are

decreasingdecreasing

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 5

Page 6: New challenges forNew challenges for designers of fault

Technology trends (2)Technology trends (2)

T i t Vth

P S l

• Transistor Vth

Power Supply

Threshold Voltage

Nodes voltages are decreasing

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 6

g

Page 7: New challenges forNew challenges for designers of fault

Single event upsetSingle event upset

A transistor changes from OFF to ON state!

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 7

g

Page 8: New challenges forNew challenges for designers of fault

SEE and Technology trends (1)SEE and Technology trends (1)

• Consequences of C and V reduction• Consequences of C and V reductionHIGH C + HIGH V HIGH Q=C.VQ

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 8

Page 9: New challenges forNew challenges for designers of fault

SEE and Technology trends (2)SEE and Technology trends (2)

• Consequences of C and V reductionLOW C + LOW V LOW Q=C.V

• Consequences of C and V reductionQ

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 9

Page 10: New challenges forNew challenges for designers of fault

Concepts and Definitions

• Radiation Induced Faults

p

• Radiation Induced Faultso Single Event Effects – SEEso Single Event Effects SEEs

o Single Event Transients – SETso Single Event Transients SETs

o Single Event Upsets – SEUso Single Event Upsets SEUs

o Soft Error - SEo Soft Error - SE

o Multiple Bit Upsets MBUso Multiple Bit Upsets – MBUs

• Soft Error Rate SERLuigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 10

• Soft Error Rate - SER

Page 11: New challenges forNew challenges for designers of fault

The Soft Error ProblemThe Soft Error Problem

Single Event Upset (SEU)

CLKCLK

DQ0

1CLK

QD

Q

1CLK

DQ

D

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 11

Page 12: New challenges forNew challenges for designers of fault

The Soft Error ProblemThe Soft Error Problem

Transient Fault Soft ErrorTransient Fault Soft Error

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 12

Page 13: New challenges forNew challenges for designers of fault

Concepts and Definitions

• Masking of faults and errors

p

• Masking of faults and errors

o Logical

o Latching window

o Electrical

o Architectural

o Software

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 13

Page 14: New challenges forNew challenges for designers of fault

Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors

• Logical: faulty value does not affect logical operation of the circuitoperation of the circuit

0

0

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 14

[Blome et al, CASES, 2006]

Page 15: New challenges forNew challenges for designers of fault

Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors

• Latching-Window: the fault pulse does not reach a state element within the latchingreach a state element within the latching window

CLK

tsetup thold

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 15

[Blome et al, CASES, 2006]

Page 16: New challenges forNew challenges for designers of fault

Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors

• Electrical: the fault pulse is electrically attenuated by subsequent gates in theattenuated by subsequent gates in the circuit

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 16

[Blome et al, CASES, 2006]

Page 17: New challenges forNew challenges for designers of fault

Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors

• Architectural/Software: incorrect state is written before it is read

Register File

written before it is read

mov r5, 8

mov r2, 4

-

Register File

01mov r5, 8

mov r2, 4

mov r5, 8 ----c

oder 1

234

add r6, r2, r5

,

add r6, r2, r5

-…

de 45

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 17

[Blome et al, CASES, 2006]

Page 18: New challenges forNew challenges for designers of fault

Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors

• Architectural/Software: incorrect state is written before it is read

Register File

written before it is read

mov r5, 8

mov r2, 4

-

Register File

01mov r5, 8

mov r2, 4

mov r5, 8 -4--c

oder 1

234

add r6, r2, r5

,

add r6, r2, r5

-…

de 45

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 18

[Blome et al, CASES, 2006]

Page 19: New challenges forNew challenges for designers of fault

Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors

• Architectural/Software: incorrect state is written before it is read

Register File

written before it is read

mov r5, 8

mov r2, 4

-

Register File

01mov r5, 8

mov r2, 4

mov r5, 8 -4--c

oder 1

234

add r6, r2, r5

,

add r6, r2, r5

9…

de 45

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 19

[Blome et al, CASES, 2006]

Page 20: New challenges forNew challenges for designers of fault

Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors

• Architectural/Software: incorrect state is written before it is read

Register File

written before it is read

mov r5, 8

mov r2, 4

-

Register File

01mov r5, 8

mov r2, 4

mov r5, 8 -

--c

oder 1

234

add r6, r2, r5

,

add r6, r2, r54

8…

de 45

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 20

[Blome et al, CASES, 2006]

Page 21: New challenges forNew challenges for designers of fault

Outline

• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future

technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults

mitigation techniquesmitigation techniques• Recent solutions working at different abstraction

l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 21

Conclusions

Page 22: New challenges forNew challenges for designers of fault

Motivation: Future Technologies

• The good news:

g

☺• The good news:

o Smaller devices ☺o Smaller devices→ Denser circuits, less area

☺o Faster devices

→ Higher performance

o Less power consumption→ Longer battery life (portable systems)→ Longer battery life (portable systems)

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 22

Page 23: New challenges forNew challenges for designers of fault

Motivation: Future Technologies

• The bad news:

g

• The bad news:

o Higher defect rates→ Lower yield→ Lower yield

o Higher sensitivity to radiation→ Increased SER: combinational logic→ Increased SER: combinational logic→ Multiple simultaneous faults→ Long duration transients

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 23

Page 24: New challenges forNew challenges for designers of fault

Outline

• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future

technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults

mitigation techniquesmitigation techniques• Recent solutions working at different abstraction

l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 24

Conclusions

Page 25: New challenges forNew challenges for designers of fault

Major Challengesj g

• Long Duration Transients (LDTs)• Long Duration Transients (LDTs)Different paces in transient widths vs. device speed scaling will lead to transient pulses lasting longer than cycle times of circuits. o ge a cyc e es o c cu sTemporal redundancy techniques will not cope.

• Multiple Simultaneous FaultsMultiple Simultaneous FaultsSmaller distances between devices will allow a i l ti l t ff t th d isingle particle to affect more than one device.

The single fault model will fail.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 25

Page 26: New challenges forNew challenges for designers of fault

Transient width studiesTransient width studies

DODD, 2004 FERLET-CAVROIS, 2006

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 26

Page 27: New challenges forNew challenges for designers of fault

Propagation delay(*) vs. TechnologiesPropagation delay vs. Technologies

Technology (nm) 180 130 90 32 180/32

10-inverter chain 508.4 157.8 120.2 79.6 6.39

in out

clk clk

32 nm32 nm

90 nm

130 nm

180 nm

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 27

(*) simulated using parameters from PTM web site and HSPICE tool

Page 28: New challenges forNew challenges for designers of fault

Transient widths vs. Propagation delaysTransient widths vs. Propagation delays

Cycle time and transient width scaling across technologiesCycle time and transient width scaling across technologies

600 Transientidth li

500

) Width 20MeV

width scaling:max. 1.37 x

300

400

time

(ps) Width 20MeV

Width 10MeVCycle 10 InvCycle 8 Inv

6.39 x

200

300

Cyc

le Cycle 8 Inv

Cycle 6 InvCycle 4 Inv

(*)

100

0180nm 130nm 100nm 90nm 70nm 32nm

Technology

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 28

(*) 180, 130, and 100nm from [DODD, 2004], 70 nm from [Ferlet-Cavrois 2006]

Page 29: New challenges forNew challenges for designers of fault

Single event, multiple effects[Rossi 2005 *]

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 29

[*] Multiple Transient Faults in Logic: An Issue for Next Generation ICs ?, Daniele Rossi et al, DFT 2005

Page 30: New challenges forNew challenges for designers of fault

Outline

• Introduction: concepts and definitionsp• Motivation: new challenges imposed by future

technologiestechnologies• Radiation induced faults: the major challengesj g• Existing mitigation techniques vs. the new

scenarioscenario• Desired properties of new radiation induced faults

mitigation techniques• Recent solutions working at different abstraction• Recent solutions working at different abstraction

levels to deal with transient faults

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 30

• Conclusions

Page 31: New challenges forNew challenges for designers of fault

LDT Effects on Temporal RedundancyLDT Effects on Temporal Redundancy

• Time Redundancy [Anghel et al, 2000]

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 31

Page 32: New challenges forNew challenges for designers of fault

LDT Effects on Temporal RedundancyLDT Effects on Temporal Redundancy

• Time Redundancy [Anghel et al, 2000]

Increase delay ?⇒ Higher performance⇒ g e pe o a ce

penalty !!!

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 32

Page 33: New challenges forNew challenges for designers of fault

LDT Effects on Space RedundancyLDT Effects on Space Redundancy

• Space Redundancy [Nieuwland et al, 2006]

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 33

Page 34: New challenges forNew challenges for designers of fault

LDT Effects on Space RedundancyLDT Effects on Space Redundancy

• Space Redundancy [Nieuwland et al, 2006]

Can not copepwith long duration

transients !!!

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 34

Page 35: New challenges forNew challenges for designers of fault

LDT Effects on Space RedundancyLDT Effects on Space Redundancy

- DMR can cope with LDTs affecting one of the modules

- allows detection only requires recomputationallows detection only, requires recomputation

- area and power overheads above 100% (too much for ES)

k i t tLuigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 35

- weak point: comparator

Page 36: New challenges forNew challenges for designers of fault

LDT Effects on Space RedundancyLDT Effects on Space Redundancy

- TMR can cope with LDTs affecting one of the modules

- allows detection and correctionallows detection and correction

- area and power overheads above 200% (too much for ES)

k i t tLuigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 36

- weak point: voter

Page 37: New challenges forNew challenges for designers of fault

Multiple simultaneous errors [Sorin 2009 *]Multiple simultaneous errors [Sorin 2009 ]

• It is an interesting open problem• It is an interesting open problem.• If forecasts of greatly increased fault rates

come to pass, error detection schemes targeting single error scenarios may betargeting single error scenarios may be insufficient.

• Most of current schemes assume a single error scenario.e o sce a o

• Some existing schemes may do well, but th lt d t ti th tthere are no results demonstrating that capability.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 37

p y[*] Fault Tolerant Computer Architecture, Daniel J. Sorin, Morgan & Claypool, 2009

Page 38: New challenges forNew challenges for designers of fault

Multiple Effects vs. Space RedundancyMultiple Effects vs. Space Redundancy

- DMR: what if a single particle affects two modules ?

different output bits affected (O O ) → OK- different output bits affected (O1i, O2j) → OK

- same output bit affected (O1k, O2k)→ PROBLEM ! Comparator will not detect error

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 38

→ PROBLEM ! Comparator will not detect error

Page 39: New challenges forNew challenges for designers of fault

Multiple Effects vs. Space RedundancyMultiple Effects vs. Space Redundancy

- TMR: what if a single particle affects two modules ?

different output bits affected (O O ) → no majority !- different output bits affected (O1i, O2j) → no majority !

- same output bit affected (O1k, O2k)→ EVEN WORSE → Voter will select erroneous output !

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 39

→ EVEN WORSE → Voter will select erroneous output !

Page 40: New challenges forNew challenges for designers of fault

Outline

• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future

technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced

faults mitigation techniquesfaults mitigation techniques• Recent solutions working at different abstraction

l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 40

Conclusions

Page 41: New challenges forNew challenges for designers of fault

AnalysisAnalysis

Currently known mitigation techniques based on• Currently known mitigation techniques based on temporal redundancy can not cope with LDTs.

• Space redundancy based mitigations techniques:y g- able to cope with LDTs; - may fail when subject to multiple faults; y j p ;- impose very high area and power overheads;- not suited for the Embedded Systems arenanot suited for the Embedded Systems arena.

• The development of new low cost techniques to• The development of new low cost techniques to face those new challenges is mandatory.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 41

Page 42: New challenges forNew challenges for designers of fault

Desired properties of new approachesDesired properties of new approaches

T l t LDT d lti l• Tolerance to LDTs and multiple simultaneous faults.

• Error detection area overhead << DMRError detection area overhead << DMR

• Error correction area overhead << TMR• Error correction area overhead << TMR

L f h d• Low performance overhead

• Additional concern for Embedded Systems:low power consumption

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 42

low power consumption

Page 43: New challenges forNew challenges for designers of fault

Suggested approachSuggested approach

Work at higher abstraction levels with low cost

System LevelAlgorithm LevelAlgorithm Level

Architecture LevelCi it L lCircuit Level

Component LevelTechnology Level

“Computer users do not notice if a transistor failsComputer users do not notice if a transistor failsor a bit of SRAM is flipped by a cosmic ray;

h i h h i h”Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 43

they notice when their programs crash” [Sorin, 2009]

Page 44: New challenges forNew challenges for designers of fault

Outline

• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future

technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults

mitigation techniquesmitigation techniques• Recent solutions working at different

b t ti l l t d l ith t i t f ltabstraction levels to deal with transient faults• Conclusions

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 44

Conclusions

Page 45: New challenges forNew challenges for designers of fault

Recently proposed solutions (1 of 6)Recently proposed solutions (1 of 6)

Working at circuit level with low cost to cope with increased SER in combinational logicwith increased SER in combinational logic

System LevelAlgorithm Level

Architecture LevelCircuit Level Combinational

H iC cu t e eComponent LevelTechnology Level

Hamming

Technology Level

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 45

Page 46: New challenges forNew challenges for designers of fault

SER evolution[*]SER evolution

[*] Baumann, R., “Soft Errors in Advanced Computer Systems”, IEEE Design and Test of Computers, vol. 22, no. 3,

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 46

IEEE Computer Society, New-York-London, May-June 2005, pp 258-266.

Page 47: New challenges forNew challenges for designers of fault

SER Trend: Latches & Chip impactSER Trend: Latches & Chip impactSER Trend: Full Chipp

10

logiclogic

30nm cache arrays

orm

to 1

1180 130 90 65 45 32SE

R N

o

Technology (nm)

Source: Intel Barcelona

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 47

Source: Intel Barcelona

Page 48: New challenges forNew challenges for designers of fault

Combinational HammingCombinational Hamming

Conventional Hamming applications: data storage and communications hardening- data storage and communications hardening

- number of inputs = number of outputs

Combinational logic: number of inputs ≠ number of outputs

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 48

Combinational logic: number of inputs ≠ number of outputs

Page 49: New challenges forNew challenges for designers of fault

Combinational HammingCombinational Hamming

Hamming codeword for 4-output circuits

k1 = s3 ⊕ s2 ⊕ s0k2 = s3 ⊕ s1 ⊕ s0k2 = s3 ⊕ s1 ⊕ s0k3 = s2 ⊕ s1 ⊕ s0P = k1 ⊕ k2 ⊕ s3 ⊕ k3 ⊕ s2 ⊕ s1 ⊕ s0P = k1 ⊕ k2 ⊕ s3 ⊕ k3 ⊕ s2 ⊕ s1 ⊕ s0

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 49

Page 50: New challenges forNew challenges for designers of fault

Combinational HammingCombinational Hamming

Ripple carry adder: 7 inputs and 4 outputs

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 50

Ripple carry adder: 7 inputs and 4 outputs

Page 51: New challenges forNew challenges for designers of fault

Combinational Hamming: ExperimentsCombinational Hamming: Experiments

Sample circuits: adders and multipliersID I O

Area(μm2)

Power(mW)

Delay(ns)

Sample circuits: adders and multipliers

4+4 8 5 263.758 0.334 0.780

5+5 10 6 445.549 1.165 1.320

6+6 12 7 493.513 3.572 1.670

7+7 14 8 575.765 4.168 1.482

4+4+cin 9 5 296.758 0.394 0.830

5+5+cin 11 6 487.286 1.579 1.520

6+6+cin 13 7 590.279 3.712 1.130

4×4 8 8 2,993.088 8.357 2.940

10 10 6 993 088 8 3 2 9405×5 10 10 6,993.088 8.357 2.940

6×6 12 12 27,865.910 29.278 5.600

7 7 14 14 121 649 969 112 609 13 250

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 51

7×7 14 14 121,649.969 112.609 13.250

Page 52: New challenges forNew challenges for designers of fault

Combinational Hamming: ResultsCombinational Hamming: Results

Areas (µm2)Areas (µm2)

ID Standard HammingHamming overheadoverhead

4+4 263.758 498.449 88.980%

5+5 445.549 924.943 107.596%

6+6 493.513 1,207.267 144.627%

7+7 575.765 1,408.478 144.627%

4 4 i 296 758 516 449 74 030%4+4+cin 296.758 516.449 74.030%

5+5+cin 487.286 938.179 92.532%

6+6+Cin 590.279 1,417.765 140.186%, %

4×4 2,993.088 3,796.460 26.841%

5×5 6,993.088 11,810.657 68.890%

6×6 27,865.910 48,609.331 74.440%

7×7 121,649.969 176,320.018 44.940%

M 14 786 815 22 495 272 91 608%

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 52

Mean 14,786.815 22,495.272 91.608%

Page 53: New challenges forNew challenges for designers of fault

Combinational Hamming: ResultsCombinational Hamming: Results

Power (mW)Power (mW)

ID Standard HammingHammingoverheadoverhead

4+4 0.334 0.697 108.692%

5+5 1.165 1.598 37.246%

6+6 3.572 6.990 95.658%

7+7 4.168 8.155 95.658%

4 4 i 0 394 0 807 104 831%4+4+cin 0.394 0.807 104.831%

5+5+cin 1.579 1.911 21.006%

6+6+Cin 3.712 7.812 110.427%%

4×4 8.357 11.989 43.472%

5×5 8.357 11.989 43.472%

6×6 29.278 41.365 41.285%

7×7 112.609 97.835 87.120%

M 15 775 17 377 71 715%

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 53

Mean 15.775 17.377 71.715%

Page 54: New challenges forNew challenges for designers of fault

Combinational Hamming: ResultsCombinational Hamming: Results

Propagation Delays (ns)Propagation Delays (ns)

ID Standard HammingHammingoverheadoverhead

4+4 0.780 1.120 43.590%

5+5 1.320 1.760 33.333%

6+6 1.670 2.170 29.940%

7+7 1.482 2.170 46.457%

4 4 i 0 830 1 200 44 578%4+4 +cin 0.830 1.200 44.578%

5+5 +cin 1.520 1.870 23.026%

6+6+Cin 1.130 1.700 50.442%%

4×4 2.940 3.690 25.510%

5×5 2.940 3.690 25.510%

6×6 5.600 6.900 23.214%

7×7 13.250 14.180 7.019%

M 3 042 3 677 32 056%

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 54

Mean 3.042 3.677 32.056%

Page 55: New challenges forNew challenges for designers of fault

Combinational Hamming vs. TMRCombinational Hamming vs. TMR

Areas (µm2)Areas (µm2)

ID TMR HammingReductionover TMRover TMR

4+4 952.474 498.449 47.668%

5+5 1,530.087 924.943 39.550%

6+6 1,706.219 1,207.267 29.243%

7+7 1,985.216 1,408.478 29.052%

4 4 i 1 051 474 516 449 50 883%4+4+cin 1,051.474 516.449 50.883%

5+5+cin 1,655.298 938.179 43.323%

6+6+Cin 1,996.517 1,417.765 28.988%, , %

4×4 9,237.184 3,796.460 58.900%

5×5 21,301.664 11,810.657 44.555%

6×6 83,984.610 48,609.331 42.121%

7×7 365,401.266 176,320.018 51.746%

M 44 618 364 22 495 272 42 366%

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 55

Mean 44,618.364 22,495.272 42.366%

Page 56: New challenges forNew challenges for designers of fault

Combinational Hamming vs. TMRCombinational Hamming vs. TMR

Power (mW)Power (mW)

ID TMR HammingReductionover TMRover TMR

4+4 1.103 0.697 36.788%

5+5 3.615 1.598 55.781%

6+6 10.858 6.990 35.628%

7+7 12.665 8.155 35.611%

4 4 i 1 283 0 807 37 083%4+4+cin 1.283 0.807 37.083%

5+5+cin 4.858 1.911 60.668%

6+6+Cin 11.278 7.812 30.735%%

4×4 25.231 11.989 52.482%

5×5 25.271 11.989 52.557%

6×6 88.075 41.365 53.034%

7×7 338.110 97.835 71.064%

M 47 486 17 377 47 403%

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 56

Mean 47.486 17.377 47.403%

Page 57: New challenges forNew challenges for designers of fault

Combinational Hamming vs. TMRCombinational Hamming vs. TMR

Propagation Delays (ns)Propagation Delays (ns)

ID TMR HammingOverheadover TMRover TMR

4+4 1.090 1.120 2.752%

5+5 1.630 1.760 7.975%

6+6 1.980 2.170 9.596%

7+7 1.792 2.170 21.116%

4 4 i 1 140 1 200 5 263%4+4+cin 1.140 1.200 5.263%

5+5+cin 1.830 1.870 2.186%

6+6+Cin 1.440 1.700 18.056%%

4×4 3.250 3.690 13.538%

5×5 3.250 3.690 13.538%

6×6 5.910 6.900 16.751%

7×7 13.560 14.180 4.572%

M 3 352 3 677 9 705%

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 57

Mean 3.352 3.677 9.705%

Page 58: New challenges forNew challenges for designers of fault

Recently proposed solutions (2 of 6)Recently proposed solutions (2 of 6)

Working at algorithm level with low cost errordetection for matrix multiplication algorithmdetection for matrix multiplication algorithm

System LevelAlgorithm Level

MatrixMultiplication

H d iArchitecture Level

Circuit Level

Hardening

C cu t e eComponent LevelTechnology LevelTechnology Level

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 58

Page 59: New challenges forNew challenges for designers of fault

Fault-Tolerant Matrix MultiplicationFault Tolerant Matrix Multiplication

• MxM is a widely used algorithm:• MxM is a widely used algorithm:• signal and image processing,

th di ti• weather prediction,• finite element analysis,y ,• control systems, etc.

Error correction ↔ System performance• Error correction ↔ System performance• Computational cost: O(n3)p ( )

A11 . . . A1n B11 . . . B1n C11 . . . C1n

× ⇒. . . . . . . . .

An1 . . . Ann

. . . . . . . . .

Bn1 . . . Bnn

. . . . . . . . .

Cn1 . . . Cnn

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 59

n1 nn n1 nn n1 nn

Page 60: New challenges forNew challenges for designers of fault

Alternative approachesAlternative approaches

• Duplication With Comparison (DWC)Detection only, > 100% overheady,

• Triple Modular Redundancy (TMR)p y ( )Correction, > 200% overhead

• Freivalds, 1979Detection only, probabilistic, overhead < 100%y, p , %

• Subject technique (Lisboa, ETS 2007)j q ( , )Detection only, deterministic, overhead << 100%

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 60

Page 61: New challenges forNew challenges for designers of fault

Freivalds’ technique [*]Freivalds technique

× ⇒A11 . . . A1n

. . . . . . . . .

B11 . . . B1n

. . . . . . . . .

C11 . . . C1n

. . . . . . . . .

r1

. . .

Cr1

. . .× ⇒An1 . . . Ann Bn1 . . . Bnn Cn1 . . . Cnn rn Crn

Vector r: random 0’s and 1’sVector r: random 0 s and 1 s

[*] Freivalds, R. 1979. Fast probabilistic algorithms. In Mathematical Formulations of CS. Lecture Notes in Computer Science vol 74 Springer Verlag New York pp 57 69

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 61

Lecture Notes in Computer Science, vol. 74. Springer-Verlag, New York, pp. 57–69.

Page 62: New challenges forNew challenges for designers of fault

Freivalds’ techniqueFreivalds technique

× ⇒A11 . . . A1n

. . . . . . . . .

B11 . . . B1n

. . . . . . . . .

C11 . . . C1n

. . . . . . . . .

r1

. . .

Cr1

. . .× ⇒An1 . . . Ann Bn1 . . . Bnn Cn1 . . . Cnn rn Crn

Vector r: random 0’s and 1’sVector r: random 0 s and 1 s

B11 . . . B1nA11 . . . A1n r1 Ar1

× ⇒ABr1

× ⇒. . . . . . . . .

Bn1 . . . Bnn

. . . . . . . . .

An1 . . . Ann

. . .

rn

. . .

Arn

× ⇒ . . .

ABrn

× ⇒

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 62

Page 63: New challenges forNew challenges for designers of fault

Freivalds’ techniqueFreivalds technique

× ⇒A11 . . . A1n

. . . . . . . . .

B11 . . . B1n

. . . . . . . . .

C11 . . . C1n

. . . . . . . . .

r1

. . .

Cr1

. . .× ⇒An1 . . . Ann Bn1 . . . Bnn Cn1 . . . Cnn rn Crn

If Cr = ABr OK otherwise ERROR =?If Cr = ABr, OK, otherwise, ERROR =?

B11 . . . B1nA11 . . . A1n r1 Ar1

× ⇒ABr1

× ⇒. . . . . . . . .

Bn1 . . . Bnn

. . . . . . . . .

An1 . . . Ann

. . .

rn

. . .

Arn

× ⇒ . . .

ABrn

× ⇒

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 63

Page 64: New challenges forNew challenges for designers of fault

Basic subject technique [*]Basic subject technique

• The main difference w. r. t. the Freivalds’ technique is that here the r Vector has only 1’s. q y

• This means that to calculate Ar and Cr only• This means that to calculate Ar and Cr only additions are needed, no multiplications.

• The computational cost of verification is pthereby significantly decreased.

[*] Lisbôa, C. A., Erigson, M. I., and Carro, L., “System level approaches for mitigation of long durationtransient faults in future technologies”, in Proceedings of the 12th IEEE European Test Symposium -

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 64

ETS 2007, pp. 165-170, IEEE Computer Society, Los Alamitos, CA, May 2007.

Page 65: New challenges forNew challenges for designers of fault

Basic subject techniqueBasic subject technique

× ⇒A11 . . . A1n

. . . . . . . . .

B11 . . . B1n

. . . . . . . . .

C11 . . . C1n

. . . . . . . . .

Cr1

. . .Cri = ΣCik,⇒

An1 . . . Ann Bn1 . . . Bnn Cn1 . . . Cnn Crnk=1...n⇒

If Cr = ABr OK otherwise ERROR =?If Cr = ABr, OK, otherwise, ERROR =?

B11 . . . B1nA11 . . . A1n Ar1 ABr1

× ⇒Ari = ΣAik,⇒ . . . . . . . . .

Bn1 . . . Bnn

. . . . . . . . .

An1 . . . Ann

. . .

Arn

. . .

ABrn

× ⇒k=1...n⇒

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 65

Page 66: New challenges forNew challenges for designers of fault

Extended Subject Technique [*]Extended Subject Technique

• compute vectors Br and BrT (only sums)• compute vectors Br and BrT (only sums)

B11 B12 B1n Br1...

⇒B21 B22 B2n Br2

... ... ...

...

... ...

Σ

Σ

⇒Bn1 Bn2 Bnn Brn...

⇒Σ

BrT1 BrT

2 BrTn...

[*] Lisboa, C.; Argyrides, C.; Pradhan, D.; and Carro, L., “Algorithm Level Fault Tolerance: a Technique to Cope with Long Duration Transient Faults in Matrix Multiplication Algorithms” in Proceedings of the 26th

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 66

Cope with Long Duration Transient Faults in Matrix Multiplication Algorithms , in Proceedings of the 26IEEE VLSI Test Symposium (VTS 2008), San Diego, CA, USA, April 2008.

Page 67: New challenges forNew challenges for designers of fault

Extended Subject TechniqueExtended Subject Technique

• compute vectors Br and BrT (only sums)• compute vectors Br and BrT (only sums)• compute vectors ABr = A × Br and ABrT = A × BrTp

BrT1 BrT

2 BrTn...

Br1 A11 A12 A1n... ABr1

×

Br2

...

A21 A22 A2n

... ...

...

... ...×

ABr2

...⇒

Brn An1 An2 Ann... ABrn⇒

ABrT1 ABrT

2 ABrTn...

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 67

Page 68: New challenges forNew challenges for designers of fault

Extended Subject TechniqueExtended Subject Technique

• compute vectors Br and BrT (only sums)• compute vectors Br and BrT (only sums)• compute vectors ABr = A × Br and ABrT = A × BrTp• compute vectors Cr and CrT (only sums)

C11 C12 C1n Cr1...

⇒C21 C22 C2n Cr2... Σ

⇒Cn1 Cn2 Cnn Crn

... ... ......

...

...

ΣCrT CrT CrT

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 68

CrT1 CrT

2 CrTn...

Page 69: New challenges forNew challenges for designers of fault

Extended Subject TechniqueExtended Subject Technique

• Verification:• Verification:• If ABr = Cr AND ABrT = CrT, then NO ERROR

⇒Cr1C11 C12 C1n... ABr1

!

⇒Cr2

...

C21 C22 C2n

... ...

...

... ...

ABr2

...

!=

• Otherwise: CrnCn1 Cn2 Cnn... ABrn

⇒CrT

1 CrT2 CrT

n...⇒

ABrT1 ABrT

2 ABrTn...

!=

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 69

1 2 n

Page 70: New challenges forNew challenges for designers of fault

Extended Subject Technique - ExampleExtended Subject Technique Example

6129 6129‐2082 ‐3582 11793

15744 96372160 ‐61 13645 !=C = Cr = ABr =2937 29372280 3222 ‐2565

2358 ‐421 22873CrT =

!=

2358 ‐6528 22873ABrT =

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 70

Page 71: New challenges forNew challenges for designers of fault

Results: Verification CostResults: Verification Cost

Total Verification Cost (# of add equivalent operations)

n Multiplication Freivalds Subject Extended2 36 58 26 524 304 244 116 2328 2,496 1,000 488 97616 20 224 4 048 2 000 4 00016 20,224 4,048 2,000 4,00032 162,816 16,288 8,096 16,19264 1,306,624 65,344 32,576 65,152

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 71

Page 72: New challenges forNew challenges for designers of fault

Results: Recomputation CostResults: Recomputation Cost

Subject (whole matrix) vs. Extended (single element)

n Subject % Extended %2 36 100 9 25 02 36 100 9 25.04 304 100 19 6.258 2,496 100 39 1.56

16 20 224 100 79 0 3916 20,224 100 79 0.3932 162,816 100 159 0.103 6 ,8 6 00 59 0 064 1,306,624 100 319 0.02

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 72

Page 73: New challenges forNew challenges for designers of fault

Minimizing the recomputation timeMinimizing the recomputation time

6129 6129‐2082 ‐3582 11793

15744 96372160 ‐61 13645 !=C = Cr = ABr =2937 29372280 3222 ‐2565

Single element recomputation:2358 ‐421 22873

Single element recomputation:

C[i,j] = Σ A[i,k] * B[k,i], k=1...nCrT =

!=

j

C[2 2] (C [2] AB [2]) 6 168

cheaper

cheaper

2358 ‐6528 22873ABrT =C[2,2]-(Cr[2]-ABr[2]) = -6,168

orC[2 2] (CrT[2] ABrT[2]) 6 168

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 73

C[2,2]-(CrT[2]-ABrT[2]) = -6,168

Page 74: New challenges forNew challenges for designers of fault

Minimizing the recomputation timeMinimizing the recomputation time

Computational cost when an error occurs

Multiplication Verification Recomputation Totaln Multiplication4n3+ n2(n-1)

Verification10n2+6n(n-1)

Recomputation2

TotalCost

2 36 52 2 904 304 232 2 5388 2,496 976 2 3,47416 20,224 4.000 2 24,22632 162,816 16,192 2 179,01064 1,306,624 65,152 2 1,371,778

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 74

Page 75: New challenges forNew challenges for designers of fault

Minimizing the recomputation timeMinimizing the recomputation time

Improvement over extended technique

Extended Minimum cost % Costn Extended Technique

Minimum cost technique

% Cost Reduced

2 36 2 94.444 304 2 99.348 2,496 2 99.9216 20,224 2 99.9932 162,816 2 99.9964 1,306,624 2 99.99

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 75

Page 76: New challenges forNew challenges for designers of fault

Minimizing the recomputation timeMinimizing the recomputation time

Computational cost when an error occurs

Multiplication Verification Recomputation Totaln Multiplication4n3+ n2(n-1)

Verification10n2+6n(n-1)

Recomputation2

TotalCost

2 36 52 2 904 304 232 2 5388 2,496 976 2 3,47416 20,224 4.000 2 24,22632 162,816 16,192 2 179,01064 1,306,624 65,152 2 1,371,778

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 76

Page 77: New challenges forNew challenges for designers of fault

Minimizing the recomputation timeMinimizing the recomputation time

Improvement over previous techniques

N Subject Extended % Cost Minimum cost % Cost N jTechnique Technique Reduction technique Reduction

2 36 9 77.77 2 94.444 304 19 89.47 2 99.348 2,496 39 94.87 2 99.9216 20,224 79 97.47 2 99.9932 162,816 159 98.74 2 99.9964 1,306,624 319 99.37 2 99.99

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 77

Page 78: New challenges forNew challenges for designers of fault

Recently proposed solutions (3 of 6)Recently proposed solutions (3 of 6)

Working at algorithm level with low costfor runtime error detectionfor runtime error detection

System LevelAlgorithm Level

Using Invariantsfor Runtime Error

D iArchitecture LevelCircuit Level

Detection

C cu t e eComponent LevelTechnology LevelTechnology Level

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 78

Page 79: New challenges forNew challenges for designers of fault

GoalGoal

• Achieve tolerance to long duration transient pulsestransient pulses

l i h i l l• at algorithmic level

• with low performance overhead

• in an automatic fashion

• generalized to other algorithms

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 79

g g

Page 80: New challenges forNew challenges for designers of fault

Alternative approachesAlternative approaches

• Software based error detection techniques

• Duplication with Comparison: increases memory usage and execution time. [Rebaudengo et al, 1999]g [ g , ]

• Self Checking Block Signatures: imposes coding and performance penalties. [Goloubeva et al, 2003]

U f bj t i t d l d lib i i• Use of object oriented languages and libraries in some approaches leads to increased memory f t i t d i d difi tifootprint and requires source code modification. [Benso, 2005]

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 80

Page 81: New challenges forNew challenges for designers of fault

Alternative approachesAlternative approaches

• An algorithm level technique is proposed in• An algorithm level technique is proposed in [Lisboa, 2007] for matrix multiplication hardening• Far less computational cost than recompute and

compare (32x32 matrix – only 4.97% time increase).p ( y )

• Explores algorithm properties: conditions that hold after the execution of the algorithm known asafter the execution of the algorithm - known as program invariants or post conditions - are checked.

IDEA

Use algorithm properties as a mean forrun time error detection

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 81

run-time error detection.

Page 82: New challenges forNew challenges for designers of fault

Subject techniqueSubject technique

• Invariants

• Properties that always hold during program execution:

• Pre-conditionsP t diti• Post-conditions

• Loop invariantsp

• Usually used in the software engineering arena,h k if f i kto check if a program performs its tasks as

expected after maintenance.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 82

Page 83: New challenges forNew challenges for designers of fault

Subject techniqueSubject technique• Daikon Tool [Ernst et al 2001]Daikon Tool [Ernst et al, 2001]

• Automatically detects potential invariants for a given programprogram.

• Identification of a testable set of invariants feasible for small programs.

• Linear relationships between up to 3 variables.• Low support to complex data structures.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 83

Low support to complex data structures.

Page 84: New challenges forNew challenges for designers of fault

MethodologyMethodology

• Fault injection campaigns• Fault injection campaigns• Main program is divided into smaller, less complex,

pieces of code.

• Daikon is used to extract the invariants of each part.Daikon is used to extract the invariants of each part.

• Verification code is appended after the algorithm code.main(){

}

ProgramBody

main(){Program Slice

Program Slice

InvariantDetector

decompose

}

}

g

Program Slice

IncludeVerification

C d

Invariants

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 84

Code

Page 85: New challenges forNew challenges for designers of fault

MethodologyMethodology

• Fault coverage and performance evaluation

PerformanceFault CoverageEvaluation

ModifiedCode

• Fault coverage and performance evaluation

PerformanceEvaluation

Evaluation

main(){Program Slice

Verification

GenerateReference

Random FaultSetup

FaultInjection

2

1

Verification

Program SliceVerification

Setup

Ch k3 4

2

Program Slice

Verification

}

Program SliceVerification

No

CheckDetection 5

F times?

TimingReport

Yes

6AnalysisReport

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 85

Report6Report

Page 86: New challenges forNew challenges for designers of fault

MethodologyMethodology

• Reference and execution results are compared.

• Comparison of results is confronted with verification flagverification flag.

Statistical analysis with report generation• Statistical analysis with report generation.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 86

Page 87: New challenges forNew challenges for designers of fault

Experimental results and analysisExperimental results and analysis

• The subject methodology was applied to a test• The subject methodology was applied to a test program, split into 5 code pieces:

• Evaluation of the Baskara formula ( domain ).

• Iterative integer multiplication.

• Conditional statement execution.

• Arithmetic expression evaluation.

• Square root calculation.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 87

Page 88: New challenges forNew challenges for designers of fault

Experimental results and analysisExperimental results and analysis/* mult() */while(k2>0){( ){

if ((k2%2)==0 ){k2/=2;x2+=x2;

}else{

Test case program/* baskara() */x1=-1.1;x2=-1.1;if (a==0 && b!=0){

x1=-c/b;

{k2--;m2+=x2;

}}/* biggerminus() *// ;

x2=x1;}else{

delta= pow(b,2) - 4*a*c;if (a!=0 && delta>=0){

/ gg () /if(m1>m2){

bg=m1-m2;}else{

bg=m2-m1;( ){x1=(-b + sqrt(delta) )/(2*a);x2=(-b - sqrt(delta) )/(2*a);

}}/* mult() */

g ;}/* sum() */s = a + b - c;/* sqrt() */if(s<0){/ () /

while(k1>0){if ((k1%2)==0 ){

k1/=2;x1+=x1;

}

( ){sq=sqrt(-s);

}else{

sq=sqrt(2*s);}}

else{k1--;m1+=x1;

}}

}/* biggerminus() */if(sq>bg){

r=sq-bg;}else{

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 88

} {r=bg-sq;

}

Page 89: New challenges forNew challenges for designers of fault

Experimental results and analysisExperimental results and analysis

• Example of invariants inferred for the mult( )• Example of invariants inferred for the mult( ) algorithm which are used for verification

inputs(x,y) >= 0 inputs(x,y) > 0

()..mult():::EXIT

..mult():::EXIT ::y == orig(::z) ::y == 0

::y == orig(::z) ::y == 0 ::y < ::x ::y < ::z::y 0

::z >= 0 ::y <= ::x

::y < ::z ::y < orig(::y) ::y < orig(::x) ::x <= ::z

::y <= ::z ::y <= orig(::y) ::y <= orig(::x)

::x % orig(::x)==0 ::x >= orig(::x) ::z % orig(::y)==0 ::z >= orig(::y)::y < orig(::x)

::x >= orig(::x)::z >= orig(::y) ::z % orig(::x)==0 ::z >= orig(::x)

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 89

Page 90: New challenges forNew challenges for designers of fault

Experimental results and analysisExperimental results and analysis

• Fault injection campaigns• Fault injection campaigns

• 2000 samples (saturation) for each slice and2000 samples (saturation) for each slice and complete program.

Algorithm Correct detections Detection rate*

mult( ) 1141 57 05 %mult( ) 1141 57,05 %

baskara( ) 394 19,70 %

( )sum( ) 388 19,40 %

biggerminus( ) 539 26,95 %

square( ) 288 14,40 %

* ( Reference ≠ Result ) AND ( verification = error )

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 90

* ( Reference ≠ Result ) AND ( verification = error )

Page 91: New challenges forNew challenges for designers of fault

Experimental results and analysisExperimental results and analysis

• Fault injection campaigns• Fault injection campaigns

• 2000 samples (saturation) for each slice and2000 samples (saturation) for each slice and complete program.

Algorithm Correct detections Detection rate**

mult( ) 1963 98 15 %mult( ) 1963 98,15 %

baskara( ) 1621 81,05 %

( )sum( ) 1729 86,45 %

biggerminus( ) 1630 81,50 %

square( ) 1031 51,55 %

** verification = error

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 91

** verification = error

Page 92: New challenges forNew challenges for designers of fault

Experimental results and analysisExperimental results and analysis

• Fault injection campaigns• Fault injection campaigns

• 2000 samples (saturation) for each slice and2000 samples (saturation) for each slice and complete program.

(**)(*)

Completemult( ) 98,15%57,05%

( )( )

Complete Program Baskara( )

sum( )

81,05%

86,45%

19,70%

19,40%

biggerminus( )

sqrt( )

36,20% (**) 81,50%

51 55%

18,75% (*) 26,95%

14 40%sqrt( ) 51,55%14,40%

* ( Reference ≠ Result ) AND ( verification = error )

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 92

** verification = error

Page 93: New challenges forNew challenges for designers of fault

Experimental results and analysisExperimental results and analysis

• Performance overhead• Performance overhead

Algorithm Execution time Verification time Time increasemult( ) 190 00 ns 5 00 ns 2 63 %mult( ) 190,00 ns 5,00 ns 2,63 %baskara( ) 207,33 ns 104,83 ns 50,56 %

( ) 90 16 00 67 0 74 %sum( ) 90,16 ns 00,67 ns 0,74 %biggerminus( ) 87,50 ns 12,66 ns 12,65 %square( ) 169,33 ns 3,50 ns 2,02 %complete 493,20 ns 68,80 ns 13,95 %program , , ,

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 93

Page 94: New challenges forNew challenges for designers of fault

AnalysisAnalysis

Provides a low cost error detection mechanism• Provides a low cost error detection mechanism, when invariants are detected.

• Better performance using program slices.

• Coverage still low.

• Coding style to enhance detection.

Lack of a tomatic tools to handle comple data• Lack of automatic tools to handle complex data structures.

• Automatic generation of invariants is still a bottle-neck

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 94

neck.

Page 95: New challenges forNew challenges for designers of fault

Recently proposed solutions (4 of 6)Recently proposed solutions (4 of 6)

Working at software level

System LevelSIFT

Algorithm LevelArchitecture Level

Software ImplementedFault Tolerance

Circuit LevelComponent LevelComponent LevelTechnology Level

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 95

Page 96: New challenges forNew challenges for designers of fault

Data-oriented ApproachesData oriented Approaches

• Provide a solution for tolerating the effects of• Provide a solution for tolerating the effects of faults affecting the data program manipulates

• Introduced by Rebaudengo Politecnico diSWIFT• Introduced by Rebaudengo, Politecnico di

Torino, Italy• Used for hardening any operation among

variables• Based on automatic algorithm-level

modifications that introduce informationmodifications that introduce information (duplication code) and time redundancies

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 96

[Violante, M. Politecnico di Torino, 2006]

Page 97: New challenges forNew challenges for designers of fault

SWIFTSWIFT

Basic principle:Basic principle:• Each variable must be replicated two times

E h ti i bl t b li t d• Each operation among variables must be replicatedtwo times

• After every usage of a variable, its two replicas must be checked for consistency

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 97

Page 98: New challenges forNew challenges for designers of fault

SWIFTSWIFT

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 98

[Violante, M. Politecnico di Torino, 2006]

Page 99: New challenges forNew challenges for designers of fault

SWIFTSWIFT

Success stories:Success-stories:• Motorola 68040oto o a 680 0• Intel 8051• IBM PowerPC

G i l LEON1/LEON2• Gaisler LEON1/LEON2Fault models:Fault models:• SEUs• SETs

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 99

[Violante, M. Politecnico di Torino, 2006]

Page 100: New challenges forNew challenges for designers of fault

ED4IED I

• Introduced by McCluskey Stanford University USA• Introduced by McCluskey, Stanford University, USA• Used for hardening any operation among variables• Based on algorithm-level modifications that

Introduces time redundancies (replicated with shifted operands)

Basic principle:C t l ti S f( )• Compute one solution S=f(x)

• Compute a shifted solution S’=f(x.k)p ( )• Verify whether S and S’ are consistent

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 100

[Violante, M. Politecnico di Torino, 2006]

Page 101: New challenges forNew challenges for designers of fault

ED4IED I

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 101

[Violante, M. Politecnico di Torino, 2006]

Page 102: New challenges forNew challenges for designers of fault

ED4IED I

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 102

[Violante, M. Politecnico di Torino, 2006]

Page 103: New challenges forNew challenges for designers of fault

ED4IED I

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 103

[Violante, M. Politecnico di Torino, 2006]

Page 104: New challenges forNew challenges for designers of fault

Control-oriented ApproachesControl oriented Approaches

• Provide a solution for tolerating the effects of• Provide a solution for tolerating the effects of faults affecting the programs’ execution flow

Control Flow Errors

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 104

Page 105: New challenges forNew challenges for designers of fault

Control Flow ErrorsControl Flow Errors

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 105

[Violante, M. Politecnico di Torino, 2006]

Page 106: New challenges forNew challenges for designers of fault

ECCAECCA• Introduced by Abraham University of Texas USAIntroduced by Abraham, University of Texas, USA• Used for detecting contro-flow errors

Based on:• Modifications to the program source code• Trigger of division-by-zero exception for error detection

Basic approach:Basic approach:• Assign an odd signature to each program’s basic block• Maintain run-time signature with the currently executed basic blockg y• While entering a basic block, set the run-time signature according to

the current basic block and check the correctness of the flow• While exiting a basic blocks, set the run-time signature according to

the next basic block

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 106

[Violante, M. Politecnico di Torino, 2006]

Page 107: New challenges forNew challenges for designers of fault

ECCAECCA

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 107

[Violante, M. Politecnico di Torino, 2006]

Page 108: New challenges forNew challenges for designers of fault

ECCAECCA

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 108

[Violante, M. Politecnico di Torino, 2006]

Page 109: New challenges forNew challenges for designers of fault

ECCAECCA

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 109

[Violante, M. Politecnico di Torino, 2006]

Page 110: New challenges forNew challenges for designers of fault

ECCAECCA

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 110

[Violante, M. Politecnico di Torino, 2006]

Page 111: New challenges forNew challenges for designers of fault

CFCSSCFCSS• Introduced by McClusckey Stanford University USAIntroduced by McClusckey, Stanford University, USA• Used for detecting control-flow errors

B dBased on:• Modifications to the program source code

U l i ti t t k t l fl ti• Use logic operations to track control-flow execution

Basic approach:• Assign a signature to each program’s basic block• During program execution, a run-time signature is continuously

updated• While entering a basic block:

• The run-tine signature is updated• The consistency of the run-time signature with a pre-defined one

is evaluated

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 111

is evaluated[Violante, M. Politecnico di Torino, 2006]

Page 112: New challenges forNew challenges for designers of fault

CFCSSCFCSS

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 112

[Violante, M. Politecnico di Torino, 2006]

Page 113: New challenges forNew challenges for designers of fault

CFCSSCFCSS

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 113

[Violante, M. Politecnico di Torino, 2006]

Page 114: New challenges forNew challenges for designers of fault

CFCSSCFCSS

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 114

[Violante, M. Politecnico di Torino, 2006]

Page 115: New challenges forNew challenges for designers of fault

CFCSSCFCSS

• Low cost techniques:• Low-cost techniques:• Logic operations are not time consuming• Few operations are added, resulting in low code

penalty

• Error detection is very critical: it changes the• Error detection is very critical: it changes the program’s graph by introducing a jump

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 115

[Violante, M. Politecnico di Torino, 2006]

Page 116: New challenges forNew challenges for designers of fault

YACCAYACCA

• Introduced by MassimoViolante• Introduced by MassimoViolante, Politecnico di Torino, Italy

• Used for detecting control-flow errors

Based on:Based on:• Modifications to the program source codep g• Use logic operations to track control-flow

tiexecution

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 116

[Violante, M. Politecnico di Torino, 2006]

Page 117: New challenges forNew challenges for designers of fault

YACCAYACCA

Basic principle:Basic principle:• Two signatures are assigned to each program’s

b i bl k ( t d it i t B 1 B 2)basic block (enter and exit signatures, Bx1, Bx2)• A run-time signature is constantly updated• When entering a basic block:

• Check the correctness of the execution• Check the correctness of the execution• Set the run-time signature to the enter one

• When exiting a basic block:• Check the correctness of the executionCheck the correctness of the execution• Set the run-time signature to the exit one

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 117

[Violante, M. Politecnico di Torino, 2006]

Page 118: New challenges forNew challenges for designers of fault

YACCAYACCA

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 118

[Violante, M. Politecnico di Torino, 2006]

Page 119: New challenges forNew challenges for designers of fault

YACCAYACCA

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 119

[Violante, M. Politecnico di Torino, 2006]

Page 120: New challenges forNew challenges for designers of fault

YACCAYACCA

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 120

[Violante, M. Politecnico di Torino, 2006]

Page 121: New challenges forNew challenges for designers of fault

YACCAYACCA

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 121

[Violante, M. Politecnico di Torino, 2006]

Page 122: New challenges forNew challenges for designers of fault

YACCAYACCA

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 122

[Violante, M. Politecnico di Torino, 2006]

Page 123: New challenges forNew challenges for designers of fault

YACCAYACCA

• Low cost techniques:• Low-cost techniques:• Logic operations are not time consuming• Few operations are added, resulting in low

code penanltycode penanlty• The program’s graph is not modified

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 123

[Violante, M. Politecnico di Torino, 2006]

Page 124: New challenges forNew challenges for designers of fault

ComparisonComparison

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 124

[Violante, M. Politecnico di Torino, 2006]

Page 125: New challenges forNew challenges for designers of fault

Some figuresSome figures

• Experimental setup• Experimental setup• Matrix multiplication programat u t p cat o p og a• Intel 8051 processor• Hardware-accelerated fault injection in:

C d t• Code segment• Data segmentData segment• Processor’s registers

• SEU fault model

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 125

[Violante, M. Politecnico di Torino, 2006]

Page 126: New challenges forNew challenges for designers of fault

Some FiguresSome Figures

• System failures due to SEUs in the• System failures due to SEUs in thecode segment:code seg e t• Un-hardened program: 1.0• ABFT: 4x better

ED4I 4 b tt• ED4I: 4x better• SWIFT+YACCA: 6x betterSWIFT+YACCA: 6x better

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 126

[Violante, M. Politecnico di Torino, 2006]

Page 127: New challenges forNew challenges for designers of fault

Some FiguresSome Figures

• System failures due to SEUs in the• System failures due to SEUs in thedata segment:data seg e t• Un-hardened program: 1.0• ABFT: 6x better

ED4I 29 b tt• ED4I: 29x better• SWIFT+YACCA: ∞ better (0 systemSWIFT+YACCA: ∞ better (0 system

failures observed)

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 127

[Violante, M. Politecnico di Torino, 2006]

Page 128: New challenges forNew challenges for designers of fault

Some FiguresSome Figures

• System failures due to SEUs in the• System failures due to SEUs in the processor’s registers:• Un-hardened program: 1.0• ABFT: 9x better• ED4I: 13x better• ED4I: 13x better• SWIFT+YACCA: 15x better

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 128

[Violante, M. Politecnico di Torino, 2006]

Page 129: New challenges forNew challenges for designers of fault

Some FiguresSome Figures

• Time increase:• Time increase:• Un-hardened program: 1.0• ABFT: 3.8x• ED4I : 1.9xED4I : 1.9x• SWIFT+YACCA: 3.5x

C d i• Code increase:• Un-hardened program: 1.0• ABFT: 2.3x• ED4I : 1 6x• ED4I : 1.6x• SWIFT+YACCA: 3.9x

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 129

[Violante, M. Politecnico di Torino, 2006]

Page 130: New challenges forNew challenges for designers of fault

Some FiguresSome Figures

• Data increase:• Data increase:• Un-hardened program: 1.0• ABFT: 2.0x• ED4I: 1 9x• ED4I: 1.9x• SWIFT+YACCA: 2.2x

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 130

[Violante, M. Politecnico di Torino, 2006]

Page 131: New challenges forNew challenges for designers of fault

Hybrid SIFTHybrid SIFT

• Software only SIFT may introduce unacceptable• Software-only SIFT may introduce unacceptabletime penalty

• Moving in hardware some tasks may reduce this overhead

• Masking, detection, location, and recovery implemented in software and in hardwareimplemented in software and in hardware

• Possible approaches:• Lockstep execution• Watchdogs• Watchdogs• Lightweight watchdogs

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 131

Page 132: New challenges forNew challenges for designers of fault

Recently proposed solutions (5 of 6)Recently proposed solutions (5 of 6)

Working at system (software and hardware) level

System LevelSWAT

SoftWare AnomalyT t t

Algorithm LevelArchitecture Level

Treatment

Architecture LevelCircuit Level

Component LevelComponent LevelTechnology Level

Li, M.-L.; Ramachandran, P.; Sahoo, S. K.; Adve, S.; Adve, V.; and Zhou, Y. Understanding thepropagation of hard errors to software and implications for resilient system design. In Proc. of the

13th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2008.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 132

13 Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2008.

Page 133: New challenges forNew challenges for designers of fault

Main conceptsMain concepts

• Detection of errors when they affect software behavior is• Detection of errors when they affect software behavior is preferable to detection at hardware level

• SWAT exploits this concept to achieve low cost error detection for cores at software level, by checking:o Fatal exceptionso Program crashes or hangso Program crashes or hangso Unusually high amount of operating system activity

• Some hardware errors that do not manifest themselves in software behaviors are not detected by SWAT

• SWAT suffers from the drawbacks of high level error detection mechanisms that will be discussed later

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 133

de ec o ec a s s a be d scussed a e

Page 134: New challenges forNew challenges for designers of fault

Recently proposed solutions (6 of 6)Recently proposed solutions (6 of 6)

Working at lower levels to detect errorsand at higher system levels to correct them.

Application Layer

g yn Application Layer

Middleware/Architectural Layer

C fi bl /P i L epor

ts

gura

tio

Configurable/Programming Layer

Register/Logic Layer

Erro

r Re

Rec

onfig

Technology Layer ER

Albrecht, C.; Koch, R.; Pionteck, T.; and Glösekötter, P. Towards a Flexible Fault-TolerantSystem-on-Chip. 22th International Conference on Architecture of Computing Systems

- Workshop Proceedings – ARCS 2009 pp 83-90 VDE Verlag GmbH Berlin 2009

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 134

- Workshop Proceedings – ARCS 2009, pp 83-90, VDE Verlag GmbH, Berlin, 2009.

Page 135: New challenges forNew challenges for designers of fault

Main conceptsMain concepts

• SoC is divided into several layers• SoC is divided into several layers

• Each layer has specific fault tolerance mechanisms:y

o Detection is cheaper at lower layers

o Correction is better performed at higher layers

• Lower layers notify upper layers when error is detected

• Upper layers send reconfiguration information to lower layers• Upper layers send reconfiguration information to lower layers according to application requirements

• Key issue: interfaces between layers to report errors and inform about needed level of reliability according to

li tiLuigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 135

application

Page 136: New challenges forNew challenges for designers of fault

Sample roles of layersSample roles of layers

• Technology layer• Technology layero Built-in current sensors detect transient upsetso Upper layer can configure detection capabilitieso Upper layer can configure detection capabilities

• Register/Logic layero EDAC used to harden memorieso TMR used to harden logico Upper layer can enable/disable detection mechanisms

• Configuration/Programming layer (in reconfigurable platforms)Configuration/Programming layer (in reconfigurable platforms)o Reconfiguration can be used to disable faulty moduleso Periodical relocation of active modules reduces degradation

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 136

o Periodical relocation of active modules reduces degradation

Page 137: New challenges forNew challenges for designers of fault

Sample roles of layersSample roles of layers• Middleware/Architectural layerMiddleware/Architectural layer

o Applies well-known redundancy techniques such as TMR at component levelat component level

o Redundant modules designed independently to allow SEU and design errors detectiong

o Test mechanisms can be used to check modules at run time

o Checkpoints can be used to allow error recovery

• Application layero Almost everything can be used to improve reliability at this

levelo Software implemented TMR, EDAC and other techniques

b dLuigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 137

can be used

Page 138: New challenges forNew challenges for designers of fault

Outline

• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future

technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults

mitigation techniquesmitigation techniques• Recent solutions working at different abstraction

l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 138

Conclusions

Page 139: New challenges forNew challenges for designers of fault

ConclusionsConclusions

• New low cost mitigation techniques, providing error detection and errorproviding error detection and error correction must be developed

• Circuit level approaches can be better than TMR but still impose significant areathan TMR, but still impose significant area and power overheads

• Algorithm level mitigation is a better h b t it i h d t li dapproach, but it is hard to generalize and

automate

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 139

Page 140: New challenges forNew challenges for designers of fault

High level error detection: pros and consHigh level error detection: pros and cons

[Sorin, 2009][Sorin, 2009]• Checking at a higher level:

• reduces hardware costs

• reduces the number of false positives• reduces the number of false positives

• is necessary anyway for certain types of errors

• However:

id littl di ti i f ti (t d l ti )• provides little diagnostic information (type and location)

• longer and potentially unbounded error detection g p ylatency

• recovery process may be more complex

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 140

• recovery process may be more complex

Page 141: New challenges forNew challenges for designers of fault

Final RemarkFinal Remark

•There is NO silver bullet!•Combine hardware and software based techniques at different levels

•Leverage on specific strengths of each technique at each level.Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 141

Page 142: New challenges forNew challenges for designers of fault

Thank You !Thank You !

Questions ?Questions ?Contact: [email protected], [email protected]

Copy of slides available at http://www inf ufrgs br/~calisboa/IESS2009

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 142

http://www.inf.ufrgs.br/ calisboa/IESS2009

Page 143: New challenges forNew challenges for designers of fault

References (in order of appearance)References (in order of appearance)

• BLOME, J. A., GUPTA, S., FENG, S., and MAHLKE, S. Cost-efficient soft error protection for embedded microprocessors. In: INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURE AND SYNTHESIS FOR EMBEDDED SYSTEMS, CASES 2006, 2006, Proceedings… Los Alamitos, USA: IEEE Computer Society, 2006, p421-431.Proceedings… Los Alamitos, USA: IEEE Computer Society, 2006, p421 431.

• DODD, P. et al. Production and propagation of single-event transients in high-speed digital logic ics. IEEE Transactions On Nuclear Science, Los Alamitos, USA: IEEE Computer Society, 2004, v. 51, n 6 (part 2) p 3278–3284n. 6 (part 2), p.3278–3284.

• FERLET-CAVROIS. V. et al. Statistical analysis of the charge collected in SOI and bulk devicesunder heavy ion and proton irradiation—implications for digital SETs. IEEE Transactions OnNuclear Science Los Alamitos USA : IEEE Computer Society 2006 v 53 n 6 (part 1) p 3242Nuclear Science, Los Alamitos, USA : IEEE Computer Society, 2006, v. 53, n. 6 (part 1), p. 3242-3252.

• ROSSI, D. et al. Multiple transient faults in logic: an issue for next generation ICs? In: IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS 20INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS, 20., DFT 2005, 2005, Monterey, USA. Proceedings… Los Alamitos, USA: IEEE Computer Society, 2005, p. 352-360.

• ANGHEL, L.; NICOLAIDIS, M. Cost reduction and evaluation of a temporary faults detection technique. In.: DESIGN, AUTOMATION AND TEST IN EUROPE CONFERENCE, 2000, DATE 2000, Paris, FRA. Proceedings… New York, USA: ACM Press, 2000, p. 591-598.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 143

Page 144: New challenges forNew challenges for designers of fault

References (in order of appearance)References (in order of appearance)

• NIEUWLAND, A.; JASAREVIC, S.; JERIN, G. Combinational logic soft error analysis and protection. In: IEEE INTERNATIONAL ON-LINE TEST SYMPOSIUM, 12., IOLTS 2006, Lake of Como, ITA. Proceedings… Los Alamitos, USA: IEEE Computer Society, 2006. p. 99-104.

• SORIN, D. J., Fault Tolerant Computer Architecture, Morgan & Claypool, USA : 2009

• PRADHAN, D. Fault-tolerant computer system design. Upper Saddle River, USA : Prentice-Hall, 1995.

• BAUMANN, R. Soft errors in advanced computer systems. IEEE Design and Test of Computers, New York, USA: IEEE Computer Society, 2005, v. 22, n. 3, p. 258-266.

• HAMMING, R. Error Detecting and Error Correcting Codes. The bell system technical journal, 2005, v. 26, n. 2, p. 147-160.

• ALMUHKAIZIM, S. and MAKRIS, Y., “Fault Tolerant Design of Combinational and Sequential Logic , , , g q gbased on a Parity Check Code”, in Proceedings of th 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 2003), IEEE Computer Society, Los Alamitos, CA, October 2003, pp. 344-351.

• FREIVALDS, R. Fast probabilistic algorithms. In: FREIVALDS, R. Mathematical Formulations of CS. New York, USA: Springer-Verlag, 1979. p. 57-69. (Lecture Notes in Computer Science).

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 144

Page 145: New challenges forNew challenges for designers of fault

References (in order of appearance)References (in order of appearance)

• LISBOA, C. A., ERIGSSON, M. I., and CARRO, L. System level approaches for mitigation of long duration transient faults in future technologies. In: IEEE EUROPEAN TEST SYMPOSIUM, 12., ETS 2007, Freiburg, DEU. Proceedings… Los Alamitos, USA: IEEE Computer Society, 2007, p. 165-170.170.

• LISBOA, C.; ARGYRIDES, C.; PRADHAN, D.; and CARRO, L. Algorithm level fault tolerance: a technique to cope with long duration transient faults in matrix multiplication algorithms. In: IEEE VLSI TEST SYMPOSIUM 26 VTS 2008 San Diego USA Proceedings [S l : s n ] 2008VLSI TEST SYMPOSIUM, 26., VTS 2008, San Diego, USA. Proceedings… [S.l.: s.n.], 2008.

• LISBOA, C. et al. Invariant checkers: an efficient low cost technique for run-time transient errors detection. In: IEEE INTERNATIONAL ON-LINE TESTING SYMPOSIUM, 15., IOLTS 2009, Sesimbra POR Proceedings [S l : s n ] 2009Sesimbra, POR. Proceedings… [S.l.: s.n.], 2009.

• REBAUNDENGO, M. et al. Soft-error detection through software fault-tolerance techniques. In: IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS 14 DFT1999 1999 Alb USA P di N Y k USA IEEESYSTEMS, 14., DFT1999, 1999, Albuquerque, USA. Proceedings… New York, USA: IEEE Computer Society, 1999, p. 210-218.

• GOLOUBEVA, O. et al. Soft error detection using control flow assertions. INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE, 18., 2003, Boston, USA. Proceedings…Los Alamitos, USA: IEEE Computer Society, 2003, p. 581-588.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 145

Page 146: New challenges forNew challenges for designers of fault

References (in order of appearance)References (in order of appearance)

• BENSO, A. et al. PROMON: a profile monitor of software applications. In: IEEE WORKSHOP ON DESIGN AND DIAGNOSTICS OF ELECTRONIC CIRCUITS AND SYSTEMS, 8., DDECS05, Sopron, HUN. Proceedings… New York, USA: IEEE Computer Society, 2005, p. 81-86.

• [DAIKON] ERNST, M.; COCKRELL, J.; GRISWOLD, W. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering. New York, USA: IEEE Computer Society, 2001, v. 27, n. 2, p.99–123.

• KASTENSMIDT, F.; CARRO, L.; REIS, R. Fault-Tolerance Techniques for SRAM-Based FPGA. New York, USA: Springer. 2006, 183 p. REBAUNDENGO, M. et al. Soft-error detection through software fault-tolerance techniques. In: IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS 14 DFT1999 1999 Albuquerque USA ProceedingsFAULT TOLERANCE IN VLSI SYSTEMS, 14., DFT1999, 1999, Albuquerque, USA. Proceedings…New York, USA: IEEE Computer Society, 1999, p. 210-218.

• [ABFT] HUANG, K.; ABRAHAM, J. Algorithm-based fault tolerance for matrix operations. IEEE T ti C t N Y k USA IEEE C t S i t 1984 C 33 6Transactions on Computers. New York, USA : IEEE Computer Society, 1984, v. C-33, n. 6, p. 518-528.

• [EDDI] OH, N., SHIRVANI, P. P., McCLUSKEY, E.J. EDDI: Error Detection by Duplicated Instructions. IEEE Transactions on Reliability, IEEE Reliability Society ,2002, v. 51, n. 1, p. 63-75.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 146

Page 147: New challenges forNew challenges for designers of fault

References (in order of appearance)References (in order of appearance)

4 4• [ED4I] OH, N.; MITRA, S.; McCLUSKEY, E. J. ED4I: error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, IEEE Computer Society, 2002, v. 51, n. 2, p. 180-199.

• [ECCA] ALKHALIFA, Z. et al. Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems, New York, USA: IEEE Computer Society, 1999, v. 10, n. 6, p. 627-641.

• [EDDI] OH, N., SHIRVANI, P. P., McCLUSKEY, E.J. EDDI: Error Detection by Duplicated Instructions. IEEE Transactions on Reliability, IEEE Reliability Society, 2002, v. 51, n. 1, p. 111-122.

• [YACCA], VIOLANTE, M. Dependability assurance by design. Internal report, Politecnico di Torino, Italy, available at http://www.cad.polito.it/~sonza/diistp03/lucidi/2007/03-assurance.pdf.

• [SWAT] LI M -L ; Ramachandran P ; Sahoo S K ; Adve S ; Adve V ; and Zhou Y Understanding• [SWAT] LI, M.-L.; Ramachandran, P.; Sahoo, S. K.; Adve, S.; Adve, V.; and Zhou, Y. Understanding the propagation of hard errors to software and implications for resilient system design. In Proc. of the 13th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2008.

• ALBRECHT, C. et al. Towards a Flexible Fault-Tolerant System-on-Chip. In: INTERNATIONAL CONFERENCE ON ARCHITECTURE OF COMPUTING SYSTEMS, 22., 2009, ARC 2009, Karlsruhe, GER. Proceedings… Berlin, GER: VDE Verlag GMBH, 2009, p. 83-90.

Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 147