analysis and optimization of fault-tolerant embedded systems with hardened processors

39
1 of 14 1 Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors Viacheslav Izosimov , Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB), Linköping University, Sweden Ilia Polian Institute for Computer Science, Albert-Ludwigs-University of Freiburg, Germany Paul Pop Dept. of Informatics and Mathematical Modeling Technical University of Denmark (DTU), Denmark

Upload: oleg-kim

Post on 01-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors. Ilia Polian Institute for Computer Science, Albert-Ludwigs-University of Freiburg, Germany. Paul Pop Dept. of Informatics and Mathematical Modeling Technical University of Denmark (DTU), Denmark. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

1 of 141

Analysis and Optimization of Fault-Tolerant Embedded Systems with

Hardened ProcessorsViacheslav Izosimov, Petru Eles, Zebo Peng

Embedded Systems Lab (ESLAB), Linköping University, Sweden

Ilia PolianInstitute for Computer Science, Albert-Ludwigs-University of Freiburg,

GermanyPaul Pop

Dept. of Informatics and Mathematical ModelingTechnical University of Denmark (DTU), Denmark

Page 2: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

2 of 142

Hard real-time safety-critical applications Time-constrained Cost-constrained Quality-of-service Fault-tolerant etc.

Motivation

Focus on transient faults and intermittent faults

Page 3: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

3 of 143

Transient and Intermittent Faults

Radiation

Electromagneticinterference (EMI)

Lightning storms

Internal EMICrosstalk

Power supplyfluctuations

Software errors(Heisenbugs)

Errors caused by transient (intermittent)faults have to be tolerated before

they crash the system

Page 4: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

4 of 144

Hardening Hardening

Improving the hardware architecture to reduce the error rate Hardware redundancy (selective duplication of gates/units/nodes,

dedicated additional hardware modules/flip-flops) Re-designing the hardware to reduce susceptibility to transient

faults Using higher voltages / lower frequencies / larger transistor sizes Protecting with shields

Lead to lower performance Use of technologies few generations back Increase of the critical path and silicon area

Very expensive Extra-design effort / More expensive technologies More silicon / Increase in the number of gates or computation units Low production volumes

Still may not guarantee the required reliability levels at affordable cost!

Page 5: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

5 of 145

P1

Software fault tolerance Reliability increase with time redundancy Lead to lower performance

Fault tolerance overheads Overheads due to error detection, voting, agreement

Low hardware cost

Often cannot guarantee the required reliability levels and, at the same time, meet deadlines!

Software-level Fault Tolerance

P1 P1

Page 6: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

6 of 146

Motivation

A trade-off between hardware and softwarefault tolerance has to be addressed to provide

a reliable and low-cost system!

Neither hardening nor pure software-level fault tolerance can guarantee the required level of

reliability…

Fault tolerance against transient faults may lead to

significant performance or cost overhead!

Page 7: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

7 of 147

Outline

Motivation

Architecture

Application example

Fault tolerance: hardening & re-execution

Hardening/re-execution trade-off

Problem formulation & design strategy

Experimental results

Conclusions

Page 8: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

8 of 148

Architecture

Processes: Re-executionComputation nodes: Hardening

Messages: Fault-tolerant predictable

protocol

Transient faults

P2

P4P3

P5

P1

m1

m2

The error rates for each hardening version (h-

version) of each computation node

is the maximum probability of a system failure due to transient faults on any computation node within a

time unit

The reliability goal = 1

Page 9: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

9 of 149

Application Example

80P1

N1h = 1

10

h = 2

20Cost

h = 3

40

t t tp p p

100 1604·10-2 4·10-4 4·10-6

N1

= 1 10-5

Hardening versions of computation node N1

Increase in reliabilityDecrease in process failure

probabilities

Worst-case execution times are increasedHardening performance degradation (HPD)

Cost is increasedwith more hardening!

t – worst-case execution time p – process failure probability Cost – h-version cost

P1

Page 10: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

10 of 1410

System Failure Probability (SFP) Analysis

80P1

N1h = 1

t p

We have proposed a system failure probability (SFP) analysis to connect error rates and the reliability goal to

the number of re-executions in software

SFP

= 1 10-5

4·10-2T = 360ms

main execution +

k = 6 re-executions

Non-trivialExactSafe

Page 11: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

11 of 1411

System Failure Probability (SFP) Analysis

100P1

N1h = 2

t p

We have proposed a system failure probability (SFP) analysis to connect error rates and the reliability goal to

the number of re-executions in software

SFP

= 1 10-5

4·10-4T = 360ms

main execution +

k = 2 re-executions

Page 12: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

12 of 1412

System Failure Probability (SFP) Analysis

160P1

N1h = 3

t p

We have proposed a system failure probability (SFP) analysis to connect error rates and the reliability goal to

the number of re-executions in software

SFP

= 1 10-5

4·10-6T = 360ms

main execution +

k = 1 re-executions

Page 13: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

13 of 1413

Application Example

80P1

N1h = 1

10

h = 2

20Cost

h = 3

40

t t tp p p

100 1604·10-2 4·10-4 4·10-6 = 20 ms

D = 360ms

N1

= 1 10-5

P1/1N1 P1/23

P1/1N1 P1/2 P1/32

P1/1N1 P1/2 P1/3 P1/4 P1/5 P1/6 P1/71

Page 14: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

14 of 1414

Application Example

P2

P1

P4 = 1 10-5

= 15 ms

N1 N2

D = 360 msP3

m1 m4

m3

m2

60

75

60

P1

P2

P3

1.2·10-3

1.3·10-3

1.4·10-3

N1h = 1

16

75P4 1.6·10-3

h = 2

32Cost

h = 3

64

t t tp p p

75

90

75

90

1.2·10-5

1.3·10-5

1.4·10-5

1.6·10-5

90

105

90

105

1.2·10-10

1.3·10-10

1.4·10-10

1.6·10-10

P1

P2

P3

N2h = 1

20

P4

h = 2

40Cost

h = 3

80

t t tp p p

65

50

50 1·10-3

1.2·10-3

1.2·10-3

65 1.3·10-3

75

60

60 1·10-5

1.2·10-5

1.2·10-5

75 1.3·10-5

90

75

75 1·10-10

1.2·10-10

1.2·10-10

90 1.3·10-10

Page 15: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

15 of 1415

Application Example

Cc = 64

Ce = 72

Cb = 40

P1 P3 P2 P4N13

P1 P3 P2/1 P4/1N2 P2/2 P4/2

2

P4N2

N1P2/1

bus m2

m3

P3/1

P2/2

P3/2

P1

2

2

Ca = 322N1 P1 P3 P2/1 P4/1P2/2 P4/2

3N2 P1 P3 P2 P4Cd = 80

Page 16: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

16 of 1416

Problem Formulation (Input)Input: Application as a set of directed acyclic graphs Reliability goal Deadline D, period T Recovery overhead Bus-based hardware architecture

1. Process worst-case execution times for all h-versions of computation nodes

2. Process failure probabilities for all h-versions3. Costs of all h-versions4. Worst-case message sizes, transformed into the

worst-case transmission times on the bus

Page 17: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

17 of 1417

Problem Formulation (Output)

Output:

Selection of h-versions of computation nodes Mapping of all processes Maximum number of re-executions (by using

our SFP analysis) Schedule (static cyclic) of all processes and

messages The final solution has to

1.Be schedulable2.Meet reliability goal3.Minimize the overall system cost

Page 18: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

18 of 1418

Design Optimization Strategy

Re-executionOptimization

(based on SFP)

SatisfyReliability

Number ofRe-executions

SFP

Input:Reliability Goal

Period TProcess Failure Probabilities

Mapping +Hardening Setup

Page 19: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

19 of 1419

Design Optimization Strategy

HardeningOptimization

+Scheduling

Re-executionOptimization

(based on SFP)

MeetDeadlin

e

HardeningSetup

SatisfyReliability

Number ofRe-executions

SFPInput:Reliability Goal

Period TProcess Failure Probabilities

Mapping

Page 20: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

20 of 1420

Design Optimization Strategy

MappingOptimization

+Scheduling

HardeningOptimization

+Scheduling

Re-executionOptimization

(based on SFP)

MeetDeadlin

e

Mapping

MeetDeadlin

e

HardeningSetup

SatisfyReliability

Number ofRe-executions

Input:Reliability Goal

Period T

Architecture(Set of Nodes)

Selection

SFP

Page 21: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

21 of 1421

Design Optimization Strategy

ArchitectureOptimization

MappingOptimization

+Scheduling

HardeningOptimization

+Scheduling

Re-executionOptimization

(based on SFP)

BestCost

ArchitectureSelection

MeetDeadlin

e

Mapping

MeetDeadlin

e

HardeningSetup

SatisfyReliability

Number ofRe-executions

SFP

DATE’05

Page 22: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

22 of 1422

Selected Experimental Results

% accepted architectures as a function of soft error rate (SER)

0

20

40

60

80

100

10-12 10-11 10-10

% a

cce

pte

d a

rch

itect

ure

s

MAXMINOPT

MAX – hardware optimizationMIN – software optimizationOPT – combined architecture

Accepted architecture:Satisfying maximum accepted cost

Satisfying reliability goalSchedulable

Hardening performance

degradation (HPD) 5%Performance difference

between the least hardened and the most hardened

versions

Maximum cost 20

Page 23: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

23 of 1423

10-12 10-11 10-100

20

40

60

80

100

% a

cce

pte

d a

rch

itect

ure

s

MAX MIN OPT

Selected Experimental Results

% accepted architectures as a function of soft error rate (SER)

MAX – hardware optimizationMIN – software optimizationOPT – combined architecture

Hardening performance

degradation (HPD) 100%

Maximum cost 20

Page 24: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

24 of 1424

Conclusions

Design optimization strategy for minimization of overall system cost by trading-off between hardening and re-execution Hardware + software fault tolerance techniques System failure probability (SFP) analysis A set of design optimization heuristics

Combining hardware and software fault tolerance techniques is essential

for obtaining cost efficient implementation of fault-tolerant

embedded systems

Page 25: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

25 of 1425

System Failure Probability (SFP) Analysis

Given: Application as a set of directed acyclic graphs,

period T Reliability goal Architecture composed of a set of h -versions of

computation nodes Mapping of processes on the nodes Process failure probabilities for all h –versions The number of re-executions kj on each node NjOutput: True, if the system reliability is above or

equal to the reliability goal False, if the system reliability is below the

reliability goal

Page 26: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

26 of 1426

System Failure Probability (SFP) Analysis

Probability of a system failure during period T due to transient faults,

or the probability that any of Nj nodes experience more than kj transient faults during period T

( is time unit for reliability goal )

Page 27: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

27 of 1427

System Failure Probability (SFP) Analysis

Probability that node Nj experience more than kj transient faults

Page 28: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

28 of 1428

System Failure Probability (SFP) Analysis

No fault probability on

node Nj

Probability of that all the

combinations of exactly f faults

are tolerated on node Nj

Probability of that all the

combinations of faults f kj are tolerated on node

Nj

Page 29: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

29 of 1429

System Failure Probability (SFP) Analysis

No fault probability on node Nj

A multiplication of no fault probabilities of all the

processes mapped on node Nj

Probability of process Pi failure on node Nj

with hardening level h

Page 30: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

30 of 1430

System Failure Probability (SFP) Analysis

Probability of recovery from f faults in a particular fault scenario s* on node Nj

Probability of that all the combinations of exactly f faults are tolerated on node Nj

S* is a multiset!

Page 31: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

31 of 1431

System Failure Probability (SFP) Analysis

No fault probability on

node Nj

Probability of that all the

combinations of exactly f faults

are tolerated on node Nj

Node failure probability:

Page 32: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

32 of 1432

System Failure Probability (SFP) Analysis

System failure probability during period T:

Page 33: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

33 of 1433

System Failure Probability (SFP) Analysis

The evaluation criteria:

Page 34: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

34 of 1434

System Failure Probability (SFP) Analysis

Computation example:

P4N2

N1P2/1

bus m2

m3

P3/1

P2/2

P3/2

P1

2

2

Page 35: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

35 of 1435

System Failure Probability (SFP) Analysis

60

75

60

P1

P2

P3

1.2·10-3

1.3·10-3

1.4·10-3

N1h = 1

16

75P4 1.6·10-3

h = 2

32Cost

h = 3

64

t t tp p p

75

90

75

90

1.2·10-5

1.3·10-5

1.4·10-5

1.6·10-5

90

105

90

105

1.2·10-10

1.3·10-10

1.4·10-10

1.6·10-10

Cost

P1

P2

P3

N2h = 1

20

P4

h = 2

40

h = 3

80

t t tp p p

65

50

50 1·10-3

1.2·10-3

1.2·10-3

65 1.3·10-3

75

60

60 1·10-5

1.2·10-5

1.2·10-5

75 1.3·10-5

90

75

75 1·10-10

1.2·10-10

1.2·10-10

90 1.3·10-10

P4N2

N1P2/1

bus m2

m3

P3/1

P2/2

P3/2

P1

2

2

Page 36: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

36 of 1436

System Failure Probability (SFP) Analysis

1) No re-execution:

Probability of no faulty processes for both nodes N12 and N2

2

Pr (NF ;N12) = (1– 1.2·10-5)·(1– 1.3·10-5) =0.99997500015

Pr (NF ;N22) = (1– 1.2·10-5)·(1– 1.3·10-5) =0.99997500015

Probability of more than no faults:

Pr ([f > 0]F ; N12) = 1 – 0.99997500015 = 0.000024999844

Pr ([f > 0]F ; N22) = 1 – 0.99997500015 = 0.000024999844

The system failure probability during period T without any re-executions:

Pr ([f > 0]F ; N12 [f > 0]F ; N2

2) = 0.000024999844 + 0.000024999844 – 0.000024999844 · 0.000024999844 = 0.00004999907

T = 360 ms(1 – 0,00004999907)10000 = 0.95122912011 < = 1 – 10-5

FALSE!

Page 37: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

37 of 1437

System Failure Probability (SFP) Analysis

2) One re-execution on each node:

Probability of exactly one fault to be tolerated with re-execution on each node:Pr (1F ;N1

2)=0.99997500015·(1.2·10-5+1.3·10-5) =0.00002499937

Pr (1F ;N22)=0.99997500015·(1.2·10-5+1.3·10-5) =0.00002499937

Probability of more than 1 fault:Pr ([f >1]F ;N1

2)= 1 – 0.99997500015 – 0.00002499937 = 4.8·10-10

Pr ([f >1]F ;N22)=1 – 0.99997500015 – 0.00002499937 = 4.8·10-10

The system failure probability during period T with one re-execution on each node:Pr ([f > 1]F ; N1

2 [f > 1]F ; N22)= 9.6·10-10

T = 360 ms (1 – 9.6·10-10)10000= 0,99999904000 > = 1 – 10-5

TRUE!

Page 38: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

38 of 1438

System Failure Probability (SFP) Analysis

P4N2

N1P2/1

bus m2

m3

P3/1

P2/2

P3/2

P1

2

2

SFPA ( ) True

Page 39: Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors

39 of 1439

Questions?