scheduling and optimization of fault-tolerant embedded systems

56
1 of 14 1 Scheduling and Optimization of Fault-Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden Presentation of Licentiate Thesis

Upload: kalei

Post on 25-Feb-2016

58 views

Category:

Documents


2 download

DESCRIPTION

Presentation of Licentiate Thesis. Scheduling and Optimization of Fault-Tolerant Embedded Systems. Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden. Motivation. Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scheduling and Optimization of Fault-Tolerant Embedded Systems

1 of 141

Scheduling and Optimization of Fault-Tolerant Embedded Systems

Viacheslav IzosimovEmbedded Systems Lab (ESLAB)

Linköping University, Sweden

Presentation of Licentiate Thesis

Page 2: Scheduling and Optimization of Fault-Tolerant Embedded Systems

2 of 142

Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc.

Motivation

Focus on transient faults and intermittent faults

Page 3: Scheduling and Optimization of Fault-Tolerant Embedded Systems

3 of 143

Motivation

Transient faults

Radiation

Electromagnetic interference (EMI)

Lightning storms

Happen for a short time Corruptions of data,

miscalculation in logic Do not cause a permanent

damage of circuits Causes are outside system

boundaries

Page 4: Scheduling and Optimization of Fault-Tolerant Embedded Systems

4 of 144

Motivation

Intermittent faults

Internal EMICrosstalk

Power supply fluctuations

Init (Data)

Software errors (Heisenbugs)

Transient faults Manifest similar as

transient faults Happen repeatedly Causes are inside

system boundaries

Page 5: Scheduling and Optimization of Fault-Tolerant Embedded Systems

5 of 145

Motivation

Errors caused by transient faults haveto be tolerated before they crash the system

However, fault tolerance againsttransient faults leads to significant

performance overhead

Transient faults are more likely to occuras the size of transistors is shrinking

and the frequency is growing

Page 6: Scheduling and Optimization of Fault-Tolerant Embedded Systems

6 of 146

Motivation

Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc.

The Need for Design Optimization of Embedded Systems with Fault

Tolerance

Page 7: Scheduling and Optimization of Fault-Tolerant Embedded Systems

7 of 147

Outline MotivationBackground and limitations of previous work Thesis contributions:

Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency

Conclusions and future work

Page 8: Scheduling and Optimization of Fault-Tolerant Embedded Systems

8 of 148

General Design Flow

System Specification

Architecture Selection

Mapping & Hardware / Software Partitioning

Scheduling

Back-end Synthesis

Feedback loops

Fault Tolerance

Techniques

Page 9: Scheduling and Optimization of Fault-Tolerant Embedded Systems

9 of 149

P10 20 40 60

N1

P12P1 P11 2

P1P1P1/1

Fault Tolerance Techniques

Error-detection overhead

N1

Re-execution

Checkpointing overhead

P1 P11 2N1

Rollback recovery with checkpointing

Recovery overhead

P1(1)

P1(2)

N1

N2

P1(1)

P1(2)

N1

N2

Active replication

P1/2

P11 P1/12 P1/22N11

Page 10: Scheduling and Optimization of Fault-Tolerant Embedded Systems

10 of 1410

Limitations of Previous Work Design optimization with fault tolerance is

limited

Process mapping is not considered together with fault tolerance issues

Multiple faults are not addressed in the framework of static cyclic scheduling

Transparency, if at all addressed, is restricted to a whole computation node

Page 11: Scheduling and Optimization of Fault-Tolerant Embedded Systems

11 of 1411

Outline Motivation Background and limitations of previous workThesis contributions:

Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency

Conclusions and future work

Page 12: Scheduling and Optimization of Fault-Tolerant Embedded Systems

12 of 1412

Fault-Tolerant Time-Triggered Systems

Processes: Re-execution,Active Replication, Rollback

Recovery with CheckpointingMessages: Fault-

tolerant predictable protocol

Transient faults

P2

P4P3

P5

P1

m1m2

Maximum k transient faults within each application run (system period)

Page 13: Scheduling and Optimization of Fault-Tolerant Embedded Systems

13 of 1413

Scheduling with Fault Tolerance Reqirements

Conditional Scheduling

Shifting-based Scheduling

Conditional Scheduling

Page 14: Scheduling and Optimization of Fault-Tolerant Embedded Systems

14 of 1414

PP11

PP11

PP22

m1

Conditional Scheduling

true

P1 0

P2

k = 2

0 20 40 60 80 100 120 140 160 180 200

Page 15: Scheduling and Optimization of Fault-Tolerant Embedded Systems

15 of 1415

PP11 PP22

PP11

PP22

m1

Conditional Scheduling

true

P1 0

P240

k = 2

1PF

0 20 40 60 80 100 120 140 160 180 200

Page 16: Scheduling and Optimization of Fault-Tolerant Embedded Systems

16 of 1416

PP11

PP22

m1

Conditional Scheduling

true

P1 0

P2

454

0

k = 2

1PF

1PF

PP11PP1/11/1 PP1/21/2

0 20 40 60 80 100 120 140 160 180 200

Page 17: Scheduling and Optimization of Fault-Tolerant Embedded Systems

17 of 1417

PP22

0 20 40 60 80 100 120 140

PP11

PP22

m1

Conditional Scheduling

true

P1 0

P2

454

0

k = 2

160 180 200

90130

1PF

1PF

1 1P PF FÙ

PP1/11/1 PP1/21/2 PP1/31/3

Page 18: Scheduling and Optimization of Fault-Tolerant Embedded Systems

18 of 1418

PP11

PP22

m1

Conditional Scheduling

true

P1 0

P2

454

0

k = 2

90130 14085

1PF

1PF

1 1P PF FÙ1 1P PF FÙ

1 1 2P P PF F FÙ Ù

0 20 40 60 80 100 120 140 160 180 200PP1/11/1 PP1/21/2 PP22PP2/12/1 PP2/22/2

Page 19: Scheduling and Optimization of Fault-Tolerant Embedded Systems

19 of 1419

PP11

PP22

m1

Conditional Scheduling

true

P1 0

P2

454

0

k = 2

90130 1509514085

1PF

1PF

1 1P PF FÙ1 1P PF FÙ

1 1 2P P PF F FÙ Ù1 2P PF FÙ

1 2 2P P PF F FÙ Ù

0 20 40 60 80 100 120 140 160 180 200PP11 PP22PP2/12/1 PP2/22/2 PP2/32/3

Page 20: Scheduling and Optimization of Fault-Tolerant Embedded Systems

20 of 1420

PP22

PP22

PP22

PP22

PP22 PP22

PP11

PP11

PP11

Fault-Tolerance Conditional Process Graph

12P

F21PF

22P

F

PP11

PP22

m1

k = 211

22

33

11

22

33

44

55 66

m1

m1

m1

1

2

3

Conditional Scheduling

42P

F

11PF

Page 21: Scheduling and Optimization of Fault-Tolerant Embedded Systems

21 of 1421

Conditional Schedule Table

true

P1 0

m1

P2

454

050

90130140 160105150

8595

1PF

1PF

1 1P PF FÙ1 1P PF FÙ

1 1 2P P PF F FÙ Ù1 2P PF FÙ

1 2 2P P PF F FÙ Ù

PP11

PP22

m1k = 2 N1 N2

Page 22: Scheduling and Optimization of Fault-Tolerant Embedded Systems

22 of 1422

Conditional Scheduling Conditional scheduling:

Generates short schedules Allows to trade-off between transparency and

performance (to be discussed later...)

– Requires a lot of memory to store schedule tables

– Scheduling algorithm is very slow

Alternative: shifting-based scheduling

Page 23: Scheduling and Optimization of Fault-Tolerant Embedded Systems

23 of 1423

Shifting-based Scheduling Messages sent over the bus should be

scheduled at one time Faults on one computation node must not

affect other computation nodes Requires less memory Schedule generation is very fast

– Schedules are longer

– Does not allow to trade-off between transparency and performance (to be discussed later...)

Page 24: Scheduling and Optimization of Fault-Tolerant Embedded Systems

24 of 1424

Ordered FT-CPG

k = 2

PP11

PP33

PP44PP22

m2

m3

m1

P2 after P1

P3 after P4

PP33

PP33

PP33

PP33

PP33

PP33

PP44

PP44PP44

PP22PP22

PP22

PP22PP22 PP11

PP11

PP11

SSSSSS

1

2

3

1

2

3

4

56

1

mm33mm22mm11

2

3

1

2

3

4

5 5

PP22

Page 25: Scheduling and Optimization of Fault-Tolerant Embedded Systems

25 of 1425

Root Schedules

P2

m 1

P1

m 2 m 3

P4 P3

N1

N2

Bus

P1 P1

Worst-case scenario for P1

Recovery slack for P1 and P2

Page 26: Scheduling and Optimization of Fault-Tolerant Embedded Systems

26 of 1426

Extracting Execution Scenarios

P2

m 1

P1

m 2 m 3

P4/1 P3

N1

N2

Bus

P4/2 P4/3

Page 27: Scheduling and Optimization of Fault-Tolerant Embedded Systems

27 of 1427

Memory Required to Store Schedule Tables

  20 proc.  40  proc.  60  proc.  80 proc.

  k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3

100% 0.13 0.28 0.54 0.36 0.89 1.73 0.71 2.09 4.35 1.18 4.21 8.75

75% 0.22 0.57 1.37 0.62 2.06 4.96 1.20 4.64 11.55 2.01 8.40 21.11

50% 0.28 0.82 1.94 0.82 3.11 8.09 1.53 7.09 18.28 2.59 12.21 34.46

25% 0.34 1.17 2.95 1.03 4.34 12.56 1.92 10.00 28.31 3.05 17.30 51.30

0% 0.39 1.42 3.74 1.17 5.61 16.72 2.16 11.72 34.62 3.41 19.28 61.85

Applications with more frozen nodesrequire less memory

1.734.968.09

12.5616.72

Page 28: Scheduling and Optimization of Fault-Tolerant Embedded Systems

28 of 1428

Memory Required to Store Root Schedule

  20 proc.  40  proc.  60  proc.  80 proc.

  k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3

100% 0.016 0.034 0.054 0.070

Shifting-based scheduling requires very little memory

1.730.03

Page 29: Scheduling and Optimization of Fault-Tolerant Embedded Systems

29 of 1429

Schedule Generation Time and Quality

Shifting-based scheduling much faster thanconditional scheduling

Shifting-based scheduling requires 0.2 seconds to generate a root schedule for

application of 120 processes and 10 faultsConditional scheduling already takes 319 seconds to generate a schedule table for application of 40 processes and 4 faults

~15% worse than conditional scheduling with100% inter-processor messages set to frozen

(in terms of fault tolerance overhead)

Page 30: Scheduling and Optimization of Fault-Tolerant Embedded Systems

30 of 1430

Fault Tolerance Policy AssignmentCheckpoint Optimization

Page 31: Scheduling and Optimization of Fault-Tolerant Embedded Systems

31 of 1431

Fault Tolerance Policy Assignment

P1/1 P1/2 P1/3

Re-execution

N1

P1(1)

P1(2)

P1(3)

Replication

N1

N2

N3

P1(1)/1

P1(2)

N1

N2

P1(1)/2

Re-executed replicas

2

Page 32: Scheduling and Optimization of Fault-Tolerant Embedded Systems

32 of 1432

Re-execution vs. Replication

N1 N2P1

P3

P2

m1 1P1P2P3

N1 N240504060

5070

A1 P1 P3P2m1 m2A2

N1

N2

bus

P1 P2 P3

Missed

Deadline

P1(1)N1

N2

bus

P1(2)

P2(1)

P2(2)

P3(1)

P3(2) Met

m1(

2)

m1(

1)

m2(

2)

m2(

1)

Replication is better

N1

N2

bus

P1 P2

P3

Met

Deadline

P1(1)N1

N2

bus

P1(2)

P2(1)

P2(2)

P3(1)

P3(2)

Missedm

1(2)

m1(

1)

Re-execution is better

Page 33: Scheduling and Optimization of Fault-Tolerant Embedded Systems

33 of 1433

P1N1

N2

bus

P2

P3

P4

m 2

Missed

N1

N2

P1(1)

P3(2)

P4(1)P2(1)

P1(2)

bus

P2(2)

P3(1)

P4(2)

Missedm

1(2)

m1(

1)

m2(

1)m

2(2)

m3(

1)m

3(2)

N1

N2

P1(1)

P3

P4P2

P1(2)

m2(

1)m

1(2)

bus

MetOptimization

of fault tolerancepolicy

assignment

Fault Tolerance Policy Assignment

P1P2P3

N1 N240 506060

8080

P4 40 50

1N1 N2

P1

P4P2

P3

m1

m2

m3

Deadline

Page 34: Scheduling and Optimization of Fault-Tolerant Embedded Systems

34 of 1434

Optimization Strategy Design optimization:

Fault tolerance policy assignment Mapping of processes and messages

Root schedules

Three tabu-search optimization algorithms: 1. Mapping and Fault Tolerance Policy assignment (MRX)

Re-execution, replication or both2. Mapping and only Re-Execution (MX)3. Mapping and only Replication (MR)

Tabu-search

Shifting-based scheduling

Page 35: Scheduling and Optimization of Fault-Tolerant Embedded Systems

35 of 1435

80

20

Experimental Results

010

3040506070

90100

20 40 60 80 100

80 Mapping and replication (MR)

20Mapping and re-execution (MX)

Mapping and policy assignment (MRX)

Number of processes

Avge

rage

% d

evia

tion

from

MRX

Schedulability improvement under resource constraints

Page 36: Scheduling and Optimization of Fault-Tolerant Embedded Systems

36 of 1436

N1

Checkpoint Optimization

P1

P12P11 P12P1/12 P1/22

P12P11 P12

Page 37: Scheduling and Optimization of Fault-Tolerant Embedded Systems

37 of 1437

Locally Optimal Number of Checkpoints

1 = 15 ms

k = 2

1 = 5 ms

1 = 10 ms

P1C1 = 50 ms

No. o

f che

ckpo

ints

P1 P12 1 2

P1 P1 P13 1 2 3

P1 P1 P1 P14 1 2 3 4

P11

P1 P1 P1 P1 P15 1 2 3 4 5

Page 38: Scheduling and Optimization of Fault-Tolerant Embedded Systems

38 of 1438

Globally Optimal Number of Checkpoints

P2P1

m1

10 5 1010 5 10

P1P2

P1C1 = 50 ms

P2 C2=60 ms

k = 2

265

P1 P1 P11 2 3

P2 P2 P21 2 3

P1 P2 P2P11 2 1 2

255

Page 39: Scheduling and Optimization of Fault-Tolerant Embedded Systems

39 of 1439

Globally Optimal Number of Checkpoints

P2P1

m1

10 5 1010 5 10

P1P2

P1C1 = 50 ms

P2 C2=60 ms

k = 2

P1 P1 P11 2 3 P2 P2 P2

1 2 3a)265

P1 P2 P2P11 2 1 2b)255

Page 40: Scheduling and Optimization of Fault-Tolerant Embedded Systems

40 of 1440

Globally Optimal Number of Checkpoints

P2P1

m1

10 5 1010 5 10

P1P2

P1C1 = 50 ms

P2 C2=60 ms

k = 2

P1 P1 P11 2 3 P2 P2 P2

1 2 3a)265

P1 P2 P2P11 2 1 2b)255

Page 41: Scheduling and Optimization of Fault-Tolerant Embedded Systems

41 of 1441

0%

10%

20%

30%

40%

40 60 80 100

Global Optimization of Checkpoint Distribution (MC)% d

evia

tion

from

MC0

(how

sm

alle

r th

e fa

ult t

oler

ance

ove

rhea

d)

Application size (the number of tasks)

4 nodes, 3 faults

Local Optimization of Checkpoint Distribution (MC0)

Global Optimization vs. Local Optimization

Does the optimization reduce the fault tolerance overheads on the schedule

length?

Page 42: Scheduling and Optimization of Fault-Tolerant Embedded Systems

42 of 1442

Trading-off Transparency for PerformanceMapping Optimization with

Transparency

Page 43: Scheduling and Optimization of Fault-Tolerant Embedded Systems

43 of 1443

Good for debugging and testing

FT Implementations with Transparency

PP22

PP44PP33

PP55

PP11

m1

m2

– regular processes/messages– frozen processes/messages

PP33Frozen

Transparency is achieved with frozen processes and messages

Page 44: Scheduling and Optimization of Fault-Tolerant Embedded Systems

44 of 1444

N1 N2 P1P2P3

N130 X20X

X20

N2

P4 X 30

= 5 ms

k = 2

No TransparencyDeadline

PP22 PP11

PP44

m2 m1m3

PP33

PP22m

1PP11

m2

m3

PP44 PP33

no fault scenario N1

N2

bus

PP22

m 1

PP11

m2

PP44 PP33

PP11

PP44

m3

the worst-case fault scenario N1

N2

bus

processes start at different times

messages are sent at different times

Page 45: Scheduling and Optimization of Fault-Tolerant Embedded Systems

45 of 1445

Full TransparencyCustomized Transparency

PP22PP11

m2

m3

PP33 PP33

m1

PP44 PP33

Customized transparency

PP22

m1

PP11

PP44

m2

m3

PP33no fault scenarioPP22

m1

PP11 PP11 PP11

PP44

m2

m3

PP33

PP22

m 1

PP11

m2

PP44 PP33

PP11

PP44

m3

No transparencyDeadline

DeadlinePP22

m1

PP11

PP44m

2

m3

PP33 PP33 PP33

Full transparency

Page 46: Scheduling and Optimization of Fault-Tolerant Embedded Systems

46 of 1446

Trading-Off Transparency for Performance

    0%     25%     50%     75%   

100% 

  k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3

20 24 44 63 32 60 92 39 74 115 48 83 133 48 86 139

40 17 29 43 20 40 58 28 49 72 34 60 90 39 66 97

60 12 24 34 13 30 43 19 39 58 28 54 79 32 58 86

80 8 16 22 10 18 29 14 27 39 24 41 66 27 43 73

Four (4) computation nodesRecovery time 5 ms

Trading transparency for performance is essential

29 40 49 60 66

How longer is the schedule length with fault tolerance?

increasing increasing transparencytransparency

Page 47: Scheduling and Optimization of Fault-Tolerant Embedded Systems

47 of 1447

m3

N1 N2 = 10 ms

k = 2

Mapping with Transparency

PP33

PP11

PP44

m2

PP66

m1

m4

PP55

N2

P1P2P3

N1

30 30

P4P5

40 4050 5060 6040 40

P6 50 50

m 1N1

N2

bus

PP44

PP11 PP33

PP22

PP66

PP55optimal mapping

without transparency

Deadline

N1

N2

busPP11 PP33

PP22

PP66

m 1

PP55PP4/24/2 PP4/34/3PP4/14/1

the worst-case faultscenario for optimal mapping

PP22

Page 48: Scheduling and Optimization of Fault-Tolerant Embedded Systems

48 of 1448

N1 N2 = 10 ms

k = 2

Mapping with Transparency

PP33

PP11

PP44

m2

PP66

m1

m4

PP22

PP55

m3

N2

P1P2P3

N1

30 30

P4P5

40 4050 5060 6040 40

P6 50 50

bus

DeadlineN1

N2

m 1

the worst-case fault scenario withtransparency for “optimal” mappingPP11 PP33

PP2/12/1

PP66

PP55PP44 PP2/22/2 PP2/32/3

bus

N1

N2

m 2

the worst-case fault scenario withtransparency and optimized mappingPP11

PP33

PP22

PP66

PP55

PP4/14/1 PP4/24/2 PP4/34/3

Page 49: Scheduling and Optimization of Fault-Tolerant Embedded Systems

49 of 1449

Design Optimization

Hill-climbing mapping optimization heuristic

Fast

Slow2. Schedule Length Estimation (SE)

1. Conditional Scheduling (CS)

Schedule length

Page 50: Scheduling and Optimization of Fault-Tolerant Embedded Systems

50 of 1450

Experimental Results

4 nodes 25% of processes and50% of messages are frozen

15 applications k = 2 faults k = 3 faults k = 4 faults

Recovery overhead = 5 ms SE CS SE CS SE CS

20 processes 0.01 0.07 0.02 0.28 0.04 1.3730 processes 0.13 0.39 0.19 2.93 0.26 31.5040 processes 0.32 1.34 0.50 17.02 0.69 318.88

How faster is schedule length estimation (SE) compared to conditional scheduling (CS)?

0.69s318.88s

Schedule length estimation (SE) is more than 400 times faster

than conditional scheduling (CS)

Page 51: Scheduling and Optimization of Fault-Tolerant Embedded Systems

51 of 1451

Experimental Results

4 computation nodes15 applications

 25% of processes and50% of messages are frozen

Recovery overhead = 5 ms

k = 2 faults

k = 3 faults

k = 4 faults

20 processes 32.89% 32.20% 30.56%30 processes 35.62% 31.68% 30.58%40 processes 28.88% 28.11% 28.03%

How much is the improvement when transparency is taken into account?

31.68%

Schedule length offault-tolerant applications is

31.68%shorter on average if

transparency was considered during mapping

Page 52: Scheduling and Optimization of Fault-Tolerant Embedded Systems

52 of 1452

Outline Motivation Background and limitations of previous work Thesis contributions:

Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency

Conclusions and future work

Page 53: Scheduling and Optimization of Fault-Tolerant Embedded Systems

53 of 1453

Conclusions

Scheduling with fault tolerance requirements Two novel scheduling techniques Handling customized transparency

requirements, trading-off transparency for performance

Fast scheduling alternative with low memory requirements for schedules

Page 54: Scheduling and Optimization of Fault-Tolerant Embedded Systems

54 of 1454

Conclusions

Design optimization with fault tolerance Policy assignment optimization strategy Estimation-driven mapping optimization that

can handle customized transparency requirements

Optimization of the number of checkpoints

Approaches and algorithms have been evaluated on the large number of synthetic applications and a real life example – vehicle cruise controller

Page 55: Scheduling and Optimization of Fault-Tolerant Embedded Systems

55 of 1455

Design Optimization of Embedded Systems with Fault Tolerance is

Essential

Page 56: Scheduling and Optimization of Fault-Tolerant Embedded Systems

56 of 1456

Some More…

Future Work

Fault-TreeAnalysis

ProbabilisticFault Model

Soft Real-Time