tolerating communication and processor failures in distributed real-time systems

Tolerating Communication and Processor Failures in Distributed

Real-Time Systems

Hamoudi Kalla, Alain Girault and Yves Sorel

Grenoble, November 13, 2003

2

Outline• Introduction

• Modeling distributed real-time systems

• The Fault model

• Related work

• Processor fault tolerance

• Communication fault tolerance

• Conclusion and future work

3

High level program

Compiler

Architecture specification

Distribution constraints

Execution times

Real-time constraints

Failure specification

Fault-tolerant distributed static schedule

Fault-tolerant distributed code

Code generator

Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic

Model of the algorithm

Introduction

4

Modeling distributed real-time systems

a. Algorithm Model

« I1 and I2 » are inputs operations (sensors)

« O » is output operation (actuator)

« A, B and C » are computations operations

I

1

A

B

C O

I2

5

Modeling distributed real-time systems

b. Architecture Model

P1

P2

P3

« P1, P2 and P3 » are processors

« B1 and B2 » are communication buses

B1

B2

Processor

Computation unit

mem

ory

co-processor

…

co-processor

6

The Fault Model

1. Tolerating a fixed number of fail-silent processors.

2. Tolerating a fixed number of fail-silent bus: complete and partial faults.

Complete bus faults

Partial bus faultsProcessors faults

P1

P2

P3

B1

B2

P1

P2

P3

B1

B2

P1

P2

P3

B1

B2

7

Find a distributed schedule of the algorithm on the

architecture which is fault-tolerantfault-tolerant to processors

and communications failures ?

Problem ?

I

1

A

B

C O

I2

scheduleschedulescheduleschedule

P1

P2 P3

B1 B2

8

2.2. Forward Error Correction (FEC)Forward Error Correction (FEC): passive or active replication of

operations and active replication of communication.

Related Work (1)

1.1. Time-Triggered Architecture (TTA)Time-Triggered Architecture (TTA): active replication of operations and

communications. (20 years = 100 masters theses and 25 doctoral)

9

1.1. Time-Triggered Architecture (TTA)Time-Triggered Architecture (TTA):

Related Work (2)

Processor fault tolerance: k replicas or copies of each operation are

actively allocated to separate processors.

Communication fault tolerance: k’ replicas or copies of each

communication are actively allocated to separate buses.

10

1.1. Forward Error Correction (FEC)Forward Error Correction (FEC):

Related Work (3)

Processor fault tolerance: k replicas or copies of each operation are

actively or passively allocated to separate processors.

Communication fault tolerance: First, each communication is coded

by the FEC code on k’ messages with redundant informations. Next,

the k’ messages are actively allocated to separate buses.

11

Outline• Introduction

• Modeling distributed real-time systems

• The Fault model

• Related work

• Processor fault tolerance

• Communication fault tolerance

• Conclusion and future work

12

Use the active sactive software replicationoftware replication of operations; where each

operation is replicated on k different processors to tolerate k

processors failures.

Processor fault tolerance

13

a. Use the passive software replicationpassive software replication of communication, which need

« watchdog timer watchdog timer »,

Communication fault tolerance (1)

b. Split each data communication on k messages. (data fragmentation)(data fragmentation)

14


a. Use the passive software replicationpassive software replication of communication, which need

« watchdog timer watchdog timer »,

15


b. Split each data communication on k messages. (data fragmentation)(data fragmentation)

16


Why data data fragmentation fragmentation of communication ?

1. Distinction between complete and partialcomplete and partial communication fault !

17


Why data data fragmentation fragmentation of communication ?

2. Enable rapid recoveryrapid recovery from processors and buses failures

18

Recovery from failures (1)

1. Processor fault

19


2. Partial bus fault

20


3. Complete bus fault

21

Example (1)

22

Example (2)

23

Conclusion and future work

Implementation of the proposed method into the SynDEx tool.

Simulations.

A new method to tolerate both communication and processor failuresboth communication and processor failures in

distributed real-time systems, which may be reduce the load and the

overhead of the recovery from failures.

Result

Future work

24

Questions Questions ??

tolerating communication and processor failures in distributed real-time systems

Documents

partial communication

data communication

processor fault toleranceuse

separate processors

processors failures

silent processors

distributed schedule

processors b1