tolerating communication and processor failures in distributed real-time systems

24
Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003

Upload: skah

Post on 05-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

POPART. Rhones-Alpes. Tolerating Communication and Processor Failures in Distributed Real-Time Systems. Hamoudi Kalla , Alain Girault and Yves Sorel. Grenoble, November 13, 2003. Outline. Introduction Modeling distributed real-time systems The Fault model Related work - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

Tolerating Communication and Processor Failures in Distributed

Real-Time Systems

Hamoudi Kalla, Alain Girault and Yves Sorel

Grenoble, November 13, 2003

Page 2: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

2

Outline• Introduction

• Modeling distributed real-time systems

• The Fault model

• Related work

• Processor fault tolerance

• Communication fault tolerance

• Conclusion and future work

Page 3: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

3

High level program

Compiler

Architecture specification

Distribution constraints

Execution times

Real-time constraints

Failure specification

Fault-tolerant distributed static schedule

Fault-tolerant distributed code

Code generator

Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic

Model of the algorithm

Introduction

Page 4: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

4

Modeling distributed real-time systems

a. Algorithm Model

« I1 and I2 » are inputs operations (sensors)

« O » is output operation (actuator)

« A, B and C » are computations operations

I

1

A

B

C O

I2

Page 5: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

5

Modeling distributed real-time systems

b. Architecture Model

P1

P2

P3

« P1, P2 and P3 » are processors

« B1 and B2 » are communication buses

B1

B2

Processor

Computation unit

mem

ory

co-processor

co-processor

Page 6: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

6

The Fault Model

1. Tolerating a fixed number of fail-silent processors.

2. Tolerating a fixed number of fail-silent bus: complete and partial faults.

Complete bus faults

Partial bus faultsProcessors faults

P1

P2

P3

B1

B2

P1

P2

P3

B1

B2

P1

P2

P3

B1

B2

Page 7: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

7

Find a distributed schedule of the algorithm on the

architecture which is fault-tolerantfault-tolerant to processors

and communications failures ?

Problem ?

I

1

A

B

C O

I2

scheduleschedulescheduleschedule

P1

P2 P3

B1 B2

Page 8: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

8

2.2. Forward Error Correction (FEC)Forward Error Correction (FEC): passive or active replication of

operations and active replication of communication.

Related Work (1)

1.1. Time-Triggered Architecture (TTA)Time-Triggered Architecture (TTA): active replication of operations and

communications. (20 years = 100 masters theses and 25 doctoral)

Page 9: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

9

1.1. Time-Triggered Architecture (TTA)Time-Triggered Architecture (TTA):

Related Work (2)

Processor fault tolerance: k replicas or copies of each operation are

actively allocated to separate processors.

Communication fault tolerance: k’ replicas or copies of each

communication are actively allocated to separate buses.

Page 10: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

10

1.1. Forward Error Correction (FEC)Forward Error Correction (FEC):

Related Work (3)

Processor fault tolerance: k replicas or copies of each operation are

actively or passively allocated to separate processors.

Communication fault tolerance: First, each communication is coded

by the FEC code on k’ messages with redundant informations. Next,

the k’ messages are actively allocated to separate buses.

Page 11: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

11

Outline• Introduction

• Modeling distributed real-time systems

• The Fault model

• Related work

• Processor fault tolerance

• Communication fault tolerance

• Conclusion and future work

Page 12: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

12

Use the active sactive software replicationoftware replication of operations; where each

operation is replicated on k different processors to tolerate k

processors failures.

Processor fault tolerance

Page 13: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

13

a. Use the passive software replicationpassive software replication of communication, which need

« watchdog timer watchdog timer »,

Communication fault tolerance (1)

b. Split each data communication on k messages. (data fragmentation)(data fragmentation)

Page 14: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

14

Communication fault tolerance (2)

a. Use the passive software replicationpassive software replication of communication, which need

« watchdog timer watchdog timer »,

Page 15: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

15

Communication fault tolerance (3)

b. Split each data communication on k messages. (data fragmentation)(data fragmentation)

Page 16: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

16

Communication fault tolerance (3)

Why data data fragmentation fragmentation of communication ?

1. Distinction between complete and partialcomplete and partial communication fault !

Page 17: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

17

Communication fault tolerance (4)

Why data data fragmentation fragmentation of communication ?

2. Enable rapid recoveryrapid recovery from processors and buses failures

Page 18: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

18

Recovery from failures (1)

1. Processor fault

Page 19: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

19

Recovery from failures (2)

2. Partial bus fault

Page 20: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

20

Recovery from failures (3)

3. Complete bus fault

Page 21: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

21

Example (1)

Page 22: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

22

Example (2)

Page 23: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

23

Conclusion and future work

Implementation of the proposed method into the SynDEx tool.

Simulations.

A new method to tolerate both communication and processor failuresboth communication and processor failures in

distributed real-time systems, which may be reduce the load and the

overhead of the recovery from failures.

Result

Future work

Page 24: Tolerating Communication and Processor Failures in Distributed Real-Time Systems

24

Questions Questions ??