7 . fault tolerance through dynamic or standby redundancy

23
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current erroneous state and determines the correct state without any loss of computation . There are two different approaches: a) H ardware R edundancy Static Redundancy Dynamic Redundancy b) S oftware R edundancy

Upload: hafwen

Post on 25-Feb-2016

48 views

Category:

Documents


1 download

DESCRIPTION

7 . Fault Tolerance Through Dynamic or Standby Redundancy. 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current erroneous state and determines the correct state without any loss of computation . There are two different approaches: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems

Upon the detection of a failure, the system discards the current erroneous state and determines the correct state without any loss of computation.

There are two different approaches:

a) Hardware Redundancy – Static Redundancy– Dynamic Redundancy

b) Software Redundancy

Page 2: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.1 Static Redundancy Approaches

There are 3 different approaches to mask the failures:

Active Masking RedundancyActive Masking RedundancyActive Masking Using Fail-Stop ModulesActive Masking Using Fail-Stop ModulesActive Redundancy Using Self-DiagnosisActive Redundancy Using Self-Diagnosis

Page 3: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.1 Static Redundancy Approaches

Active Masking RedundancyActive Masking Redundancy:

Uses adequate level of replication to tolerate the failures, using voting on the outputs of all the replicas.

E.g.: TMR (Triple Modular Redundant) systems mask a single failure without any performance loss.

Page 4: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.1 Static Redundancy Approaches

Active Redundancy Using Fail-Stop ModulesActive Redundancy Using Fail-Stop Modules:

Multiple modules of each processor actively execute each process. Each processor itself is assumed to be fail-stop. Thus, if one of the processors fails, it stops executing and the other processors executing the task continue functioning without any performance penalty, even in the presence of failures.

7. Fault Tolerance Through Dynamic or Standby Redundancy

Page 5: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.1 Static Redundancy Approaches

E.g. in a given system, each subsystem is duplicated, forming a pair. One of the replicas is identified as the spare. Each subsystem and its spare are, themselves, made self-checking by replication. The HW is thereby replicated 4 times. All 4 copies of the HW are tightly synchronized. When a fault is detected in a subsystem by its self-checking mechanisms, it disconnects itself as well as that the spare starts providing its service without any interruption or rollback.

7. Fault Tolerance Through Dynamic or Standby Redundancy

Page 6: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.1 Static Redundancy Approaches

Active Redundancy Using Self-DiagnosisActive Redundancy Using Self-Diagnosis:

Analogous to the one using “fail-stop modules”, however, instead of concurrent self-checking mechanism, self-diagnosis tasks are used to identify the faulty processor.

7. Fault Tolerance Through Dynamic or Standby Redundancy

Page 7: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.1 Static Redundancy Approaches

E.g. the reconfigurable duplication mechanism, where the process is replicated on 2 processors. Their outputs are continuously compared. If any mismatch indicating a failure of at least one of the processors in the pair is detected, each processor runs self-diagnostic tasks to determine if it has failed. Once the faulty processor is identified, the output of the fault-free processor can be accepted as correct.

7. Fault Tolerance Through Dynamic or Standby Redundancy

The use of self-diagnostic tasks instead of concurrent self-checking results in a slight computation overhead for determining

the faulty processor after a fault is detected.

Page 8: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Forward recovery schemes based on dynamic redundancy and checkpointing try to avoid rollback even in the presence of failures. The fault is thus tolerated without the performance penalty of a rollback.

E.g. Consider a duplex system that detects failures by checkpointing the two modules in the system periodically and then, comparing their states.

When a failure is detected, the roll-forward checkpointing scheme tries to determine which of the two processing modules, if any, is fault-free.

Page 9: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Concurrent retry in the Roll Forward

Checkpointing Scheme (RFCS) Scheme.

Page 10: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Concurrent retry in the Roll Forward

Checkpointing Scheme (RFCS) Scheme.

Page 11: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Recovery StrategyResources Used

With Spare No Spare

Optimistic (only single faults) Roll-forward (I) Roll-forward (I)Rollback (I)*

Pessimistic (may occur double faults) Roll-forward (II) Rollback (II)

Three Different Recovery Schemes (* no built-in fault detection capability included).

Variations of the RFCS may assume that each module has built-in fault detection capability such as parity checks, exception detection. Thus, 4 different scenarios can be conceptualized:

Page 12: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Optimistic scheme with or without spare.

Roll-forward (I)

I1 I2

Module

A

I1 I2

B

roll-forward

In an optimistic recovery strategy, one trusts the built-in detection capability to the fullest extent. This scheme will not require the use of a spare, even though it may be available.

Page 13: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Pess

imis

tic s

chem

es. In the pessimistic recovery strategy, It may

be noted that although module B has been already suspect to be faulty, a more conservative action was taken just in case A might have experienced a failure which escaped the built-in detection capability during I1.

Pessimistic Scheme with spare rolling forward with all single faults.

Pessimistic Scheme with spare rolling back with double faults.

Page 14: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Three different roll-forward schemes.Performance

Reliability

1

2

3

The ideal curve 1 is preferred because it allows a small reduction in reliability to be traded off against a large gain in performance. (This is the case of Optimistic Recovery Strategies).

Page 15: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Generally, the mean completion time given a failure has occurred is lower for the roll-forward scheme for both optimistic and pessimistic strategies.

Without any failure, all the schemes perform similarly.

When there is no built-in detection capability, the pessimistic and the corresponding optimistic scheme have identical reliabilities. Since there is no built-in detection, there is no way to identify the faulty module without comparison between operating modules and the spare one.

When there is 100% fault detection, with or without spare schemes have identical reliabilities.

Page 16: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Note:

= failure rate;

c = detection coverage (indicates the degree of built- in detection capabilities);

n = # of checkpoint intervals.

Page 17: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Performance comparison between optimistic and pessimistic schemes: mean completion time, given a fault.

(Optimistic scheme is better)

Reliability comparison between optimistic and pessimistic schemes.

(Pessimistic scheme is better)

Rollback Optimistic

Roll-forward

Pessimistic

Page 18: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Permanent delay in rollback scheme outputs in the event of a fault.

One of the important advantages of a roll-forward scheme is in the minimal degradation in I/O performance:

All outputs after I1 will experience one checkpoint interval delay.

Page 19: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

The outputs x and y are the only ones delayed and all other outputs are will occur at the regularly scheduled interval.

Temporary delay in roll-forward scheme outputs in the event of a fault.

I1 I2

Module

A

B

Spare Release

I3 I4 I5 I6

I1 I2Spare Activated

I1 I2 I3 I4 I5 I6

x,y,z w v : System outputs

Page 20: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Forward Recovery Using Checkpointing.

Page 21: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.3 Software Redundancy-Based Approach for Forward Error

Recovery

The previous approaches primarily require HW redundancyHW redundancy (+300%+300%).

This approach requires a certain degree of SW redundancySW redundancy, as well as HW redundancyHW redundancy:

SW redundancy is implemented by using Recovery BlocksRecovery Blocks. Recovery blocks are a language construct that supports the incorporation of

program redundancyprogram redundancy into a fault-tolerant program in a concise and easily readable form.

Page 22: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.3 Software Redundancy-Based Approach for Forward Error

Recovery

The syntax of the recovery block is:

Ensure Ensure TT by by BB11 else by else by BB22 ......else by else by BBnnelse else errorerror

Where: Where: TT is acceptance test; is acceptance test; BB11 denotes the primary try block; denotes the primary try block; BBkk denotes the (k – 1)th alternate try block. denotes the (k – 1)th alternate try block.

Page 23: 7 . Fault Tolerance Through  Dynamic or Standby Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems– 7.5.3 Software Redundancy-Based Approach for Forward Error

Recovery

Distributed Recovery Block.