“designing masking fault tolerance via nonmasking fault tolerance“

27
Designing Masking Fault Designing Masking Fault Tolerance via Nonmasking Tolerance via Nonmasking Fault Tolerance“ Fault Tolerance“ Oğuzhan YILDIRIM – Erkin GÜVEL Boğaziçi University Computer Engineering Department [email protected] [email protected]

Upload: lynnea

Post on 07-Jan-2016

52 views

Category:

Documents


0 download

DESCRIPTION

“Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“. Oğuzhan YILDIRIM – Erkin GÜVEL Boğaziçi University Computer Engineering Department [email protected] [email protected]. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

““Designing Masking Fault Tolerance Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“via Nonmasking Fault Tolerance“

Oğuzhan YILDIRIM – Erkin GÜVEL

Boğaziçi University Computer Engineering [email protected]

[email protected]

Page 2: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

IntroductionIntroduction

Masking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults.

Nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program executions converge to states from where programs continually (re)satisfy their specification.

Page 3: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

ObjectivesObjectives

We will show a practical method to design masking fault-tolerance is to first design nonmasking fault-tolerance and then transform the nonmasking fault-tolerant program minimally so as to achieve masking fault-tolerance

Page 4: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Novel method for the design of “masking” fault-tolerant system Actions

– Critical– Noncritical

Overview on Methodolgy Case Study

Novel Method

Critical – Noncritical

OverviewCase Study

OutlineOutline

Page 5: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

The ImportanceThe Importance

It is often simpler and cheaper to design nonmasking fault-tolerance than to design masking fault tolerance.

It is often simpler and cheaper to design safe programs or programs with well-defined failure than to design masking fault-tolerant programs

Page 6: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Critical ActionsCritical Actions

Critical actions are those actions whose execution in the presence of faults can violate the system specification.– Database transactions, the actions that produce an

output or commit a result are critical.

Page 7: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Noncritical ActionsNoncritical Actions

The execution of noncritical actions should not necessarily have to mask faults; in other words, when noncritical actions execute, the system state may be “unsafe”.

The execution of the noncritical actions in unsafe states should not allow the system to remain in unsafe states forever, otherwise the system will never execute its critical actions.

Page 8: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

OverviewOverview

First Stage: The system is designed so that after faults stop occurring, subsequent execution of the system actions guarantees that the system reaches a safe state.

Second Stage: the critical actions are modified so that their execution always masks faults.

Page 9: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

First StageFirst Stage

In this stage, first, a nonmasking fault-tolerant version of the program is designed. Then, certain actions of the nonmasking fault-tolerant program are distinguished as being critical.

No specific approaches, many acceptable methods exist.

Page 10: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

First Stage (Cont.)First Stage (Cont.)

To design the tolerance requirement hand-in-hand with the other requirements of the program.

Transform an existing faultin tolerant program into one that is nonmasking faulttolerant.

Page 11: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Second StageSecond Stage

In this stage, first, a “safe predicate” is identified for each critical action. Then,each critical action is augmented, so that it is executed only in states where its safe predicate holds. Finally, the augmentation is shown to itself mask the effects of faults. The resulting program is masking fault tolerant.

No specific approaches, many acceptable methods exist.

Page 12: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Second Stage (Cont.)Second Stage (Cont.)

Add actions that check whether the program state satisfies the state predicate, and allow execution to proceed only when the check succeeds.

To enforce real-time constraints on the execution of critical actions.

Page 13: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

ApplicationApplication

Case Study:Leader Election

System Logic

Arora’s Program: Spanning tree

Leader Election Study

Page 14: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

System LogicSystem Logic

A system consists of processes, that have unique integer ids, and channels, that each connect a unique pair of nodes.

At any instant, each process is either “up” or “down”.

Systems are subject to fail-stop and repair of processes.

Page 15: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Arora’s Nonmasking ProgramArora’s Nonmasking Program

Arora’s nonmasking fault-tolerant program for distributed maintenance of a rooted spanning tree.

Specifically, it allows faults to yield program states where there are multiple trees and unrooted trees.

To deal with unrooted trees, the program has actions that inform all processes in unrooted trees that they have no root process.

Page 16: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Leader Election ProblemLeader Election Problem

The action that declares a process to be the leader.

A unique process is to be elected as the leader; at no point during election may multiple processes declare themselves as leaders.

And the purpose is to design a masking fault tolerant program for leader election.

Page 17: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Leader Election ProblemLeader Election Problem

Our tree maintenance program elects a unique process as leader.

However, in the presence of faults, our tree maintenance program allows multiple processes to declare themselves as leaders.

Page 18: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Defining Critical ActionDefining Critical Action

In keeping with the proposed method, we proceed by identifying the critical actions in the nonmasking fault-tolerant tree maintenance program..

After this the identification the non-masking fault tolerated program is augmented to result in a masking fault tolerated system.

Page 19: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Defining Critical Action(Cont.)Defining Critical Action(Cont.)

The critical actions in the tree maintenance program are the actions that elect a process as leader.

This action is safely executed only in states where no process is elected as leader

Page 20: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Section 2:Section 2:

Checking that the critical action is executed in a safe state.

And to guarantee the critical action is implemented in a masking-fault tolerant way.

Page 21: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Checking Critical ActionChecking Critical Action

A diffusing computation is used to check whether the critical action is executed in a safe state.

This diffusing computation verifies the safe statement requirement by reaching all other processes and determines that no process is leader.

Page 22: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Diffusing ComputationDiffusing Computation

The diffusing computation we design consists of two phases: “propagate” and “complete“.

The computation extends in an up-down manner:

Upon receiving a diffusing computation from its parent in the tree. a process enters the propagate phase, and propagates the computation to all of its neighbors.

Upon receiving a response from all of its

neighbors, the process sends a response to its parent and reverts its phase to complete.

Page 23: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Masking The Critical ActionMasking The Critical Action

If the child falls in a fail-stop fault, let the parent has a premature result with the value=false.

Then create a new diffusing computation by assigning a sequence number to it.

This way masking is done via redundancy of diffusing method.

Page 24: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Fail-Stop Repair

Recomputation in case of a fault

ROOT waiting for answer

FAULT!!!

Diffusing computation

Page 25: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

ConclusionConclusion

In this presentation, we presented a novel method for designing masking fault-tolerant programs. First, a nonmasking fault-tolerant program was designed to ensure that once faults stop occurring the program eventually reaches a safe state.

Then, a masking component was designed to ensure that the composite program is masking fault-tolerant.

Page 26: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

ReferencesReferences

Designing Masking Fault-tolerance via Nonmasking Fault-tolerance,Department of Computer and Information Science The Ohio State University, Columbus, Ohio

B. Randell. System structure for software fault tolerance. IEEE Transactions on Software Eng., pages 220-232, 1975.

J.-C. Laprie. Dependable computing and fault tolerance: Concepts and terminology. Proceedings of the 15th International Symposium on Fault-Tolerant Computing, pages 2-11, 1985.

Internet Research

Page 27: “Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Thanks For Listening…Thanks For Listening…

Any Questions ?