introduction to fault tolerance by sahithi podila

Introduction to Fault Tolerance

BySahithi Podila

Basic Concepts

Distributed systems being fault tolerant is related to dependable systems.

DependabilityDependability is a term, that covers

useful requirements for distributed systems.

1.Availability2.Reliability3.Safety4.Maintainability

Fault tolerance in distributed systems

Dependability

Availability is defined as the moment at which the system is ready to perform the functions on behalf of the user. If the system is highly available, then it is most likely be working at a given instant of time.

Reliability is defined as the time interval in which the system could run continuously without a failure. If the system is highly reliable then it is working for a relatively longer period of time with out interruption.

Dependability

Safety is defined as, when system fails temporarily nothing disastrous should happen.

Maintainability is defined as how easily the system could be repaired when failure happens.

Fault and Error

Fault means that when a system fails to do some required services.

Error is defined as the state of the system that leads to failure. Fault is the cause of an error.

Fault ToleranceFault tolerance is defined as the ability the

system has to provide the services even in the presence of faults.

Types of FaultTransient: These faults occur once and disappear.Intermittent: These faults occurs and goes away but often comes back and goes at varied times. They are difficult to find.Permanent: These faults remain until they are diagnosed and replaced with the working ones. Ex: burnt-out chips.

Failure Models

Types of failureType of failure DescriptionCrash failure A server halts, but is working

correctly until it haltsOmission failure Receive omission Send omission

A server fails to respond to incoming requests.A server fails to receive incoming messagesA server fails to send messages

Timing failure A server’s response lies outside the specified time interval

Response failure Value failure State transition failure

A server’s response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control

Arbitrary failure A server may produce arbitrary responses at arbitrary times

Redundancy

Failure Masking- Redundancy

Three kinds of redundancy

Information redundancy: Extra information(bits) is added in order to recover from grabbled bits.Time redundancy: Action is performed once again if needed. Example: Transactions.Physical redundancy: Extra physical component is added in order to handle any of the malfunctioning components.

Physical RedundancyPhysical redundancy is a well known

technique for fault-tolerance. The following example illustrates how

fault tolerance is achieved by using physical redundancy technique in electronic circuit.

Triple modular redundancy

Triple modular redundancy is a general technique for fault tolerance.

Each device is replicated three times, if two or three inputs are correct then output is defined.

If A1 device fails, the circuit still works of two more inputs A2, A3.

A fault in V1 or in B1 means the same.

Reference: Andrew S. Tanenbaum , and Maarten Van Steen. Distributed Systems Principles and paradigms. Second Edition, 2007.

Thank You

introduction to fault tolerance by sahithi podila

Documents