introduction to fault tolerance by sahithi podila

14

Click here to load reader

Upload: elfrieda-knight

Post on 18-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

 Distributed systems being fault tolerant is related to dependable systems. Dependability  Dependability is a term, that covers useful requirements for distributed systems. 1. Availability 2. Reliability 3. Safety 4. Maintainability Fault tolerance in distributed systems

TRANSCRIPT

Page 1: Introduction to Fault Tolerance By Sahithi Podila

Introduction to Fault Tolerance

BySahithi Podila

Page 2: Introduction to Fault Tolerance By Sahithi Podila

Basic Concepts

Page 3: Introduction to Fault Tolerance By Sahithi Podila

Distributed systems being fault tolerant is related to dependable systems.

DependabilityDependability is a term, that covers

useful requirements for distributed systems.

1.Availability2.Reliability3.Safety4.Maintainability

Fault tolerance in distributed systems

Page 4: Introduction to Fault Tolerance By Sahithi Podila

Dependability

Availability is defined as the moment at which the system is ready to perform the functions on behalf of the user. If the system is highly available, then it is most likely be working at a given instant of time.

Reliability is defined as the time interval in which the system could run continuously without a failure. If the system is highly reliable then it is working for a relatively longer period of time with out interruption.

Page 5: Introduction to Fault Tolerance By Sahithi Podila

Dependability

Safety is defined as, when system fails temporarily nothing disastrous should happen.

Maintainability is defined as how easily the system could be repaired when failure happens.

Page 6: Introduction to Fault Tolerance By Sahithi Podila

Fault and Error

Fault means that when a system fails to do some required services.

Error is defined as the state of the system that leads to failure. Fault is the cause of an error.

Page 7: Introduction to Fault Tolerance By Sahithi Podila

Fault ToleranceFault tolerance is defined as the ability the

system has to provide the services even in the presence of faults.

Types of FaultTransient: These faults occur once and disappear.Intermittent: These faults occurs and goes away but often comes back and goes at varied times. They are difficult to find.Permanent: These faults remain until they are diagnosed and replaced with the working ones. Ex: burnt-out chips.

Page 8: Introduction to Fault Tolerance By Sahithi Podila

Failure Models

Page 9: Introduction to Fault Tolerance By Sahithi Podila

Types of failureType of failure DescriptionCrash failure A server halts, but is working

correctly until it haltsOmission failure Receive omission Send omission

A server fails to respond to incoming requests.A server fails to receive incoming messagesA server fails to send messages

Timing failure A server’s response lies outside the specified time interval

Response failure Value failure State transition failure

A server’s response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control

Arbitrary failure A server may produce arbitrary responses at arbitrary times

Page 10: Introduction to Fault Tolerance By Sahithi Podila

Redundancy

Page 11: Introduction to Fault Tolerance By Sahithi Podila

Failure Masking- Redundancy

Three kinds of redundancy

Information redundancy: Extra information(bits) is added in order to recover from grabbled bits.Time redundancy: Action is performed once again if needed. Example: Transactions.Physical redundancy: Extra physical component is added in order to handle any of the malfunctioning components.

Page 12: Introduction to Fault Tolerance By Sahithi Podila

Physical RedundancyPhysical redundancy is a well known

technique for fault-tolerance. The following example illustrates how

fault tolerance is achieved by using physical redundancy technique in electronic circuit.

Page 13: Introduction to Fault Tolerance By Sahithi Podila

Triple modular redundancy

Triple modular redundancy is a general technique for fault tolerance.

Each device is replicated three times, if two or three inputs are correct then output is defined.

If A1 device fails, the circuit still works of two more inputs A2, A3.

A fault in V1 or in B1 means the same.

Page 14: Introduction to Fault Tolerance By Sahithi Podila

Reference: Andrew S. Tanenbaum , and Maarten Van Steen. Distributed Systems Principles and paradigms. Second Edition, 2007.

Thank You