introduction to fault tolerance by sahithi podila
DESCRIPTION
Distributed systems being fault tolerant is related to dependable systems. Dependability Dependability is a term, that covers useful requirements for distributed systems. 1. Availability 2. Reliability 3. Safety 4. Maintainability Fault tolerance in distributed systemsTRANSCRIPT
Introduction to Fault Tolerance
BySahithi Podila
Basic Concepts
Distributed systems being fault tolerant is related to dependable systems.
DependabilityDependability is a term, that covers
useful requirements for distributed systems.
1.Availability2.Reliability3.Safety4.Maintainability
Fault tolerance in distributed systems
Dependability
Availability is defined as the moment at which the system is ready to perform the functions on behalf of the user. If the system is highly available, then it is most likely be working at a given instant of time.
Reliability is defined as the time interval in which the system could run continuously without a failure. If the system is highly reliable then it is working for a relatively longer period of time with out interruption.
Dependability
Safety is defined as, when system fails temporarily nothing disastrous should happen.
Maintainability is defined as how easily the system could be repaired when failure happens.
Fault and Error
Fault means that when a system fails to do some required services.
Error is defined as the state of the system that leads to failure. Fault is the cause of an error.
Fault ToleranceFault tolerance is defined as the ability the
system has to provide the services even in the presence of faults.
Types of FaultTransient: These faults occur once and disappear.Intermittent: These faults occurs and goes away but often comes back and goes at varied times. They are difficult to find.Permanent: These faults remain until they are diagnosed and replaced with the working ones. Ex: burnt-out chips.
Failure Models
Types of failureType of failure DescriptionCrash failure A server halts, but is working
correctly until it haltsOmission failure Receive omission Send omission
A server fails to respond to incoming requests.A server fails to receive incoming messagesA server fails to send messages
Timing failure A server’s response lies outside the specified time interval
Response failure Value failure State transition failure
A server’s response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at arbitrary times
Redundancy
Failure Masking- Redundancy
Three kinds of redundancy
Information redundancy: Extra information(bits) is added in order to recover from grabbled bits.Time redundancy: Action is performed once again if needed. Example: Transactions.Physical redundancy: Extra physical component is added in order to handle any of the malfunctioning components.
Physical RedundancyPhysical redundancy is a well known
technique for fault-tolerance. The following example illustrates how
fault tolerance is achieved by using physical redundancy technique in electronic circuit.
Triple modular redundancy
Triple modular redundancy is a general technique for fault tolerance.
Each device is replicated three times, if two or three inputs are correct then output is defined.
If A1 device fails, the circuit still works of two more inputs A2, A3.
A fault in V1 or in B1 means the same.
Reference: Andrew S. Tanenbaum , and Maarten Van Steen. Distributed Systems Principles and paradigms. Second Edition, 2007.
Thank You