fault avoidance and tolerance technique
TRANSCRIPT
-
8/13/2019 Fault Avoidance and Tolerance Technique
1/15
FAULT AVOIDANCE AND
TOLERANCE TECHNIQUE
BY
Nithiyanandham M.
Pavithra R.
-
8/13/2019 Fault Avoidance and Tolerance Technique
2/15
The fault intolerance (or fault-avoidance) approachimproves system reliability by removing the source
of failures (i.e., hardware and software faults) before
normal operation begins
The approach of fault-tolerance expect faults to be
present during system operation, but employs design
techniques which insure the continued correct
execution of the computing process
Fault-Intolerance and Fault-Tolerance
-
8/13/2019 Fault Avoidance and Tolerance Technique
3/15
-
8/13/2019 Fault Avoidance and Tolerance Technique
4/15
Dependability
Dependability Includes
Availability
Reliability
Safety
Maintainability
-
8/13/2019 Fault Avoidance and Tolerance Technique
5/15
Availability: A measurement of whether a system is ready to be used
immediately. System is up and running at any given moment
Reliability: A measurement of whether a system can run continuously
without failure. System continues to function for a long period of time
A system goes down 1ms/hr has an availability of more than
99.99%, but is unreliable
A system that never crashes but is shut down for a week onceevery year is 100% reliable but only 98% available
Safety: A measurement of how safe failures are
System fails, nothing serious happens
For instance, high degree of safety is required for systems
controlling nuclear power plants Maintainability: A measurement of how easy it is to repair a system
A highly maintainable system may also show a high degree of
availability
-
8/13/2019 Fault Avoidance and Tolerance Technique
6/15
-
8/13/2019 Fault Avoidance and Tolerance Technique
7/15
A faultis the cause of the error
An erroris part of a system state that may lead to a failure
A system failswhen it cannot meet its promises (specifications)
Faultscan be three categories
Transient (appear once and disappear)
Intermittent (appear-disappear-reappear behavior)
A loose contact on a connectorintermittent fault
Permanent (appear and persist until repaired)
-
8/13/2019 Fault Avoidance and Tolerance Technique
8/15
FAULTS
Any fault may be
fail-silent (fail-stop)
Byzantine
synchronous system vs. asynchronous systemE.g., IP packet versus serial port
transmission
-
8/13/2019 Fault Avoidance and Tolerance Technique
9/15
-
8/13/2019 Fault Avoidance and Tolerance Technique
10/15
ACHIEVING FAULT TOLERENCE
Redundancy
information redundancy
Hamming codes, parity memory ECC memory
time redundancy
Timeout & retransmit
physical redundancy/replication
TMR, RAID disks, backup servers
Replication vs. redundancy:Replication: multiple identical units functioning concurrently
vote on outcome
-
8/13/2019 Fault Avoidance and Tolerance Technique
11/15
REDUNDANCY
Redundancy is key technique for hiding failures When one unit is functioning others are available to fill in in
case the unit ceases to work.
Redundancy types:
1. Information: add extra (control) information
Error-correction codes in messages
2. Time: perform an action persistently until it succeeds:
Transactions
3. Physical: add extra components (S/W & H/W)
Process replication, electronic circuits
-
8/13/2019 Fault Avoidance and Tolerance Technique
12/15
ACTIVE REPLICATION
Technique for fault tolerance through physical redundancy
No redundancy:
Triple Modular Redundancy (TMR):
Threefold component replication to detect and
correct a single component failure
-
8/13/2019 Fault Avoidance and Tolerance Technique
13/15
Availability: how much fault tolerance?
100 % fault-tolerance cannotbe achieved.
The closer we wish to get to 100%, the more expensive thesystem will be.
Availability: % of time that the system is functioning
five nines: system is up 99.999% of the time: 55.6 minutes
downtime per year
Three nines: system is up 99.9% of the time: 8.76 hours
downtime per year
Downtime includes all time when the system is unavailable.
-
8/13/2019 Fault Avoidance and Tolerance Technique
14/15
GOAL OF FAULT TOLERANT
COMPUTING
Dependability
Reliability
Availability
Safety Security
Performability
Maintainability
Testability
Goal of tolerance
-
8/13/2019 Fault Avoidance and Tolerance Technique
15/15
POINTS OF FAILURE
Points of failure: A system is k-fault tolerant if it can
withstand k faults.
Need k+1 components with silent faults
k can fail and one will still be workingNeed 2k+1 components with Byzantine faults
k can generate false replies: k+1 will provide a majority
vote