fault avoidance and tolerance technique

Upload: tipodeincognito

Post on 04-Jun-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    1/15

    FAULT AVOIDANCE AND

    TOLERANCE TECHNIQUE

    BY

    Nithiyanandham M.

    Pavithra R.

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    2/15

    The fault intolerance (or fault-avoidance) approachimproves system reliability by removing the source

    of failures (i.e., hardware and software faults) before

    normal operation begins

    The approach of fault-tolerance expect faults to be

    present during system operation, but employs design

    techniques which insure the continued correct

    execution of the computing process

    Fault-Intolerance and Fault-Tolerance

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    3/15

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    4/15

    Dependability

    Dependability Includes

    Availability

    Reliability

    Safety

    Maintainability

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    5/15

    Availability: A measurement of whether a system is ready to be used

    immediately. System is up and running at any given moment

    Reliability: A measurement of whether a system can run continuously

    without failure. System continues to function for a long period of time

    A system goes down 1ms/hr has an availability of more than

    99.99%, but is unreliable

    A system that never crashes but is shut down for a week onceevery year is 100% reliable but only 98% available

    Safety: A measurement of how safe failures are

    System fails, nothing serious happens

    For instance, high degree of safety is required for systems

    controlling nuclear power plants Maintainability: A measurement of how easy it is to repair a system

    A highly maintainable system may also show a high degree of

    availability

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    6/15

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    7/15

    A faultis the cause of the error

    An erroris part of a system state that may lead to a failure

    A system failswhen it cannot meet its promises (specifications)

    Faultscan be three categories

    Transient (appear once and disappear)

    Intermittent (appear-disappear-reappear behavior)

    A loose contact on a connectorintermittent fault

    Permanent (appear and persist until repaired)

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    8/15

    FAULTS

    Any fault may be

    fail-silent (fail-stop)

    Byzantine

    synchronous system vs. asynchronous systemE.g., IP packet versus serial port

    transmission

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    9/15

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    10/15

    ACHIEVING FAULT TOLERENCE

    Redundancy

    information redundancy

    Hamming codes, parity memory ECC memory

    time redundancy

    Timeout & retransmit

    physical redundancy/replication

    TMR, RAID disks, backup servers

    Replication vs. redundancy:Replication: multiple identical units functioning concurrently

    vote on outcome

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    11/15

    REDUNDANCY

    Redundancy is key technique for hiding failures When one unit is functioning others are available to fill in in

    case the unit ceases to work.

    Redundancy types:

    1. Information: add extra (control) information

    Error-correction codes in messages

    2. Time: perform an action persistently until it succeeds:

    Transactions

    3. Physical: add extra components (S/W & H/W)

    Process replication, electronic circuits

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    12/15

    ACTIVE REPLICATION

    Technique for fault tolerance through physical redundancy

    No redundancy:

    Triple Modular Redundancy (TMR):

    Threefold component replication to detect and

    correct a single component failure

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    13/15

    Availability: how much fault tolerance?

    100 % fault-tolerance cannotbe achieved.

    The closer we wish to get to 100%, the more expensive thesystem will be.

    Availability: % of time that the system is functioning

    five nines: system is up 99.999% of the time: 55.6 minutes

    downtime per year

    Three nines: system is up 99.9% of the time: 8.76 hours

    downtime per year

    Downtime includes all time when the system is unavailable.

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    14/15

    GOAL OF FAULT TOLERANT

    COMPUTING

    Dependability

    Reliability

    Availability

    Safety Security

    Performability

    Maintainability

    Testability

    Goal of tolerance

  • 8/13/2019 Fault Avoidance and Tolerance Technique

    15/15

    POINTS OF FAILURE

    Points of failure: A system is k-fault tolerant if it can

    withstand k faults.

    Need k+1 components with silent faults

    k can fail and one will still be workingNeed 2k+1 components with Byzantine faults

    k can generate false replies: k+1 will provide a majority

    vote