lect 1 intro taxonomy

Upload: himanshuagra

Post on 04-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Lect 1 Intro Taxonomy

    1/50

    Fault Tolerant Systems

    Dependable & Secure Systems

  • 7/31/2019 Lect 1 Intro Taxonomy

    2/50

    Text Book to be Followed

  • 7/31/2019 Lect 1 Intro Taxonomy

    3/50

    Course Outline Introduction - Basic concepts Dependability measures

    Redundancy techniques Hardware fault tolerance Error detecting and correcting codes Redundant disks (RAID) Fault-tolerant networks Software fault tolerance

    Checkpointing

    Case studies of fault-tolerant systems Defect tolerance in VLSI circuits Fault detection in cryptographic systems Simulation techniques

  • 7/31/2019 Lect 1 Intro Taxonomy

    4/50

    Need For Fault Tolerance - Critical

    Applications

    Aircrafts, nuclear reactors, chemical plants,medical equipment

    A malfunction of a computer in suchapplications can lead to catastrophe

    Their probability of failure must be

    extremely low, possibly one in a billion perhour of operation

    Also included - financial applications

  • 7/31/2019 Lect 1 Intro Taxonomy

    5/50

    Need for Fault Tolerance - Harsh

    Environments

    A computing system operating in a harshenvironment where it is subjected to

    electromagnetic disturbances

    particle hits and alike

    Very large number of failures means: thesystem will not produce useful results unless

    some fault-tolerance is incorporated

  • 7/31/2019 Lect 1 Intro Taxonomy

    6/50

    Need For Fault Tolerance - Highly Complex

    Systems

    Complex systems consist of millions of devices

    Every physical device has a certain probability of

    failure A very large number of devices implies that the

    likelihood of failures is high

    The system will experience faults at such afrequency which renders it useless

  • 7/31/2019 Lect 1 Intro Taxonomy

    7/50

    Fault Taxonomy

  • 7/31/2019 Lect 1 Intro Taxonomy

    8/50

    Basic Concepts of Dependability

    Dependability the trustworthiness of a computersystem such that reliance can be justifiably put on

    the service it delivers.

    It is the system property that integrates such attributes asreliability, availability, safety, security, survivability,

    maintainability.

    A systematic exposition of the concepts of

    dependability consists of three parts: the threats to,the attributes of, and the means by which

    dependability is attained.

  • 7/31/2019 Lect 1 Intro Taxonomy

    9/50

    Dependability Tree

  • 7/31/2019 Lect 1 Intro Taxonomy

    10/50

    Fault-Error-Failure Model

    System Under

    Consideration

    Unintended State:

    Error

    Cause of Error

    (& Failure): Fault

    Deviation of Actual

    Service from Intended

    Service: Failure

    Faults and errors are states; Failures are external events.

    Failuredenotes an elements inability to perform its designed functionbecause of errors in the element or its environment, which in turn arecaused by various faults.

  • 7/31/2019 Lect 1 Intro Taxonomy

    11/50

    Fault, Error, Failure Examples

    Cosmic ray knocks charge off of DRAM cell

    Error: bit flip in memory

    Failure: computation produces incorrect result

    Software bug could allow for NULL pointerBug gets exercised and we get NULL pointer

    Program segment faults when it tries to access pointer

  • 7/31/2019 Lect 1 Intro Taxonomy

    12/50

    Duration of Faults/Errors

    Transient (soft): occurs once and disappears

    E.g., Cosmic ray knocks charge off transistorbit flip

    Tend to be due to transient physical phenomena

    Also known as Single Event Upset (SEU) Intermittent: occurs occasionally

    E.g., Loose connectionoccasionally open circuit

    E.g., Bug software for roundingincorrect data

    Permanent (hard): occurs and does not go away

    E.g., Broken connectionalways open circuit

  • 7/31/2019 Lect 1 Intro Taxonomy

    13/50

    Software Faults/Errors

    Types of bugs (or errors/failures that are due to bugs)

    Incorrect algorithm

    Array bounds violation

    Memory leak (C, C++, but not Java) Allocating memory, but not de-allocating it

    Reference to NULL pointer (C, C++, but not Java)

    Incorrect synchronization in multithreaded code

    Allowing more than 1 thread in critical section at a time Blocking when holding a lock

    Inability to handle unanticipated inputs

  • 7/31/2019 Lect 1 Intro Taxonomy

    14/50

    Software Failure

    What happens if we exercise a software bug? Failures can occur in:

    User-level software Incorrect data

    Livelock/deadlock Exception that triggers OS to kill process

    Segmentation fault

    Bus error

    Operating system software (including device drivers) Livelock/deadlock

    Crash and reboot

    Incorrect I/O

  • 7/31/2019 Lect 1 Intro Taxonomy

    15/50

    Dependability and its Attributes

    Availability: readiness for correct service

    Reliability: continuity of correct service

    Safety: absence of catastrophic consequences on theuser(s) and the environment

    Confidentiality: absence of unauthorized disclosure ofinformation

    Integrity: absence of improper system alterations

    Maintainability: ability to undergo, modifications, andrepairs

    Security is a composite attributes of availability,confidentiality, integrity.

  • 7/31/2019 Lect 1 Intro Taxonomy

    16/50

    Traditional Measures - Reliability

    Assumption: The system can be in one of two states:

    up or down Examples:

    Lightbulb - good or burned out

    Wire - connected or broken

    Reliability, R(t): Probability that the system is upduring the whole interval [0,t], given it was up at time 0

    Related measure - Mean Time To Failure, MTTF :Average time the system remains up before it goes down and

    has to be repaired or replaced

  • 7/31/2019 Lect 1 Intro Taxonomy

    17/50

    Traditional Measures - Availability

    Availability, A(t) : Fraction of time system is up during

    the interval [0,t] Point Availability, Ap(t) :

    Probability that the system is up at time t

    Long-Term Availability, A:

    Availability is used in systems with recovery/repair

    Related measures:

    Mean Time To Repair, MTTR

    Mean Time Between Failures, MTBF = MTTF + MTTR

    MTTRMTTF

    MTTF

    MTBF

    MTTFA

    +

    ==

    (t)AlimA(t)limA ptt

    ==

  • 7/31/2019 Lect 1 Intro Taxonomy

    18/50

    Need For More Measures

    The assumption of the system being in state upor down is very limiting

    Example: A processor with one of its severalhundreds of millions of gates stuck at logic value 0

    and the rest is functional - may affect the outputof the processor once in every 25,000 hours of use

    The processor is not fault-free, but cannot bedefined as being down

    More detailed measures than the generalreliability and availability are needed

  • 7/31/2019 Lect 1 Intro Taxonomy

    19/50

    Computational Capacity Measures

    Example: N processors in a gracefully degrading

    system System is useful as long as at least one processor

    remains operational

    Let Pi = Prob {i processors are operational}

    Let c = computational capacity of a processor (e.g.,number of fixed-size tasks it can execute)

    Computational capacity ofi processors: Ci = i c

    Average computational capacity of system:

    =1i

    iPR(t)

    i

    1i

    iPC

  • 7/31/2019 Lect 1 Intro Taxonomy

    20/50

    Another Measure - Performability

    Another approach - consider everything from theperspective of the application Application is used to define accomplishment levels

    L1, L2,...,Ln

    Each represents a level of quality of service delivered

    by the application Example: Li indicates i system crashes during the

    mission time period T

    Performability is a vector (P(L1),P(L2),...,P(Ln)) whereP(Li) is the probability that the computer functionswell enough to permit the application to reach up toaccomplishment level Li

  • 7/31/2019 Lect 1 Intro Taxonomy

    21/50

    Network Connectivity Measures

    Focus on the network that connects the processors

    Classical Node and Line Connectivity - the minimumnumber of nodes and lines, respectively, that have

    to fail before the network becomes disconnected Measure indicates how vulnerable the network is to

    disconnection

    A network disconnected by the failure of just one(critically-positioned) node is potentially more

    vulnerable than another which requires several

    nodes to fail before it becomes disconnected

  • 7/31/2019 Lect 1 Intro Taxonomy

    22/50

    Connectivity - Examples

  • 7/31/2019 Lect 1 Intro Taxonomy

    23/50

    Network Resilience Measures

    Classical connectivity distinguishes between onlytwo network states: connected and disconnected

    It says nothing about how the network degrades as

    nodes fail before becoming disconnected

    Two possible resilience measures: Average node-pair distance

    Network diameter - maximum node-pair distance

    Both calculated given probability of node and/or linkfailure

  • 7/31/2019 Lect 1 Intro Taxonomy

    24/50

    Means to Attain Dependability

    Fault prevention: means to prevent the occurrence orintroduction of faults

    Fault tolerance: means to avoid service failures in thepresence of faults

    Fault removal: means to reduce the number and severity offaults Fault forecasting: means to estimate the present number,

    the future incidence, and the likely consequences of faults

    Note:

    Fault prevention and fault tolerance aim to provide the ability to deliver a servicethat can be trusted. [Procurement]

    Fault removal and fault forecasting aim to reach confidence in that ability byjustifying that the functional and dependability specifications are adequate andthat the system is likely to meet them. [Validation]

  • 7/31/2019 Lect 1 Intro Taxonomy

    25/50

    Failure Modes A system does not always fail in the same way. Its

    failure modes characterize incorrect serviceaccording to three viewpoints:the failure domainthe perception of a failure by system users

    the detectability of failuresthe consequences of failures on the environment

  • 7/31/2019 Lect 1 Intro Taxonomy

    26/50

    A Taxonomy of Faults

    All faults thatmay affect a

    system during its

    life are classifiedaccording to

    eight basic

    viewpoints.

  • 7/31/2019 Lect 1 Intro Taxonomy

    27/50

    Classes of Faults Tree Representation

  • 7/31/2019 Lect 1 Intro Taxonomy

    28/50

    Classes of Combined Faults

  • 7/31/2019 Lect 1 Intro Taxonomy

    29/50

    Key System/Functional Unit Properties

    Fail Safe: In case of a fault, the system or functional unittransits to a safe state.

    Fail Silent: In case of a fault, the output interfaces aredisabled in a way that no further outputs are made.

    Fail Operational: It describes the ability of a system orfunctional unit to continue normal operation at itsoutput interfaces despite the presence of hardware orsoftware faults.

    Graceful Degradation: the system continues to operatein the presence of errors, accepting partial degradationof performance during recovery.

    EASIS Vi F il Sil t

  • 7/31/2019 Lect 1 Intro Taxonomy

    30/50

    EASISs View on Fail-SilentElectronic Control Unit (ECU)

  • 7/31/2019 Lect 1 Intro Taxonomy

    31/50

    CPU Faults/ErrorsProcessing core:

    I. Calculating errors (e.g. HW fault, logic error )

    II. Value errors (e.g. HW fault, memory/register corruption, EMI, SEU, etc )

    III. Program flow errors (e.g. HW error)

    IV. Interrupt errors (sequence, frequency, delay, disregarding, etc.)

    V. Algorithmic errors (= Compiler/Logic Synthesizer errors / design faults)

    VI. Timing errors

    RAM/ROM:VII. Errors in the RAM/ROM ( memory cell defective)

    VIII. Faulty RAM/ROM access (wrong memory address)

    IX. Faulty memory mapping (=Compiler or linker errors / design faults)

    X. Memory overflow

    I/O-Interface:XI. Interface errors (errors in ADC/digital IO/ ... )

  • 7/31/2019 Lect 1 Intro Taxonomy

    32/50

    Supervisor Faults/Errors

    I. Internal error (the same as CPU faults/errors if the

    supervisor is a processor).

    II. Synchronization lost between CPU and supervisor.

    III. Supervisor and CPU are getting different informationfrom the outer world.

    IV. Supervisor loses the control over the enable-lines.

    V. CPU and supervisor use different, but both valid,rules to judge the control.

  • 7/31/2019 Lect 1 Intro Taxonomy

    33/50

    SW Related Faults/Errors

    Scheduling Faults/ErrorsI. missed activation

    deadline

    II. missed terminationdeadline

    Communication between SWcomponentsI. Data values of the received

    data are faulty

    II. The data is received later

    than a deadlineIII. The data is received too early

    IV. The data can not be sent outin the given time range

    V. The data can not be sent out

    VI. API Access Fault, (e.g.dynamic argument is out ofrange, )

  • 7/31/2019 Lect 1 Intro Taxonomy

    34/50

    Actuator Faults/Errors

    I. The actuator is not driven.

    II. The actuator is permanently driven (without controller

    command).

    III. The actuator is not driven at the right time.

    IV. The actuator is not driven with the correct

    performance.

    V. The actuator can not be driven correctly.

  • 7/31/2019 Lect 1 Intro Taxonomy

    35/50

    Sensor Faults/Errors

    I. The sensor delivers no value or an error signal.

    II. The read value of the sensor is wrong.

    III. The sensor delivers a value with a wrong timing.

  • 7/31/2019 Lect 1 Intro Taxonomy

    36/50

    Internal Power Supply Faults

    I. Over voltageII. Under voltage

    III. Short circuit

    IV. Over current (due to erroneously activated actuators,

    defective actuators, defective components, misuse ofcomponents, etc )

    V. Leakage current too high

    VI. Brown out (slow decrease of the supply voltage belowthe minimum limit)

    VII. Startup timing

    VIII. Shutdown timing

  • 7/31/2019 Lect 1 Intro Taxonomy

    37/50

    External Power Supply Faults

    I. Over voltage (load dump, ISO pulse, generator

    error)

    II. Under voltage (due to Battery Low, line break)

    III. Current limit

    IV. Short circuit

  • 7/31/2019 Lect 1 Intro Taxonomy

    38/50

    Faults/Errors in Communication SystemsAt a node level

    I. Data values of a received message are faulty (Faulty data value).

    II. The message is received later than a deadline (late message).III. The message is received too early.

    IV. The message can not be sent out in the given time range.

    V. The message can not be sent out.

    At the system levelI. All receivers of the message (in a special case only one receiver exists)

    regard the message as faulty with respect to the same main fault type,which is one of the faults (I to III)

    II. All receivers of the message regard the message as faulty with respect toone of the main fault types (I to III), which can be different for eachreceiver.

    III. Some of the receivers get a correct message, while the others get a faultymessage with respect to one of the main fault types (I to III), which is thesame for each receiver of the faulty message.

    IV. Some of the receivers get a correct message, while the others get a faultymessage with respect to one of the main fault types (I to III), which can bedifferent for each receiver of the faulty message.

  • 7/31/2019 Lect 1 Intro Taxonomy

    39/50

    Comprehensive Fault Model

    Specification Faults

    Adequacy faults: some of the properties expressed in thespecification are in contradiction with the required properties.

    Over-specification: the specification satisfies the requiredproperties, but some feasible solutions are excluded because of

    the presence of unnecessary properties; the specification is too

    detailed.

    Under-specification: all the properties expressed in thespecification are adequate, but some unacceptable solutions are

    accepted; the specification is not precise enough.

    Source: NUREG/CR-6316 Guidelines

  • 7/31/2019 Lect 1 Intro Taxonomy

    40/50

    Requirement Faults (NASA fault taxonomy)

    Incompleteness Omitted/Missing Incorrect Ambiguous Infeasible Inconsistent Over-specification Not Traceable Misplaced

    Unachievable Item Non-verifiable Intentional Deviation Redundant or Duplicate

  • 7/31/2019 Lect 1 Intro Taxonomy

    41/50

    Design Faults

    Software design faults Application design faults Basic software design faults

    Scheduling faults Services faults

    Calibration faults

    Firmware design faults

    Hardware design faults Component design faults ECU design faults

    Malicious design faults Disrupt or halt service; causing denial of service; improper

    modification of system behavior

    System design faults Relating to architecture design, communication infrastructure, wiring

    harness, EMI protection, etc.

  • 7/31/2019 Lect 1 Intro Taxonomy

    42/50

    Manufacturing Faults

    Arise from weakness in the manufacturing andassembly processes at the various levels of details

    from component manufacturing to the vehicle final

    assembly. Such a fault could be caused by low quality in

    materials/components, but may also be caused by a

    software/hardware fault in the manufacturing system.

  • 7/31/2019 Lect 1 Intro Taxonomy

    43/50

    Operational Faults(Refer to EASIS Fault Model)

    Hardware faultsNode faults

    CPU faults

    Supervisor/watchdog faults

    Internal communication (SPI) faults

    Reset logic faults

    Actuator faults

    Sensor faults Power-supply faults

    Communication faults/errors

  • 7/31/2019 Lect 1 Intro Taxonomy

    44/50

    Operational Faults (Contd.)

    Susceptibility faultsElectrical susceptibility (EMI transported by cablings)

    Electromagnetic susceptibility (transported by air)

    Environmental susceptibility

    Maintenance faultsWrong software download

    Wrong replacement parts

    Wrong maintenance procedure followed

    Malicious faultsSoftware intrusions

    Hardware intrusions

    Fault Hypothesis

  • 7/31/2019 Lect 1 Intro Taxonomy

    45/50

    Fault Hypothesis

    The fault hypothesis partitions the fault space into two sets

    Level-1 faults: this is the set of faults that will be tolerated by thefault-tolerance mechanisms.

    Level-2 faults: this is the set of fault that will not be tolerated bythe fault-tolerance mechanisms. These faults must be rare events.

    If there is no precise fault hypothesis available, it isimpossible to test the proper behavior of the fault-

    tolerance mechanisms.

    If during the test and installation phase, it is found out thatlevel-2 faults are not rare events, then there exists afundamental design problem:

    Either the fault-hypothesis is wrong

    Or the implementation is deficient.

  • 7/31/2019 Lect 1 Intro Taxonomy

    46/50

    Hardware Redundancy

    Extra hardware is added to override the effects of a

    failed component Static Hardware Redundancy - for

    immediate masking of a failure

    Example: Use three processors and vote on the

    result.The wrong output of a single faulty processor ismasked

    Dynamic Hardware Redundancy - Sparecomponents are activated upon the failure of acurrently active component

    Hybrid Hardware Redundancy - Acombination of static and dynamic redundancytechniques

  • 7/31/2019 Lect 1 Intro Taxonomy

    47/50

    Software Redundancy Example

    Multiple teams of programmers

    Write different versions of software for the same

    function The hope is that such diversity will ensure that not

    all the copies will fail on the same set of input data

  • 7/31/2019 Lect 1 Intro Taxonomy

    48/50

    Information Redundancy

    Add check bits to original data bits so that an errorin the data bits can be detected and even corrected

    Error detecting and correcting codes have beendeveloped and are being used

    Information redundancy often requires hardwareredundancy to process the additional check bits

  • 7/31/2019 Lect 1 Intro Taxonomy

    49/50

    Time Redundancy

    Provide additional time during which a failedexecution can be repeated

    Most failures are transient - they go away aftersome time

    If enough slack time is available, failed unit canrecover and redo affected computation

  • 7/31/2019 Lect 1 Intro Taxonomy

    50/50