fault injection

Upload: himanshuagra

Post on 04-Apr-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Fault Injection

    1/19

    4/8/20

    Assessing System Dependability

    Purnendu Sinha

    General Motors R&D, India Science Lab

    ITPL, [email protected]

    Fault-Injection (FI)

    It is important for evaluating the dependability of computersystems.

    Often simulation-based FI is used to evaluate the dependability of asystem that is in the concept and design phases.

    A technique which injects faults, creates errors/failures at theHW, SW or HW&SW levels.

    2

  • 7/31/2019 Fault Injection

    2/19

    4/8/20

    FAULTS

    Physical Faults Interaction FaultsDesign Faults

    Faults Induced

    by the User

    Software Faults

    Initialization Faults

    Assignment Faults

    Condition Check FaultsFunction Faults

    Documentation Faults

    Hardware Faults

    Memory Faults

    CPU Faults

    Bus FaultsI/O Faults

    Taxonomy of Faults

    3

    Foundation

    4

    When fault-injection is to be considered on a target system, The input domain corresponds to a set of faults Fand a set of activations

    A that specifies the domain used to functionally exercise the system

    The output domain corresponds to a set of readouts R and a set ofderived measures M.

    The FARMsets constitute the major attributes that can be usedto fully characterize fault injection.

    Level of Abstraction of FI

    Axiomatic modelsthe analytical models used to model the structureand the dependability and/or performance behavior of the system suchas Reliability Block Diagram, Fault Tree, Petri nets, etc.

    Empirical modelscorrespond to models that incorporate morecomplex or detailed behavioral and/or structural descriptions

    Physical modelsprototypes actually implementing the hardwareand/or software features of the developed system.

  • 7/31/2019 Fault Injection

    3/19

    4/8/20

    Foundation (Contd.) - Impact of models on the FARMattributes.

    5

    The Fset: In axiomaticmodels, it is described bystochastic processeswhose parameters

    are characterized by probabilistic distributions; Examples ofempiricalmodels (realistic distributions for the parameters) include the faultsimulation methods at component, gate, circuit or system level; Physicalmodels (prototypes)SW, HW or HW-SW; for SW, Fset correspondsto simple alterations in the source code of programs; for HW or HW-SW, Fset is mainly based on physical faults.

    The A set: In axiomaticmodels, the Aset is described by stochastic processes; In

    empiricalmodels, the A set describes the behavior of the system in a formwhere elementary parameters can be more appropriately identified and

    assigned; Forphysicalmodels, in SW only and HW only, theA

    set consistsof a set of test data patterns aimed at exercising the injected faults; ForSW and HW, it is application-dependent.

    6

    The R& Msets: In axiomaticmodels, the Mset corresponds todependability measures such as reliability, MTTF, etc.; For bothempiricalandphysicalmodels, the measures in Mcan be obtainedonly experimentally from a series of fault injection case studies.

    For each experiment, a faultfis selected in Fand an activationtrajectoryais described in A. The reactions of the system areobserved and form a readout rthat fully characterizes the outcomeof the experiment.

    A fault-injection experiment is characterized by the triple f, a, r,where the readouts for each experiment r form a global set ofreadout R for the test sequence and can be used to elaborate ameasure in M.

    Foundation (Contd.) - Impact of models on the FARMattributes.

  • 7/31/2019 Fault Injection

    4/19

    4/8/20

    Basics of Fault Injection

    Where: to apply change (location, abstraction/system level)

    What: to inject (what should be injected/corrupted?)

    Which: trigger to use (event, instruction, timeout, exception,code mutation)

    When: to inject (corresponding to type of faults)

    How: often to inject (corresponding to type of faults)

    What to record and interpret? Whats the purpose?

    How is the system loaded at the time of the injection? Workload (Real Realistic Synthetic)

    System resources

    Simulation time explosion When too much detail is simulated

    When extremely small failure probabilities require large simulation runs

    7

    Coverage and Latency Aim is to find characteristics of EventX

    EventXmay be detection, recovery, etc.

    Coverage of EventX

    Conditional probability of EventXoccurring

    E.g. probability of error detection given that an error exists inthe system

    Latency of EventX Time from the earliest possible occurrence of EventXto the

    actual monitored occurrence

    E.g. time from error occurrence to error detection

    8

  • 7/31/2019 Fault Injection

    5/19

  • 7/31/2019 Fault Injection

    6/19

    4/8/20

    Experimental AnalysisPrototype Phase

    Approach and Goals System runs under controlled workload

    Controlled fault injections used to evaluate systems in presence of faults

    Information produced

    The failure process from fault-occurrence to system recovery, errorlatency, propagation, detection and recovery (may include reconfiguration)

    Limitations/Issues

    Can only study artificial faults; Injected faults should induce/create failurescenarios representative of actual system operation

    11

    Expt. AnalysisOperational Phase

    Approach and Goals

    Study naturally occurring errors

    Measure systems in the field under real workloads

    Analyze collected error/failure and performance data

    Information produced

    Actual error/failure characteristics and insight into analytical models (failurerates, time to failure distributions)

    Limitations/Issues HW/SW instrumentation, analysis tools

    Approach limited to detected errors; conditions in the field can vary widely

    12

  • 7/31/2019 Fault Injection

    7/19

    4/8/20

    Operational Phase (Contd.) Measurement-based analysis uses actual data, which contains much

    information about naturally occurring errors and failures, andsometimes about recovery attempts.

    Given field error data collected from a system, this study consistsof four steps:

    Step I: consists of extracting necessary information from field data, classifying errorsand failures, and coalescing repeated error reports.

    Step II: includes identifying appropriate models and estimating various measures ofinterest from the coalesced data.

    Step III: solves these models to obtain dependability measures

    Step IV: involves a careful interpretation of the models and measures obtained fromthe data.

    13

    FI Environment & Implementation Methods

    Fault models

    HWOpen, bridging, bit-flip, spurious, power-surge, stuck-at-faults

    SWstorage data corruption (register, memory, disk), communicationdata corruption (bus, network), manifestation of SW defects (machinelevel and higher levels)

    14

  • 7/31/2019 Fault Injection

    8/19

    4/8/20

    Fault Injection TargetsWhere to Inject?

    15

    Various Fault Injection Approaches Physical fault injection

    EMI, radiation

    Simulated fault injection

    Injections into VHDL-model

    Hardware fault-injection

    Pin-level injection

    Scan chains

    Software implemented fault injection (SWIFI)

    Bit-flips, mutations

    Code and Data segments

    interfaces

    16

  • 7/31/2019 Fault Injection

    9/19

    4/8/20

    Physical Fault Injection

    Reproduce extreme environmental conditions

    EMI/Radiation

    Heat/Shock

    Voltage drops/spikes etc

    Advantages

    Real/actual faults

    Tangible

    Simple experiments

    Disadvantages

    Difficult to control/repeat

    Needs at least a prototype

    17

    Simulation-based Fault Injection Using a model of the system

    VHDL

    MatLab

    SystemC

    Spice

    Advantages

    Usable during design

    Controllable Disadvantages

    Requires a model

    Model accuracy

    Slow

    18

  • 7/31/2019 Fault Injection

    10/19

    4/8/20

    Simulated Fault InjectionFault injection

    Electrical level Logical level Functional level

    Change current

    Change voltage

    Stuck at 0 or 1

    Inverted fault

    Change CPU Register

    Flip memory bits, etc.

    Electricalcircuits

    Logic gates FunctionalunitsPhysical

    process

    Logic

    operation

    19

    Hardware-based Fault Injection Inject faults using hardware (similar to physical)

    Pin-level injection

    Scan chains

    Advantages

    Controllable

    Close to real faults

    Disadvantages

    Requires special equipment

    Reachability

    20

  • 7/31/2019 Fault Injection

    11/19

    4/8/20

    HW Fault-Injection (1/2)

    HW FI with contact: In pin-level injection, the injector hasdirect physical contact with the target system, producing voltageand current changes externally to the target chip.

    Active probes: adds current via the probes attached to the pinslimitedto stuck-at-faults, bridging faults could be handled.

    Socket insertion: inserts a socket between the target hardware and itscircuit boardcan inject stuck-at, open, or more complex logic faults(inverted, ANDed, Ored) into the target hardware.

    Provides good controllability of fault times and location with

    little or no perturbation to the target system.

    21

    HW Fault-Injection (2/2)

    HW FI without contact: The injector has no direct physicalcontact with the target system.

    An external source produce some physical phenomenon, such asheavy ion radiation and electro-magnetic interference, causingspurious currents inside the target chip.

    Difficult to exactly trigger the time and location of a FI as one

    cannot precisely control the exact moment of heavy-ion emissionor electro-magnetic field creation.

    22

  • 7/31/2019 Fault Injection

    12/19

    4/8/20

    Why Inject SW Faults?

    Software faults are most probably the major cause of computersystem outages

    Goals:

    Experimental risk assessment in component-based softwaredevelopment

    Dependability evaluation of COTS components

    Robustness testing

    Fault tolerance layer evaluation

    Dependability benchmarking

    23

    Software Implemented Fault Injection (SWIFI)

    A testing technique that aids in understanding how SW behaveswhen stressed in unusual ways.

    Variations in the technique allow it to be applied to many typesof SW and for different purposes.

    Manipulate bits in memory locations and registers

    Emulation of HW faults

    Change text segment of processes

    Emulation of SW faults (bugs, defects)

    Dynamic: E.g., Op-code switch during operation

    Static: Change source code and recompile (mutation)

    24

  • 7/31/2019 Fault Injection

    13/19

    4/8/20

    Usage of SWFI

    Finding defects in software

    Robustness Testing

    COTS Validation/Determining failure modes

    Safety Verification

    Security Assessment

    Software Testability Analysis

    25

    SWIFI Attractive as does not require expensive hardware

    Target, an application injector is inserted into the applicationor layered between the application and the OS.

    Target, the OS injector must be embedded in the OS

    Shortcomings: Cannot inject faults into locations inaccessible to SW

    SW probes may alter the workload running on the target or even changethe structure of the program

    The poor time-resolution of the approach may cause fidelity problems.Okay for long latency faults (memory faults) but problematic for shortlatency faults (bus or CPU faults)

    Characterization of SWIFI methods Compile-time injectionthe program instruction is modified before the

    program image is loaded and executed

    Run-time injectionduring run-time, a mechanism is needed to trigger FI

    26

  • 7/31/2019 Fault Injection

    14/19

    4/8/20

    SWIFICompile-time Injection

    Rather than injecting faults into the HW of the target system,inject errors into the source code or assembly code of the targetprogram to emulate the effect of HW, SW and transient faults

    The modified code alters the target program instructions, andinjection thus caused, generates an erroneous software image, and

    when the system executes the fault image, it activates the fault.

    Requires the modification of the program that will evaluate faulteffect; requires no additional software during run-time.

    Causes no perturbation to the target system during execution.

    As the fault effect is hard-coded, it can be used to emulatepermanent faults.

    27

    SWIFIRun-time Injection A mechanism is needed to trigger fault injection at run-time.

    Triggering mechanisms include:

    Timeout: the timeout event generates an interrupt to invoke fault-injection The timer can be a HW or SW timer

    Since it injects faults on the basis of time rather than specific events or system state, itproduces unpredictable fault effects and program behavior

    Can emulate transient and intermittent HW faults

    Exception/trap: a HW exception or when a SW trap instruction beinginserted into a target application executes, an interrupt is generated thattransfers control to an interrupt-handler, basically the fault-injector.

    It can inject fault whenever certain events or conditions occur.

    Code insertion: instructions are added to the target program that allowfault injection to occur before particular instructions.

    Performs fault injection at run-time and adds instructions instead of changing the code

    The fault-injector may exist as part of the target program and runs at user mode ratherthan system mode

    28

  • 7/31/2019 Fault Injection

    15/19

    4/8/20

    Summary of Techniques for SWIFI

    29

    Type Method

    SW Fault Modify the text segment of the program

    SW Error Modify the data segment of the program

    Memory Fault Flip memory bits

    CPU Fault Use a trap to modify the memory area of the saved CPU register

    Bus Fault Use traps before and after an instruction tochange the code or data used by theinstruction and then restore them after the

    instruction is executedNetworked Fault Modify of delete transmission messages

    SW Fault-Injection Techniques

    30

  • 7/31/2019 Fault Injection

    16/19

    4/8/20

    Many Tools Available

    DEPEND, MEFISTO

    Evaluating HW/SW architectures using simulations

    FERRARI, DOCTOR, RIFLE, Xception, FIST, Messaline

    Evaluate tolerance against HW faults

    DEFINE, FIAT, FTAPE

    Evaluate tolerance against HW and SW faults

    MAFALDA, NFTAPE, PROPANE

    Evaluate effects of HW & SW faults and analyze error propagation

    Ballista

    OS Robustness testing

    31

    DEPEND

    Provides a library of objects to behaviorally model a systems

    hardware components; using these objects, a control programwritten in C++ simulates system operation and models system SW

    The objects automatically inject faults, initiate repairs, and compilestatistics.

    Permanent, transient, and user-defined faults can be injected withlatency or at correlated times.

    FI scheme based on workload.

    32

  • 7/31/2019 Fault Injection

    17/19

    4/8/20

    Messaline

    The injection, activation and collection modules are implemented in HW; theSW management module resides on a PC

    Signals collected from the target system can provide feedback to the injector. A device is associated with each injection point to sense when and if each fault

    is activated and produces an error.

    33

    Uses both active probesand sockets to conductpin-level FI

    Can inject stuck-at, open,bridging, and complex

    logical faults

    FISTFault Injection System for Study ofTransient Fault Effect

    34

    Employs both contact and contact-less methods to create transient faults

    Use heavy-ion radiation to createtransient faults at random locationsinside a chip.

    Radiation source inside a vacuumchamber with two small processors

    (Ref and Test CPI) In addition to radiation, FIST allows

    for injection of power disturbancefaults (to cause gate propagation delayfaults)

  • 7/31/2019 Fault Injection

    18/19

    4/8/20

    Xception

    Uses a processors built-in hardware exception triggers to trigger fault injection.The fault injector is implemented as an exception handler and requiresmodification of the interrupt handler vector.

    Events which can trigger fault injection include: opcode fetch from a specified

    address, operand load from a specified address, operand store to a specifiedaddress, a specified time passes since start-up.

    Each fault has a specifiedfault mask: a set of bits that determines whichcorresponding bits in the target location will be injected.

    35

    Takes advantage of theadvanced debugging andperformance monitoringfeatures present in manymodern processors to injectmore realistic faults

    Characteristics of Fault Injection Methods

    36

  • 7/31/2019 Fault Injection

    19/19

    4/8/20

    Key Issues in Fault Injection

    Effective fault injection mechanisms using hardware, software, andhybrid technology to accurately assess and validate networkedsystems

    Practical evaluation methods to accurately quantify fault effect andrecovery mechanisms in complex environments

    Evaluation of error detection, diagnosis, and recovery techniques

    Quantification of confidence in the fault-injection based validation

    Usable fault tolerance benchmark for assessing systems and NWs

    Common evaluation/validation framework

    37

    References1. R.K. Iyer, D. Tang, Experimental Analysis of

    Computer System Dependability, Chapter 5, Fault-

    Tolerant Computer System Design, Edited by D.K.Pradhan, Prentice Hall, 1994.

    2. J. Clark, D.K. Pradhan, Fault-Injection: A Methodfor Validating Computer-System Dependability,

    IEEE Computer, pp. 47-56, June 1995.3. M-C. Hsueh, T. Tsai, R.K. Iyer, Fault-Injection

    Techniques and Tools, IEEE Computer, pp. 75-82,April 1997.Look for references to other tools/techniques in these papers/book-chapter

    38