fault injection

7/31/2019 Fault Injection

1/19

4/8/20

Assessing System Dependability

Purnendu Sinha

General Motors R&D, India Science Lab

ITPL, [email protected]

Fault-Injection (FI)

It is important for evaluating the dependability of computersystems.

Often simulation-based FI is used to evaluate the dependability of asystem that is in the concept and design phases.

A technique which injects faults, creates errors/failures at theHW, SW or HW&SW levels.

2


2/19

4/8/20

FAULTS

Physical Faults Interaction FaultsDesign Faults

Faults Induced

by the User

Software Faults

Initialization Faults

Assignment Faults

Condition Check FaultsFunction Faults

Documentation Faults

Hardware Faults

Memory Faults

CPU Faults

Bus FaultsI/O Faults

Taxonomy of Faults

3

Foundation

4

When fault-injection is to be considered on a target system, The input domain corresponds to a set of faults Fand a set of activations

A that specifies the domain used to functionally exercise the system

The output domain corresponds to a set of readouts R and a set ofderived measures M.

The FARMsets constitute the major attributes that can be usedto fully characterize fault injection.

Level of Abstraction of FI

Axiomatic modelsthe analytical models used to model the structureand the dependability and/or performance behavior of the system suchas Reliability Block Diagram, Fault Tree, Petri nets, etc.

Empirical modelscorrespond to models that incorporate morecomplex or detailed behavioral and/or structural descriptions

Physical modelsprototypes actually implementing the hardwareand/or software features of the developed system.


3/19

4/8/20

Foundation (Contd.) - Impact of models on the FARMattributes.

5

The Fset: In axiomaticmodels, it is described bystochastic processeswhose parameters

are characterized by probabilistic distributions; Examples ofempiricalmodels (realistic distributions for the parameters) include the faultsimulation methods at component, gate, circuit or system level; Physicalmodels (prototypes)SW, HW or HW-SW; for SW, Fset correspondsto simple alterations in the source code of programs; for HW or HW-SW, Fset is mainly based on physical faults.

The A set: In axiomaticmodels, the Aset is described by stochastic processes; In

empiricalmodels, the A set describes the behavior of the system in a formwhere elementary parameters can be more appropriately identified and

assigned; Forphysicalmodels, in SW only and HW only, theA

set consistsof a set of test data patterns aimed at exercising the injected faults; ForSW and HW, it is application-dependent.

6

The R& Msets: In axiomaticmodels, the Mset corresponds todependability measures such as reliability, MTTF, etc.; For bothempiricalandphysicalmodels, the measures in Mcan be obtainedonly experimentally from a series of fault injection case studies.

For each experiment, a faultfis selected in Fand an activationtrajectoryais described in A. The reactions of the system areobserved and form a readout rthat fully characterizes the outcomeof the experiment.

A fault-injection experiment is characterized by the triple f, a, r,where the readouts for each experiment r form a global set ofreadout R for the test sequence and can be used to elaborate ameasure in M.

Foundation (Contd.) - Impact of models on the FARMattributes.


4/19

4/8/20

Basics of Fault Injection

Where: to apply change (location, abstraction/system level)

What: to inject (what should be injected/corrupted?)

Which: trigger to use (event, instruction, timeout, exception,code mutation)

When: to inject (corresponding to type of faults)

How: often to inject (corresponding to type of faults)

What to record and interpret? Whats the purpose?

How is the system loaded at the time of the injection? Workload (Real Realistic Synthetic)

System resources

Simulation time explosion When too much detail is simulated

When extremely small failure probabilities require large simulation runs

7

Coverage and Latency Aim is to find characteristics of EventX

EventXmay be detection, recovery, etc.

Coverage of EventX

Conditional probability of EventXoccurring

E.g. probability of error detection given that an error exists inthe system

Latency of EventX Time from the earliest possible occurrence of EventXto the

actual monitored occurrence

E.g. time from error occurrence to error detection

8


5/19


6/19

4/8/20

Experimental AnalysisPrototype Phase

Approach and Goals System runs under controlled workload

Controlled fault injections used to evaluate systems in presence of faults

Information produced

The failure process from fault-occurrence to system recovery, errorlatency, propagation, detection and recovery (may include reconfiguration)

Limitations/Issues

Can only study artificial faults; Injected faults should induce/create failurescenarios representative of actual system operation

11

Expt. AnalysisOperational Phase

Approach and Goals

Study naturally occurring errors

Measure systems in the field under real workloads

Analyze collected error/failure and performance data

Information produced

Actual error/failure characteristics and insight into analytical models (failurerates, time to failure distributions)

Limitations/Issues HW/SW instrumentation, analysis tools

Approach limited to detected errors; conditions in the field can vary widely

12


7/19

4/8/20

Operational Phase (Contd.) Measurement-based analysis uses actual data, which contains much

information about naturally occurring errors and failures, andsometimes about recovery attempts.

Given field error data collected from a system, this study consistsof four steps:

Step I: consists of extracting necessary information from field data, classifying errorsand failures, and coalescing repeated error reports.

Step II: includes identifying appropriate models and estimating various measures ofinterest from the coalesced data.

Step III: solves these models to obtain dependability measures

Step IV: involves a careful interpretation of the models and measures obtained fromthe data.

13

FI Environment & Implementation Methods

Fault models

HWOpen, bridging, bit-flip, spurious, power-surge, stuck-at-faults

SWstorage data corruption (register, memory, disk), communicationdata corruption (bus, network), manifestation of SW defects (machinelevel and higher levels)

14


8/19

4/8/20

Fault Injection TargetsWhere to Inject?

15

Various Fault Injection Approaches Physical fault injection

EMI, radiation

Simulated fault injection

Injections into VHDL-model

Hardware fault-injection

Pin-level injection

Scan chains

Software implemented fault injection (SWIFI)

Bit-flips, mutations

Code and Data segments

interfaces

16


9/19

4/8/20

Physical Fault Injection

Reproduce extreme environmental conditions

EMI/Radiation

Heat/Shock

Voltage drops/spikes etc

Advantages

Real/actual faults

Tangible

Simple experiments

Disadvantages

Difficult to control/repeat

Needs at least a prototype

17

Simulation-based Fault Injection Using a model of the system

VHDL

MatLab

SystemC

Spice

Advantages

Usable during design

Controllable Disadvantages

Requires a model

Model accuracy

Slow

18


10/19

4/8/20

Simulated Fault InjectionFault injection

Electrical level Logical level Functional level

Change current

Change voltage

Stuck at 0 or 1

Inverted fault

Change CPU Register

Flip memory bits, etc.

Electricalcircuits

Logic gates FunctionalunitsPhysical

process

Logic

operation

19

Hardware-based Fault Injection Inject faults using hardware (similar to physical)

Pin-level injection

Scan chains

Advantages

Controllable

Close to real faults

Disadvantages

Requires special equipment

Reachability

20


11/19

4/8/20

HW Fault-Injection (1/2)

HW FI with contact: In pin-level injection, the injector hasdirect physical contact with the target system, producing voltageand current changes externally to the target chip.

Active probes: adds current via the probes attached to the pinslimitedto stuck-at-faults, bridging faults could be handled.

Socket insertion: inserts a socket between the target hardware and itscircuit boardcan inject stuck-at, open, or more complex logic faults(inverted, ANDed, Ored) into the target hardware.

Provides good controllability of fault times and location with

little or no perturbation to the target system.

21

HW Fault-Injection (2/2)

HW FI without contact: The injector has no direct physicalcontact with the target system.

An external source produce some physical phenomenon, such asheavy ion radiation and electro-magnetic interference, causingspurious currents inside the target chip.

Difficult to exactly trigger the time and location of a FI as one

cannot precisely control the exact moment of heavy-ion emissionor electro-magnetic field creation.

22


12/19

4/8/20

Why Inject SW Faults?

Software faults are most probably the major cause of computersystem outages

Goals:

Experimental risk assessment in component-based softwaredevelopment

Dependability evaluation of COTS components

Robustness testing

Fault tolerance layer evaluation

Dependability benchmarking

23

Software Implemented Fault Injection (SWIFI)

A testing technique that aids in understanding how SW behaveswhen stressed in unusual ways.

Variations in the technique allow it to be applied to many typesof SW and for different purposes.

Manipulate bits in memory locations and registers

Emulation of HW faults

Change text segment of processes

Emulation of SW faults (bugs, defects)

Dynamic: E.g., Op-code switch during operation

Static: Change source code and recompile (mutation)

24


13/19

4/8/20

Usage of SWFI

Finding defects in software

Robustness Testing

COTS Validation/Determining failure modes

Safety Verification

Security Assessment

Software Testability Analysis

25

SWIFI Attractive as does not require expensive hardware

Target, an application injector is inserted into the applicationor layered between the application and the OS.

Target, the OS injector must be embedded in the OS

Shortcomings: Cannot inject faults into locations inaccessible to SW

SW probes may alter the workload running on the target or even changethe structure of the program

The poor time-resolution of the approach may cause fidelity problems.Okay for long latency faults (memory faults) but problematic for shortlatency faults (bus or CPU faults)

Characterization of SWIFI methods Compile-time injectionthe program instruction is modified before the

program image is loaded and executed

Run-time injectionduring run-time, a mechanism is needed to trigger FI

26


14/19

4/8/20

SWIFICompile-time Injection

Rather than injecting faults into the HW of the target system,inject errors into the source code or assembly code of the targetprogram to emulate the effect of HW, SW and transient faults

The modified code alters the target program instructions, andinjection thus caused, generates an erroneous software image, and

when the system executes the fault image, it activates the fault.

Requires the modification of the program that will evaluate faulteffect; requires no additional software during run-time.

Causes no perturbation to the target system during execution.

As the fault effect is hard-coded, it can be used to emulatepermanent faults.

27

SWIFIRun-time Injection A mechanism is needed to trigger fault injection at run-time.

Triggering mechanisms include:

Timeout: the timeout event generates an interrupt to invoke fault-injection The timer can be a HW or SW timer

Since it injects faults on the basis of time rather than specific events or system state, itproduces unpredictable fault effects and program behavior

Can emulate transient and intermittent HW faults

Exception/trap: a HW exception or when a SW trap instruction beinginserted into a target application executes, an interrupt is generated thattransfers control to an interrupt-handler, basically the fault-injector.

It can inject fault whenever certain events or conditions occur.

Code insertion: instructions are added to the target program that allowfault injection to occur before particular instructions.

Performs fault injection at run-time and adds instructions instead of changing the code

The fault-injector may exist as part of the target program and runs at user mode ratherthan system mode

28


15/19

4/8/20

Summary of Techniques for SWIFI

29

Type Method

SW Fault Modify the text segment of the program

SW Error Modify the data segment of the program

Memory Fault Flip memory bits

CPU Fault Use a trap to modify the memory area of the saved CPU register

Bus Fault Use traps before and after an instruction tochange the code or data used by theinstruction and then restore them after the

instruction is executedNetworked Fault Modify of delete transmission messages

SW Fault-Injection Techniques

30


16/19

4/8/20

Many Tools Available

DEPEND, MEFISTO

Evaluating HW/SW architectures using simulations

FERRARI, DOCTOR, RIFLE, Xception, FIST, Messaline

Evaluate tolerance against HW faults

DEFINE, FIAT, FTAPE

Evaluate tolerance against HW and SW faults

MAFALDA, NFTAPE, PROPANE

Evaluate effects of HW & SW faults and analyze error propagation

Ballista

OS Robustness testing

31

DEPEND

Provides a library of objects to behaviorally model a systems

hardware components; using these objects, a control programwritten in C++ simulates system operation and models system SW

The objects automatically inject faults, initiate repairs, and compilestatistics.

Permanent, transient, and user-defined faults can be injected withlatency or at correlated times.

FI scheme based on workload.

32


17/19

4/8/20

Messaline

The injection, activation and collection modules are implemented in HW; theSW management module resides on a PC

Signals collected from the target system can provide feedback to the injector. A device is associated with each injection point to sense when and if each fault

is activated and produces an error.

33

Uses both active probesand sockets to conductpin-level FI

Can inject stuck-at, open,bridging, and complex

logical faults

FISTFault Injection System for Study ofTransient Fault Effect

34

Employs both contact and contact-less methods to create transient faults

Use heavy-ion radiation to createtransient faults at random locationsinside a chip.

Radiation source inside a vacuumchamber with two small processors

(Ref and Test CPI) In addition to radiation, FIST allows

for injection of power disturbancefaults (to cause gate propagation delayfaults)


18/19

4/8/20

Xception

Uses a processors built-in hardware exception triggers to trigger fault injection.The fault injector is implemented as an exception handler and requiresmodification of the interrupt handler vector.

Events which can trigger fault injection include: opcode fetch from a specified

address, operand load from a specified address, operand store to a specifiedaddress, a specified time passes since start-up.

Each fault has a specifiedfault mask: a set of bits that determines whichcorresponding bits in the target location will be injected.

35

Takes advantage of theadvanced debugging andperformance monitoringfeatures present in manymodern processors to injectmore realistic faults

Characteristics of Fault Injection Methods

36


19/19

4/8/20

Key Issues in Fault Injection

Effective fault injection mechanisms using hardware, software, andhybrid technology to accurately assess and validate networkedsystems

Practical evaluation methods to accurately quantify fault effect andrecovery mechanisms in complex environments

Evaluation of error detection, diagnosis, and recovery techniques

Quantification of confidence in the fault-injection based validation

Usable fault tolerance benchmark for assessing systems and NWs

Common evaluation/validation framework

37

References1. R.K. Iyer, D. Tang, Experimental Analysis of

Computer System Dependability, Chapter 5, Fault-

Tolerant Computer System Design, Edited by D.K.Pradhan, Prentice Hall, 1994.

2. J. Clark, D.K. Pradhan, Fault-Injection: A Methodfor Validating Computer-System Dependability,

IEEE Computer, pp. 47-56, June 1995.3. M-C. Hsueh, T. Tsai, R.K. Iyer, Fault-Injection

Techniques and Tools, IEEE Computer, pp. 75-82,April 1997.Look for references to other tools/techniques in these papers/book-chapter

38

fault injection

Documents