fault injection
TRANSCRIPT
-
7/31/2019 Fault Injection
1/19
4/8/20
Assessing System Dependability
Purnendu Sinha
General Motors R&D, India Science Lab
ITPL, [email protected]
Fault-Injection (FI)
It is important for evaluating the dependability of computersystems.
Often simulation-based FI is used to evaluate the dependability of asystem that is in the concept and design phases.
A technique which injects faults, creates errors/failures at theHW, SW or HW&SW levels.
2
-
7/31/2019 Fault Injection
2/19
4/8/20
FAULTS
Physical Faults Interaction FaultsDesign Faults
Faults Induced
by the User
Software Faults
Initialization Faults
Assignment Faults
Condition Check FaultsFunction Faults
Documentation Faults
Hardware Faults
Memory Faults
CPU Faults
Bus FaultsI/O Faults
Taxonomy of Faults
3
Foundation
4
When fault-injection is to be considered on a target system, The input domain corresponds to a set of faults Fand a set of activations
A that specifies the domain used to functionally exercise the system
The output domain corresponds to a set of readouts R and a set ofderived measures M.
The FARMsets constitute the major attributes that can be usedto fully characterize fault injection.
Level of Abstraction of FI
Axiomatic modelsthe analytical models used to model the structureand the dependability and/or performance behavior of the system suchas Reliability Block Diagram, Fault Tree, Petri nets, etc.
Empirical modelscorrespond to models that incorporate morecomplex or detailed behavioral and/or structural descriptions
Physical modelsprototypes actually implementing the hardwareand/or software features of the developed system.
-
7/31/2019 Fault Injection
3/19
4/8/20
Foundation (Contd.) - Impact of models on the FARMattributes.
5
The Fset: In axiomaticmodels, it is described bystochastic processeswhose parameters
are characterized by probabilistic distributions; Examples ofempiricalmodels (realistic distributions for the parameters) include the faultsimulation methods at component, gate, circuit or system level; Physicalmodels (prototypes)SW, HW or HW-SW; for SW, Fset correspondsto simple alterations in the source code of programs; for HW or HW-SW, Fset is mainly based on physical faults.
The A set: In axiomaticmodels, the Aset is described by stochastic processes; In
empiricalmodels, the A set describes the behavior of the system in a formwhere elementary parameters can be more appropriately identified and
assigned; Forphysicalmodels, in SW only and HW only, theA
set consistsof a set of test data patterns aimed at exercising the injected faults; ForSW and HW, it is application-dependent.
6
The R& Msets: In axiomaticmodels, the Mset corresponds todependability measures such as reliability, MTTF, etc.; For bothempiricalandphysicalmodels, the measures in Mcan be obtainedonly experimentally from a series of fault injection case studies.
For each experiment, a faultfis selected in Fand an activationtrajectoryais described in A. The reactions of the system areobserved and form a readout rthat fully characterizes the outcomeof the experiment.
A fault-injection experiment is characterized by the triple f, a, r,where the readouts for each experiment r form a global set ofreadout R for the test sequence and can be used to elaborate ameasure in M.
Foundation (Contd.) - Impact of models on the FARMattributes.
-
7/31/2019 Fault Injection
4/19
4/8/20
Basics of Fault Injection
Where: to apply change (location, abstraction/system level)
What: to inject (what should be injected/corrupted?)
Which: trigger to use (event, instruction, timeout, exception,code mutation)
When: to inject (corresponding to type of faults)
How: often to inject (corresponding to type of faults)
What to record and interpret? Whats the purpose?
How is the system loaded at the time of the injection? Workload (Real Realistic Synthetic)
System resources
Simulation time explosion When too much detail is simulated
When extremely small failure probabilities require large simulation runs
7
Coverage and Latency Aim is to find characteristics of EventX
EventXmay be detection, recovery, etc.
Coverage of EventX
Conditional probability of EventXoccurring
E.g. probability of error detection given that an error exists inthe system
Latency of EventX Time from the earliest possible occurrence of EventXto the
actual monitored occurrence
E.g. time from error occurrence to error detection
8
-
7/31/2019 Fault Injection
5/19
-
7/31/2019 Fault Injection
6/19
4/8/20
Experimental AnalysisPrototype Phase
Approach and Goals System runs under controlled workload
Controlled fault injections used to evaluate systems in presence of faults
Information produced
The failure process from fault-occurrence to system recovery, errorlatency, propagation, detection and recovery (may include reconfiguration)
Limitations/Issues
Can only study artificial faults; Injected faults should induce/create failurescenarios representative of actual system operation
11
Expt. AnalysisOperational Phase
Approach and Goals
Study naturally occurring errors
Measure systems in the field under real workloads
Analyze collected error/failure and performance data
Information produced
Actual error/failure characteristics and insight into analytical models (failurerates, time to failure distributions)
Limitations/Issues HW/SW instrumentation, analysis tools
Approach limited to detected errors; conditions in the field can vary widely
12
-
7/31/2019 Fault Injection
7/19
4/8/20
Operational Phase (Contd.) Measurement-based analysis uses actual data, which contains much
information about naturally occurring errors and failures, andsometimes about recovery attempts.
Given field error data collected from a system, this study consistsof four steps:
Step I: consists of extracting necessary information from field data, classifying errorsand failures, and coalescing repeated error reports.
Step II: includes identifying appropriate models and estimating various measures ofinterest from the coalesced data.
Step III: solves these models to obtain dependability measures
Step IV: involves a careful interpretation of the models and measures obtained fromthe data.
13
FI Environment & Implementation Methods
Fault models
HWOpen, bridging, bit-flip, spurious, power-surge, stuck-at-faults
SWstorage data corruption (register, memory, disk), communicationdata corruption (bus, network), manifestation of SW defects (machinelevel and higher levels)
14
-
7/31/2019 Fault Injection
8/19
4/8/20
Fault Injection TargetsWhere to Inject?
15
Various Fault Injection Approaches Physical fault injection
EMI, radiation
Simulated fault injection
Injections into VHDL-model
Hardware fault-injection
Pin-level injection
Scan chains
Software implemented fault injection (SWIFI)
Bit-flips, mutations
Code and Data segments
interfaces
16
-
7/31/2019 Fault Injection
9/19
4/8/20
Physical Fault Injection
Reproduce extreme environmental conditions
EMI/Radiation
Heat/Shock
Voltage drops/spikes etc
Advantages
Real/actual faults
Tangible
Simple experiments
Disadvantages
Difficult to control/repeat
Needs at least a prototype
17
Simulation-based Fault Injection Using a model of the system
VHDL
MatLab
SystemC
Spice
Advantages
Usable during design
Controllable Disadvantages
Requires a model
Model accuracy
Slow
18
-
7/31/2019 Fault Injection
10/19
4/8/20
Simulated Fault InjectionFault injection
Electrical level Logical level Functional level
Change current
Change voltage
Stuck at 0 or 1
Inverted fault
Change CPU Register
Flip memory bits, etc.
Electricalcircuits
Logic gates FunctionalunitsPhysical
process
Logic
operation
19
Hardware-based Fault Injection Inject faults using hardware (similar to physical)
Pin-level injection
Scan chains
Advantages
Controllable
Close to real faults
Disadvantages
Requires special equipment
Reachability
20
-
7/31/2019 Fault Injection
11/19
4/8/20
HW Fault-Injection (1/2)
HW FI with contact: In pin-level injection, the injector hasdirect physical contact with the target system, producing voltageand current changes externally to the target chip.
Active probes: adds current via the probes attached to the pinslimitedto stuck-at-faults, bridging faults could be handled.
Socket insertion: inserts a socket between the target hardware and itscircuit boardcan inject stuck-at, open, or more complex logic faults(inverted, ANDed, Ored) into the target hardware.
Provides good controllability of fault times and location with
little or no perturbation to the target system.
21
HW Fault-Injection (2/2)
HW FI without contact: The injector has no direct physicalcontact with the target system.
An external source produce some physical phenomenon, such asheavy ion radiation and electro-magnetic interference, causingspurious currents inside the target chip.
Difficult to exactly trigger the time and location of a FI as one
cannot precisely control the exact moment of heavy-ion emissionor electro-magnetic field creation.
22
-
7/31/2019 Fault Injection
12/19
4/8/20
Why Inject SW Faults?
Software faults are most probably the major cause of computersystem outages
Goals:
Experimental risk assessment in component-based softwaredevelopment
Dependability evaluation of COTS components
Robustness testing
Fault tolerance layer evaluation
Dependability benchmarking
23
Software Implemented Fault Injection (SWIFI)
A testing technique that aids in understanding how SW behaveswhen stressed in unusual ways.
Variations in the technique allow it to be applied to many typesof SW and for different purposes.
Manipulate bits in memory locations and registers
Emulation of HW faults
Change text segment of processes
Emulation of SW faults (bugs, defects)
Dynamic: E.g., Op-code switch during operation
Static: Change source code and recompile (mutation)
24
-
7/31/2019 Fault Injection
13/19
4/8/20
Usage of SWFI
Finding defects in software
Robustness Testing
COTS Validation/Determining failure modes
Safety Verification
Security Assessment
Software Testability Analysis
25
SWIFI Attractive as does not require expensive hardware
Target, an application injector is inserted into the applicationor layered between the application and the OS.
Target, the OS injector must be embedded in the OS
Shortcomings: Cannot inject faults into locations inaccessible to SW
SW probes may alter the workload running on the target or even changethe structure of the program
The poor time-resolution of the approach may cause fidelity problems.Okay for long latency faults (memory faults) but problematic for shortlatency faults (bus or CPU faults)
Characterization of SWIFI methods Compile-time injectionthe program instruction is modified before the
program image is loaded and executed
Run-time injectionduring run-time, a mechanism is needed to trigger FI
26
-
7/31/2019 Fault Injection
14/19
4/8/20
SWIFICompile-time Injection
Rather than injecting faults into the HW of the target system,inject errors into the source code or assembly code of the targetprogram to emulate the effect of HW, SW and transient faults
The modified code alters the target program instructions, andinjection thus caused, generates an erroneous software image, and
when the system executes the fault image, it activates the fault.
Requires the modification of the program that will evaluate faulteffect; requires no additional software during run-time.
Causes no perturbation to the target system during execution.
As the fault effect is hard-coded, it can be used to emulatepermanent faults.
27
SWIFIRun-time Injection A mechanism is needed to trigger fault injection at run-time.
Triggering mechanisms include:
Timeout: the timeout event generates an interrupt to invoke fault-injection The timer can be a HW or SW timer
Since it injects faults on the basis of time rather than specific events or system state, itproduces unpredictable fault effects and program behavior
Can emulate transient and intermittent HW faults
Exception/trap: a HW exception or when a SW trap instruction beinginserted into a target application executes, an interrupt is generated thattransfers control to an interrupt-handler, basically the fault-injector.
It can inject fault whenever certain events or conditions occur.
Code insertion: instructions are added to the target program that allowfault injection to occur before particular instructions.
Performs fault injection at run-time and adds instructions instead of changing the code
The fault-injector may exist as part of the target program and runs at user mode ratherthan system mode
28
-
7/31/2019 Fault Injection
15/19
4/8/20
Summary of Techniques for SWIFI
29
Type Method
SW Fault Modify the text segment of the program
SW Error Modify the data segment of the program
Memory Fault Flip memory bits
CPU Fault Use a trap to modify the memory area of the saved CPU register
Bus Fault Use traps before and after an instruction tochange the code or data used by theinstruction and then restore them after the
instruction is executedNetworked Fault Modify of delete transmission messages
SW Fault-Injection Techniques
30
-
7/31/2019 Fault Injection
16/19
4/8/20
Many Tools Available
DEPEND, MEFISTO
Evaluating HW/SW architectures using simulations
FERRARI, DOCTOR, RIFLE, Xception, FIST, Messaline
Evaluate tolerance against HW faults
DEFINE, FIAT, FTAPE
Evaluate tolerance against HW and SW faults
MAFALDA, NFTAPE, PROPANE
Evaluate effects of HW & SW faults and analyze error propagation
Ballista
OS Robustness testing
31
DEPEND
Provides a library of objects to behaviorally model a systems
hardware components; using these objects, a control programwritten in C++ simulates system operation and models system SW
The objects automatically inject faults, initiate repairs, and compilestatistics.
Permanent, transient, and user-defined faults can be injected withlatency or at correlated times.
FI scheme based on workload.
32
-
7/31/2019 Fault Injection
17/19
4/8/20
Messaline
The injection, activation and collection modules are implemented in HW; theSW management module resides on a PC
Signals collected from the target system can provide feedback to the injector. A device is associated with each injection point to sense when and if each fault
is activated and produces an error.
33
Uses both active probesand sockets to conductpin-level FI
Can inject stuck-at, open,bridging, and complex
logical faults
FISTFault Injection System for Study ofTransient Fault Effect
34
Employs both contact and contact-less methods to create transient faults
Use heavy-ion radiation to createtransient faults at random locationsinside a chip.
Radiation source inside a vacuumchamber with two small processors
(Ref and Test CPI) In addition to radiation, FIST allows
for injection of power disturbancefaults (to cause gate propagation delayfaults)
-
7/31/2019 Fault Injection
18/19
4/8/20
Xception
Uses a processors built-in hardware exception triggers to trigger fault injection.The fault injector is implemented as an exception handler and requiresmodification of the interrupt handler vector.
Events which can trigger fault injection include: opcode fetch from a specified
address, operand load from a specified address, operand store to a specifiedaddress, a specified time passes since start-up.
Each fault has a specifiedfault mask: a set of bits that determines whichcorresponding bits in the target location will be injected.
35
Takes advantage of theadvanced debugging andperformance monitoringfeatures present in manymodern processors to injectmore realistic faults
Characteristics of Fault Injection Methods
36
-
7/31/2019 Fault Injection
19/19
4/8/20
Key Issues in Fault Injection
Effective fault injection mechanisms using hardware, software, andhybrid technology to accurately assess and validate networkedsystems
Practical evaluation methods to accurately quantify fault effect andrecovery mechanisms in complex environments
Evaluation of error detection, diagnosis, and recovery techniques
Quantification of confidence in the fault-injection based validation
Usable fault tolerance benchmark for assessing systems and NWs
Common evaluation/validation framework
37
References1. R.K. Iyer, D. Tang, Experimental Analysis of
Computer System Dependability, Chapter 5, Fault-
Tolerant Computer System Design, Edited by D.K.Pradhan, Prentice Hall, 1994.
2. J. Clark, D.K. Pradhan, Fault-Injection: A Methodfor Validating Computer-System Dependability,
IEEE Computer, pp. 47-56, June 1995.3. M-C. Hsueh, T. Tsai, R.K. Iyer, Fault-Injection
Techniques and Tools, IEEE Computer, pp. 75-82,April 1997.Look for references to other tools/techniques in these papers/book-chapter
38