lect 1 intro taxonomy

7/31/2019 Lect 1 Intro Taxonomy

1/50

Fault Tolerant Systems

Dependable & Secure Systems


2/50

Text Book to be Followed


3/50

Course Outline Introduction - Basic concepts Dependability measures

Redundancy techniques Hardware fault tolerance Error detecting and correcting codes Redundant disks (RAID) Fault-tolerant networks Software fault tolerance

Checkpointing

Case studies of fault-tolerant systems Defect tolerance in VLSI circuits Fault detection in cryptographic systems Simulation techniques


4/50

Need For Fault Tolerance - Critical

Applications

Aircrafts, nuclear reactors, chemical plants,medical equipment

A malfunction of a computer in suchapplications can lead to catastrophe

Their probability of failure must be

extremely low, possibly one in a billion perhour of operation

Also included - financial applications


5/50

Need for Fault Tolerance - Harsh

Environments

A computing system operating in a harshenvironment where it is subjected to

electromagnetic disturbances

particle hits and alike

Very large number of failures means: thesystem will not produce useful results unless

some fault-tolerance is incorporated


6/50

Need For Fault Tolerance - Highly Complex

Systems

Complex systems consist of millions of devices

Every physical device has a certain probability of

failure A very large number of devices implies that the

likelihood of failures is high

The system will experience faults at such afrequency which renders it useless


7/50

Fault Taxonomy


8/50

Basic Concepts of Dependability

Dependability the trustworthiness of a computersystem such that reliance can be justifiably put on

the service it delivers.

It is the system property that integrates such attributes asreliability, availability, safety, security, survivability,

maintainability.

A systematic exposition of the concepts of

dependability consists of three parts: the threats to,the attributes of, and the means by which

dependability is attained.


9/50

Dependability Tree


10/50

Fault-Error-Failure Model

System Under

Consideration

Unintended State:

Error

Cause of Error

(& Failure): Fault

Deviation of Actual

Service from Intended

Service: Failure

Faults and errors are states; Failures are external events.

Failuredenotes an elements inability to perform its designed functionbecause of errors in the element or its environment, which in turn arecaused by various faults.


11/50

Fault, Error, Failure Examples

Cosmic ray knocks charge off of DRAM cell

Error: bit flip in memory

Failure: computation produces incorrect result

Software bug could allow for NULL pointerBug gets exercised and we get NULL pointer

Program segment faults when it tries to access pointer


12/50

Duration of Faults/Errors

Transient (soft): occurs once and disappears

E.g., Cosmic ray knocks charge off transistorbit flip

Tend to be due to transient physical phenomena

Also known as Single Event Upset (SEU) Intermittent: occurs occasionally

E.g., Loose connectionoccasionally open circuit

E.g., Bug software for roundingincorrect data

Permanent (hard): occurs and does not go away

E.g., Broken connectionalways open circuit


13/50

Software Faults/Errors

Types of bugs (or errors/failures that are due to bugs)

Incorrect algorithm

Array bounds violation

Memory leak (C, C++, but not Java) Allocating memory, but not de-allocating it

Reference to NULL pointer (C, C++, but not Java)

Incorrect synchronization in multithreaded code

Allowing more than 1 thread in critical section at a time Blocking when holding a lock

Inability to handle unanticipated inputs


14/50

Software Failure

What happens if we exercise a software bug? Failures can occur in:

User-level software Incorrect data

Livelock/deadlock Exception that triggers OS to kill process

Segmentation fault

Bus error

Operating system software (including device drivers) Livelock/deadlock

Crash and reboot

Incorrect I/O


15/50

Dependability and its Attributes

Availability: readiness for correct service

Reliability: continuity of correct service

Safety: absence of catastrophic consequences on theuser(s) and the environment

Confidentiality: absence of unauthorized disclosure ofinformation

Integrity: absence of improper system alterations

Maintainability: ability to undergo, modifications, andrepairs

Security is a composite attributes of availability,confidentiality, integrity.


16/50

Traditional Measures - Reliability

Assumption: The system can be in one of two states:

up or down Examples:

Lightbulb - good or burned out

Wire - connected or broken

Reliability, R(t): Probability that the system is upduring the whole interval [0,t], given it was up at time 0

Related measure - Mean Time To Failure, MTTF :Average time the system remains up before it goes down and

has to be repaired or replaced


17/50

Traditional Measures - Availability

Availability, A(t) : Fraction of time system is up during

the interval [0,t] Point Availability, Ap(t) :

Probability that the system is up at time t

Long-Term Availability, A:

Availability is used in systems with recovery/repair

Related measures:

Mean Time To Repair, MTTR

Mean Time Between Failures, MTBF = MTTF + MTTR

MTTRMTTF

MTTF

MTBF

MTTFA

+

==

(t)AlimA(t)limA ptt

==


18/50

Need For More Measures

The assumption of the system being in state upor down is very limiting

Example: A processor with one of its severalhundreds of millions of gates stuck at logic value 0

and the rest is functional - may affect the outputof the processor once in every 25,000 hours of use

The processor is not fault-free, but cannot bedefined as being down

More detailed measures than the generalreliability and availability are needed


19/50

Computational Capacity Measures

Example: N processors in a gracefully degrading

system System is useful as long as at least one processor

remains operational

Let Pi = Prob {i processors are operational}

Let c = computational capacity of a processor (e.g.,number of fixed-size tasks it can execute)

Computational capacity ofi processors: Ci = i c

Average computational capacity of system:

=1i

iPR(t)

i

1i

iPC


20/50

Another Measure - Performability

Another approach - consider everything from theperspective of the application Application is used to define accomplishment levels

L1, L2,...,Ln

Each represents a level of quality of service delivered

by the application Example: Li indicates i system crashes during the

mission time period T

Performability is a vector (P(L1),P(L2),...,P(Ln)) whereP(Li) is the probability that the computer functionswell enough to permit the application to reach up toaccomplishment level Li


21/50

Network Connectivity Measures

Focus on the network that connects the processors

Classical Node and Line Connectivity - the minimumnumber of nodes and lines, respectively, that have

to fail before the network becomes disconnected Measure indicates how vulnerable the network is to

disconnection

A network disconnected by the failure of just one(critically-positioned) node is potentially more

vulnerable than another which requires several

nodes to fail before it becomes disconnected


22/50

Connectivity - Examples


23/50

Network Resilience Measures

Classical connectivity distinguishes between onlytwo network states: connected and disconnected

It says nothing about how the network degrades as

nodes fail before becoming disconnected

Two possible resilience measures: Average node-pair distance

Network diameter - maximum node-pair distance

Both calculated given probability of node and/or linkfailure


24/50

Means to Attain Dependability

Fault prevention: means to prevent the occurrence orintroduction of faults

Fault tolerance: means to avoid service failures in thepresence of faults

Fault removal: means to reduce the number and severity offaults Fault forecasting: means to estimate the present number,

the future incidence, and the likely consequences of faults

Note:

Fault prevention and fault tolerance aim to provide the ability to deliver a servicethat can be trusted. [Procurement]

Fault removal and fault forecasting aim to reach confidence in that ability byjustifying that the functional and dependability specifications are adequate andthat the system is likely to meet them. [Validation]


25/50

Failure Modes A system does not always fail in the same way. Its

failure modes characterize incorrect serviceaccording to three viewpoints:the failure domainthe perception of a failure by system users

the detectability of failuresthe consequences of failures on the environment


26/50

A Taxonomy of Faults

All faults thatmay affect a

system during its

life are classifiedaccording to

eight basic

viewpoints.


27/50

Classes of Faults Tree Representation


28/50

Classes of Combined Faults


29/50

Key System/Functional Unit Properties

Fail Safe: In case of a fault, the system or functional unittransits to a safe state.

Fail Silent: In case of a fault, the output interfaces aredisabled in a way that no further outputs are made.

Fail Operational: It describes the ability of a system orfunctional unit to continue normal operation at itsoutput interfaces despite the presence of hardware orsoftware faults.

Graceful Degradation: the system continues to operatein the presence of errors, accepting partial degradationof performance during recovery.

EASIS Vi F il Sil t


30/50

EASISs View on Fail-SilentElectronic Control Unit (ECU)


31/50

CPU Faults/ErrorsProcessing core:

I. Calculating errors (e.g. HW fault, logic error )

II. Value errors (e.g. HW fault, memory/register corruption, EMI, SEU, etc )

III. Program flow errors (e.g. HW error)

IV. Interrupt errors (sequence, frequency, delay, disregarding, etc.)

V. Algorithmic errors (= Compiler/Logic Synthesizer errors / design faults)

VI. Timing errors

RAM/ROM:VII. Errors in the RAM/ROM ( memory cell defective)

VIII. Faulty RAM/ROM access (wrong memory address)

IX. Faulty memory mapping (=Compiler or linker errors / design faults)

X. Memory overflow

I/O-Interface:XI. Interface errors (errors in ADC/digital IO/ ... )


32/50

Supervisor Faults/Errors

I. Internal error (the same as CPU faults/errors if the

supervisor is a processor).

II. Synchronization lost between CPU and supervisor.

III. Supervisor and CPU are getting different informationfrom the outer world.

IV. Supervisor loses the control over the enable-lines.

V. CPU and supervisor use different, but both valid,rules to judge the control.


33/50

SW Related Faults/Errors

Scheduling Faults/ErrorsI. missed activation

deadline

II. missed terminationdeadline

Communication between SWcomponentsI. Data values of the received

data are faulty

II. The data is received later

than a deadlineIII. The data is received too early

IV. The data can not be sent outin the given time range

V. The data can not be sent out

VI. API Access Fault, (e.g.dynamic argument is out ofrange, )


34/50

Actuator Faults/Errors

I. The actuator is not driven.

II. The actuator is permanently driven (without controller

command).

III. The actuator is not driven at the right time.

IV. The actuator is not driven with the correct

performance.

V. The actuator can not be driven correctly.


35/50

Sensor Faults/Errors

I. The sensor delivers no value or an error signal.

II. The read value of the sensor is wrong.

III. The sensor delivers a value with a wrong timing.


36/50

Internal Power Supply Faults

I. Over voltageII. Under voltage

III. Short circuit

IV. Over current (due to erroneously activated actuators,

defective actuators, defective components, misuse ofcomponents, etc )

V. Leakage current too high

VI. Brown out (slow decrease of the supply voltage belowthe minimum limit)

VII. Startup timing

VIII. Shutdown timing


37/50

External Power Supply Faults

I. Over voltage (load dump, ISO pulse, generator

error)

II. Under voltage (due to Battery Low, line break)

III. Current limit

IV. Short circuit


38/50

Faults/Errors in Communication SystemsAt a node level

I. Data values of a received message are faulty (Faulty data value).

II. The message is received later than a deadline (late message).III. The message is received too early.

IV. The message can not be sent out in the given time range.

V. The message can not be sent out.

At the system levelI. All receivers of the message (in a special case only one receiver exists)

regard the message as faulty with respect to the same main fault type,which is one of the faults (I to III)

II. All receivers of the message regard the message as faulty with respect toone of the main fault types (I to III), which can be different for eachreceiver.

III. Some of the receivers get a correct message, while the others get a faultymessage with respect to one of the main fault types (I to III), which is thesame for each receiver of the faulty message.

IV. Some of the receivers get a correct message, while the others get a faultymessage with respect to one of the main fault types (I to III), which can bedifferent for each receiver of the faulty message.


39/50

Comprehensive Fault Model

Specification Faults

Adequacy faults: some of the properties expressed in thespecification are in contradiction with the required properties.

Over-specification: the specification satisfies the requiredproperties, but some feasible solutions are excluded because of

the presence of unnecessary properties; the specification is too

detailed.

Under-specification: all the properties expressed in thespecification are adequate, but some unacceptable solutions are

accepted; the specification is not precise enough.

Source: NUREG/CR-6316 Guidelines


40/50

Requirement Faults (NASA fault taxonomy)

Incompleteness Omitted/Missing Incorrect Ambiguous Infeasible Inconsistent Over-specification Not Traceable Misplaced

Unachievable Item Non-verifiable Intentional Deviation Redundant or Duplicate


41/50

Design Faults

Software design faults Application design faults Basic software design faults

Scheduling faults Services faults

Calibration faults

Firmware design faults

Hardware design faults Component design faults ECU design faults

Malicious design faults Disrupt or halt service; causing denial of service; improper

modification of system behavior

System design faults Relating to architecture design, communication infrastructure, wiring

harness, EMI protection, etc.


42/50

Manufacturing Faults

Arise from weakness in the manufacturing andassembly processes at the various levels of details

from component manufacturing to the vehicle final

assembly. Such a fault could be caused by low quality in

materials/components, but may also be caused by a

software/hardware fault in the manufacturing system.


43/50

Operational Faults(Refer to EASIS Fault Model)

Hardware faultsNode faults

CPU faults

Supervisor/watchdog faults

Internal communication (SPI) faults

Reset logic faults

Actuator faults

Sensor faults Power-supply faults

Communication faults/errors


44/50

Operational Faults (Contd.)

Susceptibility faultsElectrical susceptibility (EMI transported by cablings)

Electromagnetic susceptibility (transported by air)

Environmental susceptibility

Maintenance faultsWrong software download

Wrong replacement parts

Wrong maintenance procedure followed

Malicious faultsSoftware intrusions

Hardware intrusions

Fault Hypothesis


45/50

Fault Hypothesis

The fault hypothesis partitions the fault space into two sets

Level-1 faults: this is the set of faults that will be tolerated by thefault-tolerance mechanisms.

Level-2 faults: this is the set of fault that will not be tolerated bythe fault-tolerance mechanisms. These faults must be rare events.

If there is no precise fault hypothesis available, it isimpossible to test the proper behavior of the fault-

tolerance mechanisms.

If during the test and installation phase, it is found out thatlevel-2 faults are not rare events, then there exists afundamental design problem:

Either the fault-hypothesis is wrong

Or the implementation is deficient.


46/50

Hardware Redundancy

Extra hardware is added to override the effects of a

failed component Static Hardware Redundancy - for

immediate masking of a failure

Example: Use three processors and vote on the

result.The wrong output of a single faulty processor ismasked

Dynamic Hardware Redundancy - Sparecomponents are activated upon the failure of acurrently active component

Hybrid Hardware Redundancy - Acombination of static and dynamic redundancytechniques


47/50

Software Redundancy Example

Multiple teams of programmers

Write different versions of software for the same

function The hope is that such diversity will ensure that not

all the copies will fail on the same set of input data


48/50

Information Redundancy

Add check bits to original data bits so that an errorin the data bits can be detected and even corrected

Error detecting and correcting codes have beendeveloped and are being used

Information redundancy often requires hardwareredundancy to process the additional check bits


49/50

Time Redundancy

Provide additional time during which a failedexecution can be repeated

Most failures are transient - they go away aftersome time

If enough slack time is available, failed unit canrecover and redo affected computation


50/50

lect 1 intro taxonomy

Documents