fault-tolerant computing jenn-wei lin department of computer science and information engineering fu...

28
FAULT-TOLERANT COMPUTING Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Motivation and Introduction Lecture Set 1

Upload: erik-whitehead

Post on 28-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

FAULT-TOLERANT COMPUTING

Jenn-Wei LinDepartment of Computer Science and Information Engineering

Fu Jen Catholic University

Motivation and IntroductionLecture Set 1

ECE 753 Fault Tolerant Computing 2

General Information• Textbook

– Marin L. Shooman: Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, John Wiley and Sons, 2002.

– D.P. Siewiorek and R.S. Swarz: Reliable Computer Systems: Design and Evaluation, 3rd ed. A. K. Peters, 1999.

– D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice-Hall, 1996. The book is out of print

• Paper– Dependable Computing Conference

• Grading Policy– Exam. 20%– Presentation 40% (four)– Term report & Project 40%

ECE 753 Fault Tolerant Computing 3

Overview• Motivation

• Introduction

• Terminology

• Fundamental Principles

• Fault-Error-Failure concept

ECE 753 Fault Tolerant Computing 4

Motivation

• Informal Definition

• Key Attributes

• Who, What and Why Study

• Examples

ECE 753 Fault Tolerant Computing 5

Motivation

• What is Fault-Tolerance?

A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failurs in some componetns that constitute the system.

ECE 753 Fault Tolerant Computing 6

Motivation (contd.)• Who is concerned about fault-tolerance?

– System Users• Who is concerned at design stages?

– Universities• R, d, and a (Research, development,

applications)– Industry

• r, D, and A (research, Development, Applications)

ECE 753 Fault Tolerant Computing 7

Motivation (contd.)

Examples

• General Purpose Systems– PCs: RAMs with parity checks– Workstations: error detection (HW), occasional

corrective action (SW), ECC (HW), keeping log (SW)

ECE 753 Fault Tolerant Computing 8

Motivation (contd.)

Examples

• Reliable Systems– Telephone systems– Banking systems e.g. ATM– Stock market– Football games display/ticketing

ECE 753 Fault Tolerant Computing 9

Motivation (contd.)

Examples

• Critical and Life Critical Systems– Manned and unmanned space borne systems– Aircraft control systems– Nuclear reactor control systems– Life support systems

ECE 753 Fault Tolerant Computing 10

Motivation (contd.)

Examples

• Reliable -> Critical Systems– 911 telephone switching system– Traffic light control system– Automobile control system (ABS, Fuel

injection system)

ECE 753 Fault Tolerant Computing 11

Introduction

– Historical perspective and major push

– Goals of fault-tolerance

– Applications of fault-tolerance

ECE 753 Fault Tolerant Computing 12

Introduction (contd.)

• Historical Perspective– not a new concept

– first use by J. van Neumann 1956

• Major push– Space program

– HW Fault tolerance - then

– SW Fault tolerance later

– Merge the two

ECE 753 Fault Tolerant Computing 13

Introduction (contd.)

• Applications– Space borne system

• long life system

– Airplane control system• critical system

– Transaction processing system• high availability system

– Switching system• high availability over certain level of performance

ECE 753 Fault Tolerant Computing 14

Terminology

• Reliability and concept of probability– R(t): conditional probability that a system provides continuous

proper service in the interval [0,t] given that it provided desired service at time 0.

• Availability– The probability that an item is up at any point in time– Uptime/(Uptime+Downtime)

• Dependability– Property of computer system that allows

reliance to be placed justifiably on service it delivers

ECE 753 Fault Tolerant Computing 15

Fundamental Principles

• Dependability

• Impairments– Faults, errors, failures

• Means– Fault Avoidance, Fault Tolerance, Fault Removal, Fault

Forecasting

• Measures– Reliability, Availability, Maintainability

ECE 753 Fault Tolerant Computing 16

Fundamental Principles (contd.)

• A set of methods, tools and solutions that enable development of dependable systems.

- Fault Prevention: how to prevent fault occurrence or

introduction,- Fault Tolerance: how to ensure a service up to fulfilling the system’s function in the presence of faults,- Fault Removal: how to reduce the presence (number seriousness) of faults,- Fault Forecasting: how to estimate the present number, the future incidence, and the consequences of faults

ECE 753 Fault Tolerant Computing 17

Fundamental Principles (contd.)

• Fault Avoidance: To prevent by construction fault occurrence. E.g., nearly fault-free components, shielding against electromagnetic fields– Drawbacks:

- Cost of near-perfect components high- Cost of maintenance personnel

• Fault Tolerance: To provide, by redundancy, service complying with specification in spite of faults occurring

ECE 753 Fault Tolerant Computing 18

Fundamental Principles (contd.)

• Fault Removal: To minimize, by verification, the presence of faults. E.g. Am I building the right system? Concepts of coverage,etc.

• Fault Forecasting: To estimate, by evaluation, the presence, occurrence and consequences of faults. E.g. For how long will the system be right ?

ECE 753 Fault Tolerant Computing 19

Fundamental Principles (contd.)

• Reliability: A measure of continuous delivery of proper service (or equivalently, of the time to failure) from a reference initial time

• Availability: A measure of the delivery of the proper service with respect to the alternation of delivery of proper and improper service

• Maintainability: A measure of continuous delivery of improper service (time to restoration or repair)

ECE 753 Fault Tolerant Computing 20

Fundamental Principles (contd.)

• Hardware redundancy• Low level

• High level

• Software Redundancy• Time Redundancy• Information Redundancy

ECE 753 Fault Tolerant Computing 21

Fundamental Principles (contd.)

• Hardware Redundancy - Low level– logic level

• Example 1 - Self checking circuits

• Example 2 - Arithmetic code A modular adder using the mathematical principle

(A+B+|) mod k = ((A mod k) + (B mod k)) mod k

• Hardware Redundancy - High level– Triplicate or 5-copies as in space shuttle

ECE 753 Fault Tolerant Computing 22

Fundamental Principles (contd.)

• Software Redundancy – Use two different programs/algorithms

• Time Redundancy– Re-compute or redo the task and compare the results

– May or may not use the same hardware/software

• Information Redundancy– backup information

– Use of ECC

• Question - What kind of FT is achieved?

ECE 753 Fault Tolerant Computing 23

Fault-Error-Failure concept

• Intuitive definitions• Origins of faults• Methods to break FEF chain• Attribute of faults

ECE 753 Fault Tolerant Computing 24

Fault-Error-Failure concept (contd.)

Intuitive definitions

• Fault -– An anomalous physical condition caused by a

manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, …

– Causes

• Error - Effect of activation of a fault

• Failure - over-all system effect of an error

Fault -> Error -> Failure

ECE 753 Fault Tolerant Computing 25

Fault-Error-Failure concept (contd.)

• Failure occurs when the delivered service deviates from the specified service; failures are caused by errors

• Error is the manifestation of a fault within a program or data structure

• Fault is an incorrect state of hardware or software resulting from failures of components, physical interferences from the environment, operator error or incorrect design

ECE 753 Fault Tolerant Computing 26

Fault-Error-Failure concept (contd.) Causes of faults

• Specification mistakes– Incorrect algorithms, architectures, or hardware and software design

specification• Implementation mistakes

– Process of transforming hardware and software specifications into the physical hardware and the actual software

– Poor design, poor component selection, poor construction, software coding mistakes

• Component defects– Manufacturing imperfections, random device defects, and component

wear-out• External disturbance

– Radiation, electromagnetic interference, operator mistakes, battle damage, and environmental extremes

ECE 753 Fault Tolerant Computing 27

Fault-Error-Failure concept (contd.)

Causes of faultsSpecification

Mistakes

Implementaion Mistakes

External Disturbances

Component Defects

SoftwareFaults

HaredwareFaults

ErrorsSystemFailures

ECE 753 Fault Tolerant Computing 28

Fault-Error-Failure concept (contd.) Characteristics of faults

• Fault nature– Specify the type of fault

• Is the fault a hardware or a software fault?• Fault duration

– Specify the length of time that a fault is active• Permanent fault• Transient fault

– Appear and disappear within a very short period of time• Intermittent fault

– Appear, disappear, and reappear repeatedly• Fault extent

– Fault is localized to a given hardware or software module or globally affects the hardware, the software, or both.

• Fault value– Determinate or indeterminate

• Fault sensitive to either the data or time