zs01_04intro

Dependable Systems1. Introduction

Prof. Dr. Miroslaw Malek

Wintersemester 2004/05

www.informatik.hu-berlin.de/rok/zs

DS - I - INTR - 2

An ever more Complex World

DS - I - INTR - 3

Course Activities

Lectures

Project

Presentation

Invited speakers

Conferences and workshops

Some websites: www.dependability.org

www.paradise.caltech.edu

www.weibull.com/knowledge/rel_glossary.htm

www.crhc.uiuc.edu

www.reflexsoftware.com

DS - I - INTR - 4

Course Activities

Introduction Motivation

System views

Dependability rings

Dependable design methodology

Dependability concepts, measures and models Basic definitions

Dependability measures

Dependability models / examples

Dependability evaluation tools

Testing techniques Testing techniques principles

Processor / Memory / Network testing

DS - I - INTR - 5

Dependable Computing SystemsTopical Outline:

Topological testing

Behavior

Organization

Hierarchy

Fault diagnosis techniques

Fault detection techniques

Fault location (isolation) methods

Fault recovery and tolerance techniques (system level)

Dynamic techniques

Static techniques

Hybrid techniques

DS - I - INTR - 6

Dependable Computing SystemsTopical Outline:

Fault-tolerant and fault-secure memories Fault-tolerant techniques in manufacturing

Replication, coding, reconfiguration

Network fault tolerance Computer networks

Basic techniques

Example multistage networks

Case studies ESS and 3B20

FTMP Fault-tolerant multiprocessor

SIFT Software-implemented fault tolerance

Communication controller

Fault-tolerant building block architecture

DS - I - INTR - 7

References (General) 1

Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, WileyInterscience, 1970.

Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971.

Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976.

T. Anderson, B. Randell. Computing Systems Reliability. Cambridge University Press, 1979. 482p. CSR.

Kraft, G. D. and W. N. Toy, MicroprogrammedControl and Reliable Design of Small Computers, Prentice-Hall, 1981.

DS - I - INTR - 8


Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982.

Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995.

Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985.

Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986.

Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault-Tolerant Computing, Springer-Verlag, 1987.

DS - I - INTR - 9


T. Anderson. Safe and Secure Computing Systems. Oxford, Blackwell Scientific, 1989.

Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989.

Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989.

Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-VerlagWien New York, 1992.

DS - I - INTR - 10


Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993.

M. Banatre, P. A. Lee. Hardware and software architectures for fault tolerance, Lecture Notes in Computer Science, 774, Springer-Verlag, 1994.

Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, System Implementation, Kluwer Academic Publishers, 1994.

DS - I - INTR - 11


Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994.

Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994.

Malek, M. (ed.), Responsive Computing, KluwerAcad. Publish., 1994.

Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems, Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995.

DS - I - INTR - 12


Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995.

Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.

Birman K. P., Building Secure and Reliable Network Applications, Prentice-Hall and Manning Publishing Company, 1997

A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997

S. Montenegro, Sichere und fehlertoleranteSteuerungen, Hanser Mnchen, 1999.

DS - I - INTR - 13


F. Redmill, T. Anderson. Towards System Safety. Springer-Verlag, 1999.

W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999

S. Krakowiak, S. K. Shrivastava. Recent Advancesin Distributed Systems, Berlin, Lecture Notes in Computer Science, 1752, Springer-Verlag, 2000.

F. Redmill and T. Anderson. Lessons in System Safety. Springer-Verlag, 2000

Myers, G. J., Software Reliability Principles and Practice, Wiley-Interscience, 1976.

DS - I - INTR - 14

References(Reliability Evaluation)

Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982.

Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984.

Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987.

W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999

DS - I - INTR - 15

References (E-Commerce and Internet)

Daniel A. Menasce, Virgilio A. F. Almeida, Capacity Planning for Web Performance : Metrics, Models, and Methods, Prentice Hall, 1998

F.J Kauffels, E-Business, Taschenbuch, 1998

Eric Siegel, Designing Quality of Service Solutionsfor the Enterprise, John Wiley & Sons, 1999

Daniel A. Menasce, Virgilio A. F. Almeida, Scaling for E-Business: Technologies, Models, Performance, and Capacity Planning, Prentice Hall, 2000

Wasin Rajput, E-Commerce Systems Architecture and Applications,Artech House, 2000

DS - I - INTR - 16

References (Coding)

Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968.

Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972.

Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978.

Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983.

Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986.

DS - I - INTR - 17

References (Software) 1

Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970.

Deutsch, M. D., Software Verification and Validation, Prentice-Hall, 1982.

Shooman, M. L., Software Engineering, McGraw-Hill, 1983.

Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983.

Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.

DS - I - INTR - 18

References (Software) 2

Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993.

Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995.

Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995.

DS - I - INTR - 19

References (Journals)

Special Issues

IEEE Trans. on Computers

IEEE Trans. on Reliability

IEEE Trans. on Software Engineering

IEEE Trans. on Parallel and Distributed Computing

Computer

Design and Test

Electronics

Computer Design

Journal of Electronic Testing: Theory and Applications

Journal of Parallel and Distributed Computing

DS - I - INTR - 20

References (Conference Proceedings) 1

Fault-Tolerant Computing Symposium (since 2000 International Conference on Dependable Systems and Networks)

European Dependable Computing Conference

Symposium on Reliable Distributed Systems

Reliability and Maintainability Symposium

Reliability in Distributed Software and Database Systems Symposium

Test Conference

International Conference on Computer Safety, Reliability and Security

DS - I - INTR - 21

References (Conference Proceedings) 2

International Symposium on Software Reliability Engineering

Pacific Rim International Symposium on Dependable Computing

Distributed Computing Systems Conference

Parallel Processing Conference

Real-Time Systems Symposium

Computer Architecture Symposiumhttp://liinwww.ira.uka.de/bibliography(over 1.1 M refs)

DS - I - INTR - 22

Introduction

Objectives:

Motivation dependable systems

To introduce various views of computer systems and their relations to computer system dependability

To present basic concepts and approaches

To introduce dependable design methodology

DS - I - INTR - 23

Introduction

Contents:

Motivation

System views

System dependability concepts

Approaches to dependable design

Dependability rings

Dependable design methodology

DS - I - INTR - 24

Hardware

Software

50 %

91 %

Spare Parts

Software + Upgrades

Liveware

Cost of damage

5 10 years

A Stunning Prediction

DS - I - INTR - 25

Types of Systems

Dependable (reliable) system

A system which delivers a required service during its lifetime

Fault-tolerant computer systems

A system that has the capability to continue the correct execution of its programs and input/output functions in the presence of faults

DS - I - INTR - 26

Types of Systems

Real-time-computer systems:

are the ones that deliver service to a user within a specified deadline (physical time, duration, etc.)

Responsive computer system:

are Fault-Tolerant Real-Time Systems that deliver satisfactory service in a timely manner

DS - I - INTR - 27

Motivation

Economic necessity

Ever-growing reliance

Life saving

Novice users

Harsh environments

More complex systems

DS - I - INTR - 28

106

105

104

103

102

101

1950 1960 1970 1980 1990 2000 2010

Equivalent Device Reliability

Mean Time between Failures

(MTBF) in Years Minimum Acceptable

Reliability

System Reliability

Relays Semiconductors MSI VLSIVacuum tubes SSI LSI

Device Reliability and System Reliability

DS - I - INTR - 29

101 102 103 104 105 106 107 108

0.99

0.9999

0.999

0.99

0.9

Massively Parallel /Distributed Systems

Commercial Fault-TolerantSystems

Ultra Reliable Systems

A

v

a

i

l

a

b

i

l

i

t

y

Throughput (MFLOPS)

DependabilityPerformance Tradeoff

12-mal

DS - I - INTR - 30

Examples

E-commerce systems

Air traffic control

Flight systems

Communication systems (telephone and internet)

Banking systems

Defense systems

Airline seat reservations

Household appliances

Video games

DS - I - INTR - 31

View I:System Life Cycle

System constraints

Obsolescence NeedsNew technology

Concept formulation

System specification

Design

Prototype

Production

Installation

Operational life

Modification and retirement

Notice that testing, verification or validation should occur after every phase of life cycle

Very few tools exist,and for some steps of the cycle only

DS - I - INTR - 32

View 2:Packaging Levels of Integration

Applications

Applications modules

Special-purpose languages

Standard languages

Operating systems

Cabinets/frames

Boxes/cages

Printed circuit boards/cards, wafers, TCMS

Integrated circuits (Chips)

Dependability must be considered at every level.

System decomposition (partitioning) may have a significant impact on dependability.

DS - I - INTR - 33

Preparation Useful

work

Semi-useful work

Fault

servicing Idling

Liveware

Hardware/ Software

Eliminate idling and use it for testing to improve dependability

View 3:Workload View

DS - I - INTR - 34

View 4:Levels of Abstraction 1

data paths, registers, data operators, control (hardwired), microprogramming (microstore)

Register transfer level (RTL)

Logic

software, memory state, processor state, effective address calculation, instruction decode, instruction execution

HLL, ISP (Instruction Set)

Program

processors, memories, switches, links (networks), controllers, ALUs, I/Os

PMS

ComponentsSublevelLevel

DS - I - INTR - 35

View 4:Levels of Abstraction 2

ComponentsSublevelLevel

disks, tapesQuantum & electromagnetic

Transistors

resistors, capacitors, inductors, power sources, diodes

Curcuit

DS - I - INTR - 36

View 5:Computer System 1

HARDWARE

CPUs

I/O devices, memories

Interconnection networks

FIRMWARE

Microprogram & microprogramming systems

Liveware

Maintenance Personal

Operators

System Designers

System Analysts

Programmers

Users

SOFTWARE

Packages

Assemblers

Compilers

Operating systems

Utility programs

Debugging programs

File processing programs

DS - I - INTR - 37

View 5:Computer System 2

Faults are attributed to:

Hardware: 20% - 65%

Software: 20% - 80%

People: 15% - 40%

AT&Ts: 20-40-40%(2/3 applications + 1/3 operating system)

DS - I - INTR - 38

View 6:The Six Phases

Warning: If you do not follow dependable design methodology you may end up with the following scenario!

Six phases of a project

1. Enthusiasm

2. Disillusionment

3. Panic and hysteria

4. Search for the guilty

5. Punishment of the innocent

6. Praise and awards for the non-participants(Author unknown found in one of the computer companies)

DS - I - INTR - 39

System DependabilityConcepts

Availability

Instantaneous availability is the probability that a system is performing correctly at time t and is equal to reliability of non-repairable systems

A (t) = R (t)

Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime

As (t) =LIFETIME

UPTIME

DS - I - INTR - 40

System DependabilityConcepts

Survivability

the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset

Reliability

Is a conditional probability that the system will perform its intended function without failure at time t provided it was fully operational at time t = 0

DS - I - INTR - 41

Approaches

Fault intolerance

Fault tolerance

Maintainability

Hardware/software trade-offs

DS - I - INTR - 42

Hardware Continuum

Hardware

Instructions

Integer arithmetic Add/Sub

Mul/Div

Floating-point arithmetic

Vector processing

Multiprocessing (e.g., submachine set-up)

Examples

M6800

MC68000

VAX-11/780 IBM-30XX

Cray-YMP, Hitachi

Systolic arrays, Grid, reconfigurable or experimental multicomputers

DS - I - INTR - 43

Software Continuum

SOFTWARE

VERTICAL MIGRATION is a transfer of functions implementation from software to firmware and/or hardware or vice-versa.

Vertical Migration improves performance and dependability, and reduces cost.

DS - I - INTR - 44

Dependability (Reliability) Rings For Fault Tolerance

Logic level

Acceptance test

Register-transfer level

Acceptance test

System hardware

Acceptance test

Operating system, languages and application

Acceptance test Dependability

rings

Each dependability ring should provide measures and mechanisms for fault tolerance (detection, location, testability and recovery)

DS - I - INTR - 45

A Bootstrap Test Rings In A Multicomputer System

Processor

Memories

Network

Test rings

Diagnostic andmaintenanceprocessor(s)

DS - I - INTR - 46

Dependable Design Methodology

Identify fault classes, fault latency and fault impact

Determine qualitative and quantitative specs for fault tolerance and evaluate your design in specific environment

Identify weak spots and assess potential damage

Decompose the system

Develop fault and error detection techniques and algorithms

Develop fault isolation techniques and algorithms

Develop recovery/reintegration/restart

Evaluate degree of fault tolerance

Refine, iterate for improvement; try to eliminate weak spots and minimize potential damage

DS - I - INTR - 47

Real-time Systems Design

Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment.

Characterize timing of a system (hardware and software).

Map timing specification onto a system timing (find the best resource allocation and scheduling methods), and incorporate concurrent monitoring.

Verify and validate the design for quantitative and qualitative specifications.

Refine, iterate and fine-tune the design.

DS - I - INTR - 48

Responsive System Design 1

Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements.

Determine system timing (hardware and software) assess damage, availability and responsiveness.

Develop and time fault and error detection techniques and algorithms.

Develop and time fault isolation techniques and algorithms.

Develop time recovery/reintegration/restart.

DS - I - INTR - 49

Responsive System Design 1

Map timing specification onto system timing under appropriate assumptions and incorporate concurrent monitoring.

Evaluate responsiveness.

Refine and iterate for improvement.

Responsive systems need architects of space and architects of time.

DS - I - INTR - 50

References 1

C. G. Bell, J. C. Mudge and J. E. McNamara Seven Views of Computer Systems, Chapter 1 in the book by the same authors titled Computer Engineering, Digital Press, 1978.

G.J. Lipovski and M. Malek, Parallel Computing: Theory and Comparisons, Wiley-Interscience, New York, 1987.

M. Malek, Parallel Computer Systems Testing and Integration, in the book titled Testing and Diagnosis of VLSI and LSI, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988.

DS - I - INTR - 51

References 2

Pankaj Jalote, Fault Tolerance in Distributed Systems, Prentice-Hall,1994

Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.

Birman K. P., Building Secure and Reliable Network Applications, PrenticeHall and Manning Publishing Company, 1997

zs01_04intro

Documents

faulttolerant systems

fault tolerant computing

evolution of fault

fault tolerance principles

fault testable hardware

secure computing systems

computing systems reliability

digital press