zs01_04intro

51
Dependable Systems 1. Introduction Prof. Dr. Miroslaw Malek Wintersemester 2004/05 www.informatik.hu-berlin.de/rok/zs

Upload: adilsondisso

Post on 30-Sep-2015

216 views

Category:

Documents


3 download

DESCRIPTION

Dependable system computer lecture

TRANSCRIPT

  • Dependable Systems1. Introduction

    Prof. Dr. Miroslaw Malek

    Wintersemester 2004/05

    www.informatik.hu-berlin.de/rok/zs

  • DS - I - INTR - 2

    An ever more Complex World

  • DS - I - INTR - 3

    Course Activities

    Lectures

    Project

    Presentation

    Invited speakers

    Conferences and workshops

    Some websites: www.dependability.org

    www.paradise.caltech.edu

    www.weibull.com/knowledge/rel_glossary.htm

    www.crhc.uiuc.edu

    www.reflexsoftware.com

  • DS - I - INTR - 4

    Course Activities

    Introduction Motivation

    System views

    Dependability rings

    Dependable design methodology

    Dependability concepts, measures and models Basic definitions

    Dependability measures

    Dependability models / examples

    Dependability evaluation tools

    Testing techniques Testing techniques principles

    Processor / Memory / Network testing

  • DS - I - INTR - 5

    Dependable Computing SystemsTopical Outline:

    Topological testing

    Behavior

    Organization

    Hierarchy

    Fault diagnosis techniques

    Fault detection techniques

    Fault location (isolation) methods

    Fault recovery and tolerance techniques (system level)

    Dynamic techniques

    Static techniques

    Hybrid techniques

  • DS - I - INTR - 6

    Dependable Computing SystemsTopical Outline:

    Fault-tolerant and fault-secure memories Fault-tolerant techniques in manufacturing

    Replication, coding, reconfiguration

    Network fault tolerance Computer networks

    Basic techniques

    Example multistage networks

    Case studies ESS and 3B20

    FTMP Fault-tolerant multiprocessor

    SIFT Software-implemented fault tolerance

    Communication controller

    Fault-tolerant building block architecture

  • DS - I - INTR - 7

    References (General) 1

    Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, WileyInterscience, 1970.

    Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971.

    Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976.

    T. Anderson, B. Randell. Computing Systems Reliability. Cambridge University Press, 1979. 482p. CSR.

    Kraft, G. D. and W. N. Toy, MicroprogrammedControl and Reliable Design of Small Computers, Prentice-Hall, 1981.

  • DS - I - INTR - 8

    References (General) 2

    Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982.

    Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995.

    Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985.

    Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986.

    Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault-Tolerant Computing, Springer-Verlag, 1987.

  • DS - I - INTR - 9

    References (General) 3

    T. Anderson. Safe and Secure Computing Systems. Oxford, Blackwell Scientific, 1989.

    Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989.

    Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989.

    Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-VerlagWien New York, 1992.

  • DS - I - INTR - 10

    References (General) 4

    Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993.

    M. Banatre, P. A. Lee. Hardware and software architectures for fault tolerance, Lecture Notes in Computer Science, 774, Springer-Verlag, 1994.

    Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, System Implementation, Kluwer Academic Publishers, 1994.

  • DS - I - INTR - 11

    References (General) 5

    Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994.

    Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994.

    Malek, M. (ed.), Responsive Computing, KluwerAcad. Publish., 1994.

    Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems, Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995.

  • DS - I - INTR - 12

    References (General) 6

    Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995.

    Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.

    Birman K. P., Building Secure and Reliable Network Applications, Prentice-Hall and Manning Publishing Company, 1997

    A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997

    S. Montenegro, Sichere und fehlertoleranteSteuerungen, Hanser Mnchen, 1999.

  • DS - I - INTR - 13

    References (General) 7

    F. Redmill, T. Anderson. Towards System Safety. Springer-Verlag, 1999.

    W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999

    S. Krakowiak, S. K. Shrivastava. Recent Advancesin Distributed Systems, Berlin, Lecture Notes in Computer Science, 1752, Springer-Verlag, 2000.

    F. Redmill and T. Anderson. Lessons in System Safety. Springer-Verlag, 2000

    Myers, G. J., Software Reliability Principles and Practice, Wiley-Interscience, 1976.

  • DS - I - INTR - 14

    References(Reliability Evaluation)

    Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982.

    Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984.

    Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987.

    W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999

  • DS - I - INTR - 15

    References (E-Commerce and Internet)

    Daniel A. Menasce, Virgilio A. F. Almeida, Capacity Planning for Web Performance : Metrics, Models, and Methods, Prentice Hall, 1998

    F.J Kauffels, E-Business, Taschenbuch, 1998

    Eric Siegel, Designing Quality of Service Solutionsfor the Enterprise, John Wiley & Sons, 1999

    Daniel A. Menasce, Virgilio A. F. Almeida, Scaling for E-Business: Technologies, Models, Performance, and Capacity Planning, Prentice Hall, 2000

    Wasin Rajput, E-Commerce Systems Architecture and Applications,Artech House, 2000

  • DS - I - INTR - 16

    References (Coding)

    Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968.

    Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972.

    Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978.

    Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983.

    Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986.

  • DS - I - INTR - 17

    References (Software) 1

    Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970.

    Deutsch, M. D., Software Verification and Validation, Prentice-Hall, 1982.

    Shooman, M. L., Software Engineering, McGraw-Hill, 1983.

    Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983.

    Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.

  • DS - I - INTR - 18

    References (Software) 2

    Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993.

    Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995.

    Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995.

  • DS - I - INTR - 19

    References (Journals)

    Special Issues

    IEEE Trans. on Computers

    IEEE Trans. on Reliability

    IEEE Trans. on Software Engineering

    IEEE Trans. on Parallel and Distributed Computing

    Computer

    Design and Test

    Electronics

    Computer Design

    Journal of Electronic Testing: Theory and Applications

    Journal of Parallel and Distributed Computing

  • DS - I - INTR - 20

    References (Conference Proceedings) 1

    Fault-Tolerant Computing Symposium (since 2000 International Conference on Dependable Systems and Networks)

    European Dependable Computing Conference

    Symposium on Reliable Distributed Systems

    Reliability and Maintainability Symposium

    Reliability in Distributed Software and Database Systems Symposium

    Test Conference

    International Conference on Computer Safety, Reliability and Security

  • DS - I - INTR - 21

    References (Conference Proceedings) 2

    International Symposium on Software Reliability Engineering

    Pacific Rim International Symposium on Dependable Computing

    Distributed Computing Systems Conference

    Parallel Processing Conference

    Real-Time Systems Symposium

    Computer Architecture Symposiumhttp://liinwww.ira.uka.de/bibliography(over 1.1 M refs)

  • DS - I - INTR - 22

    Introduction

    Objectives:

    Motivation dependable systems

    To introduce various views of computer systems and their relations to computer system dependability

    To present basic concepts and approaches

    To introduce dependable design methodology

  • DS - I - INTR - 23

    Introduction

    Contents:

    Motivation

    System views

    System dependability concepts

    Approaches to dependable design

    Dependability rings

    Dependable design methodology

  • DS - I - INTR - 24

    Hardware

    Software

    50 %

    91 %

    Spare Parts

    Software + Upgrades

    Liveware

    Cost of damage

    5 10 years

    A Stunning Prediction

  • DS - I - INTR - 25

    Types of Systems

    Dependable (reliable) system

    A system which delivers a required service during its lifetime

    Fault-tolerant computer systems

    A system that has the capability to continue the correct execution of its programs and input/output functions in the presence of faults

  • DS - I - INTR - 26

    Types of Systems

    Real-time-computer systems:

    are the ones that deliver service to a user within a specified deadline (physical time, duration, etc.)

    Responsive computer system:

    are Fault-Tolerant Real-Time Systems that deliver satisfactory service in a timely manner

  • DS - I - INTR - 27

    Motivation

    Economic necessity

    Ever-growing reliance

    Life saving

    Novice users

    Harsh environments

    More complex systems

  • DS - I - INTR - 28

    106

    105

    104

    103

    102

    101

    1950 1960 1970 1980 1990 2000 2010

    Equivalent Device Reliability

    Mean Time between Failures

    (MTBF) in Years Minimum Acceptable

    Reliability

    System Reliability

    Relays Semiconductors MSI VLSIVacuum tubes SSI LSI

    Device Reliability and System Reliability

  • DS - I - INTR - 29

    101 102 103 104 105 106 107 108

    0.99

    0.9999

    0.999

    0.99

    0.9

    Massively Parallel /Distributed Systems

    Commercial Fault-TolerantSystems

    Ultra Reliable Systems

    A

    v

    a

    i

    l

    a

    b

    i

    l

    i

    t

    y

    Throughput (MFLOPS)

    DependabilityPerformance Tradeoff

    12-mal

  • DS - I - INTR - 30

    Examples

    E-commerce systems

    Air traffic control

    Flight systems

    Communication systems (telephone and internet)

    Banking systems

    Defense systems

    Airline seat reservations

    Household appliances

    Video games

  • DS - I - INTR - 31

    View I:System Life Cycle

    System constraints

    Obsolescence NeedsNew technology

    Concept formulation

    System specification

    Design

    Prototype

    Production

    Installation

    Operational life

    Modification and retirement

    Notice that testing, verification or validation should occur after every phase of life cycle

    Very few tools exist,and for some steps of the cycle only

  • DS - I - INTR - 32

    View 2:Packaging Levels of Integration

    Applications

    Applications modules

    Special-purpose languages

    Standard languages

    Operating systems

    Cabinets/frames

    Boxes/cages

    Printed circuit boards/cards, wafers, TCMS

    Integrated circuits (Chips)

    Dependability must be considered at every level.

    System decomposition (partitioning) may have a significant impact on dependability.

  • DS - I - INTR - 33

    Preparation Useful

    work

    Semi-useful work

    Fault

    servicing Idling

    Liveware

    Hardware/ Software

    Eliminate idling and use it for testing to improve dependability

    View 3:Workload View

  • DS - I - INTR - 34

    View 4:Levels of Abstraction 1

    data paths, registers, data operators, control (hardwired), microprogramming (microstore)

    Register transfer level (RTL)

    Logic

    software, memory state, processor state, effective address calculation, instruction decode, instruction execution

    HLL, ISP (Instruction Set)

    Program

    processors, memories, switches, links (networks), controllers, ALUs, I/Os

    PMS

    ComponentsSublevelLevel

  • DS - I - INTR - 35

    View 4:Levels of Abstraction 2

    ComponentsSublevelLevel

    disks, tapesQuantum & electromagnetic

    Transistors

    resistors, capacitors, inductors, power sources, diodes

    Curcuit

  • DS - I - INTR - 36

    View 5:Computer System 1

    HARDWARE

    CPUs

    I/O devices, memories

    Interconnection networks

    FIRMWARE

    Microprogram & microprogramming systems

    Liveware

    Maintenance Personal

    Operators

    System Designers

    System Analysts

    Programmers

    Users

    SOFTWARE

    Packages

    Assemblers

    Compilers

    Operating systems

    Utility programs

    Debugging programs

    File processing programs

  • DS - I - INTR - 37

    View 5:Computer System 2

    Faults are attributed to:

    Hardware: 20% - 65%

    Software: 20% - 80%

    People: 15% - 40%

    AT&Ts: 20-40-40%(2/3 applications + 1/3 operating system)

  • DS - I - INTR - 38

    View 6:The Six Phases

    Warning: If you do not follow dependable design methodology you may end up with the following scenario!

    Six phases of a project

    1. Enthusiasm

    2. Disillusionment

    3. Panic and hysteria

    4. Search for the guilty

    5. Punishment of the innocent

    6. Praise and awards for the non-participants(Author unknown found in one of the computer companies)

  • DS - I - INTR - 39

    System DependabilityConcepts

    Availability

    Instantaneous availability is the probability that a system is performing correctly at time t and is equal to reliability of non-repairable systems

    A (t) = R (t)

    Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime

    As (t) =LIFETIME

    UPTIME

  • DS - I - INTR - 40

    System DependabilityConcepts

    Survivability

    the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset

    Reliability

    Is a conditional probability that the system will perform its intended function without failure at time t provided it was fully operational at time t = 0

  • DS - I - INTR - 41

    Approaches

    Fault intolerance

    Fault tolerance

    Maintainability

    Hardware/software trade-offs

  • DS - I - INTR - 42

    Hardware Continuum

    Hardware

    Instructions

    Integer arithmetic Add/Sub

    Mul/Div

    Floating-point arithmetic

    Vector processing

    Multiprocessing (e.g., submachine set-up)

    Examples

    M6800

    MC68000

    VAX-11/780 IBM-30XX

    Cray-YMP, Hitachi

    Systolic arrays, Grid, reconfigurable or experimental multicomputers

  • DS - I - INTR - 43

    Software Continuum

    SOFTWARE

    VERTICAL MIGRATION is a transfer of functions implementation from software to firmware and/or hardware or vice-versa.

    Vertical Migration improves performance and dependability, and reduces cost.

  • DS - I - INTR - 44

    Dependability (Reliability) Rings For Fault Tolerance

    Logic level

    Acceptance test

    Register-transfer level

    Acceptance test

    System hardware

    Acceptance test

    Operating system, languages and application

    Acceptance test Dependability

    rings

    Each dependability ring should provide measures and mechanisms for fault tolerance (detection, location, testability and recovery)

  • DS - I - INTR - 45

    A Bootstrap Test Rings In A Multicomputer System

    Processor

    Memories

    Network

    Test rings

    Diagnostic andmaintenanceprocessor(s)

  • DS - I - INTR - 46

    Dependable Design Methodology

    Identify fault classes, fault latency and fault impact

    Determine qualitative and quantitative specs for fault tolerance and evaluate your design in specific environment

    Identify weak spots and assess potential damage

    Decompose the system

    Develop fault and error detection techniques and algorithms

    Develop fault isolation techniques and algorithms

    Develop recovery/reintegration/restart

    Evaluate degree of fault tolerance

    Refine, iterate for improvement; try to eliminate weak spots and minimize potential damage

  • DS - I - INTR - 47

    Real-time Systems Design

    Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment.

    Characterize timing of a system (hardware and software).

    Map timing specification onto a system timing (find the best resource allocation and scheduling methods), and incorporate concurrent monitoring.

    Verify and validate the design for quantitative and qualitative specifications.

    Refine, iterate and fine-tune the design.

  • DS - I - INTR - 48

    Responsive System Design 1

    Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements.

    Determine system timing (hardware and software) assess damage, availability and responsiveness.

    Develop and time fault and error detection techniques and algorithms.

    Develop and time fault isolation techniques and algorithms.

    Develop time recovery/reintegration/restart.

  • DS - I - INTR - 49

    Responsive System Design 1

    Map timing specification onto system timing under appropriate assumptions and incorporate concurrent monitoring.

    Evaluate responsiveness.

    Refine and iterate for improvement.

    Responsive systems need architects of space and architects of time.

  • DS - I - INTR - 50

    References 1

    C. G. Bell, J. C. Mudge and J. E. McNamara Seven Views of Computer Systems, Chapter 1 in the book by the same authors titled Computer Engineering, Digital Press, 1978.

    G.J. Lipovski and M. Malek, Parallel Computing: Theory and Comparisons, Wiley-Interscience, New York, 1987.

    M. Malek, Parallel Computer Systems Testing and Integration, in the book titled Testing and Diagnosis of VLSI and LSI, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988.

  • DS - I - INTR - 51

    References 2

    Pankaj Jalote, Fault Tolerance in Distributed Systems, Prentice-Hall,1994

    Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.

    Birman K. P., Building Secure and Reliable Network Applications, PrenticeHall and Manning Publishing Company, 1997