zs01_04intro
DESCRIPTION
Dependable system computer lectureTRANSCRIPT
-
Dependable Systems1. Introduction
Prof. Dr. Miroslaw Malek
Wintersemester 2004/05
www.informatik.hu-berlin.de/rok/zs
-
DS - I - INTR - 2
An ever more Complex World
-
DS - I - INTR - 3
Course Activities
Lectures
Project
Presentation
Invited speakers
Conferences and workshops
Some websites: www.dependability.org
www.paradise.caltech.edu
www.weibull.com/knowledge/rel_glossary.htm
www.crhc.uiuc.edu
www.reflexsoftware.com
-
DS - I - INTR - 4
Course Activities
Introduction Motivation
System views
Dependability rings
Dependable design methodology
Dependability concepts, measures and models Basic definitions
Dependability measures
Dependability models / examples
Dependability evaluation tools
Testing techniques Testing techniques principles
Processor / Memory / Network testing
-
DS - I - INTR - 5
Dependable Computing SystemsTopical Outline:
Topological testing
Behavior
Organization
Hierarchy
Fault diagnosis techniques
Fault detection techniques
Fault location (isolation) methods
Fault recovery and tolerance techniques (system level)
Dynamic techniques
Static techniques
Hybrid techniques
-
DS - I - INTR - 6
Dependable Computing SystemsTopical Outline:
Fault-tolerant and fault-secure memories Fault-tolerant techniques in manufacturing
Replication, coding, reconfiguration
Network fault tolerance Computer networks
Basic techniques
Example multistage networks
Case studies ESS and 3B20
FTMP Fault-tolerant multiprocessor
SIFT Software-implemented fault tolerance
Communication controller
Fault-tolerant building block architecture
-
DS - I - INTR - 7
References (General) 1
Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, WileyInterscience, 1970.
Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971.
Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976.
T. Anderson, B. Randell. Computing Systems Reliability. Cambridge University Press, 1979. 482p. CSR.
Kraft, G. D. and W. N. Toy, MicroprogrammedControl and Reliable Design of Small Computers, Prentice-Hall, 1981.
-
DS - I - INTR - 8
References (General) 2
Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982.
Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995.
Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985.
Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986.
Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault-Tolerant Computing, Springer-Verlag, 1987.
-
DS - I - INTR - 9
References (General) 3
T. Anderson. Safe and Secure Computing Systems. Oxford, Blackwell Scientific, 1989.
Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989.
Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989.
Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-VerlagWien New York, 1992.
-
DS - I - INTR - 10
References (General) 4
Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993.
M. Banatre, P. A. Lee. Hardware and software architectures for fault tolerance, Lecture Notes in Computer Science, 774, Springer-Verlag, 1994.
Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, System Implementation, Kluwer Academic Publishers, 1994.
-
DS - I - INTR - 11
References (General) 5
Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994.
Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994.
Malek, M. (ed.), Responsive Computing, KluwerAcad. Publish., 1994.
Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems, Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995.
-
DS - I - INTR - 12
References (General) 6
Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995.
Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.
Birman K. P., Building Secure and Reliable Network Applications, Prentice-Hall and Manning Publishing Company, 1997
A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997
S. Montenegro, Sichere und fehlertoleranteSteuerungen, Hanser Mnchen, 1999.
-
DS - I - INTR - 13
References (General) 7
F. Redmill, T. Anderson. Towards System Safety. Springer-Verlag, 1999.
W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999
S. Krakowiak, S. K. Shrivastava. Recent Advancesin Distributed Systems, Berlin, Lecture Notes in Computer Science, 1752, Springer-Verlag, 2000.
F. Redmill and T. Anderson. Lessons in System Safety. Springer-Verlag, 2000
Myers, G. J., Software Reliability Principles and Practice, Wiley-Interscience, 1976.
-
DS - I - INTR - 14
References(Reliability Evaluation)
Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982.
Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984.
Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987.
W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999
-
DS - I - INTR - 15
References (E-Commerce and Internet)
Daniel A. Menasce, Virgilio A. F. Almeida, Capacity Planning for Web Performance : Metrics, Models, and Methods, Prentice Hall, 1998
F.J Kauffels, E-Business, Taschenbuch, 1998
Eric Siegel, Designing Quality of Service Solutionsfor the Enterprise, John Wiley & Sons, 1999
Daniel A. Menasce, Virgilio A. F. Almeida, Scaling for E-Business: Technologies, Models, Performance, and Capacity Planning, Prentice Hall, 2000
Wasin Rajput, E-Commerce Systems Architecture and Applications,Artech House, 2000
-
DS - I - INTR - 16
References (Coding)
Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968.
Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972.
Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978.
Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983.
Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986.
-
DS - I - INTR - 17
References (Software) 1
Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970.
Deutsch, M. D., Software Verification and Validation, Prentice-Hall, 1982.
Shooman, M. L., Software Engineering, McGraw-Hill, 1983.
Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983.
Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.
-
DS - I - INTR - 18
References (Software) 2
Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993.
Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995.
Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995.
-
DS - I - INTR - 19
References (Journals)
Special Issues
IEEE Trans. on Computers
IEEE Trans. on Reliability
IEEE Trans. on Software Engineering
IEEE Trans. on Parallel and Distributed Computing
Computer
Design and Test
Electronics
Computer Design
Journal of Electronic Testing: Theory and Applications
Journal of Parallel and Distributed Computing
-
DS - I - INTR - 20
References (Conference Proceedings) 1
Fault-Tolerant Computing Symposium (since 2000 International Conference on Dependable Systems and Networks)
European Dependable Computing Conference
Symposium on Reliable Distributed Systems
Reliability and Maintainability Symposium
Reliability in Distributed Software and Database Systems Symposium
Test Conference
International Conference on Computer Safety, Reliability and Security
-
DS - I - INTR - 21
References (Conference Proceedings) 2
International Symposium on Software Reliability Engineering
Pacific Rim International Symposium on Dependable Computing
Distributed Computing Systems Conference
Parallel Processing Conference
Real-Time Systems Symposium
Computer Architecture Symposiumhttp://liinwww.ira.uka.de/bibliography(over 1.1 M refs)
-
DS - I - INTR - 22
Introduction
Objectives:
Motivation dependable systems
To introduce various views of computer systems and their relations to computer system dependability
To present basic concepts and approaches
To introduce dependable design methodology
-
DS - I - INTR - 23
Introduction
Contents:
Motivation
System views
System dependability concepts
Approaches to dependable design
Dependability rings
Dependable design methodology
-
DS - I - INTR - 24
Hardware
Software
50 %
91 %
Spare Parts
Software + Upgrades
Liveware
Cost of damage
5 10 years
A Stunning Prediction
-
DS - I - INTR - 25
Types of Systems
Dependable (reliable) system
A system which delivers a required service during its lifetime
Fault-tolerant computer systems
A system that has the capability to continue the correct execution of its programs and input/output functions in the presence of faults
-
DS - I - INTR - 26
Types of Systems
Real-time-computer systems:
are the ones that deliver service to a user within a specified deadline (physical time, duration, etc.)
Responsive computer system:
are Fault-Tolerant Real-Time Systems that deliver satisfactory service in a timely manner
-
DS - I - INTR - 27
Motivation
Economic necessity
Ever-growing reliance
Life saving
Novice users
Harsh environments
More complex systems
-
DS - I - INTR - 28
106
105
104
103
102
101
1950 1960 1970 1980 1990 2000 2010
Equivalent Device Reliability
Mean Time between Failures
(MTBF) in Years Minimum Acceptable
Reliability
System Reliability
Relays Semiconductors MSI VLSIVacuum tubes SSI LSI
Device Reliability and System Reliability
-
DS - I - INTR - 29
101 102 103 104 105 106 107 108
0.99
0.9999
0.999
0.99
0.9
Massively Parallel /Distributed Systems
Commercial Fault-TolerantSystems
Ultra Reliable Systems
A
v
a
i
l
a
b
i
l
i
t
y
Throughput (MFLOPS)
DependabilityPerformance Tradeoff
12-mal
-
DS - I - INTR - 30
Examples
E-commerce systems
Air traffic control
Flight systems
Communication systems (telephone and internet)
Banking systems
Defense systems
Airline seat reservations
Household appliances
Video games
-
DS - I - INTR - 31
View I:System Life Cycle
System constraints
Obsolescence NeedsNew technology
Concept formulation
System specification
Design
Prototype
Production
Installation
Operational life
Modification and retirement
Notice that testing, verification or validation should occur after every phase of life cycle
Very few tools exist,and for some steps of the cycle only
-
DS - I - INTR - 32
View 2:Packaging Levels of Integration
Applications
Applications modules
Special-purpose languages
Standard languages
Operating systems
Cabinets/frames
Boxes/cages
Printed circuit boards/cards, wafers, TCMS
Integrated circuits (Chips)
Dependability must be considered at every level.
System decomposition (partitioning) may have a significant impact on dependability.
-
DS - I - INTR - 33
Preparation Useful
work
Semi-useful work
Fault
servicing Idling
Liveware
Hardware/ Software
Eliminate idling and use it for testing to improve dependability
View 3:Workload View
-
DS - I - INTR - 34
View 4:Levels of Abstraction 1
data paths, registers, data operators, control (hardwired), microprogramming (microstore)
Register transfer level (RTL)
Logic
software, memory state, processor state, effective address calculation, instruction decode, instruction execution
HLL, ISP (Instruction Set)
Program
processors, memories, switches, links (networks), controllers, ALUs, I/Os
PMS
ComponentsSublevelLevel
-
DS - I - INTR - 35
View 4:Levels of Abstraction 2
ComponentsSublevelLevel
disks, tapesQuantum & electromagnetic
Transistors
resistors, capacitors, inductors, power sources, diodes
Curcuit
-
DS - I - INTR - 36
View 5:Computer System 1
HARDWARE
CPUs
I/O devices, memories
Interconnection networks
FIRMWARE
Microprogram & microprogramming systems
Liveware
Maintenance Personal
Operators
System Designers
System Analysts
Programmers
Users
SOFTWARE
Packages
Assemblers
Compilers
Operating systems
Utility programs
Debugging programs
File processing programs
-
DS - I - INTR - 37
View 5:Computer System 2
Faults are attributed to:
Hardware: 20% - 65%
Software: 20% - 80%
People: 15% - 40%
AT&Ts: 20-40-40%(2/3 applications + 1/3 operating system)
-
DS - I - INTR - 38
View 6:The Six Phases
Warning: If you do not follow dependable design methodology you may end up with the following scenario!
Six phases of a project
1. Enthusiasm
2. Disillusionment
3. Panic and hysteria
4. Search for the guilty
5. Punishment of the innocent
6. Praise and awards for the non-participants(Author unknown found in one of the computer companies)
-
DS - I - INTR - 39
System DependabilityConcepts
Availability
Instantaneous availability is the probability that a system is performing correctly at time t and is equal to reliability of non-repairable systems
A (t) = R (t)
Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime
As (t) =LIFETIME
UPTIME
-
DS - I - INTR - 40
System DependabilityConcepts
Survivability
the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset
Reliability
Is a conditional probability that the system will perform its intended function without failure at time t provided it was fully operational at time t = 0
-
DS - I - INTR - 41
Approaches
Fault intolerance
Fault tolerance
Maintainability
Hardware/software trade-offs
-
DS - I - INTR - 42
Hardware Continuum
Hardware
Instructions
Integer arithmetic Add/Sub
Mul/Div
Floating-point arithmetic
Vector processing
Multiprocessing (e.g., submachine set-up)
Examples
M6800
MC68000
VAX-11/780 IBM-30XX
Cray-YMP, Hitachi
Systolic arrays, Grid, reconfigurable or experimental multicomputers
-
DS - I - INTR - 43
Software Continuum
SOFTWARE
VERTICAL MIGRATION is a transfer of functions implementation from software to firmware and/or hardware or vice-versa.
Vertical Migration improves performance and dependability, and reduces cost.
-
DS - I - INTR - 44
Dependability (Reliability) Rings For Fault Tolerance
Logic level
Acceptance test
Register-transfer level
Acceptance test
System hardware
Acceptance test
Operating system, languages and application
Acceptance test Dependability
rings
Each dependability ring should provide measures and mechanisms for fault tolerance (detection, location, testability and recovery)
-
DS - I - INTR - 45
A Bootstrap Test Rings In A Multicomputer System
Processor
Memories
Network
Test rings
Diagnostic andmaintenanceprocessor(s)
-
DS - I - INTR - 46
Dependable Design Methodology
Identify fault classes, fault latency and fault impact
Determine qualitative and quantitative specs for fault tolerance and evaluate your design in specific environment
Identify weak spots and assess potential damage
Decompose the system
Develop fault and error detection techniques and algorithms
Develop fault isolation techniques and algorithms
Develop recovery/reintegration/restart
Evaluate degree of fault tolerance
Refine, iterate for improvement; try to eliminate weak spots and minimize potential damage
-
DS - I - INTR - 47
Real-time Systems Design
Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment.
Characterize timing of a system (hardware and software).
Map timing specification onto a system timing (find the best resource allocation and scheduling methods), and incorporate concurrent monitoring.
Verify and validate the design for quantitative and qualitative specifications.
Refine, iterate and fine-tune the design.
-
DS - I - INTR - 48
Responsive System Design 1
Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements.
Determine system timing (hardware and software) assess damage, availability and responsiveness.
Develop and time fault and error detection techniques and algorithms.
Develop and time fault isolation techniques and algorithms.
Develop time recovery/reintegration/restart.
-
DS - I - INTR - 49
Responsive System Design 1
Map timing specification onto system timing under appropriate assumptions and incorporate concurrent monitoring.
Evaluate responsiveness.
Refine and iterate for improvement.
Responsive systems need architects of space and architects of time.
-
DS - I - INTR - 50
References 1
C. G. Bell, J. C. Mudge and J. E. McNamara Seven Views of Computer Systems, Chapter 1 in the book by the same authors titled Computer Engineering, Digital Press, 1978.
G.J. Lipovski and M. Malek, Parallel Computing: Theory and Comparisons, Wiley-Interscience, New York, 1987.
M. Malek, Parallel Computer Systems Testing and Integration, in the book titled Testing and Diagnosis of VLSI and LSI, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988.
-
DS - I - INTR - 51
References 2
Pankaj Jalote, Fault Tolerance in Distributed Systems, Prentice-Hall,1994
Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.
Birman K. P., Building Secure and Reliable Network Applications, PrenticeHall and Manning Publishing Company, 1997