safety in large technology systems - university of michigandelittle/technology failure... · 2004....

29
Safety in large technology systems Technology Residential College October 13, 1999 Dan Little

Upload: others

Post on 17-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Safety in large technologysystems

• Technology Residential College• October 13, 1999• Dan Little

Page 2: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Technology failure

• Why do large, complex systems sometimesfail so spectacularly? Do the easyexplanations of “operator error,” “faultytechnology,” or “complexity” suffice? Arethere managerial causes of technologyfailure? Are there design principles andengineering protocols that can enhancelarge system safety? What is the role ofsoftware in safety and failure?

Page 3: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Surprising failures

• Franco-Prussian war, Israeli intelligencefailure in Yom Kippur war

• The Mercedes “A” vehicle sedan and themoose test

• Chernobyl nuclear power meltdown

Page 4: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Therac-25

• high energies• computer control rather than electro-

mechanical control• positioning the turntable: x-ray beam

flattener• 15,000 rad administered rather than 200 rad

Page 5: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Causes of failure

• Complexity and multiple causal pathwaysand relations

• defective procedures• defective training systems• “human” error• faulty design

Page 6: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Technology failure

• sources of failure– management failures– design failures– proliferating random failures– “storming” the system

• design for “soft landings”• crisis management

Page 7: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Information and decision-making

• Information flow and management ofcomplex technology systems

• complex organizations pursue multipleobjectives simultaneously

• complex organizations pursue the sameobjective along different and conflictingpaths

Page 8: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Sources of potential failure

• hardware interlocks replaced with softwarechecks on turntable position

• cryptic malfunction codes; frequentmessages

• excessive operator confidence in safetysystems

• lack of effective mechanism for reportingand investigating failures

• poor software engineering practices;

Page 9: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Causes of failure

• “The causes of accidents are frequently, ifnot almost always, rooted in theorganization--its culture, management, andstructure. These factors are all critical tothe eventual safety of the engineeredsystem” (Leveson, 47).

Page 10: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Organizational factors

• “Large-scale engineered systems are morethan just a collection of technologicalartifacts: They are a reflection of thestructure, management, procedures, andculture of the engineering organization thatcreated them, and they are also, usually, areflection of the society in which they werecreated” (Leveson, 47).

Page 11: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Advice for better software design

• design for the worst case• avoid “single point of failure” designs• design “defensively”• investigate failures carefully and

extensively• look for “root cause,” not symptom or

specific transient cause• embed audit trails; design for simplicity

Page 12: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Design for safety

• hazard elimination• hazard reduction• hazard control• damage reduction

Page 13: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

System safety

• builds in safety, not simply adding it on to acompleted design

• deals with systems as a whole rather thansubsystems or components

• takes a larger view of hazards than justfailures

• emphasizes analysis rather than pastexperience and standards

Page 14: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

System safety (2)

• emphasizes qualitative rather thanquantitative approaches

• recognizes the importance of tradeoffs andconflicts in system design

• more than just system engineering

Page 15: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Hazard analysis

• development: identify and assess potentialhazards

• operations: examine an existing system toimprove its safety

• licencing: examine a planned system todemonstrate acceptable safety to aregulatory authority

Page 16: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Hazard analysis (2)

• construct an exhaustive inventory ofhazards early in design

• classify by severity and probability• construct causal pathways that lead to

hazards• design so as to eliminate, reduce, control, or

ameliorate

Page 17: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Safe software design

• control software should be designed withmaximum simplicity (408)

• design should be testable; limited number ofstates

• avoid multitasking, use polling rather thaninterrupts

• design should be easily readable andunderstood

Page 18: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Safe software (2)

• interactions between components should belimited and straightforward

• worst-case timing should be determinableby review of code

• code should include only the minimumfeatures and capabilities required by thesystem; no unnecessary or undocumentedfeatures

Page 19: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Safe software (3)

• critical decisions (launch a missile) shouldnot be made on values often taken by failedcomponents -- 0 or 1.

• Messages should be designed in ways toeliminate possibility of compute hardwarefailures having hazardous consequences(missile launch example)

Page 20: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Safe software (4)

• strive for maximal decoupling of parts of asoftware control system

• accidents in tightly coupled systems are aresult of unplanned interactions

• the flexibility of software encouragescoupling and multiple functions; importantto resist this impulse.

Page 21: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Safe software (5)

• “Adding computers to potentially dangeroussystems is likely to increase accidentsunless extra care is put into system design”(411).

Page 22: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Human interface considerations

• unambiguous error messages (Therac 25)• operator needs extensive knowledge about

the “theory” of the system• alarms need to be comprehensible (TMI);

spurious alarms minimized• operator needs knowledge about timing and

sequencing of events• design of control board is critical

Page 23: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Control panel anomalies

Page 24: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Risk assessment and prediction

• What is involved in assessing risk?– probability of failure– prediction of consequences of failure– failure pathways

Page 25: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Reasoning about risk

• How should we reason about risk?• Expected utility: probability of outcome x

utility of outcome• Probability and science• How to anticipate failure scenarios?

Page 26: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Compare scenarios

• nuclear power vs coal power• automated highway system vs routine traffic

accidents

Page 27: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Ordinary reasoning and judgment

• well-known “fallacies” of ordinaryreasoning:– time preference– framing– risk aversion

Page 28: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

large risks and small risks

• the decision-theory approach: minimizeexpected harms

• the decision-making reality: large harms aremore difficult to absorb, even if smaller inoverall consequence

• example: JR West railway

Page 29: Safety in large technology systems - University of Michigandelittle/technology failure... · 2004. 8. 18. · Technology failure • sources of failure – management failures

Scope and limits of simulations

• Computer simulations permit “experiments”on different scenarios presented to complexsystems

• Simulations are not reality• Simulations represent some factors and

exclude others• Simulations rely on a mathematicization of

the process that may be approximate oreven false.