software system safetysunnyday.mit.edu/16.355/safety.pdf · 2015. 12. 11. · mit aero/astro dept....
TRANSCRIPT
-
c
.
Nancy G. Leveson
Software System Safety
to the source is given. Abstractingwith credit is permitted.that the copies are not made or distributed for direct commercial advantage and provided that credit
Copyright by the author, July 2002.All rights reserved. Copying without fee is permitted provided
http://sunnyday.mit.edu
MIT Aero/Astro Dept. ([email protected])
-
��������� ��� ���
��������� ��� ���
���������������
���������������
Accident with No Component Failures
LC
COMPUTER
WATER
COOLING
CONDENSER
VENT
REFLUX
REACTOR
VAPOR
LA
CATALYST
GEARBOX
c
c
Caused by interactive complexity and tight coupling
Exacerbated by the introduction of computers.
Arise in interactions among components
Single or multiple component failures
Component Failure Accidents
Types of Accidents
Usually assume random failure
System Accidents
No components may have "failed"
..
-
��������� ��� ���
� �������� ��� ��������� � ��!
1. A "simple" system has a small number of unknowns in itsinteractions within the system and with its environment.
2. A system is intellectually unmanageable when the level of
3. Introducing new technology introduces unknowns and
Interactive Complexity
We seem not to trust one another as much as would bedesirable. In lieu of trusting each other, are we putting
not educating our children sufficiently well to understandtoo much trust in our technology? . . . Perhaps we are
the reasonable uses and limits of technology.
Thomas B. Sheridan
Computers and Risk
��������������"
��������������#
The underlying factor is intellectual manageability
c
c
Complexity is a moving target
interactions reaches the point where they cannot be thoroughly
planned
understood
anticipated
guarded against
even "unk−unks."
-
� �������� ��� ��������� � ��!
Limits have changed from structural integrity and physical
Build safety in by enforcing constraints on behavior
Enforce discipline and control complexity
A Possible Solution
constraints of materials to intellectual limits
Software must always open water valve before catalyst valve
catalyst is added to reactor.Water must be flowing into reflux condenser whenever
Software safety constraint:
System safety constraint:
Example (batch reactor)
� �������� ��� ��������� � ��!
Improve communication among engineers
are physical extensions of it.Displays are directly connected to process and thus
Mechanical systems
��������������$
1.
Stages in Process Control System Evolution
��������������%c
Design decisions highly constrained by:
Limited possibility of action at a distance
Physics of underlying process
Available space
Direct sensory perception of process
c
-
� �������� ��� ��������� � ��!
� �������� ��� ��������� � ��!
Capability for action at a distance
Need to provide an image of process to operators
Need to provide feedback on actions taken.
Relaxed constraints on designers but created newpossibilities for designer and operator error.
Stages in Process Control System Evolution (2)
2. Electromechanical systems
��������������&
���������������('
Stages in Process Control System Evolution (3)
c
c
3. Computer−based systems
Relaxes even more constraints and introducesmore possibility for error.
Allow multiplexing of controls and displays.
But constraints shaped environment in ways that efficientlytransmitted valuable process information and supportedcognitive processes of operators.
Finding it hard to capture and present these qualities in new systems.
-
The primary safety problem in computer−based systems
is the lack of appropriate constraints on design.
The job of the system safety engineer is to identify the
design constraints necessary to maintain safety and to
ensure the system and software design enforces them.
The Problem to be Solved
� �������� ��� ��������� � ��!
.
c �����������������
-
From a blue ribbon panel report on the V−22 Osprey problems:
From an FAA report on ATC software architectures:
Confusing Safety and Reliability
reliability."en route automation systems must posses ultra−highconsideration as a safety−critical system. Therefore,"The FAA’s en route automation meets the criteria for
) ��* ��� +����-,�� ��� � ���� � � � +
) ��* ��� +����-,�� ��� � ���� � � � +
Recommendation: Improve reliability, then verify byextensive test/fix/test in challenging environments."
.
..
to safety accordingly.their nature, and we must change our approachesAccidents in high−tech systems are changing
ReliabilitySafety
"Safety [software]: ...
.
���������������(�
���������������("
c
c
-
) ��* ��� +����-,�� ��� � ���� � � � +
) ��* ��� +����-,�� ��� � ���� � � � +
Reliability: The probability an item will perform its requiredfunction in the specified manner over a given timeperiod and under specified or assumed conditions.
Concerned primarily with failures and failure rate reduction
Parallel redundancyStandby sparingSafety factors and marginsDeratingScreeningTimed replacements
Reliability Engineering Approach to Safety
(Note: Most software−related accidents result from errors
from assumed conditions.)in specified requirements or function and deviations
���������������(#
���������������(.
c
c
Failure: Nonperformance or inability of system or componentto perform its intended function for a specified timeunder specified environmental conditions.
A basic abnormal occurrence, e.g.,
burned out bearing in a pump
relay not closing properly when voltage applied
Fault: Higher−order events, e.g.,relay closes at wrong time due to improper functioningof an upstream component.
All failures are faults but not all faults are failures.
Does Software Fail?
-
) ��* ��� +�����,��/��� � ���� � � � +
Highly reliable components are not necessarily safe.
0���-��������1�(2
all operating according to specificationOr may be caused by interactions of components
reliability analyses are based.outside parameters and time limits upon which
e.g. Accidents may be caused by equipment operation
Many accidents occur without any component ‘‘failure’’
May even decrease safety.Omits important factors in accidents.
Failure rates in hardware are quantifiable.Techniques exist to increase component reliability
Assumes accidents are the result of component failure.
Reliability Engineering Approach to Safety (2)c
-
Accidents occur when these assumptions are incorrect.
U.K. ATC software
Ariane 5
Therac−25
) ��* ��� +����-,�� ��� � ���� � � � +
) ��* ��� +����-,�� ��� � ���� � � � +
Safety and reliability are different qualities!
Are usually caused by flawed requirements
Incomplete or wrong assumptions about operation ofcontrolled system or required operation of computer.
reliable will not make it safer under these conditions.
Unhandled controlled−system states and environmentalconditions.
Merely trying to get the software ‘‘correct’’ or to make it
Software−Related Accidents
���������������($
���������������(%
COTS makes safety analysis more difficult.
c
c
One of most common factors in software−related accidents
Software contains assumptions about its environment.
Most likely to change the features embedded in orcontrolled by the software.
Software Component Reuse
-
������3������ ���4��
) ��* ��� +����-,�� ��� � ���� � � � +
throughanalysisdesign management
identificationevaluationelimination control
A planned, disciplined, and systematic approach topreventing or reducing accidents throughout the lifecycle of a system.
‘‘Organized common sense ’’ (Mueller, 1968)
Primary concern is the management of hazards:
System Safety
MIL−STD−882
���������������(&
����������������'
Hazard
c
what is specified in requirements.Software has unintended (and unsafe) behavior beyond
Requirements do not specify some particular behavior
behavior unsafe from a system perspective.Correctly implements requirements but specified
required for system safety (incomplete)
Software may be highly reliable and ‘‘correct’’ and stillbe unsafe.
Software−Related Accidents (con’t.)
c
-
Process Steps
1. Perform a Preliminary Hazard Analysis
2. Perform a System Hazard Analysis (not just Failure Analysis) Identifies potential causes of hazards
3. Identify appropriate design constraints on system, software,
4. Design at system level to eliminate or control hazards.
5. Trace unresolved hazards and system hazard controls tosoftware requirements.
and humans.
������3������ ���4��
������3������ ���4��
Produces hazard list
Change analysis
Verification
Hazard resolution
Hazard identification
Operations
Operational feedback
Development
System Safety (2)
�����������������Management
c
c
���������������5�
Design
4. Minimize damage.3. Control the hazard if it occurs.2. Prevent or minimize the occurrence of the hazard1. Eliminate the hazard
Hazard resolution precedence:
developmentConceptual
throughout system development and use.Hazard analysis and control is a continuous, iterative process
-
Process Steps (2)
6.
Human factors analyses (usability, workload, etc.)
Mode confusion and other human error analyses
Robustness (environment) analysis
Software hazard analysis
Simulation and animation
Software requirements review and analysis
������3������ ���4��
Derive from system hazard analysis
Specifying Safety Constraints
What must not do is not inverse of what must do
Need to specify what software must NOT do
Need to specify off−nominal behavior
Most software requirements only specify nominal behavior
������3������ ���4��
Completeness
����������������"
����������������#
c
c
-
9.
Process Steps (4)
Periodic audits
Performance monitoring
Incident and accident analysis
Change analysis
Operational Analysis and Auditing
������3������ ���4��
8.
7.
Off−nominal and safety testing
Exception−handling etc.
Elimination of unnecessary functions
Separation of critical functions
Assertions and run−time checking
Defensive programming
Implementation with safety in mind
������3������ ���4��
Process Steps (3)
����������������.
����������������2
c
c
-
687:9=5=�? @�A5BC7
Other Human FactorsEvaluation (workload, situationawareness, etc.)
Performance Monitoring
Hazard List
Usability Analysis
Task Allocation Principles
Training Requirements
DFE�GIH(JKD�L E�L E�M
Operator Goals and
Operator Task and
Responsibilities
NKOQPQO H(RFSTG�R O L MFE�UVFW DFXQY V�ZF[\V R�]�DF^QL Z JH_DO Y O D�E�GIX Z Sa` Z E�R�EFH
bIZ GFR W D�EFGIR�^QD W c D�H_R Z `�RFJ�D�H Z J
d L R W GIH_R O H_L EFM�e�L E O H_D W W DFH(L Z EFe
f `FR�J�DFH(L Z E O
Fault Tree Analysis
g:9�hi9�ji@lknmpo5q
Simulation/Experiments
Change Analysis
Incident and accident analysis
Periodic audits
Performance MonitoringChange Analysis
Periodic audits
Preliminary Hazard Analysis
System Hazard Analysis
Safety Verification
Operational AnalysisOperational Analysis
Operator Task Analysis
Safety Requirements andConstraints
Completeness/ConsistencyAnalysis
State Machine Hazard Analysis
Deviation Analysis (FMECA)
Mode Confusion Analysis
Human Error Analysis
Timing and other analyses
Safety TestingSoftware FTA
Simulation and Animation
Preliminary Task Analysis
Engineering
DFE�GIG�R O L M�EIX Z E O H_J�DFL EFH OZ `�R�JKD�H(L Z EFD W JKR�r c L JKR�SIR�E�H Os RFE�R�JKD�H(R O�PQO H_R�STDFE�G
t RFJ�L u(L XQDFH(L Z E
DFE�G Z `�RFJ�D�H Z JvSID�E c D W OGFLO ` W D PQO eFH(JKD�L E�L EFMISaDFH(R�JKL D W O eXZ SI` Z E�RFE�H O e�X Z E�H(J Z�W O D�E�Gw R O L M�EID�E�GIX Z E O H_J c X�H
RFE�^QL J Z E�SIR�E�H_D W D O�O c SI`�H_L Z E Ox G�RFE�H(L u P\OQPQO H_R�SyM Z D W O D�E�G
c
N L E�X W c G�L E�MIz bIx UMFR�E�R�JKD�H_ROQPQO H_R�SyG�R O L M�E
{
-
������3������ ���4��
������3������ ���4��
[1.2] TCAS shall provide collision avoidance protection for any two
High−Level Requirements
aircraft closing horizontally at any rate up to 1200 knots and vertically up to 10,000 feet per minute.
Assumption: Commercial aircraft can operate up to 600 knots and
the planes can close horizontally up to 1200 knots and vertically up to 10,000 fpm.
Design and Safety Constraints
critical phases of flight nor disrupt aircraft operation.
[SC5.1] The pilot of a TCAS−equipped aircraft must have theoption to switch to the Traffic−Advisory−Only mode where TAs are displayed but display of resolution advisories is prohibited.
Assumption: This feature will be used during final approach to parallel runways when two aircraft are projected to come close to each other and TCAS would call for an evasive maneuver.
5000 fpm during vertical climb or controlled descent (and therefore
[SC5] The system must not disrupt the pilot and ATC operations during
Level 1: System Purpose
����������������$
����������������&Example Level 1 Safety Constraints for TCAS
2.36, 2.38, 2.48, 2.49.2
SC−7.1 Crossing maneuvers must be avoided if possible.
SC−7 TCAS must not create near misses (result in a hazardous level of verticalseparation) that would not have occurred had the aircraft not carried TCAS.
2.51, 2.56.3, 2.65.3, 2.66
SC−7.2 The reversal of a displayed advisory must be extremelyrare.
2.52
SC−7.3 TCAS must not reverse an advisory if the pilot willhaveinsufficient time to respond to the RA before the closestpoint of approach (four seconds or less) or if own andintruder aircraft are separated by less than 200 feet verticallywhen 10 seconds or less remain to closest point of approach.
c
c
-
closest point of approach, the vertical separation is less than100 feet and the horizontal separation is less than 500 feet.
H1:
H2: TCAS causes controlled maneuver into ground
e.g. descend command near terrain
TCAS causes pilot to lose control of the aircraft.H3:
H4: TCAS interferes with other safety−related systems
e.g. interferes with ground proximity warning
Hazard List for TCAS
������3������ ���4��
������3������ ���4��
Near midair collision (NMAC): An encounter for which, at the
��������������"�'
��������������"5�
c
c
Level 1: System Purpose (3)
System Limitations
Operator Requirements
Human−Interface Requirements
Hazard and other System Analyses
L.5 TCAS provides no protection against aircraft with nonoperational or non−Mode C transponders.
OP. 4 After the threat is resolved the pilot shall return promptly and smoothly to his/her previously assigned flight path.
-
�i����3|����� ���4��
No RA inputs are provided to the display.
OP.4
OP.10
OP.10
OP.1
TCAS does not display a resolution advisory.
1.23.1
...
Altitude errors put threat on ground
Altitude errors put threat in non−threat position.
RA beyond CPA>
-
������3������ ���4��
result from hilly terrain (
1.SC4.9).
Sea Level
Barometric
of Ground
AirborneDeclared
Ground Level
OWN TCAS
RadarAltimeter
Value
Estimated Elevation
Altimeter
Allowance
2.19 When below 1700 feet AGL, the CAS logic uses the difference
}0~�-~����F�c
own TCAS is within 500 feet of the ground.FTA−320). All RAs are inhibited when
reduce vacillations in the display of traffic advisories that mighttracked altitude is below this estimate. Hysteresis is provided to Traffic and resolution advisories are inhibited for any intruder whose180 feet, TCAS considers the target to be on the ground (altitude − radar altitude + 180 feet). If this altitude is less thanapproximate altitude of the target above the ground (barometricpressure altitude value received from the target to determine thelevel (see Figure 2.5). It then subtracts the latter value from thedetermine the approximate elevation of the ground above seabetween its own aircraft pressure altitude and radar altitude to
180−footon GroundDeclared
on GroundDeclared
-
��~���� ���
��~���� ���Level 3 Modeling Language Example
. .
.
.
.
.
.
.
.
.
.
v� v
1v� �
-
ò �ó�� ô�|��õ ö��÷
4. Establish the hazard log.
1. Identify system hazards
2. Translate system hazards into high−level
3. Assess hazards if required to do so.
system safety design constraints.
Preliminary Hazard Analysis
.
.
..
ò �ó�� ô�|��õ ö��÷
Door that closes on an obstruction does not reopen or reopeneddoor does not reclose.
Doors cannot be opened for emergency evacuation.
}�~��~������ø
}�~��~������ù
c
Door closes while someone is in doorway
Door opens while improperly aligned with station platform.
Door opens while train is in motion.
Train starts with door open.
System Hazards for Automated Train Doors
c
-
other than a safe point of touchdown on assigned runway (CFIT)
ò �ó�� ô�|��õ ö��÷
violate minimum separation.
with stationary objects or leaves the paved area.
Controlled aircraft executes an extreme maneuver within its
Identify the system hazards for this cruise−control systemExercise:
The cruise control system operates only when the engine is running.
traveling at that instant is maintained. The system monitors the car’sWhen the driver turns the system on, the speed at which the car is
speed by sensing the rate at which the wheels are turning, and itmaintains desired speed by controlling the throttle position. After the system has been turned on, the driver may tell it to start increasingspeed, wait a period of time, and then tell it to stop increasing speed.Throughout the time period, the system will increase the speed at afixed rate, and then will maintain the final speed reached.
The driver may turn off the system at any time. The system will turnoff if it senses that the accelerator has been depressed far enough tooverride the throttle control. If the system is on and senses that thebrake has been depressed, it will cease maintaining speed but will notturn off. The driver may tell the system to resume speed, whereuponit will return to the speed it was maintaining before braking and resumemaintenance of that speed.
authorization.
Controlled airborne aircraft and an intruder in controlled airspace
ò �ó�� ô�|��õ ö��÷
Controlled airborne aircraft gets too close to a fixed obstable
Controlled aircraft operates outside its performance envelope.
Aircraft on ground comes too close to moving objects or collides
Aircraft enters a runway for which it does not have clearance.
performance envelope.
Loss of aircraft control.
System Hazards for Air Traffic ControlControlled aircraft violate minimum separation standards (NMAC).
Airborne controlled aircraft enters an unsafe atmospheric region.
Controlled airborne aircraft enters restricted airspace without
}�~��~������ú
}�~��~�����û�ü
c
c
-
ò �ó-� ô�|��õ ö�-÷
1. A pair of controlled aircraft
1b. ATC shall provide conflict alerts.
maintain safe separation betweenaircraft.
1a. ATC shall provide advisories that
direct aircraft into areas with unsafeatmospheric conditions.
2a. ATC must not issue advisories that
2b. ATC shall provide weather advisoriesand alerts to flight crews.
2c. ATC shall warn aircraft that enter an unsafe atmospheric region.
Hazards must be translated into design constraints.
Door areas must be clear before door
Door opens while train is in motion.
violate minimum separation
}�~�-~����Fû5ýc
Example PHA for ATC Approach Control
areas, thunderstorm cells)(icing conditions, windshear
REQUIREMENTS/CONSTRAINTSHAZARDS
unsafe atmospheric region.2. A controlled aircraft enters an
standards.
doorway.
Door must be capable of opening only after
motion.Doors must remain closed while train is in
any door open.Train must not be capable of moving withTrain starts with door open.
DESIGN CRITERIONHAZARD
train is stopped and properly aligned with
Doors cannot be opened foremergency evacuation.
Door that closes on an obstructiondoes not reopen or reopened door does not reclose.
Door closes while someone is in
with station platform.Door opens while improperly aligned
emergency evacuation.anywhere when the train is stopped forMeans must be provided to open doors
reclose.removal of obstruction and then automaticallyAn obstructed door must reopen to permit
closing begins.
platform unless emergency exists (see below).
-
ò �ó-� ô�|��õ ö�-÷
Example PHA for ATC Approach Control (2)
c
to avoid intruders if at all possible.
HAZARDS REQUIREMENTS/CONSTRAINTS
6. Loss of controlled flight or lossof airframe integrity.
safety of flight.
the pilot or aircraft cannot fly or that 6c. ATC must not issue advisories that
6b. ATC advisories must not distractor disrupt the crew from maintaining
degrade the continued safe flight of the aircraft.
it at the wrong place.
that cause an aircraft to fall below
6a. ATC must not issue advisories outsidethe safe performance envelope of theaircraft.
6d. ATC must not provide advisories
the standard glidepath or intersect
}�~�-~����Fû�þ
5. ATC shall provide alerts and advisories
HAZARDS REQUIREMENTS/CONSTRAINTS
3. A controlled aircraft entersrestricted airspace withoutauthorization.
4. A controlled aircraft gets too close to a fixed obstacle or terrain other than a safe point oftouchdown on assigned runway.
5. A controlled aircraft and anintruder in controlled airspaceviolate minimum separationstandards.
3a. ATC must not issue advisories thatdirect an aircraft into restricted airspaceunless avoiding a greater hazard.
3b. ATC shall provide timely warnings toaircraft to prevent their incursion intorestricted airspace.
4. ATC shall provide advisories that maintain safe separation betweenaircraft and terrain or physical obstacles.
-
of discrete states required to describe software behavior.
ÿ ~�����÷ ~���~��� �|��õ ö��÷
ÿ ~�����÷ ~���~��� �|��õ ö��÷
2) Use abstraction and metamodels to handle large number
complexity of internal design to accomplish the behavior.
Requirements Validation
Requirements are source of most operational errors and almost
requirements.
all the software contributions to accidents.
Much of software hazard analysis effort therefore should focus on
Problem is dealing with complexity
1) Use blackbox models to separate external behavior from
Test Coverage Analysis and Test Case Generation
Automatic code generation?
Human Error Analysis
Software Deviation Analysis
State Machine Hazard Analysis (backwards reachability)
Completeness
Model Execution, Animation, and Visualization
Requirements Analysis
modeler needs to describe.drastically reduce number of states and transitionsBut new types of state machine modeling languages
Do not have continuous math to assist us
c
}�~��~�����û�û
}�~��~�����û�
c
-
I/O
I/O
How were criteria derived?
Mapped the parts of a control loop to a state machine
Defined completeness for each part of state machine
States, inputs, outputs, transitionsMathematical completeness
Added basic engineering principles (e.g., feedback)
Added what have learned from accidents
ÿ ~�����÷ ~���~��� �|��õ ö��÷
ÿ ~�����÷ ~���~��� �|��õ ö��÷
might be designed.that of any other undesired program thatthe desired behavior of the software fromRequirements are sufficient to distinguishCompleteness:
c
}�~��~�����û�Ç
}�~��~�����û�È
c
Requirements Completeness Criteria (2)
Most software−related accidents involve software requirementsdeficiencies.
Accidents often result from unhandled and unspecified cases.
We have defined a set of criteria to determine whether arequirements specification is complete.
Derived from accidents and basic engineering principles.
Validated (at JPL) and used on industrial projects.
Requirements Completeness
-
ÿ ~�����÷ ~���~��� p|��õ ö�-÷
tools can check them.Most integrated into SpecTRM−RL language design or simple
Mode transitions
c }�~��~�-0��û�ø
Value and timing
Requirements Completeness Criteria (3)
Path RobustnessPreemptionReversibilityFeedbackLatencyData ageRobustness
Load and capacity
Inputs and outputs
Startup, shutdown
About 60 criteria in all including human−computer interaction.
(won’t go through them all they are in the book)
Human−computer interfaceFailure states and transitionsEnvironment capacity
-
� ~��÷ ��
� ~��÷ ��
Design should incorporate basic safety design principles
RedundancySafety Factors and Margins
Reducing exposureIsolation and containmentProtection systems and fail−safe design
HAZARD ELIMINATION
HAZARD REDUCTION
HAZARD CONTROL
DAMAGE REDUCTION
Safe Design Precedence
Software design must enforce safety constraints
Should be able to trace from requirements to code (vice versa)
Design for Safety
}�~��~�����û�ú
}�~��~�����Ç�ü
Failure Minimization
c
c
Increasing effectiveness
Decreasing cost
SubstitutionSimplificationDecouplingElimination of human errorsReduction of hazardous materials or conditions
Design for controllabilityBarriers
Lockins, Lockouts, Interlocks