software safety engineering (s2e) program status dan fitch march 7, 2001
TRANSCRIPT
Software Safety Engineering(S2E) Program Status
Dan Fitch
March 7, 2001
Software Safety Program - Overview
General Safety Concepts - WHY
Software Safety and CLCS - HOWKnown HazardsDesigning for SafetySafety & Reliability Thread
Current Status
Software Safety – What is it?
Limit
LimitAnticipate
Limit
Limit
Detect
ControlLimit
Limit
Mitigate
RateSlope
AbsoluteValue
Prevent Limit DamageReturn to Safe State
Software Safety – What is it?
DefinitionsFunctionally-critical
Mission completionSafety-Critical
Humans = Life & LimbHardware = $106
Some set theoryInput versus output
Some Theory…
Set ofInputs ()
Set ofOutputs
Unknowns ()
KnownKnown
SafeUnsafe
AssumedSafe
Sources: Normal Operation Hardware Failures Human Intervention Models/Simulators
Software Safety – Why do it?
Direction:DoD Mil-Std-882D, DoD-Std-2167
NASA NSTS-07700, NSS-8719.13, NASA-GB-1740.13, NSS-
22206, NSS-22254, Direction from Dan Goldin
CLCS 84K-00055, KDP-P-2901
Software Safety – Why do it?
Objective: Identify & Mitigate Risk
Known Fault Scenarios – by requirements, analyses & test
Possible Unknowns – by design approach & further test
“Knowns”
Hardware fault-driven scenarios
Legacy of hardware failure data available from the 1970’s
Hardware-driven hazards May be analyzed – the SSAMay be tested – specific fault injection
Identifies Risk & Yields Design Changes – Issues/ESRs
The Safety Case – Summary of Risk Findings
“Unknowns”
“Stuff” Happens
Software doesn’t fail – It just doesn’t do what we thought it would
Hardware and some functions (e.g., seeds & races) cause most random errors
Specification & Coding errors = Prime Cause90% of errors are in the specificationsC++ and Java are inherently powerful, but
dangerous
Farengi Software Safety Rule #76
If it "touches*" hardware that can impact the safety of people or equipment, an SSA is absolutely necessary.
*(i.e., controls, monitors, or mitigates therisk of using)
SSA - What and When
Assessment of risk factors due to softwareHardware Hazards SFMEA and SFTAKDP-P-2901
Schedule: 30 days before the first interaction with Flight HardwareIn time for 5A/B TestingPresented at TRR/ORR
System Safety Analysis
Detail Design
Code Development
Conceptual Design
IPT/DP-1 SRS/DP-2 DDS/ODS/DP-3
System Safety Analysis
TRR/ORR
Detail Design
Code Development
Val/VerTest
5A/B(WithHdwr)
Conceptual Design
SystemTest
IPT/DP-1 SRS/DP-2 DDS/ODS/DP-3
3A/B4A/B
ReadinessReviews
System Safety Analysis
PHA
TRR/ORR
Detail Design
Code Development
Val/VerTest
5A/B(WithHdwr)
KDP-P-2901 SSA Process
Conceptual Design
SystemTest
IPT/DP-1 SRS/DP-2 DDS/ODS/DP-3
3A/B4A/B
S-CMatrixH
azar
ds ReadinessReviews
System Safety Analysis
PHAFTA/
FMEA
TRR/ORR
Detail Design
Code Development
Val/VerTest
Issu
es
5A/B(WithHdwr)
KDP-P-2901 SSA Process
Conceptual Design
SystemTest
IPT/DP-1 SRS/DP-2 DDS/ODS/DP-3
3A/B4A/B
S-CMatrixH
azar
ds ReadinessReviews
System Safety Analysis
PHAFTA/
FMEARisk
Assessment
TRR/ORR
Detail Design
Code Development
Val/VerTest
CH
AW
S*
Issu
es
5A/B(WithHdwr)
KDP-P-2901 SSA Process
Conceptual Design
SystemTest
IPT/DP-1 SRS/DP-2 DDS/ODS/DP-3
3A/B4A/B
S-CMatrixH
azar
ds
*CHAWS = CLCS Hazard Analysis Worksheet
ReadinessReviews
Issu
es
System Safety Analysis
PHAFTA/
FMEARisk
Assessment
SSA Report
TRR/ORR
Detail Design
Code Development
Val/VerTest
CH
AW
S*
Issu
es
5A/B(WithHdwr)
KDP-P-2901 SSA Process
Conceptual Design
SystemTest
IPT/DP-1 SRS/DP-2 DDS/ODS/DP-3
3A/B4A/B
S-CMatrix
Risk
CM-Driven Changes
Haz
ards
*CHAWS = CLCS Hazard Analysis Worksheet
ReadinessReviews
Issu
es
Software Fault Tree Analysis
Works backward from the fault to its root causesUses design details of the entire systemLeads to better understanding of causes and their
preventionUnknown fault events not considered
Fault Tree Analysis
Top Event Fill Valve not closed
Other Root
Cause
Human did not notice
pressure
S/W did not react to over pressure
Basic Fault EventsIntermediate Events
S/W did not anticipate rapid
pressure rise
Causal RelationshipAND
Analysis & CLCS Architecture
HardwareSafing
System S/W
Sys Srvcs
Apps Srvcs
Applications
RemainingRisk
Hazardous Event
Control &Mitigation
Detection &Anticipation
The Software FMEA
Predicted hardware failures followed to their conclusion through the softwareWhat can go wrong?What happens when it does?
Must know system failures up frontWon’t prevent the unexpected
CLCS
Spiral Development Cultural Changes
Failure of software Test
SSA – Traditional Approach
Failure Modes& Effects Analysis
Fault Tree
Analysis
Traditional Development
•All or most code available•A lot known about the system•Too late…
SSA - An Iterative Process
Safety Criticality Assessment
EngineeringDesign Changes
Failure Modes& Effects Analysis
Fault Tree Analysis
Spiral Development
S&MA will perform a Software Safety Analysis (SSA) for each Delivery and every location; i.e., as we step up to each new drop.
After the initial SSA, an update of the analysis and a new SSA report will be done for each modification to the safety critical software.
SSA - Where
SSA - Planning
On a Pert chart, the SSA preparation activity will begin during the preparation of the design specifications and have a finish-to-finish relationship with the validation/verification (4A/B) testing.
Design Begin … Val/Ver Test
PHAFTA
FMEARisk Assessment
SSAReport
Farengi Software Safety Rule #304
The SSA isn’t enough.
CLCS
Spiral Development Cultural Changes
Failure of software Test
Paradigms
Software Failures:
“Software does not fail - it just does not perform as intended”
Dr Nancy Leveson, MIT
Paradigms
Design and test for functionality:
Also specify what the system
should not do.
Then test it.
Some Theory… 2nd Look
Set ofInputs ()
Set ofOutputs
Unknowns ()
KnownKnown
SafeUnsafe
AssumedSafe
Sources: Normal Operation Hardware Failures Human Intervention Models/Simulators
Fault Injection(added known)
Design for Safety
“Program and Project Responsibilities”Dan Goldin message:
Safety is more than FMEA and FTASafety must be designed in at the earliest
Existing SpecificationsMust include safety
Methods & techniques for mitigation of hazardsRequirements – Traceable and Testable
Initiatives
Dan Goldin: “Design for Safety”Smart Practices applied early to designs
Early engineering changes are cheaperProvide draft guidance for design of safety-critical
softwareProcess changes
Design Guidelines – NASA-GB-7410.13Peer reviews – enhanced checklistTest development – Fault Injection for Robustness
Works to prevent unforeseen fault scenarios
Objectives
Known fault scenarios – AnalysisRedesignTest – functionality and robustness
UnknownsDesign them out of the systemTest – fault injection
S/W Safety – Where we are.
Safety-Critical software identified & in engineering review
Software Safety Integration Team formedSoftware FTA/FMEA in work
Will be recurring due to spiral development
Design for Safety concepts being integratedSafety & Reliability Thread introducedPost-SSA Analysis Tools being procured
S/W Safety – What’s Next?
Today“Design for Safety” and “Known Fault
Analyses”Tomorrow
Recursive and bi-directional analysesReliability predictions, Markov, Numerical
Integration, Weibull analysis techniquesProbabilistic fault injection techniques
Summary
Life on the Leading Edge
Probably the “Largest real-time safety-critical control system on the planet”
Safety is our #1 core value
We are on front and center stage – The NASA team is watching