reliability, maintainability, and availability (rma) handbook · faa...
TRANSCRIPT
FAA Reliability, Maintainability, and Availability (RMA) HandbookFAA RMA-HDBK-006C V1.1
U.S. Department of Transportation Federal Aviation Administration
Reliability, Maintainability, and Availability (RMA) Handbook
November 19, 2015 FAA RMA-HDBK-006C V1.1
Federal Aviation Administration 800 Independence Avenue, SW
Washington, DC 20591
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
i
THIS PAGE LEFT INTENTIONALLY BLANK
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
ii
Table of Contents
SCOPE ................................................................................................................................... 17
DOCUMENT OVERVIEW ................................................................................................. 19
APPLICABLE DOCUMENTS .......................................................................................... 22
3.1 Specifications, standards, and handbooks ......................................................................... 22
3.2 FAA Orders ....................................................................................................................... 22
3.3 Non-Government Publications.......................................................................................... 22
DEFINITIONS ...................................................................................................................... 23
PURPOSE AND OBJECTIVES .......................................................................................... 32
5.1 Background ....................................................................................................................... 32
RMA REQUIREMENTS MANAGEMENT APPROACH .............................................. 34
DERIVATION OF NAS-LEVEL RMA REQUIREMENTS ............................................ 36
7.1 NAS-RD-2013 Severity Assessment Process ................................................................... 36
7.1.1 Severity Level Assessment ............................................................................................ 36
7.1.2 Severity Level Assessment Roll-Up .............................................................................. 38
7.2 Development of Service Threads ...................................................................................... 39
7.2.1 System of Systems Taxonomy of FAA Systems ........................................................... 40
7.2.2 Categorization of NAPRS Services ............................................................................... 42
7.3 Service Thread Contribution ............................................................................................. 46
7.4 Scaling of Service Threads ............................................................................................... 50
7.4.1 Facility Grouping Schema ............................................................................................. 50
7.4.1.1 ARTCCs ...................................................................................................................... 52
7.4.1.2 TRACONs................................................................................................................... 52
7.4.1.3 ATCTs......................................................................................................................... 54
7.4.1.4 Unstaffed Facilities ..................................................................................................... 54
7.4.2 Scaling Service Threads to Facility Groups................................................................... 54
7.4.3 Environmental Complications ....................................................................................... 55
7.5 Assign Service Thread Loss Severity Category................................................................ 55
7.6 STLSC Matrix Development ............................................................................................ 57
7.6.1 Terminal STLSC Matrix ................................................................................................ 59
7.6.2 En Route STLSC Matrix ................................................................................................ 61
7.6.3 “Other” Service Thread STLSC Matrix ......................................................................... 62
7.7 NAS-RD-2013 RMA Requirements ................................................................................. 64
7.7.1 Information Systems ...................................................................................................... 65
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
iii
7.7.2 Remote/Distributed and Standalone Systems ................................................................ 69
7.7.3 Infrastructure and Enterprise Systems ........................................................................... 71
7.7.3.1 Power Systems ............................................................................................................ 72
7.7.3.2 Heating, Ventilation and Air Conditioning (HVAC) Subsystems .............................. 76
7.7.3.3 Enterprise Infrastructure ............................................................................................. 77
7.7.3.3.1 Overview of Enterprise Infrastructure Systems ....................................................... 77
7.7.3.3.1.1 Service-Oriented Architecture .............................................................................. 77
7.7.3.3.1.2 Cloud Architectures .............................................................................................. 79
7.7.3.3.2 Communications Transport ...................................................................................... 80
7.7.3.3.3 Deriving RMA Requirements for Enterprise Infrastructure Systems (EIS) ............ 81
7.7.3.3.4 Increasing Reliability in EISs .................................................................................. 85
7.8 Summary of Process for Deriving RMA Requirements ................................................... 88
ACQUISITION STRATEGIES AND GUIDANCE .......................................................... 89
8.1 Preliminary Requirements Analysis ................................................................................. 89
8.1.1 System of Systems Taxonomy of FAA NAS Systems and Associated Allocation
Methods...................................................................................................................................... 90
8.1.1.1 Information Systems ................................................................................................... 91
8.1.1.2 Remote/Distributed and Standalone Systems ............................................................. 92
8.1.1.3 Mission Support Systems ............................................................................................ 94
8.1.1.4 Infrastructure and Enterprise Systems ........................................................................ 94
8.1.1.4.1 Power Systems ......................................................................................................... 95
8.1.1.4.2 HVAC Subsystems .................................................................................................. 96
8.1.1.4.3 Communications Transport ...................................................................................... 97
8.1.1.4.4 Enterprise Infrastructure Systems (EIS) .................................................................. 97
8.1.2 Analyzing Scheduled Downtime Requirements ............................................................ 98
8.1.3 Modifications to STLSC Levels .................................................................................... 99
8.1.4 Redundancy and Fault Tolerance Requirements ........................................................... 99
8.1.5 Preliminary Requirements Analysis Checklist ............................................................ 100
8.2 Procurement Package Preparation .................................................................................. 100
8.2.1 System Specification Document (SSD) ....................................................................... 100
8.2.1.1 System Quality Factors ............................................................................................. 101
8.2.1.2 System Design Characteristics .................................................................................. 107
8.2.1.3 System Operations .................................................................................................... 107
8.2.1.4 Leasing Services ....................................................................................................... 108
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
iv
8.2.1.5 System Specification Document RMA Checklist ..................................................... 109
8.2.2 Statement of Work ....................................................................................................... 109
8.2.2.1 Technical Interchange Meetings ............................................................................... 109
8.2.2.2 Documentation .......................................................................................................... 110
8.2.2.3 Risk Reduction Activities ......................................................................................... 113
8.2.2.4 Reliability Modeling ................................................................................................. 113
8.2.2.5 Performance Modeling.............................................................................................. 113
8.2.2.6 Monitor and Control Design Requirement ............................................................... 113
8.2.2.7 Fault Avoidance Strategies ....................................................................................... 114
8.2.2.8 Reliability Growth .................................................................................................... 114
8.2.2.9 Statement of Work Checklist .................................................................................... 115
8.2.3 Information for Proposal Preparation .......................................................................... 115
8.2.3.1 Inherent Availability Model ...................................................................................... 115
8.2.3.2 Proposed M&C Design Description and Specifications ........................................... 115
8.2.3.3 Fault Tolerant Design Description ............................................................................ 116
8.3 Proposal Evaluation ........................................................................................................ 116
8.3.1 Reliability, Maintainability and Availability Modeling and Assessment .................... 116
8.3.2 Fault-Tolerant Design Evaluation ................................................................................ 116
8.3.3 Performance Modeling and Assessment ...................................................................... 116
8.4 Contractor Design Monitoring ........................................................................................ 117
8.4.1 Formal Design Reviews ............................................................................................... 117
8.4.2 Technical Interchange Meetings .................................................................................. 117
8.4.3 Risk Management ........................................................................................................ 117
8.4.3.1 Fault Tolerance Infrastructure Risk Management .................................................... 118
8.4.3.1.1 Application Fault Tolerance Risk Management .................................................... 119
8.4.3.2 Performance Monitoring Risk Management ............................................................. 120
8.4.3.3 Software Reliability Growth Plan Monitoring .......................................................... 120
8.5 Design Validation and Acceptance Testing .................................................................... 121
8.5.1 Fault Tolerance Diagnostic Testing ............................................................................. 121
8.5.2 Functional Testing ....................................................................................................... 121
8.5.3 Reliability Growth Testing .......................................................................................... 121
SERVICE THREAD MANAGEMENT ........................................................................... 123
9.1 Revising Service Thread Requirements .......................................................................... 123
9.2 Adding a New Service Thread ........................................................................................ 123
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
v
9.3 FSEP and NAPRS ........................................................................................................... 124
RMA REQUIREMENTS ASSESSMENT ........................................................................ 125
10.1 RMA Feedback Paths ..................................................................................................... 125
10.2 Requirements Analysis ................................................................................................... 129
10.3 Architecture Assessment ................................................................................................. 130
NOTES ................................................................................................................................. 132
11.1 Updating this Handbook ................................................................................................. 132
11.2 Bibliography ................................................................................................................... 132
11.3 References ....................................................................................................................... 133
Appendix A SAMPLE REQUIREMENTS .......................................................................... 141
A.1 System Quality Factors ................................................................................................... 141
A.2 System Design Characteristics ........................................................................................ 142
A.3 System Operations .......................................................................................................... 144
RELIABILITY/AVAILABILITY TABLES FOR REPAIRABLE
REDUNDANT SYSTEMS ....................................................................................................... 165
B.1 Availability Table ........................................................................................................... 165
B.2 Mean Time between Failure (MTBF) Graphs ................................................................ 166
STATISTICAL METHODS AND LIMITATIONS ...................................... 170
C.1 Reliability Modeling and Prediction ............................................................................... 170
C.2 Maintainability ................................................................................................................ 171
C.3 Availability ..................................................................................................................... 171
C.4 Modeling Repairable Redundant Systems ...................................................................... 172
C.5 Availability Allocation.................................................................................................... 179
C.6 Modeling and Allocation Issues ...................................................................................... 181
FORMAL RELIABILITY DEMONSTRATION TEST PARAMETERS . 183
SERVICE THREAD DIAGRAM AND DEFINITIONS .............................. 188
EVOLUTION OF THE FAA RMA PARADIGM ......................................... 215
F.1 The Traditional RMA Paradigm ..................................................................................... 215
F.2 Agents of Change ............................................................................................................ 215
F.2.1 Technology and Requirements Driven Reliability Improvements .............................. 216
F.2.2 Fundamentals Statistical Limitations ........................................................................... 217
F.2.2.1 Reliability Modeling ................................................................................................. 217
F.2.2.2 Reliability Verification and Demonstration .............................................................. 219
F.2.3 Use of Availability as a Conceptual Specification ...................................................... 220
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
vi
F.2.4 RMA Issues for Software-Intensive Systems .............................................................. 221
F.2.4.1 Software Reliability Characteristics ......................................................................... 221
F.2.4.1.1 Software Reliability Curve .................................................................................... 223
F.2.4.2 Software Reliability Growth ..................................................................................... 224
F.2.4.2.1 Reliability Growth Program ................................................................................... 224
F.2.4.2.2 Reliability Growth Process .................................................................................... 224
F.2.5 RMA Considerations for Systems Using COTS or NDI Hardware Elements ............. 225
SOFTWARE RELIABILITY GROWTH IN THE ENGINEERING LIFE-
CYCLE 226
G.1 Relevant Software Identification .................................................................................... 226
G.2 Effort Identification ........................................................................................................ 226
G.2.1 “Full” Effort ................................................................................................................. 229
G.2.2 “Moderate” Effort ........................................................................................................ 229
G.2.3 “Minimum” Effort ....................................................................................................... 229
G.2.4 “None” Effort ............................................................................................................... 229
G.3 Goals Guidance ............................................................................................................... 230
G.4 Overarching Methodologies & Tools ............................................................................. 239
G.4.1 Metrics ......................................................................................................................... 239
G.4.2 Software Fault Taxonomies ......................................................................................... 239
G.4.3 Tools ............................................................................................................................ 239
POWER SYSTEM CATEGORY ALLOCATIONS ..................................... 240
GLOSSARY ........................................................................................................ 243
I.1 Acronyms ........................................................................................................................ 243
I.2 Definitions....................................................................................................................... 248
QUICK LOOK GUIDE TO USE OF THIS HANDBOOK ........................... 251
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
vii
Table of Figures
Figure 4-1 Operational Availability Entity-Relationship Diagram .............................................. 24
Figure 4-2 Failure / Restoration Timeline .................................................................................... 28
Figure 7-1 Functional Architecture ............................................................................................... 36
Figure 7-2 NAS System Taxonomy .............................................................................................. 42
Figure 7-3 Example Thread Diagram ........................................................................................... 46
Figure 7-4 Effect of Service Interruptions on NAS Capacity ....................................................... 46
Figure 7-5 Service Thread Loss Severity Categories - Case 1 ..................................................... 47
Figure 7-6 Potential Safety-Critical Service Thread - Case 2....................................................... 48
Figure 7-7 Decomposition of Safety-Critical Service into Threads ............................................. 49
Figure 7-8 Comparison of TRACONs over time by annual total number of operations .............. 53
Figure 7-9 Service/Function - Terminal Service Thread STLSC Matrix ..................................... 60
Figure 7-10 Service/Function – En Route Service Thread STLSC Matrix .................................. 61
Figure 7-11 Service/Function – “Other” Service Thread STLSC Matrix ..................................... 62
Figure 7-12 Example State Diagram............................................................................................. 65
Figure 7-13 Terminal Power System ............................................................................................ 74
Figure 7-14 En Route Power System ............................................................................................ 75
Figure 7-15 “Other” Power System .............................................................................................. 76
Figure 7-16 EIS Architecture for Essential Services .................................................................... 84
Figure 7-17 EIS Architecture for Efficiency-Critical Services..................................................... 85
Figure 8-1 Acquisition Process Flow Diagram ............................................................................. 89
Figure 8-2 NAS System of Systems Taxonomy ........................................................................... 90
Figure 10-1 RMA Process Diagram ........................................................................................... 125
Figure 10-2 Deployed System Performance Feedback Path....................................................... 126
Figure 10-3 Service Thread Availability Historgram ................................................................. 127
Figure 10-4 Reliability Histogram for Unscheduled Interruptions ............................................. 128
Figure 10-5 Requirements Analysis............................................................................................ 130
Figure 10-6 Architecture Assessment ......................................................................................... 131
Figure B-1 Mean Time between Failure for a "Two Needing One" Redundant Combination .. 167
Figure B-2 Mean Time between Failure for a “Two Needing One” Redundant Combination .. 168
Figure B-3 Mean Time between Failure for a “Two Needing One” Redundant Combination .. 169
Figure C-1 General State Transition Diagram for Three-State System ...................................... 174
Figure C-2 Simplified Transition Diagram ................................................................................. 176
Figure C-3 Coverage Failure ...................................................................................................... 178
Figure C-4 Availability Model.................................................................................................... 179
Figure D-1 Operating Characteristic Curves .............................................................................. 183
Figure D-2 Risks and Decision Points Associated with OC Curve ............................................ 184
Figure D-3 Effect of Increasing Test Time on OC Curve .......................................................... 185
Figure E-1 Functional Notional Architecture ............................................................................. 191
Figure E-2 Automatic Dependent Surveillance Service (ADSS) ............................................... 192
Figure E-3 Airport Surface Detection Equipment (ASDES) ...................................................... 192
Figure E-4 Beacon Data (Digitized) (BDAT) ............................................................................. 193
Figure E-5 Backup Emergency Communications Service (BUECS) ......................................... 193
Figure E-6 Composite Flight Data Processing Service (CFAD) ................................................ 194
Figure E-7 Composite Oceanic Display and Planning Service (CODAP) ................................. 194
Figure E-8 Anchorage Composite Offshore Flight Data Service (COFAD) .............................. 195
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
viii
Figure E-9 Composite Radar Data Processing Service (CRAD) (CCCH/EBUS) ..................... 195
Figure E-10 Composite Radar Data Processing Service (CRAD) (EAS/EBUS) ....................... 196
Figure E-11 En Route Communications (ECOM) ...................................................................... 196
Figure E-12 En Route Terminal Automated Radar Service (ETARS) ....................................... 197
Figure E-13 FSS Communications Service (FCOM) ................................................................. 197
Figure E-14 Flight Data Entry and Printout Service (FDAT) ..................................................... 198
Figure E-15 Flight Data Input/Output Remote (FDIOR) ........................................................... 198
Figure E-16 Flight Service Station Automated Service (FSSAS) .............................................. 199
Figure E-17 Interfacility Data Service (IDAT) ........................................................................... 199
Figure E-18 Low Level Wind Service (LLWS) ......................................................................... 200
Figure E-19 MODE-S Data Link Data Service (MDAT) ........................................................... 200
Figure E-20 MODE-S Secondary Radar Service (MSEC) ......................................................... 201
Figure E-21 NADIN Service Threads (NADS, NAMS, NDAT) ............................................... 201
Figure E-22 Radar Data (Digitized) (RDAT) ............................................................................. 202
Figure E-23 Remote Monitoring/Maintenance Logging System Service (RMLSS) .................. 202
Figure E-24 Remote Tower Alphanumeric Display System Service (RTADS)......................... 203
Figure E-25 Remote Tower Display Service (RTDS) ................................................................ 203
Figure E-26 Runway Visual Range Service (RVRS) ................................................................. 204
Figure E-27 Terminal Automated Radar Service (TARS) ......................................................... 204
Figure E-28 Terminal Communications Service (TCOM) ......................................................... 205
Figure E-29 Terminal Doppler Weather Radar Service (TDWRS) ............................................ 205
Figure E-30 Traffic Flow Management System Service (TFMSS) ............................................ 206
Figure E-31 Terminal Radar Service (TRAD) ............................................................................ 206
Figure E-32 Terminal Surveillance Backup (TSB) (NEW) ........................................................ 207
Figure E-33 Terminal Secondary Radar (TSEC) ........................................................................ 207
Figure E-34 Terminal Voice Switch (TVS) (NAPRS Facility) .................................................. 208
Figure E-35 Terminal Voice Switch Backup (TVSB) (New) ..................................................... 208
Figure E-36 Visual Guidance Service (VGS) ............................................................................. 209
Figure E-37 VSCS Training and Backup System (VTABS) (NAPRS Facility) ........................ 209
Figure E-38 Voice Switching and Control System Service (VSCSS) ........................................ 210
Figure E-39 WAAS/GPS Service (WAAS) ................................................................................ 210
Figure E-40 WMSCR Data Service (WDAT) ............................................................................ 211
Figure E-41 WMSCR Service Threads (WMSCR) .................................................................... 211
Figure E-42 R/F Approach and Landing Services ...................................................................... 212
Figure E-43 R/F Navigation Service........................................................................................... 212
Figure E-44 ARTCC En Route Communications Services (Safety-Critical Thread Pair) ......... 213
Figure E-45 Terminal Voice Communications Safety-Critical Service Thread Pair ................. 213
Figure E-46 Terminal Surveillance Safety-Critical Service Thread Pair (1) .............................. 214
Figure E-47 Terminal Surveillance Safety-Critical Service Thread Pair (2) .............................. 214
Figure F-1 FAA System Reliability Improvements .................................................................... 216
Figure F-2 NAS Stage A Recovery Effectiveness ...................................................................... 218
Figure F-3 Coverage Sensitivity of Reliability Models .............................................................. 219
Figure F-4 Notional Hardware Failure Curve, ............................................................................ 223
Figure F-5 Notional Software Failure Curve .............................................................................. 223
Figure F-6 Notional Reliability Growth Management................................................................ 225
Figure H-1 Terminal STLSC with Power System ...................................................................... 240
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
ix
Figure H-2 En Route STLSC with Power System ...................................................................... 241
Figure H-3 “Other” STLSC with Power System ........................................................................ 242
Figure J-1 FAA Lifecycle Management Process ........................................................................ 251
Figure J-2 ANG B Program Level Requirements Process.......................................................... 252
Figure J-3 RMA Process: IARD Phase ....................................................................................... 253
Figure J-4 STLSC Matrix Illustration ......................................................................................... 255
Figure J-5 Initial Investment Analysis Phase.............................................................................. 257
Figure J-6 RMA Process: Final Investment Analysis Phase ...................................................... 257
Figure J-7 Solution Implementation Phase ................................................................................. 258
Figure J-8 In-Service Management Phase .................................................................................. 259
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
x
Table of Tables
Table 7-1 NAS Architecture Services and Functions ................................................................... 37
Table 7-2 Mapping of NAPRS Services and Service Threads ..................................................... 44
Table 7-3 Summary of Mapping of Service Threads to STLSC Matrices ................................... 45
Table 7-4 Facility Grouping and Descriptions.............................................................................. 51
Table 7-5 TRACON Grouping ..................................................................................................... 53
Table 7-6 Criteria Used for Tower Classification [73] ................................................................. 54
Table 7-7 Noted Discrepancies ..................................................................................................... 63
Table 7-8 Information Service Thread Reliability, Maintainability, and Recovery Times .......... 66
Table 7-9 Remote/ Distributed and Standalone Systems Services Thread ................................... 70
Table 7-10 RMA Characteristics of FTI Services ........................................................................ 81
Table 7-11 Enterprise Infrastructure Architecture (EIS) .............................................................. 83
Table 8-1 RMA Related Data Item Description ......................................................................... 111
Table 11-1 References ................................................................................................................ 133
Table G-1 MIL-STD-882C Software Control Categories .......................................................... 226
Table G-2 Software Risk Index .................................................................................................. 227
Table G-3 Software Risk Matrix Guidance ................................................................................ 228
Table G-4 Software Risk Matrix Guidance ................................................................................ 228
Table G-5 Oversight Guidance ................................................................................................... 228
Table G-6 Software Reliability Growth Effort Specifics ........................................................... 231
Table G-7 Software Reliability Growth Phase Notes ................................................................. 237
Table I-1 Acronyms .................................................................................................................... 243
Table I-2 Terms........................................................................................................................... 248
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
xi
THIS PAGE LEFT INTENTIONALLY BLANK
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
12
REVISION HISTORY
Date Version Comments
1.16 Reorganized STLSC matrices to Terminal, En
Route, and Other; combined Info and R/D
threads for each domain
Added power codes to matrices
Fixed text and references referring to matrices
Added power text.
2.01
Added Preface.
Numerous changes to align with NAS Enterprise
Architecture, a functionally organized NAS-RD
2010 (NAS SR-1000), and miscellaneous
organizational name changes.
Added Software Reliability Growth Plan
9/30/2013 3.0 Reorganized and significantly updated the
document, including: NAS Taxonomy Diagram,
STLSC matrices, and power architecture.
Significant additions include scalability factors
for service threads and Enterprise Infrastructure
Systems.
8/31/2015 3.1 Restructured document for readability.
Updated the document to: Reflect NAS-RD-
2013: Reorganized NAS Enterprise
Requirements to align with the NAS Enterprise
Architecture
Updated Appendix E Service Thread Diagram
and Definitions, to align with the current NAS
Service Threads. Updated STLSC matrices and
associated Tables to match new Diagrams
Revised old Appendix G and reversed order of
Appendices F and G
Added Appendix J: Quick Look Guide To Use
Of This Handbook
Added Attachment 1: Economic Impact Of
Availability
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
13
PREFACE
The tools and techniques that are the foundation for reliability management were developed in
the late 1950’s and early 1960’s. In that timeframe, the pressures of the cold war and the space
race led to increasing complexity of electronic equipment, which in turn created reliability
problems that were exacerbated by the use of these “equipment” in missile and space
applications that did not permit repair of failed hardware.
Development of this Reliability, Maintainability, and Availability (RMA) Handbook was
undertaken in recognition that changes that have occurred over the past four decades to
traditional approaches to RMA specification and verification have created a need for a dramatic
change in the way these RMA issues are addressed.
The Handbook has been updated to align with the National Airspace System (NAS) Enterprise
Architecture (NAS EA) and support significant changes to the System Effectiveness and RMA
Requirements in the latest version of the NAS Requirements Document, the NAS-RD-2013.
Users of this Handbook should check for newer editions of the NAS-RD prior to applying the
techniques and processes described here to specific NAS requirements. The NAS-Level
requirements are published on the NAS EA portal. This Handbook uses the NAS EA
terminology throughout and differentiates “overloaded” terms that have one definition in the
context of the NAS EA and different definitions elsewhere in the Federal Aviation
Administration (FAA).
The Handbook also includes an appendix (Appendix J) to provide a “quick start” guide for
individuals who are time constrained or do not require the full level of detail provided by the
handbook.
The traditional RMA approach discussed above is not suitable for modern automation systems.
Appendix F outlines how technology advanced, characteristics of the systems changed, and the
severity of the applications increased over the last 40 years. It outlines several areas in which
evolving changes have affected the way the FAA has traditionally viewed RMA requirements in
a legalistic sense i.e., the requirements have been part of legally binding contracts with which
contractors must comply. These changes have degraded the Government’s ability to write and
manage RMA requirements that satisfy the following three (out of ten) characteristics of good
requirements cited in the FAA’s System Engineering Manual (SEM):
Allocable
Attainable (achievable or feasible)
Verifiable
This Handbook describes a new approach to RMA requirements management that focuses on
associating NAS-Level requirements with service threads and assigning each service thread
requirements that are achievable, verifiable, and consistent with the severity of the service
provided to users and specialists. The focus of the RMA management approach is on early
identification and mitigation of technical risks affecting the performance of fault-tolerant
systems, followed by an aggressive software reliability growth program to provide contractual
incentives to find and remove latent software defects.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
14
This Handbook serves as guidance for systems engineers, architects and developers who are
defining RMA requirements for FAA acquisitions or are implementing systems in response to
such requirements. The Handbook presents users with information about RMA practices and
provides a reference to assist users with developing realistic RMA requirements for hardware
and software or understanding such requirements in the FAA context.
This Handbook is for guidance only and cannot be cited as a requirement.
This Handbook defines a process for allocating NAS-Enterprise-Level RMA requirements to
FAA systems. Doing so facilitates the standardization of requirements across procured systems,
and promotes a common understanding among the FAA community and its affiliates. This
Handbook also describes the evolution of the FAA’s RMA paradigm to foster stakeholder
understanding of the rationale that forms the basis for the guidance provided.
The breadth and scope of system acquisitions or implementations is wide, ranging from major
new system acquisitions to direct one-for-one replacement of existing systems. This broad range
does not lend itself to a one-size-fits-all RMA requirements management
approach/methodology/process. Major new system acquisitions encompass the entire
acquisition life-cycle and are the focus of this Handbook. It is incumbent on stakeholders to
consider the scope of the program/project under consideration and tailor the process prior to
applying the techniques and processes described herein. Doing so requires early involvement
and concurrence by appropriate stakeholders (e.g., ANG and AJW) and subject matter experts
(SME) to ensure operationally acceptable RMA characteristics in fielded systems.
The primary purpose of defining NAS Enterprise-Level RMA requirements is to relate NAS
system-level functional requirements to verifiable specifications for the hardware and software
systems that implement them and to establish a baseline for the requirements.
The processes used in this Handbook are based on the concept of the service thread – strings of
systems and functions necessary to deliver that service, e.g., separation assurance. These service
threads are derived from National Airspace Performance Reporting System (NAPRS)
“Services”. Service threads bridge the gap between un-allocated functional requirements and the
specifications of the systems that support them. Service threads also provide a vehicle for
allocating NAS Enterprise-Level RMA-related requirements to specifications for the
systems/functions that comprise the service threads.
Since the first version of the RMA Handbook was issued, the NAS System Requirements
Specification (NAS-SR-1000) was reorganized and reissued with a functional alignment (NAS-
SR-1000 Version B) and more recently rewritten to align with the NAS EA. In NAS-RD-2013,
individual functional requirements are each assigned a Service Thread Loss Severity Category
(STLSC—pronounced “Still See”) of Safety-critical, Efficiency-Critical, Essential, or Routine.
Service threads are categorized by the consequences of their loss on NAS operations due to the
time-critical nature of their support to maintaining the safe and orderly flow of air traffic
Associating a STLSC with functional requirements is appropriate as it identifies the level of
service thread that should be used to support or implement that requirement. This process is
analogous to the NAS EA Operational Activity to Systems Function Traceability Matrix (SV-5)
mapping of the relationship between the operational activities and the systems and functions
that support them.
This version of handbook provides new material on the following subjects:
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
15
HW/SW Availability Requirements
Software Reliability Growth Program
Reliability, Maintainability, Availability, Logistics, and Life Cycle Cost
HW/SW Availability Requirements
Service thread and/or system-level (hardware and software) availability requirements have been
eliminated. In their place, stringent recovery time requirements and a well-constructed software
reliability growth program are recommended.
Statistical limitations make accurate prediction, demonstration or verification of the high levels
of service thread or system operational availability required for air traffic applications
impossible to substantiate. Establishing “requirements” for which compliance cannot be verified
violates a key premise of the FAA System Engineering Manual (SEM) guidelines for good
requirements and diverts resources from the real issues affecting the operational availability of
automation functions.
For hardware, only inherent availability should only be used to drive the hardware architecture;
however, the reliability of modern processors is such that using inherent availability to
determine redundancy requirements is virtually unnecessary. For software-intensive automation
systems the most significant RMA design driver is the required recovery time from hardware
and software failures, not inherent availability requirements. The required recovery time drives
the fault-tolerant design, the need for redundant hardware and software, timing budgets, and
monitoring overhead constraints. Recovery time requirements for Safety-Critical or Efficiency-
Critical Service Threads require redundant processing necessitating high inherent availability.
Availability requirements should never be applied to software. In place of availability
requirements, software reliability growth programs should be required (refer to Appendix F) to
track the quality and maturity of the software. The primary factor affecting service thread and/or
system-level reliability is software. Software reliability is not a static characteristic, but changes
over time as latent software defects are discovered and corrected. At the NAS EA level, it is
appropriate to identify the system architecture requirements that are necessary to provide a
foundation that is capable of meeting the operational availability needs for each STLSC. This
includes establishing automatic recovery time requirements and requiring one or more
independent backup service threads for Safety-Critical Service Threads. Achieving software
reliability, however, requires an effective reliability growth program, and the RMA Handbook
now includes an improved and more detailed description of how to structure a reliability growth
program and sample templates for use in preparing the Statement of Work in Appendix G.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
16
Software Reliability Growth Program
The software reliability growth section of Appendix G has been expanded in recognition of the
increased emphasis on the importance of this activity in developing automation systems with
acceptable reliability and recovery time characteristics.
The focus of the software reliability growth program is shifted from Mean Time Between
Failure (MTBF)-centered metrics to Problem Report-centered metrics. The objective is to
provide incentives for contractors to aggressively find and correct defects in the software.
The concept of establishing top-level MTBF decision thresholds for the first site and final site
has been abandoned. Decisions concerning when the system is stable enough to send to the field
and when it has met Operational Readiness Demonstration (ORD) criteria should be made
jointly by the judgment of program management personnel and operational personnel, based on
the frequency of interruptions, their duration, and their effect on user confidence in the system.
It is not realistic to establish inflexible NAS system-level criteria.
Reliability, Maintainability, Availability, Logistics, and Life Cycle Cost
The RMA requirements for the hardware elements comprising service threads are best
determined by acquisition managers, based on the unique circumstances of a particular
application, not a “top-down” mathematical allocation of a system-level requirement.
Hardware RMA requirements need to consider factors such as the availability of Commercial-
off-the-shelf (COTS) products, location of maintenance personnel, staffing, logistics support
policies, level of repair, etc. Clearly, RMA issues for hardware located in a major facility with
24/7 maintenance staffing will differ from those for hardware located on a remote mountain top.
Whether to provide a remotely activated spare should be driven by Life Cycle Cost (LCC)
considerations, not the allocation of an arbitrary system-level requirement. In a few sections of
this Handbook, this revision has added some minor discussion of LCC issues to be considered
by acquisition managers when establishing RMA characteristics for hardware components.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
17
SCOPE
THIS HANDBOOK IS FOR GUIDANCE ONLY AND CANNOT BE CITED AS A REQUIREMENT
This Handbook is intended to serve as guidance for systems engineers, architects and
developers who are defining Reliability, Maintainability and Availability (RMA) requirements
for Federal Aviation Administration (FAA) acquisitions or are implementing systems in
response to such requirements. It is also intended as an information source for the broader user
community e.g., Technical Operations personnel. It applies equally to new acquisition and
established, iterative development programs. The Handbook presents users with information on
a new approach to RMA requirements management and provides a reference to assist users with
developing realistic RMA requirements for hardware and software or understanding such
requirements in the FAA context.
Appendix J provides a ”quick-start” guide to make the handbook methodology available to
individuals who are time constrained or do not require the full level of detail provided by the
handbook.
This Handbook describes the evolution of the FAA’s RMA paradigm as the basis for
understanding the new approach and defines a process, to be used as guidance, for allocating
National Airspace System (NAS) Enterprise-Level RMA requirements to FAA systems and
documents. Doing so facilitates the standardization of requirements across procured systems,
and promotes a common understanding among the FAA stakeholder community.
The primary purpose of defining NAS Enterprise-Level RMA requirements is to relate NAS
system-level functional requirements to verifiable specifications for the hardware and software
systems that implement them and to establish a baseline for the requirements.
This document addresses RMA considerations associated with four general categories of NAS
systems:
1. Automated information systems that continuously integrate and update data from
remote sources to provide timely decision-support services to Air Traffic Control
(ATC) specialists (Section 7.7.1 and 8.1.1.1)
2. Remote/Distributed and Standalone Systems that provide services such as
navigation, surveillance, and communications to support NAS ATC systems
(Section 7.7.2 and 8.1.1.2)
3. Infrastructure systems (Section 7.7.3) that provide services such as power (Section
8.1.1.4.1), Heating, Ventilating, and Air Conditioning (HVAC) systems (Section
8.1.1.4.2), and communications transport (Section 8.1.1.4.3) in support of NAS
facilities. Enterprise Infrastructure Systems (EIS) which host multiple services
across NAS facilities (Section 8.1.1.4.4).
4. Mission support systems that assist in the design of NAS airspace and the utilization
of the electromagnetic spectrum (Section 8.1.1.3)
This document presents guidance on the treatment of RMA considerations for each category of
system and is intended for use by acquisition managers and their staffs in the preparation,
conduct and execution of FAA procurements.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
18
This revision of the Handbook expands the scope of RMA analysis to include “Right-sizing”
analysis. Today, NAS requirements are assigned to one of four availability categories based on
the criticality of the service the requirement supports, as defined in Section 4, these are:
Safety-Critical
Efficiency-Critical
Essential
Routine
Currently, there is no method for clearly delineating between Efficiency-Critical and Essential
NAS services. Some NAS level requirements have been considered overly restrictive in
designating NAS services as Efficiency-Critical for the entire NAS. This impacts current FAA
initiatives to “Right-Size” the NAS1. ANG-B has studied the impact of Availability on airline
and passenger economic costs, using facility operations and propagated delay as metrics. This
study is presented in Attachment 1, which may serve as the basis for a technique for delineating
the Efficiency-Critical / Essential boundary on a facility basis for individual systems or
programs. There may be no impact on overall NAS efficiency due to loss of “efficiency critical”
services in a low capacity region of airspace.
Where legacy systems are involved with no clear successor or replacement program or system,
the NAS EA staff should regularly assess the RMA characteristics of legacy systems, and
identify shortfalls, and identify issues.
1 Federal Aviation Administration Strategic Initiatives 2014-2018
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
19
DOCUMENT OVERVIEW
This Handbook covers three major topics. The first, addressed in Section 5 describes a new
RMA requirements management approach focused on identifying and mitigating technical risks
affecting the performance of fault-tolerant systems, followed by an aggressive software
reliability growth program to provide contractual incentives to find and remove latent software
defects.
The second major topic, contained in Section 6, describes how the NAS-Level RMA
requirements were developed. This material is included to provide the background information
necessary to develop an understanding of the RMA requirements management approach.
The third major topic, contained in Section 7, addresses the specific tasks to be performed by
service units, acquisition managers, and their technical support personnel to apply the NAS-
Level requirements to major system acquisitions. The section is organized in the order of a
typical procurement action. It provides a detailed discussion of specific RMA activities
associated with the development of a procurement package and continues throughout the
acquisition cycle until the system has successfully been deployed. The approach is designed to
help ensure that the specified RMA characteristics are actually realized in fielded systems.
The elements of this approach are summarized below:
Section 6: RMA Requirements Management Approach
This Handbook describes a new approach to RMA requirements management that focuses on
associating NAS-Level requirements with service threads and assigning each service thread
requirements that are achievable, verifiable, and consistent with the severity of the service
provided to users and specialists. The focus of the RMA management approach is on early
identification and mitigation of technical risks affecting the performance of fault-tolerant
systems, followed by an aggressive software reliability growth program to provide contractual
incentives to find and remove latent software defects. The reader is encouraged to refer to
Appendix F as it provides context and rationale for the guidance provided herein.
The key elements of the approach are:
Map the NAS-Level functional requirements to a set of generic service threads based on the
NAPRS services reported for deployed systems. (Section 6)
Assign Service Thread Loss Severity Categories (STLSC) of “Safety-Critical,” “Efficiency-
Critical,” “Essential,” " Routine," and "Remote/Distributed," to the service threads based on the
effect of the loss of the service thread on NAS safety and efficiency of operations. (Sections 6
and 7.3)
Distinguish between Efficiency-Critical threads whose interruptions can be safely
managed by reducing capacity that may, however, cause significant traffic disruption
vs. Safety-Critical threads whose interruption could present a significant safety
hazard during the transition to reduced capacity operations. (Sections 6 and 7.3)
Allocate NAS-Level availability requirements to service threads based on the
severity and associated availability requirements of the NAS capabilities supported
by the threads.
Recognize that the probability of achieving the availability requirements for any
service thread identified as Safety-Critical is unacceptably low; therefore, where
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
20
possible, decompose the thread into alternate (at least two) independent new threads,
each with a STLSC no greater than “Efficiency-Critical.”
Recognize that the availability requirements associated with “Efficiency-Critical”
service threads will require redundancy and fault tolerance to mask the effect of
software failures.
Move from using availability as a contractual requirement to parameters such as
MTBF, Mean Time To Restore (MTTR) recovery times, and mean time between
successful recoveries that are verifiable requirements. Couple these with an
aggressive software reliability growth program.
Use RMA models only as a rough order of magnitude confirmation of the potential
of the proposed hardware configuration to achieve the inherent availability of the
hardware, not a prediction of operational reliability.
Focus RMA effort, during development, on design review and risk reduction testing
activities to identify and resolve problem areas that could prevent the system from
approaching its theoretical potential.
Recognize that “pass/fail” reliability qualification tests are impractical for systems
with high reliability requirements and substitute an aggressive software reliability
growth program.
Use NAPRS data from the National Airspace System Performance Analysis System
(NASPAS) to provide feedback on the RMA performance of currently fielded
systems to assess the reasonableness and attainability of new requirements, and to
verify that the requirements for new systems will result in systems with RMA
characteristics that are at least as good as those of the systems they replace.
Apply these principles throughout the acquisition process.
The application of these RMA management methods for the new approach is
discussed in detail in Section 8. These methods apply equally to new acquisitions
and established iterative development programs. All phases of the acquisition
process are addressed, including preliminary requirements analysis, allocation,
preparation of procurement documents, proposal evaluation, contractor monitoring,
and design qualification and acceptance testing.
Throughout all phases of design and allocation, appropriately chosen subject matter
experts (SME) should be utilized to ensure successful incorporation of an
operational viewpoint into the requirements. SMEs can be both operational Air
Traffic Control personnel from appropriate facility types and locations as well as
technical operations and maintenance personnel with insight into maintainability and
repair aspects of systems to be fielded. Early involvement of SMEs during
procurement will help to ensure both clarification of the expected role and utility of a
future system, and help set operational expectations for future systems. SME input
should be carefully documented and weighed against NAS EA functional
expectations for a system and where necessary, fed back into the concept of
operations (CONOPS).
Section 7: Derivation of NAS-Level RMA Requirements
This section introduces the concept of a service thread. This section documents the procedures
used to map NAS Architecture functional requirements to generic service threads to serve as the
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
21
basis for allocating the requirements to specific systems. Section 7.7.3 provides guidance for
allocating NAS-Level requirements to Enterprise Infrastructure Systems (EIS).
Section 8: Acquisition Strategies and Guidance
This section describes the specific tasks to be performed by technical staffs of FAA Service
Units and acquisition managers to apply the NAS-Level requirements to system level
specifications and provides guidance and examples for the preparation of RMA portions of the
procurement package to include:
Preliminary Requirements Analysis
Procurement Package Preparation
System Specification Document (SSD)
Statement of Work (SOW)
Information for Proposal Preparation (IFPP)
Proposal Evaluation
Reliability, Maintainability and Availability Modeling and Assessment
Fault-Tolerant Design Evaluation
Contractor Design Monitoring
Formal Design Reviews
Technical Interchange Meetings
Risk Management
Design Validation and Acceptance Testing
Fault Tolerance Diagnostic Testing
Section 9: SERVICE THREAD MANAGEMENT
This section describes the process for updating the service thread database to maintain
consistency with the NAPRS services in response to the introduction of new services, system
deployments, modifications to NAPRS services, etc.
Section 10: RMA REQUIREMENTS ASSESSMENT
Describes the approach used to compare new requirements with the performance of fielded
systems to verify the achievability of proposed requirements, ensure that the reliability of new
systems will be at least as good as that of existing systems, and to identify deficiencies in the
performance of currently fielded systems.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
22
APPLICABLE DOCUMENTS
Documents referenced in this Section are versions current as of the revision of this document.
Readers are urged to check for and utilize the most current versions of these documents.
3.1 Specifications, standards, and handbooks FEDERAL AVIATION ADMINISTRATION
FAA-G-2100H, ELECTRONIC EQUIPMENT, GENERAL REQUIREMENTS, 9 May 2005
FAA-STD-067, FAA Standard Practice for Preparation of Specifications, 4 December 2009.
FAA System Engineering Manual, Version 3.1, 11 October 2006.
NAS-RD-2013, National Airspace System Requirements Document 2013, Baseline, 11 August
2014.
NAS-SR-1000, FAA System Requirements, 21 March 1985.
NAS Enterprise Architecture Portal. Version 8.2, 9 January 2014.
DEPARTMENT OF DEFENSE
MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, 2 December 1991.
MIL-HDBK-472, Maintainability Prediction, Notice 1, 12 January 1984.
MIL-HDBK-781A, Reliability Test Methods, Plans, and Environments for Engineering,
Development, Qualification, and Production, April 1996.
MIL-STD-471A, Maintainability Verification/Demonstration/Evaluation, 27 March 1973.
MIL-STD-882E, Department of Defense Standard Practice for System Safety, 11 May 2012.
MIL-STD-967, Department of Defense Standard Practice, Defense Handbooks Format and
Content, 1 August 2003.
3.2 FAA Orders FAA Order JO 6040.15, National Airspace Reporting System (NAPRS)
FAA Order 6000.36A, Communications Diversity, 11/14/95
DRAFT FAA Order 6000.36B, Communications Diversity
FAA Order 6000.5D - Facility, Service, and Equipment Profile (FSEP)
FAA Order 6000.15G - General Maintenance Handbook for National Airspace System (NAS)
Facilities
FAA Order 6950.2D, Electrical Power Policy Implementation at National Airspace System
Facilities, 10/16/03.
3.3 Non-Government Publications IEEE J-STD-016, Standard for Information Technology Software Life Cycle Processes
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
23
DEFINITIONS
This section provides definitions of RMA terms. Three basic categories of definitions are
presented in this section:
1. Definitions of commonly used RMA terms and effectiveness measures
2. Definitions of RMA effectiveness measures tailored to address unique characteristics
of FAA fault-tolerant automation systems
3. Definitions of unique terms used both in this document and in the RMA section of
NAS-RD-2013
Definitions for commonly used RMA effectiveness terms are based on those provided in MIL-
STD-721. In some cases, where multiple definitions exist, the standard definitions have been
modified or expanded to provide additional clarity or resolve inconsistencies. Less frequently
used terms are defined in Appendix I.2.
For unique terms created during the preparation of the document and the RMA section of the
NAS-RD-2013, a brief definition is included along with a pointer to the section of the
Handbook where the detailed rationale is provided. This document assumes the reader is
familiar with the NAS Enterprise Architecture (NAS EA) (Version 8.0 or greater) and its
associated terminology. Readers unfamiliar with the NAS EA are referred to the website:
nasea.faa.gov.
AVAILABILITY: The probability that a system or constituent piece may be operational
during any randomly selected instant of time or, alternatively, the fraction of the total available
operating time that the systems or constituent piece is operational. Measured as a probability,
availability may be defined in several ways, which allows a variety of issues to be addressed
appropriately, including (see Figure 4-1):
Operational Availability (AOp): AOp is the ratio of total operating facility/service hours
to maximum facility/service hours, expressed as a percentage. It is the Local or
Enterprise NAS availability that is needed to support ATC operations.
Human Availability (AH): The availability of the effect a human has on the operational
availability of the NAS Service. This includes the availability of operators to provide the
NAS Service and includes the availability of personnel to maintain system availability.
Service Availability (AS): The availability of the services including both the hardware
and software components. This may include multiple systems, communications links
and facilities required to provide that service. This service availability includes
scheduled and unscheduled service outages during hours of operation.
Facility Availability (AF): The availability of a facility providing or necessary to
provide service(s). It consists of the availability of the facility infrastructure components
(i.e. power, HVAC, etc.).
Communication Availability (AC): The availability related to the communications
interchange between facilities and systems required to deliver services.
System Availability (ASY): The availability of a particular piece of equipment or system
(at the “system-level”). System Availability is comprised of Inherent Availability of the
system hardware and Software Availability). This includes the availability of the system,
as installed, as well as the availability of equipment, components and tools required to
maintain the system and keep it operational.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
24
Inherent Availability (AI): The probability that a particular piece of equipment or
system will operate satisfactorily at a given point in time when used under stated
conditions in an ideal support environment. It excludes software availability (ASW),
logistics time, waiting or administrative downtime, and preventive maintenance
downtime. It includes corrective maintenance downtime. Inherent availability is
generally derived from analysis of an engineering design and is based on quantities
under control of the designer. For FAA systems, ANG regards firmware as a component
of hardware. Failures attributed to firmware should be treated as hardware failures.
Similarly, firmware updates should be treated as hardware revisions.
Software Availability (ASW): The availability of the software components independent
of the hardware.
Information Availability (AInfo): The accessibility and usability of information to end-
users and systems.
Figure 4-1 Operational Availability Entity-Relationship Diagram
This availability entity-relationship diagram may also be useful as a framework for projecting
risk, both for specific predictable hazards, and unknown hazards. For instance, AH has a
predictable vulnerability to large block retirements due to workforce aging and past hiring
patterns, but also has an unpredictable vulnerability to mass staff outages due to unlikely, but
historically realistic scenarios like epidemic disease, labor disruption, or terrorist action.
Risks to facility availability AF are also realistic (earthquake, flood, fire), as well as software
availability ASW (Y2K like problems, unavailability of Ada programmers), communications AC
(prohibitive costs of rural service, rapid technology obsolescence). As more and more FAA
systems become reliant on public internet resources (weather data, collaboration data, UAS
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
25
operator connectivity) AInfo will become increasingly vulnerable both to routine Internet
availability problems (backhoe), and to major disruption ranging from just-because-we-can
denial of service attacks to nation state actions against critical infrastructure.
The advantage of characterizing risk in this manner is that these risks can then be fed into
service availability prediction, which can then feed analysis of alternatives.
CERTIFICATION: A quality control method used by FAA Technical Operations Services
(TechOps) to ensure NAS systems and services are performing as expected. TechOps
determines certification requirements. TechOps is authorized to render an independent
discretionary judgment about the provision of advertised services. Also because of the need to
separate profit motivations from operational decisions and the desire to minimize liability,
certification and oversight of the NAS are inherently governmental functions. [FAA Order
6000.30, Definitions Para 11.d]
COVERAGE: Probability of successful recovery from a failure given that a failure occurred.
FACILITY: Generally, any installation of equipment designated to aid in the navigation,
communication, surveillance, or control of air traffic. Specifically, the term denotes the total
electronic equipment, power generation, or distribution systems and any structure used to house,
support, and/or protect the use of equipment and systems. A facility may include a number of
systems, subsystems, or equipment.
FAILURE: The event or inoperable state in which any item or part of an item does not, or
would not perform as specified.
Dependent Failure: A failure caused by the failure of an associated item(s), e.g. failure
of a computer due to loss of external power.
Independent Failure: A failure that is not caused by the failure of any other item, e.g.,
failure of a computer due to failure of its internal power supply.
FAILURE MODE AND EFFECTS ANALYSIS (FMEA): A procedure for analyzing each
potential failure mode in a system to determine its overall results or effects on the system and to
classify each potential failure mode according to its severity.
FAILURE RATE: The total number of failures within an item population, divided by the total
number of operating hours.
FAULT TOLERANCE: Fault tolerance is an attribute of a system that is capable of
automatically detecting, isolating, and recovering from unexpected hardware or software
failures.
INDEPENDENT ALTERNATE SERVICE THREADS: Independent alternate service
threads entail at least two service threads composed of separate system components that provide
alternate data paths. They provide levels of reliability and availability that cannot be achieved
with a single service thread.
Ideally, alternate threads should not share a single power source. If alternate threads do share a
single power source, the power source must be designed, or the power system topology must be
configured, to minimize failures that could cause multiple threads to fail. The independent
alternate threads may share displays, provided adequate redundant displays are provided to
permit the specialist to relocate to an alternate display in the event of a display failure.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
26
Independent Alternate Service Threads may or may not require diverse hardware and software,
but all threads should be active and available at all times. Users need to be able to select either
thread at will without need for a system switchover (See Section 6.3 for a detailed discussion of
Independent Alternate Service Threads.)
INHERENT VALUE: A measure of reliability, maintainability, or availability that includes
only the effects of an item’s hardware design and its application, and assumes an ideal operation
and support environment functioning with perfect software.
LOWEST REPLACEABLE UNIT (LRU): For restoration purposes, an LRU is an assembly,
printed circuit board, or chassis-mounted component that can easily be removed and replaced.
MAINTAINABILITY: The measure of the ability of an item to be retained in or restored to
specified condition through maintenance performed, at each prescribed level of maintenance
and repair, by appropriately skilled personnel using prescribed procedures and resources.
Many maintainability effectiveness measures have inconsistent and conflicting definitions, and
the same acronym sometimes represents more than one measure. These inconsistencies
generally arise as a consequence of the categories of downtime that are included in a
maintainability effectiveness measure.2 The following definitions reflect the usage in this
document and the NAS-RD-2013:
Maintenance Significant Items (MSI) – Hardware elements that are difficult to
replace, i.e., cables, backplanes, and antennas.
Mean Down Time (MDT) – Mean Down Time is an operational performance measure
that includes all sources of system downtime, including corrective maintenance,
preventive maintenance, travel time, administrative delays, and logistics supply time.
Mean Time to Repair (MTTR) – Mean Time to Repair is a basic measure of
maintainability. It is the sum of corrective maintenance times (required at any specific
level of repair) divided by the total number of failures experienced by an item that is
prepared at that level, during a particular interval, and under stated conditions. The
MTTR is an inherent design characteristic of the equipment. Traditionally, this
characteristic represents an average of the number of times needed to diagnose, remove,
and replace failed hardware components. In effect, it is a measure of the extent to which
physical characteristics of the equipment facilitate access to failed components in
combination with the effectiveness of diagnostics and built in test equipment.3
MTTR is predicted by inserting a broad range of failed components and measuring the
times to diagnose and replace them. It is calculated by statistically combining the
component failure rates and the measured repair times for each component. The measure
assumes an ideal support environment in which trained technicians with all necessary
tools and spare parts are immediately available – but it does not include scheduled
downtime for preventive maintenance or such things as the time needed for a technician
to arrive on scene or delays in obtaining necessary spare parts.
2 Maintenance events performed by FAA personnel are defined as any activity performed including scheduled and
unscheduled events. For the purposes of this Handbook, scheduled maintenance events are assumed to not impact
on Availability, and are excluded from availability calculations.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
27
Mean Time to Repair is a metric that is commonly used in procurement of systems or
components where the contractor is held responsible for design and manufacturing
aspects of Maintainability, but is not (or cannot be) held responsible for performing the
repair or maintaining the logistics system. In specifying this MTTR, systems engineers
should consult with TechOps on what proportion of the NAS-RD specified MTTR
(Restore) can be allocated to the Repair. Figure 4-2 Failure / Restoration Timeline
illustrates the sequence of events in failure, repair and restoration of service.
MTTR (NAS-RD Definition)
The current NAS-RD-2013 uses the acronym MTTR to mean Mean Time to Restore,
e.g. “3.3.1.2.0-5 The MTTR for non-routine service thread components shall be less than
or equal to 0.5 hours.”
NAS-RD Mean Time to Restore is commonly understood to exclude travel time and time
to obtain parts not stored on site, but to include a FAA specific process of re-
certification of the failed system prior to notification of availability to the operational
users.
Mean Time to Restore Service (MTTRS) – The MTTRS is also an inherent measure
of the design characteristics of complex systems. It represents the time needed to
manually restore service following an unscheduled service failure requiring manual
intervention. Like MTTR, it includes only unscheduled downtime and assumes an ideal
support environment, but the MTTRS includes not only the time for hardware
replacements, but also times for software reloading and system restart times. MTTRS
does not include the time for the successful operation of automatic fault detection and
recovery mechanisms that may be part of the system design. The performance
specifications for the operation of automatic recovery mechanisms are addressed
separately.
MTTRS may be a suitable quality metric for services provided by a contractor on a
leased basis. In general leased services are provided on a per location or item (circuit
etc.) basis, caution should be exercised in developing MTTRS numbers for aggregated
locations or items. In such circumstances a Maximum Time to Restore Service should
also be specified on a per item basis.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
28
Figure 4-2 Failure / Restoration Timeline
NON-DEVELOPMENTAL ITEM (NDI): An NDI of a system or element of a system, that is
used in a developmental program but has been developed under a previous program or by a
commercial enterprise.
An understanding of the relationship between reliability metrics and other terms is necessary for
the application of these factors. System failures may be caused by the hardware, the user, or
faulty maintenance.
The basic Software Reliability incident classification includes:
Mission failures – Loss of any essential functions, including system hardware failures,
operator errors, and publication errors. Related to mission reliability.
System failures – Software malfunction that may affect essential functions. Related to
maintenance reliability.
Unscheduled spares demands – Relates to supply reliability.
System/ mission failures requiring spares – Relates to mission, maintenance and supply
reliabilities. [17]
RECOVERY TIME: For systems that employ redundancy and automatic recovery, the total
time required to detect, isolate, and recover from failures. Recovery time is a performance
requirement. While successful automatic recoveries occurring within the prescribed recovery
time are not counted as downtime in RMA computations, requirements for systems employing
automatic recovery do limit the allowable frequency of automatic recovery actions.
RELIABILITY: Reliability can be expressed either as the probability that an item or system
will operate in a satisfactory manner for a specified period of time, or, when used under stated
conditions, in terms of its Mean Time between Failures (MTBF). Expressing reliability as a
probability is more appropriate for systems such as missile systems that have a finite mission
time. For repairable systems that must operate continuously, reliability is usually expressed as
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
29
the probability that a system will perform a required function under specific conditions for a
stated period of time. It is a function of MTBF, according to the formula
where “t” is the mission time and “m” is the MTBF. Also, reliability is often expressed as the
raw MTBF value, in hours, rather than calculating R according to the above formula.
Software reliability metrics are similar to hardware metrics for a repairable system. The data
provided is commonly a series of failure times or other events. The data is used during software
development to measure time between events, analyze the improvement resulting from
removing errors and making decisions about when to release or update a software product
version. Metrics are also used to assess software or system stability. [13]
Mean Time Between Failure (MTBF – MTBF is a basic measure of Reliability. MTBF
is the average time between failures of system components.
Mean Time Between Outage (MTBO) – MTBO is an operational performance
measure for deployed systems that corresponds to the inherent MTBF measure. A
measure of the time between unscheduled interruptions, MTBO is monitored by
NAPRS. It is computed by dividing the total operating hours by the number of outages.
SERVICE: The term “service” has different meanings in the contexts of the NAS Enterprise
Architecture (Version 8.0 or greater) and NAPRS and FSEP.
NAS Enterprise Architecture "Services" are services, such as separation assurance,
provided to NAS users. These services are provided by a combination of ATC
specialists and the systems that support them. Each NAS EA Service is comprised of
one or more NAS-RD functions.
NAPRS / FSEP “Services” as defined in FAA order 6000.5 are services that represent
an end product, which is delivered to a user (air traffic specialists, the aviation public or
military) that results from an appropriate combination of systems, subsystems,
equipment, and facilities. NAPRS provides the guidance on how to report the status of
these services. To distinguish these services from NAS EA Services, NAPRS / FSEP
services will be referred to in this document as “service threads.” In this document these
services will be referred to as NAPRS service threads.
SERVICE-ORIENTED ARCHITECTURE: A software design and software architecture
design pattern based on structured collections of discrete software modules, known as services,
which collectively provide the complete functionality of a large software application.[50]
SERVICE THREADS: Service threads are strings of systems/functions that support one or
more of the NAS EA Functions. These service threads represent specific data paths (e.g. radar
surveillance data) to air traffic specialists or pilots. The threads are defined in terms of
narratives and Reliability Block Diagrams depicting the systems that comprise them. They are
based on the reportable services defined in NAPRS. Note that some new service threads have
been added to the set of NAPRS services, and some of the NAPRS services that are components
of higher-level threads have been removed. (See Section 7 for a detailed discussion of the
service thread concept.)
m
t
eR
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
30
SERVICE THREAD LOSS SEVERITY CATEGORY (STLSC): Each service thread is
assigned a STLSC based on the severity of impact that loss of the thread could have on the safe
and/or efficient operation and control of aircraft. (See Section 7.4 for a detailed discussion of
the STLSC concept.) The Service Thread Loss Severity Categories are:
Safety-Critical - A key service in the protection of human life. Loss of a Safety-Critical
service increases the risk in the loss of human life.
Efficiency-Critical - A key service that is used in present operation of the NAS. Loss of
an Efficiency-Critical Service has a major impact in the present operational capacity.
Essential - A service that if lost would significantly raise the risk associated with
providing efficient NAS operations.
Routine - A service which, if lost, would have a minor impact on the risk associated
with providing safe and efficient NAS operations.
Service threads can also be characterized as Remote/Distributed where loss of a service thread
element, i.e., radar, air/ground communications site, or display console, would incrementally
degrade the overall effectiveness of the service thread but would not render the service thread
inoperable.
The distinction between Efficiency-Critical and Essential services is in the extent of the impact
of the outage on air traffic. Loss of an Efficiency-Critical service would cause significant
disruption of traffic flow affecting multiple facilities across a region of Metroplex or En Route
Center size or larger. Loss of an Essential service would cause significant disruption of traffic
flow within a local area, e.g., a single airport or TRACON. When designing systems providing
Efficiency-Critical service cost/benefit decisions may be required that will call for provision of
lower level (Essential) service at smaller or less critical locations. The techniques described in
Appendix J were developed to aid in developing criteria for making this decision. ANG-B7 can
provide assistance in applying these techniques.
SEVERITY: A relative measure of the consequence of a failure mode, sensitivity to outage
downtime, and its frequency of occurrence.
SOFTWARE RELIABILITY:
The American National Standards Institute (ANSI) and Institute of Electrical and Electronics
Engineers (IEEE) have defined Software Reliability in ANSI/IEEE STD-729-1991 [3] as:
“The probability of failure-free software operation for a specified period of time in a specified
environment”
NASA-STD-8739.8 NASA Software Assurance Standard defines Software Reliability as a
discipline of Software Assurance that:
1. “Defines the requirements for software controlled system fault/failure detection,
isolation, and recovery;
2. Reviews the software development processes and products for software error prevention
and/or reduced functionality states; and,
3. Defines the process for measuring and analyzing defects and defines/derives the
reliability and maintainability factors.”
TARGET OPERATIONAL AVAILABILITY: The desired operational availability
associated with a given NAS Service.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
31
VALIDATION: The process of confirming a product or work product satisfies or will satisfy
the stakeholders’ needs. It determines a system does everything it should and nothing it should
not do. (i.e., that the system requirements are unambiguous, correct, complete, consistent,
operationally and technically feasible, and verifiable).
VERIFICATION: The process which confirms that the system and its elements meet the
specified requirements; that the system is built as specified.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
32
PURPOSE AND OBJECTIVES
RMA guidance, methods, and techniques are found across a number of disparate sources e.g.,
FAA Orders, Standards, Handbooks, Specifications, and Non-Government Publications. This
Handbook is a compilation of salient information garnered from these source documents. This
information has been synthesized and formulated into a systematic approach to RMA
requirements management supported by newly established RMA management methods and
tasks.
The Goal: Develop realistic RMA requirements for hardware and software and ensure
that specified RMA characteristics are realized in fielded systems.
The Handbook provides a single concise source to help users gain an understanding of the
subject matter within the context of their discipline. Those needing a deeper understanding of
the subject matter are directed to the referenced documents. The Handbook assists in developing
basic skills in and describes the significance and implications of the new RMA management
approach so existing skills can be applied in a new context.
The purpose of this Handbook is twofold:
1. Introduce a new approach to RMA requirements management that is responsive to
changes in the traditional approach to RMA specification and verification.
2. Provide a reference to assist stakeholders with developing realistic RMA requirements
for hardware and software.
The objective is to facilitate the standardization of requirements across procured systems,
promoting a common understanding among the FAA community and its affiliates.
5.1 Background
The primary purpose of defining NAS Enterprise-Level RMA requirements is to relate NAS
system-level functional requirements to verifiable specifications for the hardware and software
systems that implement them and establish a "floor" for the minimum level of those
requirements. An intermediate step in this process is the introduction of the concept of generic
service threads that define the systems/functions which support the various NAS Services to
controllers and/or pilots. The service threads bridge the gap between un-allocated functional
requirements and the specifications of the systems that support them. They also provide the
vehicle for allocating NAS Enterprise-Level RMA-related4 requirements to specifications for
the systems/functions that comprise the service threads. The NAS-Level requirements on which
this Handbook is based are published on the NAS EA portal. This Handbook uses the NAS EA
portal terminology (Version 8.0 or greater) throughout and differentiates “overloaded” terms
that have one definition in the context of the NAS EA and different definitions elsewhere in the
FAA.
NAS Enterprise Level RMA requirements are provided to satisfy the following objectives:
4 The term “RMA-related requirement(s)” includes, in addition to the standard reliability, maintainability and
availability requirements, other design characteristics that contribute to the overall system reliability and
availability in a more general sense (e.g., fail-over time for redundant systems, frequency of execution of fault
isolation and detection mechanisms, system monitoring and control. on line diagnostics, etc.).
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
33
Provides a means of allocating RMA requirements from FAA CONOPS to program
requirements documents.
Establishes a common framework upon which to justify future additions and deletions of
requirements.
Provides uniformity and consistency of requirements across procured systems,
promoting common understanding among the specifying engineers and the development
contractors.
Establishes and maintains a baseline for validation and improvement of the RMA
characteristics of fielded systems.
Provides a starting point for the scaling of Program-Level RMA requirements based on
the capacity of various facility groups. (Scaling service availability requirements to
facility groups based on size and special local needs are discussed in Section 7.4.1.2).
Provides guidance in Section 7.7.3 for Program-Level RMA requirements related to
Infrastructure Systems such as Power and Enterprise Infrastructure Systems.
Purpose of this Handbook
This Handbook:
1) Introduces RMA requirements management approach that focuses on associating NAS-
Level requirements with service threads and assigning each service thread requirements that
are achievable, verifiable, and consistent with the severity of the service provided to users
and specialists.
2) Provides guidance to stakeholders in the form of a process/tasks to:
Interpret and allocate the NAS-RD-2013 RMA requirements to systems.
Decompose the NAS Enterprise-Level RMA requirements into realistic and achievable
requirements documents and design characteristics.
Establish risk management activities to permit the monitoring of critical fault tolerance
and RMA characteristics during system design and development.
Establish a software reliability growth program to ensure that latent design defects are
systematically exposed and corrected during testing at the contractor’s plant, FAA
testing facilities and subsequent deployment.
Update and maintain the NAS Enterprise-Level RMA requirements definition process.
The RMA Handbook is intended to be a living document. It will be updated periodically to
reflect changes to NAS requirements as well as to incorporate the experience gained from using
techniques described in it and from downstream procurements and implementations.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
34
RMA REQUIREMENTS MANAGEMENT APPROACH
This Handbook describes a new approach to RMA requirements management that focuses on
associating NAS-Level requirements with service threads and assigning each service thread
requirements that are achievable, verifiable, and consistent with the severity of the service
provided to users and specialists. The focus of the RMA management approach is on early
identification and mitigation of technical risks affecting the performance of fault-tolerant
systems, followed by an aggressive software reliability growth program to provide contractual
incentives to find and remove latent software defects. The reader is encouraged to refer to
Appendix J as it provides context and rationale for the guidance provided herein.
The key elements of the approach are:
1. Map the NAS-Level functional requirements to a set of generic service threads based
on the NAPRS services reported for deployed systems. (Section 7.2 Assign Service
Thread Loss Severity Categories (STLSC) of “Safety-Critical,” “Efficiency-
Critical,” “Essential,” "Routine," and "Remote/Distributed," to the service threads
based on the effect of the loss of the service thread on NAS safety and efficiency of
operations. (Sections 7.2.2 and 7.3)
2. Distinguish between Efficiency-Critical threads whose interruptions can be safely
managed by reducing capacity that may, however, cause significant traffic disruption
vs. Safety-Critical threads whose interruption could present a significant safety
hazard during the transition to reduced capacity operations. (Sections 7.2.2 and 7.3)
3. Allocate NAS-Level availability requirements to service threads based on the
severity and associated availability requirements of the NAS capabilities supported
by the threads.
4. Recognize that the probability of achieving the availability requirements for any
service thread identified as Safety-Critical is unacceptably low; therefore, where
possible, decompose the thread into alternate (at least two) independent new threads,
each with a STLSC no greater than “Efficiency-Critical.”
5. Recognize that the availability requirements associated with “Efficiency-Critical”
service threads will require redundancy and fault tolerance to mask the effect of
software failures.
6. Move from using availability as a contractual requirement to parameters such as
MTBF, Mean Time To Restore (MTTR) recovery times, and mean time between
successful recoveries that are verifiable requirements. Couple these with an
aggressive software reliability growth program.
7. Use RMA models only as a rough order of magnitude confirmation of the potential
of the proposed hardware configuration to achieve the inherent availability of the
hardware, not a prediction of operational reliability.
8. Focus RMA effort, during development, on design review and risk reduction testing
activities to identify and resolve problem areas that could prevent the system from
approaching its theoretical potential.
9. Recognize that “pass/fail” reliability qualification tests are impractical for systems
with high reliability requirements and substitute an aggressive software reliability
growth program.
10. Use NAPRS data from NASPAS to provide feedback on the RMA performance of
currently fielded systems to assess the reasonableness and attainability of new
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
35
requirements, and to verify that the requirements for new systems will result in
systems with RMA characteristics that are at least as good as those of the systems
they replace.
11. Apply these principles throughout the acquisition process.
The application of these RMA management methods for the new approach is discussed in detail
in Section 8. These methods apply equally to new acquisitions and established iterative
development programs. All phases of the acquisition process are addressed, including
preliminary requirements analysis, allocation, preparation of procurement documents, proposal
evaluation, contractor monitoring, and design qualification and acceptance testing.
Throughout all phases of design and allocation, appropriately chosen SME should be utilized to
ensure successful incorporation of an operational viewpoint into the requirements. SMEs can be
both operational Air Traffic Control personnel from appropriate facility types and locations as
well as technical operations and maintenance personnel with insight into maintainability and
repair aspects of systems to be fielded. Early involvement of SMEs during procurement will
help to ensure both clarification of the expected role and utility of a future system, and help set
operational expectations for future systems. SME input should be carefully documented and
weighed against NAS EA functional expectations for a system and where necessary, fed back
into CONOPS.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
36
DERIVATION OF NAS-LEVEL RMA REQUIREMENTS
The NAS-RD-2013 document has been rewritten to up-level the requirements and align them
with the NAS EA. This section presents background information on the methodology used to
determine the severity level associated with each NAS Service/Function. The resultant severity
level is then applied to set the severity level of the associated service thread.
7.1 NAS-RD-2013 Severity Assessment Process As shown in Figure 7-1, the NAS-RD-2013 is organized around system functional hierarchies
and system functions defined by the current version of the NAS EA.
• Service Family: Mission Services
– Service Category: Information Services
• Enterprise Service: Aeronautical Information Management
– Major Function: The NAS shall manage NAS configuration information.
• Function: The NAS shall acquire NAS configuration
information.
Figure 7-1 Functional Architecture
7.1.1 Severity Level Assessment
Associated with each Enterprise Service are a number of individual functional and non-
functional requirements. Associated with each functional requirement is a “severity level”
assessment defining the type of service thread upon which that functional requirement should be
implemented. The severity level definitions in this version of the RMA Handbook have been
updated to align with the definitions in NAS-RD-2013.
The severity definitions are:
Safety-Critical - A key service in the protection of human life. Loss of a Safety-Critical service
increases the risk in the loss of human life.
Efficiency-Critical - A key service that is used in present operation of the NAS. Loss of an
Efficiency-Critical Service has a major impact in the present operational capacity.
Essential - A service that if lost would significantly raise the risk associated with providing safe
and efficient NAS operations.
Routine - A service which, if lost, would have a minor impact on the risk associated with
providing safe and efficient NAS operations.
Service threads can also be characterized as Remote/Distributed where loss of a service thread
element, i.e., radar, air/ground communications site, or display console, would incrementally
degrade the overall effectiveness of the service thread but would not render the service thread
inoperable.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
37
The severity definitions in this version of the RMA Handbook have been updated to align with
the definitions in the current version of the NAS RD. The RMA principles developed in this
Handbook will ensure that the loss of a single system will not prevent the NAS from exercising
safe separation and control of aircraft. Where loss of a single system may increase the risk to an
unacceptable level, there are procedures to cover any loss of surveillance, loss of
communications, or even loss of both. Implementation of these procedures to cover such
eventualities may severely disrupt the efficiency of NAS operations, but the most important
objective of maintaining separation is preserved. Pilots also have a responsibility and are
motivated to maintain separation. But the RISK of doing so at a busy ARTCC or TRACON
without automation support is too high. Mitigating this risk leads to the requirement for high-
reliability systems. Table 7-1 shows the Severity Level assignments for requirement specified in
the NAS-RD-2013. Criteria for defining “loss” of a service varies on a service by service basis.
Loss criteria are defined in the NAPRS service definitions linked to in Table 7-2 and 7-8. Note
that these links are to a site behind the FAA employee firewall.
Table 7-1 NAS Architecture Services and Functions
NAS-RD-
2013 Section
Requirements Severity Level
3.1 Service Family: Mission Services
3.1.1 Category: Information Services
3.1.1.1 Aeronautical Information
Management Enterprise Services
Essential
3.1.1.2 Flight and State Data Management
Enterprise Services
Efficiency-Critical
3.1.1.3 Surveillance Information
Management Enterprise Services
Safety-Critical
3.1.1.4 Weather Information Management
Enterprise Services
Essential**
3.1.2 Category: Traffic Services
3.1.2.1 Separation Management
Enterprise Services
Safety-Critical
3.1.2.2 Trajectory Management
Enterprise Services
Efficiency-Critical
3.1.2.3 Flow Contingency Management
Enterprise Services
Efficiency-Critical
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
38
NAS-RD-
2013 Section
Requirements Severity Level
3.1.2.4 Short Term Capacity Management
Enterprise Services
Efficiency-Critical*
3.1.3 Category: Mission Support Services
3.1.3.1 Long Term Capacity Management
Enterprise Services
Routine
3.1.3.2 System and Service Analysis
Enterprise Services
Essential*
3.1.3.3 System and Service Management
Enterprise Services
Essential
3.1.3.4 Safety Management Enterprise
Services
Essential
3.2 Service Family: Technical
Infrastructure Services
3.2.1 Surveillance Data Collection
Enterprise Services
Safety-Critical
3.2.2 Weather Data Collection
Enterprise Services
Essential
3.2.3 Navigation Support Enterprise
Services
Efficiency-Critical
*This section includes requirements at varying severity levels but they have been “rolled up” to
the most severe level, as discussed in Section 7.1.2.
** NAS Requirements Document RD-2013 shows Section 3.1.1.4 Weather Information
Management requirements as Essential. The Next Generation (NextGen) Aviation Weather
Division is advocating that the collective severity level for weather services should be
Efficiency-Critical.
7.1.2 Severity Level Assessment Roll-Up
Severity levels associated with individual functional and non-functional requirements were then
rolled up to the NAS Enterprise Service level or major function level. The general rule was
simply to examine all of the severity levels of the individual functional requirements contained
under each NAS Enterprise Service or major function and then assign an overall severity level
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
39
based on the highest severity level of any of the individual constituent functional requirements
under it. However, mechanistically following this process could have led to unintended
consequences. In general, all or most of the functional requirements assigned to a NAS
Enterprise Service level have the same severity level, so it simply inherited the severity of its
constituent requirements. Where this was not the case, the severity level was rolled-up to the
major functional level.
While performing the roll-up, the severity level assignment of each individual functional
requirement was re-examined to make sure it was consistent within the context of the major
function under consideration. The guiding principle was that the severity level assigned to an
individual requirement must be realistic and constitute a significant factor in providing the
major function. The results of applying the severity level roll-up rule appear in the severity NAS
RD Roll-Up column in the matrices in Figure 7-9, Figure 7-10 and Figure 7-11 at the end of this
section.
NAS-RD-2013 maps enterprise requirements to NAS subsystems that contribute to those
requirements. The subsystems that were selected to be represented in the NAS-RD-2013 are
those that are under NAS configuration control. These subsystems are represented in the NAS-
SV-1H, the Facility, Service, and Equipment Profile (FSEP), or the NAS-MD-001. All selected
subsystems must be denoted as subsystems on the NAS-SV-1H. Variants, functions, and
components are not included. The roll-up process presupposes that all of the functional
requirements comprising a NAS service will eventually be allocated to the same system. In this
case, any lower severity functional requirements contained within a major function of a higher
severity simply inherit the higher severity of the major function and the system to which the
requirements of that major function are allocated.
Consider the case where a NAS service/major function includes requirements that are almost
certain to be allocated to different systems, for example, surveillance functions and weather
functions. Should availability requirements for a system be driven by the consequence of one of
its functional requirements being included in a critical major function? This is at the heart of the
conundrum of attempting to drive the RMA requirements for real systems from unallocated
functional requirements. The challenge is to devise a method for mapping the unallocated
requirements associated with the NAS EA to something more tangible that can be related to
specifications for real systems.
The method proposed is to relate the NAS-RD-2013 requirements to a set of real-world services
that are based on the services monitored by the National Airspace Performance Reporting
System (NAPRS), as defined in FAA Order JO 6040.15[30]. To distinguish these NAPRS-
based services from the NAS Architecture Services, they are designated as service threads, as
discussed in the next section.
7.2 Development of Service Threads NAS-RD-2013 services are supported by one or more service threads providing services to
user/specialists (i.e., pilots and controllers). One example is surveillance data derived from
sources and sensors, processed into tracks, to which relevant data is associated and displayed to
controllers. Another service thread is navigation data delivered to pilots. Service threads are
realized from interconnected facilities and systems.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
40
The FAA’s Air Traffic and Technical Operations organizations have for years monitored a set
of “service threads” under the NAPRS. NAPRS tracks the operational availability of what it
calls, “services” (not to be confused with the NAS EA Services). NAPRS tracks the operational
availability and other RMA characteristics of services delivered by individual service threads to
specialists. Since these services represent a “contract” between the Technical Operations and
Air Traffic Operations, NAPRS services do not include NAS services, such as navigation, that
are not used by air traffic controllers. (The performance of navigation facilities is monitored, but
no corresponding service is defined.)
RMA Service Thread development is based on NAPRS services defined in FAA Order 6040.15.
Basing the service threads on the NAPRS services provides several benefits. FAA personnel are
familiar with the NAPRS services and they provide a common basis of understanding among
the Air Traffic Operations, Technical Operations and headquarters personnel. The operational
data collected by NAPRS allows proposed RMA requirements to be compared with the
performance of currently fielded systems to provide a check on the reasonableness of the
requirements. The use of service threads permits the NAS architecture to evolve as components
within a thread are replaced without the need to change the thread itself. To realize these
benefits, the service threads used in this document should correlate as closely as possible with
the services defined in NAPRS services.
However, it has been necessary to define some additional service threads that are not presently
included in FAA Order JO 6040.15.
The NAPRS monitors the performance of operational systems, while the NAS-RD-2013 looks
toward requirements for future systems. Accordingly, new service threads will need to be
created from time to time as the NAS evolves. This process should be closely coordinated with
future revisions to NAPRS services.
The following section provides the traceability between the NAPRS services and the current list
of service threads used in this Handbook.
7.2.1 System of Systems Taxonomy of FAA Systems
FAA systems used to provide the capabilities, as described by the set of requirements specified
in the NAS-RD-2013, can be divided into four major categories: 1) Information Systems, 2)
Remote/Distributed and Standalone Systems, 3) Mission Support Systems and 4) Infrastructure
and Enterprise Systems. Figure 7-2 NAS System Taxonomy presents the NAS System
Taxonomy on which definitions and requirements allocation methodologies for the various
categories of systems can be based. Strategies for each of these system categories are presented
in the paragraphs that follow.
What is represented in Figure 7-2 is a system of systems taxonomy. As such, most NAS
services and resulting service threads will be comprised of systems from more than one of the
major categories in the taxonomy. As a result, some NAS services are provided by systems of
multiple types.
For example, En Route Automation Modernization (ERAM), which is an Automation System
under the Information Systems category in the taxonomy, also relies on and employs other
Information Systems such as the En Route Communications Gateway (ECG). In addition, it
receives inputs from multiple instances of Remote / Distributed & Standalone Systems such as
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
41
Air Route Surveillance Radars (ARSR) and is supported locally by Infrastructure Systems such
as power and HVAC. NAS wide Enterprise System services are provided by Communication
Transport Systems e.g., the NAS Messaging Replacement (NMR) via the FAA
Telecommunications Infrastructure (FTI).
For the purposes of this Handbook, these taxonomy categories are provided as guidance in
categorizing a system or service such that the RMA approaches described in this Handbook are
properly applied. In cases where it is not clear which category a system or service falls under or
which RMA approach should be applied, then a subject matter expert (SME) should be
consulted for further clarification.
1. Information Systems are the primary focus of the requirements allocation methodology
described in this Handbook. These systems are generally computer systems located in
major facilities staffed by ATC personnel. They consolidate large quantities of
information for use by operational personnel in performing the NAS Air Traffic Control
Mission. They usually have high severity and availability requirements, because their
failure could affect large volumes of information and many users. Typically, they
employ fault tolerance, redundancy, and automatic fault detection and recovery to
achieve high availability. These systems can be mapped to the NAS Services and
Capabilities functional requirements.
2. The Remote/Distributed and Standalone Systems category includes remote sensors,
remote air-to-ground communications, inter-facility data communications and
navigation sites, as well as distributed subsystems such as display terminals that may be
located within a major facility. Failures of single elements, or even combinations of
elements, can degrade performance at an operational facility, but generally they do not
result in the total loss of the surveillance, communications, navigation, or display
capability. Most of the service threads in the Remote/Distributed and Standalone
Systems category are covered by the diversity techniques required by FAA Order
6000.36, Communications Diversity.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
42
Figure 7-2 NAS System Taxonomy
3. The Mission Support Systems category includes systems used to assist in managing the
design of NAS airspace and the utilization of the electromagnetic spectrum. The NAS-
RD-2013 severity definitions and associated availabilities are based on the real time air
traffic control mission. Therefore, there is no basis for allocating these requirements to
service threads and systems that indirectly support the air traffic control mission but that
are not directly involved in the control of air traffic.
4. In the Infrastructure & Enterprise Systems category the scope of the systems included is
limited to those systems that provide power, environment, communications and
enterprise infrastructure services to the facilities that house the information systems.
These systems can cause failures to the systems they support, so traditional allocation
methods and the assumption of independence of failures do not apply to them.
7.2.2 Categorization of NAPRS Services
The NAPRS services were mapped to service threads, categorized in accordance with the major
system categories shown in the taxonomy. The “Remote/Distributed” service threads represent
services that are provided by remote sensor and voice communications sites. These services
generally represent a “many-to-one” or a “one-to-many” relationship with the control site.
Failure of one of these services may degrade operations, but overlapping coverage and diversity
in the set of distributed sites allows communications or surveillance functions to be maintained.
These classifications and the requirements derivation methods appropriate to each classification
are discussed in detail in Paragraph 8.1.1.
Table 7-2 lists the services from NAPRS used in this document and shows the mapping of these
services to the four categories of service threads established by the taxonomy. Several new
service threads that do not have a corresponding NAPRS service have been created.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
43
R/F Navigation Service – Navigation Support includes functions performed by ground-
based navigation and landing systems that provide electronic reference signals to assist
an aircraft in determining its position relative to a navigation fix or runway. It also
includes the provision of visual reference to flight crews.
RALS - R/F Approach and Landing Service – TBD
VGS - Visual Guidance Service – TBD
TVSB Terminal Voice Switch Backup - Provides backup air-to-ground and ground-to-
ground voice communications essential for the terminal domain of aircraft flights to
provide safe, orderly and efficient flow of air traffic
Power is addressed in Section 7.7.3.1, where a methodology to derive RMA requirements for
power distribution services are presented.
The first column of Table 7-2 provides the names of each of the services defined in NAPRS. A
“(NEW)” entry in this column indicates a newly created service thread that does not have a
corresponding facility or service in NAPRS. The remaining columns indicate the category of the
service thread (Information, Remote/Distributed, Support, or Infrastructure) and the domain of
the service thread (Terminal, En Route, or Other). NAPRS services that have not been mapped
to a service thread are also identified in these columns.
The revised RMA requirements development process has augmented the NAPRS services in
Table 7-2 with some additional service threads to include those services that are part of NAS
but not included in the list of NAPRS services. NAPRS services representing lower level
services from remote inputs that are included in composite higher level services provided to
user/specialists, have been mapped to the Remote/Distributed column in Table 7-2 because the
overall availability of the distributed communications and surveillance architecture is addressed
by a “bottom-up” application of diversity and overlapping coverage techniques as directed by
FAA Order 6000.36 instead of a top-down mathematical allocation of NAS-RD-2013
availability requirements. This is a consequence of the complex set of criteria that can affect the
number and placement of remote sensor and communication sites and the overall availability of
communications and surveillance services within a facility’s airspace. Many of these factors
such as the effects of geography, terrain, traffic patterns, and man-made obstacles are not easily
quantified and must be separately considered on a case-by-case basis. It is not practical to
attempt to establish a priority set of requirements for services in the Remote/Distributed and
Standalone Systems category.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
44
Table 7-2 Mapping of NAPRS Services and Service Threads
FAA Order 6040.15 Services Severity Level Information
Service Thread
Remote/ Distributed
Service Thread
Mission Support Systems
Infrastructure & Enterprise
Systems
Domain
Terminal Figure 7-8
En Route Figure 7-9
Other Figure 7-10
ADSS Automatic Dependent Surveillance Service Efficiency-Critical ADSS X X
ARINC HF Voice Communications Link Efficiency-Critical ARINC X
ARSR Air Route Surveillance Radar Efficiency-Critical ARSR X
ASDES Airport Surface Detection Equipment Service Essential ASDES X
BDAT Beacon Data(Digitized) Efficiency-Critical BDAT X
BUECS Backup Emergency Communications Service Efficiency-Critical BUECS X
CFAD Composite Flight Data Proc. Efficiency-Critical CFAD X
CODAP Composite Oceanic Display and Planning Efficiency-Critical CODAP X
COFAD Composite Offshore Flight Data Efficiency-Critical COFAD X
CRAD Composite Radar Data Processing Service Efficiency-Critical CRAD X
ECOM En Route Communications Efficiency-Critical ECOM X
ECSS Emergency Communications Systems Service Efficiency-Critical ECSS X
ECVEX En Route Communications Voice Exchange Service Safety-Critical** ECVEX X
ETARS En Route Terminal Automated Radar Service Safety-Critical** ETARS X
FCOM Flight Service Station Communications Essential FCOM X X
FDAT Flight Data Entry and Printout Efficiency-Critical FDAT X
FDIOR Flight Data Input/Output Remote Essential FDIOR X
FSSAS Flight Service Station Automated Service Essential FSSAS X
IDAT Interfacility Data Service Efficiency-Critical IDAT X
LLWS Low Level Wind Service Essential LLWS X
MDAT Mode S Data Link Data Service Efficiency-Critical MDAT X X
MSEC Mode S Secondary Radar Service Efficiency-Critical MSEC X X
NAMS NAS Message Transfer Service Efficiency-Critical NAMS X
NMRS NAS Messaging Replacement Service Efficiency-Critical NMRS X
RDAT Radar Data (Digitized) Efficiency-Critical RDAT X
RMLSS Remote Monitoring and Logging System Service Essential RMLSS X X
RTDS Radar Tower Display System Efficiency-Critical RTDS X
RTADS Radar Tower Automation Display Service Efficiency-Critical RTADS X
RVRS Runway Visual Range Service Essential RVRS X
STDDS SWIM Terminal Data Distribution System Essential STDDS X
TARS Terminal Automated Radar Service Safety-Critical** TARS X
TBFM Time Based Flow Management Efficiency-Critical TBFM X
TBFMR Time Based Flow Management Remote Display Efficiency-Critical TBFMR X
TCE Transceiver Communications Equipment Not Rated* TCE X
TCOM Terminal Communications Efficiency-Critical TCOM X
TCVEX Terminal Communications Voice Exchange Safety-Critical** TCVEX X
TDWRS Terminal Doppler Weather Radar Service Essential TDWRS X
TFMS Traffic Flow Management System Efficiency-Critical TFMS X X
TFMSS Traffic Flow Management System Service Efficiency-Critical TFMSS X
TRAD Terminal Radar Efficiency-Critical TRAD X X
TSEC Terminal Secondary Radar Efficiency-Critical TSEC X X
TVS Terminal Voice Switch Efficiency-Critical TVS X
TVSB Terminal Voice Switch Backup (New) Efficiency-Critical TVSB X
VGS Visual Guidance Service (New) Efficiency-Critical VGS X
VSCSS Voice Switching and Control System Service Efficiency-Critical VSCS X
VTABS VSCS Training and Backup Switch Efficiency-Critical VTABS X
WAAS Wide Area Augmentation System Service Essential WAASS X X
WDAT WMSCR Data Service Essential WDAT X
WIS Weather Information Service Essential WIS X X X
WMSCR Weather Message Switching Center Replacement Essential WMSCR X
RALS R/F Approach and Landing Services (New) Efficiency-Critical RALS X
RNS R/F Navigation Service (New) Efficiency-Critical RNS X
*TCE is a Secondary backup to the TCOM service. The ** Two Efficiency-critical threads make up Initial (E-C) back up for TCOM is to use the alternate the Safety-Critical thread (Hardwired) headset jack to directly access the radio site
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
45
The service thread list will continue to evolve as the NAS architecture evolves and the NAPRS
list of reportable Services is updated. Table 7-2 provides a numerical summary of the mapping
between NAPRS services and the service threads used in this Handbook and in NAS-RD-2013.
The intent of this table is to assure that there have been no inadvertent omissions in the
construction of the matrices in Figure 7-9, Figure 7-10, and Figure 7-11.
Encapsulating this data in a single table facilitates additions, deletions, and modifications to the
service thread list as the NAS architecture evolves and the NAPRS list of reportable services is
updated. Table 7-3 shows the mapping of the 54 service threads defined in Table 7-2 to the
Matrices shown in Figure 7-9, Figure 7-10 and Figure 7-11. The reason that the total number of
service threads in the three matrices is greater than the total number of service threads in Table
7-2 is that some service threads appear in both the Terminal and En Route Matrices and
therefore are counted twice, making the total number of service threads in the matrices greater
than the number of service threads in Table 7-3.
Table 7-3 Summary of Mapping of Service Threads to STLSC Matrices
Information
Services
Threads
R/D Service
Threads
Mission
Support
Service
Threads
Infrastructure
& Enterprise
Systems
Service
Threads
Totals
Terminal
STLSC
Matrix
12 11 1 24
En Route
STLSC
Matrix
10 17 1 28
“Other”
STLSC
Matrix
5 3 1 1 10
Total 27 31 2 2 62
The final set of service threads is presented in the three matrices illustrated in Figure 7-9, Figure
7-10, and Figure 7-11.
These matrices relate the service threads for the Terminal, En Route, and “Other” domains to
the Services and major functions in NAS-RD-2013. The following paragraphs describe the
process and rationale for constructing the matrices.
Each service thread is defined by a verbal description and a diagram. Each service thread
diagram specifies the NAS Services/Functions supported by the service thread. A sample
diagram illustrating the Terminal Automated Radar Service (TARS) Service Thread is
illustrated in Figure 7-3.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
46
ARTS
ARTS
ARTS
ART
S
ART
S
ARTS
ARTS
REMOTE RADAR SITE
ASR
3.1.2.1 Provide Separation Management
3.1.1.4 Provide Weather Information
3.1.2.2 Provide Trajectory Management
Terminal Automated Radar Service (TARS) Showing TRAD/TSEC Broadband
Backup to TARS for Safety-Critical Terminal Radar Service
ATCRB
Automation
DISPLAYS
TSEC
TRAD
Via
RCL/RML/FAA
Lines
*
*
*
*
*
*
*
*
*
*
*
*
TRAD
TARS
TSEC
Terminal
Efficiency-Critical
Figure 7-3 Example Thread Diagram
7.3 Service Thread Contribution To characterize the significance of the loss of a service thread, the NAS-RD-2013 severity
assessment process looked at the anatomy of a typical failure scenario.
Figure 7-4 Effect of Service Interruptions on NAS Capacity
Transition
Hazard
Interval
Time
Lo
cal
NA
S C
apac
ity
Normal Capacity
with Automation
Steady State
Reduced Capacity
Service
Failure
Efficiency
Loss as Result
of Failure
Based on a report prepared for the Air Traffic System
Development Integrated Product Team for Terminal
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
47
Figure 7-4 depicts the elements of a service failure scenario. Before the failure, with fully
functional automation and supporting infrastructure, a certain level of local NAS capacity is
achievable. After the failure, a hazard period exists while the capacity is reduced to maintain
safety.
The potential effect of reducing capacity on efficiency depends on the level of demand. If the
demand remains far below the airspace capacity for the available level of automation, then,
whether or not it is reduced, there is no effect on efficiency. Trouble begins when the demand is
close to the available capacity. If implementing procedures to accommodate a service thread
failure causes the demand to exceed the available capacity, then queues start to build, and
efficiency is impacted. The reduced capacity may be local, but the effects could propagate
regionally or nationwide. The result is a potential loss of system efficiency with significant
economic consequences as flights are delayed and cancelled.
Now, consider a critical NAS capability, such as flight plan processing supported by a service
thread “A” (See Figure 7-5). The effect of the loss of a service thread on NAS safety and
efficiency is characterized by the Service Thread Loss Severity Category.
In Case 1, when the service thread fails, the controller switches to manual procedures that
reduce traffic flow and increase separation to maintain safety. Depending on the level of
demand at that local NAS facility, the transition hazard period may be traversed without
compromising safety. However, when the level of demand at that local facility is significant, the
loss of the service thread may have efficiency implications and a significant ripple effect on the
broader NAS. If it does, the service thread is assigned a Loss Severity Category of Efficiency-
Critical. Because the loss of an Efficiency-Critical Service Thread has regional or nation-wide
impact, it might receive much attention and be disruptive, but not life threatening.
Figure 7-5 Service Thread Loss Severity Categories - Case 1
Case 1
• When automation fails, switch to
manual procedures
• Safety is maintained during
transition to manual procedures
• The impact of the loss of the
Service Thread on safety and
efficiency determines its Service
Thread Loss Severity Category
(STLSC)
– Efficiency-Critical –
disruptive but not life-
threatening, results in delays
– Essential if there is some
impact
Service
Thread
A
Procedures
Service
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
48
If loss of the service thread “A” has only localized impact, then it is considered to be of Loss
Severity Category Essential. Loss of the service thread has an impact on local NAS efficiency.
Now consider Case 2 (See Figure 7-6) – again a NAS service, such as aircraft-to-aircraft
separation, for which a proposal exists to support it with a single service thread “X”. The level
of demand at the local NAS facility, though, is such that the transition hazard period cannot be
traversed without compromising safety. This is a potentially Safety-Critical situation that should
not be, and is not, supported today by a single service thread. Loss of such a “Safety-Critical”
service thread would likely result in a significant safety risk and increased controller stress
levels during the transition to reduced capacity operations.
Figure 7-6 Potential Safety-Critical Service Thread - Case 2
Note, “Safety-critical” relates to an assessment of the degree of hazard involved in the transition
to a lower Local NAS Capacity. This designation distinguishes this set of circumstances from
the more common safety analysis methods intended to define a direct cause and effect
relationship between a failure and its consequences – for example, loss of digital fly-by-wire
will almost certainly result in the loss of the aircraft and/or life. In contrast, loss of a Safety-
Critical service thread will put undue stress on controllers, may result in some violations of
separation standards, and an increased risk of a serious incident, but there is no certainty that a
serious incident will result from the interruption.
Establishing requirements to make a Safety-Critical Service Thread so reliable that it will
“virtually never fail” is unrealistic given today’s state-of-the-art in software-intensive systems.
The level of reliability and availability that would be required to support a Safety-Critical
Service Thread cannot be predicted or verified with enough accuracy to be useful, and has never
been achieved in the field. For these reasons, any such requirements are meaningless. The FAA
Case 2
• When Service Thread X fails, switch to manual procedures
• Significant risk that safety could be compromised during transition to manual procedures
• Severity of Service Thread X determined by risk of compromising safety during transition
– Safety-Critical – life-threatening
– Service Thread X cannot be made reliable enough to reduce the hazard probability to acceptable levels
– NAS has none today
Service
Thread
X
Procedures
Service
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
49
has learned this in the past and has no Safety-Critical Service Threads in the field. However
Safety-Critical services exist which consist of 2 Efficiency Critical threads.
Perhaps a hypothetical single service thread supporting terminal surveillance would, then, be
Safety-Critical. In the field, however, the surveillance service has been decomposed into two
independent service threads both providing Composite Radar Data Processing Service (CRAD)
service. This is accomplished by use of paired service threads employing an ERAM system and
an independent Enhanced Back-up Surveillance System (EBUS), each of which is only
Efficiency-Critical. Similarly, the communications switching service has been decomposed into
two independent service threads, e.g., Voice Switching and Control System (VSCS) and VSCS
Training and Backup System ( ). By providing two independent threads, the unachievable
requirements for a single Safety-Critical thread are avoided5.
Figure 7-7 (Case 3) applies a second service thread, “Y,” to complement service thread “X”. It
supplies sufficient functionality to maintain safety during the hazard period traversal. In this
case safety is maintained because the controller can switch to the alternate service thread.
Sufficient capacity may be provided thereby to maintain the efficiency that minimizes impact
on the orderly flow of the NAS.
Figure 7-7 Decomposition of Safety-Critical Service into Threads
Whether or not the service thread needs to be full-service or reduced-capability depends on how
much time is spent on the backup service thread.
5 For an extensive discussion of redundancy and diversity applied to air traffic control systems
see En Route Automation Redundancy Study Task, Final Report, March 2000.
Case 3
• Independent backup Service
Thread Y introduced
• Safety is maintained because
controller can switch to backup
Service Thread
• Efficiency may or may not be
sustained, depending on the
capability of Service Thread Y
• Safety is always maintained
Service
Thread
X
Procedures
ServiceThread
Y
Service
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
50
The bottom line is – if a new service thread is determined to be “Safety-Critical,” i.e. the
transition to manual procedures presents a significant risk to safety, the potential new service
thread must be divided into at least two independent service threads that can serve as primary
and backup(s).
7.4 Scaling of Service Threads The severity of a service thread cannot always be assessed without examining the context in
which the service is deployed. In particular the distinction between Efficiency–Critical and
Essential Service Threads and between Essential and Routine may depend on the size and type
of the staffed facility, and on environmental factors including geography, air space design, and
local climate conditions.
This Section introduces the concept of Scaling of Service Threads based on Facility Groups and
the environment. This revision of the Handbook has updated the STLSC matrix format to
account for service thread scalability by severity in a manner similar to the mapping of Facility
Power System Architecture to Service Threads in previous revisions.
7.4.1 Facility Grouping Schema
The schema in this section places the various types of FAA facilities into eight groups (Table
7-4) to provide an additional level of accuracy when obtaining Service Thread Loss Severity
Categories (STLSCs). In some instances, the severities of the same service thread may vary
based on the type of facility utilizing it, so it is important to add this additional dimension to the
STLSC assessment.
There are four major groups that facilities will fall under: ARTCCs, TRACONs, ATCTs and
Unstaffed Facilities. Staffed Facilities that are combinations of any of the three major groups
should be classified under the portion of the facility that has more airspace and volume of traffic
under its control, see descriptions in Table 7-4.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
51
Table 7-4 Facility Grouping and Descriptions
Facility
Group Name Description
1
Air Route Traffic
Control Center
(ARTCC)
An air traffic control facility that provides air traffic control
service to aircraft operating on IFR flight plans within
controlled airspace and principally during the en route phase
of flight. When equipment capabilities and controller
workload permit, certain advisory/assistance services may
be provided to VFR aircraft.
Combined
Center/Radar
Approach Control
(CERAP)
An FAA air traffic control facility combining the functions
of an ARTCC and a TRACON.
Combined Control
Facility
An air traffic control facility which provides approach
control services for one or more airports as well as en route
air traffic control (center control) for a large area of
airspace. Some may provide tower services along with
approach control and en route services.
2 (Large)
3 (Med)
4 (Small)
Terminal Radar
Approach Control
(TRACON)
An FAA air traffic control facility using radar and
air/ground communications to provide approach control
services to aircraft arriving, departing, or transiting the
airspace controlled by the facility.
Radar Approach
Control (RAPCON)
A terminal air traffic control facility using radar and non-
radar capabilities to provide approach control services to
aircraft arriving, departing, or transiting airspace controlled
by the facility.
Combined TRACON
and Tower with
Radar
An air traffic control facility which provides radar control to
aircraft arriving or departing the primary airport and
adjacent airports, and to aircraft transiting the terminal's
airspace. This facility is divided into two functional areas:
radar approach control positions and tower positions. These
two areas are located within the same facility, or in close
proximity to one another, and controllers rotate between
both areas.
Combination Non-
Radar Approach
Control and Tower
without Radar
An air traffic control facility that provides air traffic control
services for the airport at which the tower is located and
without the use of radar, approach and departure control
services to aircraft operating under Instrument Flight Rules
(IFR) to and from one or more adjacent airports.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
52
Facility
Group Name Description
Combined TRACON
An air traffic control terminal that provides radar approach
control services for two or more large hub airports, as well
as other satellite airports, where no single airport accounts
for more than 60 percent of the total Combined TRACON
facility’s air traffic count. This terminal requires such a
large number of radar control positions that it precludes the
rotation of controllers through all positions.
5 (Large)
6 (Med)
7 (Small)
Tower without Radar
An airport traffic control tower that provides service using
direct observation primarily to aircraft operating under
Visual Flight Rules (VFR). These terminals are located at
airports where the principal user category is low
performance aircraft.
Towers with Display
(VFR)
ATCT providing takeoff and landing services only. It does
not provide approach control services.
Towers with Radar
Single facilities that combine an approach control with an
ATCT providing both radar and non-radar air traffic control
services.
U
Radar
Radio detection and ranging equipment. Radars determine
the aircraft’s bearing and distance by measuring the interval
between the transmission and reception of a radio pulse.
Navigational Aids
(NavAid)
Any visual or electronic devices used by pilots to navigate.
This would include VOR/ DME, NDB, TACAN, etc.
Remote
Communications
Air/Ground (RCAG)
These unstaffed facilities enable communication between
pilots and ATC specialist.
7.4.1.1 ARTCCs The grouping for ARTCCs is a simple process using the assumption, “ARTCCs of all sizes will
have service threads of the same severity.” The single group for ARTCCs includes CERAPs
and Combined Control Facilities. These combined facilities included in this grouping were
selected due to the severity of en route centers, as service threads in a center have to handle
much higher volume of traffic than a TRACON or ATCT.
7.4.1.2 TRACONs To group TRACONs, three classes are established based on facility levels. Facility level is
calculated by a complex formula that takes unweighted traffic volume and modifying it through
a number of factors consisting of geography, runway layout, class of airspace and type of
facilities [71]. The facility level may relate to the severity of service threads and the scope of
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
53
impact during service downtimes. The three TRACON groups as they relate to facility level are
captured in Table 7-5 and can be referenced in Table 7-4, Facility Group 2, 3 and 4.
Table 7-5 TRACON Grouping
TRACONs Small Medium Large
Facility Level 𝑋 ≤ 8 9≤ 𝑋 ≤ 10 𝑋 ≥11
The Large class of TRACONs consists of all the major facilities that deal with significantly
more traffic than the rest of the facilities in the country. Following that reasoning, the remaining
TRACONs are placed into either the Small or Medium class based on their facility level which
is also directly related to the volume of traffic. As an example,
Figure 7-8 shows the difference in traffic volume between the various levels of facilities. It is
clear that Facility Level 11 and 12 are in a category of their own with some facilities handling
up to about 2 million operations per year. There is also the consideration for facilities that might
change groupings due to an upgrade or downgrade, but fortunately the criteria for changing
facility levels sets a proper standard to avoid constant upgrade or downgrade in facility levels
[72]. This lack of change in facility levels can also be seen in Figure 7-8, following the data
shown from 2006 to 2012 [74], the facilities bordering the grouping constraints have not
changed levels at all despite the overall decrease in air traffic.
Figure 7-8 Comparison of TRACONs over time by annual total number of operations
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
54
7.4.1.3 ATCTs Within the ATCT group, the service threads mainly relate to traffic flow at the various airports,
so the classifications of ATCTs are also based on Facility Level. Facilities included in this
grouping are the “towers without radar,” “towers with display” and “towers with radar”
facilities. The different types of ATCTs are divided into Large, Medium, and Small groups.
Table 7-6 Criteria Used for Tower Classification [73]
ATCTs Small Medium Large
Facility Level X≤8 9≤X≤10 X≥11
7.4.1.4 Unstaffed Facilities Unstaffed Facilities are remote locations housing systems providing services to one or more
associated staffed facilities (ARTCC or TRACON). These facilities may be FAA owned or
owned by contractors providing leased services (i.e. SBS). Services provided by unstaffed
facilities have a severity level based on the size and type of staffed facilities serviced, and any
special environmental factors. Unstaffed Facilities’ services thus are rated at the highest severity
level required by the staffed facility or facilities. For example, an airport surveillance radar site
provides surveillance services to TRACON “X”, ATCT “Y” and ARTCC “Z”; for which it is
determined that the severity of the surveillance services are Essential, Essential and Efficiency-
Critical respectively. In this case the surveillance services provided by the radar site would be
considered Efficiency-Critical.
7.4.2 Scaling Service Threads to Facility Groups
The STLSC values assigned in the Handbook should be used as a guide, but in the process of
determining the appropriate RMA requirements for services at a given facility, it is
recommended to consult SMEs to:
Determine the severity of service threads in operations at varying facility types
Identify back-up approaches that can be used in the event a service thread is lost
Determine logistic support considerations that may impact availability such as level of
replacement (system/subsystem vs. spare parts), spares availability and travel time to
remote sites
Verify local system architecture and communications connectivity at the reliability block
diagram level
Define local environment or seasonal conditions impacting availability
Involving SMEs in these activities will ensure that the appropriate service level requirements
are derived.
When further guidance is required, one potential approach is to conduct a Service Risk
Assessment (SRA). A SRA is a process that utilizes SMEs to assess risk at a service thread
level. One example would be to determine if a service thread increases in risk due to the effects
of co-locating several facilities into one facility. This example would utilize SMEs to analyze
the service threads at where they are today and where they will be in the future, for instance in a
new combined facility. The SRA team would then compare the availability results from both
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
55
the existing and future assessments and determine if there are any decreases in availability. One
potential area where availability could decrease is from changes in the back-up procedures or
services. If this occurs then the SRA team would identify mitigations that when implemented
could return the service threads to the levels where they are today. This is only one example of
many areas, including EIS, where the SRA process could be applied to ensure service thread
availability in the NAS.
7.4.3 Environmental Complications
Some services may require higher STLSC values at certain facilities due to environmental
factors. For example, at a tower where there are often visibility issues, such as fog, or terrain
blocking line-of-sight, surface surveillance may warrant a greater STLSC value than at a facility
where there is usually optimal out-the-window visibility. In these cases, the Handbook assigned
STLSC values should be used as a recommendation, but SMEs should also be consulted to
make the most optimal determination for NAS safety and efficiency.
7.5 Assign Service Thread Loss Severity Category In assessing the severity of a service thread failure, there are two issues to consider:
1. How hazardous is the transition to a reduced steady state capacity, e.g., is safety
compromised while controllers increase separation and reduce traffic flow to achieve the
reduced capacity state?
2. What is the severity of the impact of the service thread failure on NAS efficiency and
traffic flow? This severity depends on several non-system related factors such as the
level of demand, level of the facility, time of day, and weather conditions.
If the transition risk is acceptable, the only issue in question is the effect of a failure on the
efficiency of NAS operations. If the effect could cause widespread delays and flight
cancellations, the service thread is considered “Efficiency-Critical.” If the reduction in NAS
capacity results in some, but not widespread, disruptions to traffic flow, then the service thread
is rated “Essential.” If, however, the hazard during transition to reduced traffic flow is
significant, prudent steps must be taken to reduce the probability of that hazardous transition to
an acceptable level. Experience has shown that this cannot be accomplished with a single
service thread – instead a prudent design dictates use of two, independent service threads each
designed to support the Safety-Critical requirement together with a simple manual capability for
switching from one to the other (sometimes called a “knife switch” capability)6 .
6 The most significant weakness of the HOST-DARC system was the reduced functionality available with the
backup data path (DARC). Analysis of the NASPAS data shows that 98% of the time controllers were required to
use the DARC system is to accommodate scheduled outages. This is essentially a procedural issue concerning
Technical Operations technicians, although the close coupling of redundant Host processors is a significant factor.
Whether or not the percentage of time spent on DARC to accommodate scheduled Host outages could have been
reduced, scheduled outages would have always comprise a significant portion of total Host outages due to the
desirability of retaining closely coupled resources within a data path to provide high availability, as discussed
above. Therefore, to mitigate the effects of scheduled and unscheduled outages on full service availability, a higher
functionality secondary data path was implemented through EBUS (EBUS does not fully implement flight data
processing). The HOST replacement system, ERAM, has implemented secondary pathing that provides reduced
capacity and different functionality but the same service.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
56
Redundancy alone is not enough to mitigate the effects of system faults. Redundant system
components can provide continuing operations after a system fault only if the resource
experiencing said fault is independent of its equivalent, redundant component. In the case of
standby redundancy, the ability to provide continuing operations depends on the successful
operation of complex automatic fault detection and recovery mechanisms. The beneficial effects
of redundancy are therefore maximized when resources are organized into active redundant
groups, and each group is made to be independent of the other groups. In this context,
independence means that resources in one group do not rely on resources either contained in
other groups or relied upon by the resources comprising other groups. Independence applies to
hardware and software.
FAA systems operate at extremely high availability in large part because of the physical
independence between the primary and secondary data path equipment. This is in contrast to the
tight coupling of resources generally found between redundant resources within a data path.
This tight coupling induces a risk, however, that certain failure modes will result in the loss of
all tightly coupled resources within a data path. One or more separate active data paths loosely
coupled to the failed data path is provided to ensure there is no disruption of service when these
more comprehensive failure modes are encountered. Simple switching mechanism (or isolated
voting mechanism) is also needed between the multiple data paths, which in the current system
consists of a capability to switch nearly instantaneously between service threads by depressing a
single button. Therefore, we conclude that to continue to provide the high availability of service
redundant independent data paths are needed for the target architecture.
For these reasons, the target architecture for systems supporting critical services should provide
separate and independent full functionality continuously active data paths, with each data path
being composed of tightly coupled redundant components. This will be required to ensure for
the future the extremely high availability of equipment and services that has been achieved with
the current system, and to mitigate the effects of scheduled outages better than is the case with
the current system. The use of independent data paths is critical to achieving extremely high
availability, and care must be taken not to compromise this characteristic during any phase of
the system life cycle. Full functionality on both data paths is needed to mitigate the impacts of
planned and unplanned primary channel outages, both of which are inevitable in an automation
system.
This leads us to the following definitions for Service Thread Loss Severity Categories with
comments on the defining characteristics of each:
1) Safety-Critical – Service thread loss would present an unacceptable safety hazard during
transition to reduced capacity operations.
Loss of a service thread supporting the service would impact safety unless a simple,
manual switchover to a backup service thread was successfully accomplished.
Depending on operational requirements, the secondary service thread might have a
lower capacity or functionality than the primary.
FAA experience has shown that this capability is achievable if the service is
delivered by at least two independent service threads, each built with off-the-shelf
components in fault tolerant configurations.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
57
2) Efficiency-Critical – Service thread loss could be accommodated by reducing capacity
without compromising safety, but the resulting impact might have a localized or system-
wide economic impact on NAS efficiency.
Experience has shown that this is achievable by a service thread built of off-the-shelf
components in a fault tolerant configuration.
3) Essential – Service thread loss could be accommodated by reducing capacity without
compromising safety, with only a localized impact on NAS efficiency.
Experience has shown that this is achievable by a service thread built of good
quality, industrial-grade, off-the-shelf components.
4) Routine – A service which, if lost, would have a minor impact on the risk associated
with providing safe and efficient NAS operations.
For Remote/Distributed threads, loss of a service thread element, i.e., radar, air/ground
communications site, or display console, would incrementally degrade the overall effectiveness
of the service thread but would not render the service thread inoperable.
The NAS-RD-2013 RMA requirements development process has assigned a STLSC to each
identified service thread.
7.6 STLSC Matrix Development The results of this process are summarized in the matrices in Figure 7-9, Figure 7-10, and
Figure 7-11 that are provided as samples from NAS-RD-20137 . These matrices provide the
mapping between the NAS architecture capabilities in Table 7-1 and the service threads in
Table 7-2.
The three matrices represent the service threads contained in each of the Terminal, En Route,
and “Other” domains. All of the matrices contain service threads in both the Information and
Remote/Distributed and Standalone Systems categories illustrated in Table 7-1. In addition, the
“Other” matrix contains the service threads in the Support Systems and Infrastructure and
Enterprise Information Systems categories of the taxonomy. Although the matrices are
organized around the domains of the service threads, the development of the RMA requirements
for the service threads depends on their category in the taxonomy (Information,
Remote/Distributed and Standalone, Support, or Infrastructure and Enterprise Information
Systems).
The column to the right of the NAS Services/Functions in the matrices represents the results of
“rolling up” the severities assigned to each of the individual NAS-RD-2013 functional
requirements contained in a NAS architecture capability to a single value representing the
severity of the entire NAS function. Each NAS architecture capability is assigned a severity
level of “Critical,” “Essential,” or “Routine.” (The only capabilities having a “Routine” severity
were those capabilities associated with non-real-time mission support. All of the capabilities
associated with the real-time air traffic control mission have a severity of “Essential” or higher.)
Each matrix is divided into two sections. The top section above the black row contains
information concerning the characteristics of the service threads, including the overall STLSC
7 In the event of a discrepancy between these matrices and those in NAS-RD-20XX, the NAS EA-RD-20XX takes
precedence.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
58
for each service thread and differentiates the service levels required for differing facility sizes.
The section of the matrix below the black row shows the mapping between the NAS
Service/Functions and the service threads.
The individual cell entries in a row indicate which of the service threads support a given major
function. The individual cell entries in a service thread column indicate which of the various
architecture capabilities are supported by the service thread. The numerical entries in the cells
represent the STLSC associated with the loss of a service thread on each of the specific
architecture capabilities that are associated with that thread. A cell entry of “N” for “not rated”
indicates one of two conditions: (1) Loss of the major function is overshadowed by the loss of a
much more critical major function, which renders the provision of the major function
meaningless in that instance, or (2) The major function is used very infrequently, and should not
be treated as a driver for RMA requirements. For example loss of the ability to communicate
with aircraft may affect the capability to provide NAS status advisories, but the effect of the
loss of air-ground communications on the far more critical capability to maintain aircraft-to-
aircraft separation overshadows the capability to provide NAS status advisories so it is “Not
Rated” in this instance.
A column labeled “Manual Procedures” has been added on the right side of each of the
matrices, with a “P” in every cell. This is to illustrate that the NAS capabilities are provided by
FAA operational personnel using a combination of automation tools (service threads) and
manual procedures. When service threads fail, manual procedures can still be employed to
continue to provide the NAS capabilities supported by the failed service thread(s). The “P” in
every cell indicates that there are always manual procedures to provide the NAS capabilities, so
that the NAS-RD-2013 availability requirements associated with Service/Function severity can
be achieved despite service thread interruptions.
The first row below the service thread names shows the pairing of service threads providing
Safety-Critical services. The Safety-Critical service thread pairs are coded red and designated
by an arrow spanning the two service threads. Note that the STLSC for each of the service
threads making up a Safety-Critical pair is “Efficiency-Critical.” This recognizes the fact that a
single service thread is incapable of achieving the level of availability needed for Safety-Critical
applications. The availability associated with each STLSC was obtained by “rolling up” the
availabilities associated with NAS Service/Functions as discussed in Section 7.1.2.
The overall Service Thread Loss Severity Category (STLSC) for each service thread was
obtained by “rolling up” the STLSCs for each of the cells in a service thread column. The
overall STLSC for each service thread is represented by the highest severity of any of the cells
in the service thread’s column. The overall STLSCs are in the second row below the service
thread names.
The row(s) beneath these two rows represent the power system architectures associated with the
service threads. Power distribution systems used by the service threads are discussed in greater
detail in Section 7.7.3.
Each of the matrices contains two categories of service threads: service threads representing
information services provided to controllers by systems located within the facility, and service
threads representing Remote/Distributed Services that include remote surveillance and
communications sites serving the facility and intercommunications between the facility and
other remote facilities.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
59
The Remote/Distributed Service Threads provide a STLSC for reference, but use a “D” in the
cells of the R/D Service Thread columns in addition to a STLSC value to indicate which NAS
capabilities are supported by the R/D Service Thread and their severity. The “D” is used to
indicate that the diversity techniques in FAA Order 6000.36 are used to achieve the required
level of availability. It should be noted that an R/D Service Thread is a generic representation of
a service thread with multiple instantiations. For example, in an ARTCC, the En Route
Communications Service Thread (ECOM) will typically have several dozen instantiations of the
service thread at specific locations (e.g. Atlantic City RCAG) to provide complete and
overlapping coverage for the center’s airspace.
The columns of the matrices have been color-coded to indicate which of the service threads are
not directly mapped from the NAPRS services. Grey indicates newly created service threads
that are not included in NAPRS.
7.6.1 Terminal STLSC Matrix
The matrix in Figure 7-9 shows all service threads associated with the Terminal domain. These
include Information Service Threads, Remote/Distributed Service Threads and Infrastructure
and Enterprise Systems Service Threads.
NOTE(s):
TARS is a “Safety-Critical” Umbrella Service comprised of two Efficiency-Critical
Service Threads (see Figure E-27). Figure E-27illustrates two redundant Terminal
Service Threads i.e., TARS-1 and TARS-2.
Terminal Communications Voice Exchange (TCVEX) is a “Safety-Critical”
Umbrella Service comprised of two Efficiency-Critical Service Threads i.e.,
Terminal Voice Communications Safety-Critical Service Thread Pair (see Figure
E-45). Figure E-45 illustrates a Service Thread Pair consisting of two redundant
Efficiency-Critical Terminal Service Threads.
The Remote/Distributed Service Threads are characterized by equipment that is located at the
control facility, (e.g. TRACON) and equipment that is remotely located and linked to the
control facility by one or more communications paths. The equipment in the control facility is
powered by the Critical bus of the Critical Power Distribution System. The remote equipment is
powered by a separate power source. The last row above the black row, “Remote Site Power
Architecture,” specifies the power architecture requirements at the remote sites. In contrast with
the Information Service Threads, Remote/Distributed Threads do not associate a quantitative
STLSC availability requirement with each service thread.
The overall availability of the critical surveillance and communications services provided by
these R/D Service Threads is achieved by employing diversity and redundancy techniques to
circumvent failures of individual service thread instantiations. The diversity requirements for
the service threads with a STLSC rating of “D” are contained in FAA Order 6000.36A,
Communications and Surveillance Service Diversity. Although these threads support critical
NAS-RD-2013 capabilities, the required availability is achieved, not by a single service thread
instantiation, but by a diverse architecture of distributed surveillance and communications sites
with overlapping coverage.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
60
NOTE: The newly created service thread “Terminal Voice Switch Backup” is not currently a
physical backup system. Rather, it represents a capability and manual procedure for controllers
to bypass a failed terminal voice switch by plugging directly into selected air/ground
communications sites. Since this represents a workable, but significantly degraded
communications capability, it is conceivable that a backup voice switch could be introduced at
some point as the NAS architecture evolves. This is an example of how procedures can be used
to assure continuity of services. The concept is particularly applicable to sites where the traffic
density permits use of manual procedures to work around failed equipment without seriously
affecting the safety and efficiency of operations.
The detailed methods for applying these requirements in the acquisition of new systems are
described in Section 8.1.
Figure 7-9 Service/Function - Terminal Service Thread STLSC Matrix8
8 This figure is provided as a sample only, refer to the latest NAS-RD for the approved requirements.
Reference Section 7.6 of the RMA Handbook
I & ES
Service
Threads
NA
S R
D R
oll-U
p
Fa
cilit
y P
ow
er
Sy
ste
m In
he
ren
t A
va
ila
bilit
y R
eq
uir
em
en
t
AS
DE
S A
irp
ort
Su
rfa
ce
De
tectio
n E
qu
ipm
en
t S
erv
ice
TF
MS
Tra
ffic
Flo
w M
an
ag
em
en
t S
yste
m
TB
FM
R T
ime
Ba
se
d F
low
Ma
na
ge
me
nt R
em
ote
Dis
pla
y
Te
rmin
al S
urv
eill
an
ce
Sa
fety
-Critica
l S
erv
ice
Th
rea
d P
air (
1)
Te
rmin
al S
urv
eill
an
ce
Sa
fety
-Critica
l S
erv
ice
Th
rea
d P
air (
2)
TV
S T
erm
ina
l V
oic
e S
witch
(N
AP
RS
Fa
cili
ty)
TV
SB
Te
rmin
al V
oic
e S
witch
Ba
cku
p
RV
RS
Ru
nw
ay V
isu
al R
an
ge
Se
rvic
e
VG
S V
isu
al G
uid
an
ce
Se
rvic
e
RA
LS
R/F
Ap
pro
ach
an
d L
an
din
g S
erv
ice
s
WIS
We
ath
er
Info
rma
tio
n S
erv
ice
LL
WS
Lo
w L
eve
l W
ind
Se
rvic
e
AD
SS
Au
tom
atic D
ep
en
de
nt S
urv
eill
an
ce
Se
rvic
e
FC
OM
Flig
ht S
erv
ice
Sta
tio
n C
om
mu
nic
atio
ns
MD
AT
Mo
de
S D
ata
Lin
k D
ata
Se
rvic
e
MS
EC
Mo
de
S S
eco
nd
ary
Ra
da
r S
erv
ice
RT
AD
S R
em
ote
To
we
r A
lph
an
um
eric D
isp
lay S
yste
m S
erv
ice
RT
DS
Ra
da
r T
ow
er
Dis
pla
y S
yste
m
TC
OM
Te
rmin
al C
om
mu
nic
atio
ns
TC
E T
ran
sce
ive
r C
om
mu
nic
atio
ns E
qu
ipm
en
t
EC
SS
Em
erg
en
cy C
om
mu
nic
atio
ns S
yste
ms S
erv
ice
TR
AD
Te
rmin
al R
ad
ar
TS
EC
Te
rmin
al S
eco
nd
ary
Ra
da
r
ST
DD
S S
WIM
Te
rmin
al D
ata
Dis
trib
utio
n S
yste
m
Ma
nu
al P
roce
du
res
1
1
E-C 2 2 2 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) D(2) 2 P
E-C 3 3 2 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) D(2) 2 P
E-C 3 3 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) D(2) D(2) D(2) 2 P
E-C 3 2 2 2 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) N D(2) D(2) D(2) 2 P
E-C 3 3 3 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) N D(2) D(2) D(2) 2 P
E 3 3 D(3) D(3) D(3) D(3) D(3) D(3) D(3) N D(3) D(3) D(3) 3 P
E-C 2 2 3 2 D(3) N
2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3
3.1.1 Information Services
3.1.1.1 Aeronautical Information Management E 3 3 3 P
3.1.1.2 Flight and State Data Management E-C 2 2 D(3) D(2) N D(2) P
3.1.1.3 Surveillance InformationManagment S-C 2 2 D(2) D(2) D(2) D(2) D(2) P
3.1.1.4 Weather Information Management E 3 3 P
3.1.2 Traffic Services
3.1.2.1 Separation Management S-C 2 2 2 2 3 D(2) D(2) D(2) D(3) D(3) D(2) N D(2)D(2) D(2) P
3.1.2.2 Trajectory Management E-C 2 2 2 2 D(2) D(2) D(2) D(3) D(3) D(2) N D(2)D(2) D(2) P
3.1.2.3 Flow Contingency Management E-C 2 2 2 2 3 D(2) D(2) P
3.1.2.4 (.0-1 only) Short Term Capacity Management E 3 P
3.1.2.4 (all other) Short Term Capacity Management E-C 2 2 2 D(2) D(2) P
3.1.3 Mission Support Services
3.1.3.1 Long Term Capacity Management * R P
3.1.3.2.0-1 System and Service Analysis * E P
3.1.3.2.0-1.0-1 System and Service Analysis * R P
3.1.3.2.0-1 (.0-2 thru.0-4) System and Service Analysis E 3 3 3 3 3 3 3 3 3 P
3.1.3.2.0-1 (.0-5 thru.0-7) System and Service Analysis * E P
3.1.3.2.0-2 (all) System and Service Analysis * E P
3.1.3.2.0-3 System and Service Analysis R 3 3 3 D(3) D(3) P
3.1.3.3 System and Service Management E 3 3 3 3 3 3 3 3 3 D(3) D(3) D(3) D(3) D(3) D(3) D(3) D(3)D(3) D(3) 3 P
3.1.3.4 Safety Management * E P
3.2.1 Surveillance Data Collection * S-C 2 2 2 2 2 P
3.2.2 Weather Data Collection * E 3 3 2 P
3.2.3 Navigation Support * E-C 2 2 2 2 P
Small Tower STLSC
Associated Unstaffed Facility Architecture
Distributed Unstaffed Facility
NAS RD-2013 Section 3.1 Mission Services
Service/Capability/Power -Terminal Service Thread STLSC Matrix
Control Facility Information Systems Service ThreadsR/D & Standalone Systems
Service Threads
TARS Terminal Automated Radar Service (Umbrella Service)TCVEX Terminal Communications Voice Exchange Service
(Umbrella Service)
Large TRACON STLSC
Medium TRACON STLSC
Small TRACON STLSC
NAS RD-2013 Section 3.2 Technical Infrastructure Services
Large Tower STLSC
Medium Tower STLSC
Service Thread Loss Severity Categories:
1 Safety-Critical = Paired Eff. Crit. Thread
2 Efficiency-Critical = .9999
3 Essential = .999
2/3 Based on Highest Associated Facility
M Mission Support Services
P Manual Procedures
N (Not Rated)
Color Key:
Safety-Critical S-C
Efficiency-Critical E-C
Essential E
Routine R
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
61
7.6.2 En Route STLSC Matrix
The En Route STLSC matrix in Figure 7-10 is similar to the Terminal STLSC matrix in Figure
7-9. The detailed methods for applying these requirements in the acquisition of new systems are
described in Section 8.1.
NOTE: En Route Communication Voice Exchange (ECVEX) Service is a “Safety-Critical”
Umbrella Service comprised of two Efficiency-Critical Service i.e., ARTCC En Route
Communications Services (Safety-Critical Thread Pair), (see Figure E-44). Figure E-44
illustrates a Service Thread Pair consisting of two redundant Efficiency-Critical En Route
Service Threads.
Figure 7-10 Service/Function – En Route Service Thread STLSC Matrix9
9 This figure is provided as a sample only, refer to the latest NAS-RD for the approved requirements.
Reference Section 7.6 of the RMA Handbook
I & ES
Service
Threads
Mission
Support
Systems
Service
Threads
NA
S R
D R
oll-U
p
TF
MS
S T
raffic
Flo
w M
an
ag
em
nt S
yste
m S
erv
ice
TD
WR
S T
erm
ina
l D
op
ple
r W
ea
the
r R
ad
ar
Se
rvic
e
WM
SC
R W
ea
the
r M
essa
ge
Sw
itch
ing
Ce
nte
r R
ep
lace
me
nt
WD
AT
WM
SC
R D
ata
Se
rvic
e
WIS
We
ath
er
Info
rma
tio
n S
erv
ice
FC
OM
Flig
ht S
erv
ice
Sta
tio
n C
om
mu
nic
atio
ns
FS
SA
S F
ligh
t S
erv
ice
Sta
tio
n A
uto
ma
ted
Se
rvic
e
WA
AS
Wid
e A
rea
Au
gm
en
tatio
n S
yste
m S
erv
ice
NM
RS
N
AS
Me
ssa
gin
g a
nd
Lo
gg
ing
Syste
m
RM
LS
S R
em
ote
Mo
nitro
ing
an
d L
og
gin
g S
yste
m S
erv
ice
Ma
nu
al P
roce
du
res
2 3 3 3 2 3 3 3 2 3 M
3.1.1 Information Services
3.1.1.1 Aeronautical Information Management E 3 3 M
3.1.1.2 Flight and State Data Management E-C 3 3 2 P
3.1.1.3 Surveillance InformationManagment* S-C
3.1.1.4 Weather Information Management E 3 3 3 2 3 3
3.1.2 Traffic Services
3.1.2.1 Separation Management S-C
3.1.2.2 Trajectory Management E-C 2 P
3.1.2.3 Flow Contingency Management E-C 2 2 2 P
3.1.2.4 (.0-1 only) Short Term Capacity Management E 3 P
3.1.2.4 (all other) Short Term Capacity Management E-C 2 2
3.1.3 Mission Support Services
3.1.3.1 Long Term Capacity Management * R M
3.1.3.2.0-1 System and Service Analysis * E
3.1.3.2.0-1.0-1 System and Service Analysis * R
3.1.3.2.0-1 (.0-2 thru.0-4) System and Service Analysis E 3 3 3
3.1.3.2.0-1 (.0-5 thru.0-7) System and Service Analysis * E
3.1.3.2.0-2 (all) System and Service Analysis) * E
3.1.3.2.0-3 System and Service Analysis * R
3.1.3.2 System and Service Analysis * E M
3.1.3.3 System and Service Management E 3 3
3.1.3.4 Safety Management * E
3.2.1 Surveillance Data Collection S-C 3 3 3
3.2.2 Weather Data Collection E 3 3 3 3 3
3.2.3 Navigation Support E-C 3 3 3 P
Service/Capability - Other Service Thread STLSC Matrix
R/
Control Facility Information Systems Service ThreadsR/D & Standalone Systems
Service Threads
Safety Critical Thread PairingService Thread Loss Severity Category
NAS RD-2013 Section 3.1 Mission Services
NAS RD-2013 Section 3.2 Technical Infrastructure Services
Service Thread Loss Severity Categories:
1 Safety-Critical = Paired Eff. Crit. Thread
2 Efficiency-Critical = .9999
3 Essential = .999
2/3 Based on Highest Associated Facility
M Mission Support Services
P Manual Procedures
N (Not Rated)
Color Key:
Safety-Critical S-C
Efficiency-Critical E-C
Essential E
Routine R
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
62
7.6.3 “Other” Service Thread STLSC Matrix
The “Other” service thread STLSC matrix in Figure 7-10 contains the service threads that are
not in Terminal facilities or ARTCCs, such as Weather Information Service (WIS), and NAS
Messaging Transfer Service (NAMS) Threads. In addition, the matrix includes service threads
representing Mission Support Services and the Remote Monitoring and Logging System Service
(RMLSS).
The Mission Support Service Thread with a STLSC rating of “M” is a generic service thread
that encompasses a wide variety of simulators, data base management systems and manual
procedures used to manage the design of the NAS airspace and to monitor and maintain the
systems used in the performance of the NAS air traffic control mission. These systems are, for
the most part, not 24/7 real time systems, and, in any event, cannot be directly related to the
NAS-RD-2013 severity definitions relating to the safe and efficient control of air traffic. The
RMA requirements for systems providing mission support services are not derived from NAS-
RD-2013 but instead are established by acquisition managers, based on what is commercially
available and life cycle cost considerations. The detailed methods for applying these
requirements in the acquisition of new systems are described in Section 7.1.
Figure 7-11 Service/Function – “Other” Service Thread STLSC Matrix10
10 This figure is provided as a sample only, refer to the latest NAS-RD for the approved requirements.
Reference Section 7.6 of the RMA Handbook
I & ES
Service
Threads
Mission
Support
Systems
Service
Threads
NA
S R
D R
oll-U
p
TF
MS
S T
raffic
Flo
w M
an
ag
em
nt S
yste
m S
erv
ice
TD
WR
S T
erm
ina
l D
op
ple
r W
ea
the
r R
ad
ar
Se
rvic
e
WM
SC
R W
ea
the
r M
essa
ge
Sw
itch
ing
Ce
nte
r R
ep
lace
me
nt
WD
AT
WM
SC
R D
ata
Se
rvic
e
WIS
We
ath
er
Info
rma
tio
n S
erv
ice
FC
OM
Flig
ht S
erv
ice
Sta
tio
n C
om
mu
nic
atio
ns
FS
SA
S F
ligh
t S
erv
ice
Sta
tio
n A
uto
ma
ted
Se
rvic
e
WA
AS
Wid
e A
rea
Au
gm
en
tatio
n S
yste
m S
erv
ice
NM
RS
N
AS
Me
ssa
gin
g a
nd
Lo
gg
ing
Syste
m
RM
LS
S R
em
ote
Mo
nitro
ing
an
d L
og
gin
g S
yste
m S
erv
ice
Ma
nu
al P
roce
du
res
2 3 3 3 2 3 3 3 2 3 M
3.1.1 Information Services
3.1.1.1 Aeronautical Information Management E 3 3 M
3.1.1.2 Flight and State Data Management E-C 3 3 2 P
3.1.1.3 Surveillance InformationManagment* S-C
3.1.1.4 Weather Information Management E 3 3 3 2 3 3
3.1.2 Traffic Services
3.1.2.1 Separation Management S-C
3.1.2.2 Trajectory Management E-C 2 P
3.1.2.3 Flow Contingency Management E-C 2 2 2 P
3.1.2.4 (.0-1 only) Short Term Capacity Management E 3 P
3.1.2.4 (all other) Short Term Capacity Management E-C 2 2
3.1.3 Mission Support Services
3.1.3.1 Long Term Capacity Management * R M
3.1.3.2.0-1 System and Service Analysis * E
3.1.3.2.0-1.0-1 System and Service Analysis * R
3.1.3.2.0-1 (.0-2 thru.0-4) System and Service Analysis E 3 3 3
3.1.3.2.0-1 (.0-5 thru.0-7) System and Service Analysis * E
3.1.3.2.0-2 (all) System and Service Analysis) * E
3.1.3.2.0-3 System and Service Analysis * R
3.1.3.2 System and Service Analysis * E M
3.1.3.3 System and Service Management E 3 3
3.1.3.4 Safety Management * E
3.2.1 Surveillance Data Collection S-C 3 3 3
3.2.2 Weather Data Collection E 3 3 3 3 3
3.2.3 Navigation Support E-C 3 3 3 P
Service/Capability - Other Service Thread STLSC Matrix
R/
Control Facility Information Systems Service ThreadsR/D & Standalone Systems
Service Threads
Safety Critical Thread PairingService Thread Loss Severity Category
NAS RD-2013 Section 3.1 Mission Services
NAS RD-2013 Section 3.2 Technical Infrastructure Services
Service Thread Loss Severity Categories:
1 Safety-Critical = Paired Eff. Crit. Thread
2 Efficiency-Critical = .9999
3 Essential = .999
2/3 Based on Highest Associated Facility
M Mission Support Services
P Manual Procedures
N (Not Rated)
Color Key:
Safety-Critical S-C
Efficiency-Critical E-C
Essential E
Routine R
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
63
In general, the naming of service threads in this Handbook follows the naming of NAPRS
services. NAPRS is not always consistent in giving separate names for systems and the services
provided. Where the Handbook uses a NAPRS system name, it should be read as a service
thread name, not a system name.
In developing the service thread diagrams in Appendix E for this edition of the Handbook only
selected STLSC referenced threads were diagrammed. Table 7-7 lists the discrepancies between
the STLSCs and Appendix E, as well as specific services referenced or diagrammed that are not
in NAPRS.
Table 7-7 Noted Discrepancies
Service Discrepancy
ARINC HF Voice Communications Link In STLSC, not in NAPRS and no Service
thread exists
ARSR Air Route Surveillance Radar In STLSC but no service thread Diagram
exists
ECSS Emergency Communications System
Service
In STLSC but no service thread Diagram
exists
FSSAS Flight Service Station Automated
Service
Diagram differs from Current NAS RD
Which only considers Alaska
NMRS NAS Messaging and Logging System In STLSC but no service thread Diagram
exists
RVRS Runway Visual Range Service In STLSC but no service thread Diagram
exists
STTDS SWIM Terminal Data Distribution
System
In STLSC but no service thread Diagram
exists
TBFM Time Based Flow Management In STLSC but no service thread Diagram
exists
TBFMR Time Based Flow Management
Remote Display
In STLSC but no service thread Diagram
exists
TCE Transceiver Communications Equipment In STLSC but no service thread Diagram
exists
TSB Terminal Surveillance Backup Diagram is not in NAPRS or STLSC
TVSB Terminal Voice Switch Backup STLSC and service thread Diagrams differ
from NAPRS
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
64
WIS Weather Information Service (Service
provided by CSS-Wx)
In STLSC, not in NAPRS and no Service
thread exists
ECVEX En Route Communication Voice
Exchange
In STLSC and NAPRS. Corresponds to
service thread Diagram “ARTCC En Route
Communications Services (Safety-Critical
Thread Pair), (see Figure E-44)
TCVEX Terminal Communications Voice
Exchange
In STLSC and NAPRS. Corresponds to
“Terminal Voice Communications Safety-
Critical Service Thread Pair, (see Figure E-
45
TFMS Traffic Flow Management System In STLSC and NAPRS but no service thread
Diagram exists
WMSCR Service Threads Diagram is not in NAPRS or STLSC
WAAS/GPS Service Diagram is not in NAPRS or STLSC
VTABS VSCS Training and Backup Switch In STLSC and service thread Diagram as
VTABS VSCS Training and Backup System
(NAPRS Facility)
7.7 NAS-RD-2013 RMA Requirements Availability is an operational performance measure (see F.2.3 Use of Availability as a
Conceptual Specification) that is not well suited to contractual requirements or specifications.
MIL-STD-961E, the Department of Defense standard for the format and content of military
specifications, precludes citing availability as a requirement together with measures of
reliability and maintainability.
The primary uses of the inherent availability requirements associated with the service threads
are to:
Compare architecture alternatives during preliminary requirements analysis,
Identify the need for redundancy and fault tolerance and
Provide a criterion for assessing the initial acceptability of architectures proposed by
contractors.
Since availability cannot be used as a direct performance measure for verification purposes, this
Handbook makes greater use of other measures. It relies, instead, on a combination of
requirements for reliability, maintainability, and verifiable recovery times that accurately
specify characteristics of the frequency and duration of service interruptions to user/specialists.
It is important to note that the NAS-RD-2013 specifies the minimum availability for NAS
services. To achieve this availability the hardware supporting the service thread should have an
inherent availability which is greater than the service it supports. For example an Essential NAS
service requires a minimum availability of three nines so the information system comprising the
thread it supports must have at least four nines inherent availability.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
65
7.7.1 Information Systems
Table 7-8 presents the reliability, maintainability, and recovery times for each of the Information
service threads shown in the matrices in Figure 7-9, Figure 7-10, and Figure 7-11.
The maintainability, specifically MTTR, is based on the FAA Technical Operations standard
requirement of 30 minutes. Recovery times are specified for those service threads that are
required to incorporate fault tolerance automatic recovery. Two values of MTBF are specified.
The first value represents the mean time between failures where successful automatic recoveries
are performed within the prescribed recovery time. The second value is the mean time between
service interruptions for which the restoration time exceeds the prescribed recovery time, either
because of unsatisfactory operation of the automatic recovery mechanisms, or because human
intervention is required to restore service. (For service threads that do not require automatic
recovery, the automatic recovery time is “N/A” and both MTBF values are equal.)
Figure 7-12 provides a general state transition diagram which illustrates the two MTBF values
needing to be specified for a redundant configuration. Within this figure the circles are
representative of systems states. At state S1 both components are functioning, S2 one
component is down and S3 indicates that both components are down.
Figure 7-12 Example State Diagram
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
66
Table 7-8 Information Service Thread Reliability, Maintainability, and Recovery Times10
Service Thread Severity
Level
Maintainability Automatic
Recovery
Reliability (MTBF)
(MTTR) Time
(hours) (sec) With
Automatic
Recovery
Without
Automatic
Recovery
(hours) (hours)
ASDES Airport Surface Detection Equipment Service
Essential 0.5 N/A 5,000 5,000
CFAD Composite Flight Data Processing Service
Efficiency
-Critical
0.5 6 300 50,000
CODAP Composite Oceanic Display and Planning Service
Efficiency
-Critical
0.5 6 300 50,000
COFAD Anchorage Composite Offshore Flight Data Service
Efficiency
-Critical
0.5 6 300 50,000
CRAD Composite Radar Data Processing Service (EAS/EBUS).
Efficiency
-Critical
0.5 6 300 50,000
CRAD Composite Radar Data Processing Service (EAS/EBUS)
Efficiency
-Critical
0.5 6 300 50,000
ECVEX En Route Communication Voice Exchange Service (1)
Safety-
Critical
0.5 N/A (2) N/A (2) 500,000
ETARS En Route Terminal Automated Radar Service (3)
Safety-
Critical
0.5 N/A (2) N/A (2) 500,000
FSSAS Flight Service Station Automated Service
Essential 0.5 N/A 5,000 5,000
LLWS Low Level Wind Service
Essential 0.5 N/A 5,000 5,000
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
67
Service Thread Severity
Level
Maintainability Automatic
Recovery
Reliability (MTBF)
(MTTR) Time
(hours) (sec) With
Automatic
Recovery
Without
Automatic
Recovery
(hours) (hours)
R/F Approach and
Landing Services
Efficiency
-Critical
0.5 6 300 50,000
RMLSS Remote Monitoring/Maintenance Logging System Service
Essential 0.5 N/A 5,000 5,000
RTADS Remote Tower Alphanumeric Display System Service (4)
Efficiency
-Critical
0.5 6 300 50,000
RTADS Remote Tower Alphanumeric Display System Service (4)
Essential 0.5 N/A 5,000 5,000
RVRS Runway Visual Range Service
Essential 0.5 N/A 5,000 5,000
TARS Terminal Automated Radar Service (5)
Safety-
Critical
0.5 N/A (2) N/A (2) 500,000
TBFM Time Based Flow Management
Efficiency
-Critical
0.5 6 300 50,000
TBFMR Time Based Flow Management Remote Display (6
Efficiency
-Critical
0.5 6 300 50,000
TCVEX Terminal Communications Voice Exchange (7)
Safety-
Critical
0.5 N/A (2) N/A (2) 500,000
TDWRS Terminal Doppler Weather Radar Service
Essential 0.5 N/A 5,000 5,000
TFMS Traffic Flow Management System (6)
Efficiency
-Critical
0.5 6 300 50,000
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
68
Service Thread Severity
Level
Maintainability Automatic
Recovery
Reliability (MTBF)
(MTTR) Time
(hours) (sec) With
Automatic
Recovery
Without
Automatic
Recovery
(hours) (hours)
TFMS Traffic Flow Management System (6)
Essential 0.5 N/A 5,000 5,000
TFMSS Traffic Flow Management System Service
Efficiency
-Critical
0.5 6 300 50,000
TVSB Terminal
Voice Switch Backup
(New)
Efficiency
-Critical
0.5 6 300 50,000
TVS Terminal Voice Switch (NAPRS Facility)
Efficiency
-Critical
0.5 6 300 50,000
TVS Terminal Voice Switch (NAPRS Facility) (4)
Essential 0.5 N/A 5,000 5,000
VGS Visual
Guidance Service
(VGS)
Efficiency
-Critical
.5 6 300 50,000
VSCSS Voice Switching and Control System Service (1)
Efficiency
-Critical
0.5 6 300 50,000
VTABS VSCS Training and Backup Switch (NAPRS Facility), (1)
Efficiency
-Critical
0.5 6 300 50,000
WAAS Wide Area
Augmentation System
Service
Efficiency
-Critical
0.5 6 300 50,000
WDAT WMSCR Data Service
Essential 0.5 N/A 5,000 5,000
WIS Weather Information Service (8)
Essential 0.5 N/A 5,000 5,000
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
69
Service Thread Severity
Level
Maintainability Automatic
Recovery
Reliability (MTBF)
(MTTR) Time
(hours) (sec) With
Automatic
Recovery
Without
Automatic
Recovery
(hours) (hours)
WMSCR Weather Message Switching Center Replacement
Essential 0.5 N/A 5,000 5,000
(1)ECVEX: Umbrella service thread encompassing two Efficiency-Critical Service Threads i.e., (1) VSCS Voice
Switching and Control System Service (VSCSS), Figure E-38 and VSCS Training and Backup Switch (VTABS)
(NAPRS Facility), Figure E-37
(2)Safety-Critical Service cutover is by manual switching only
(3)ETARS: Umbrella service thread encompassing two Efficiency-Critical Service Threads i.e., (1) CRAD
Composite Radar Data Processing Service (EAS/EBUS), Figure E-10 and (2) CRAD Composite Radar Data
Processing Service (Central Computer Complex Host (CCCH)/EBUS), Figure E-9
(4)RTADS and TVS are rated Essential in Small Towers, Efficiency-Critical at all other levels
(5)TARS: Umbrella service thread encompassing two Efficiency-Critical Service Threads i.e., Terminal
Surveillance Safety-Critical Service Thread Pair (1), Figure E-46 and Terminal Surveillance Safety-Critical Service
Thread Pair (2), Figure E-47
(6)TFMS and TBFMR are Efficiency-Critical at Large Towers/TRACONS and Essential all other levels.
(7)TCVEX: Umbrella service thread encompassing two Efficiency-Critical Service Threads i.e., (1) TVS Terminal
Voice Switch (NAPRS Facility), Figure E-34 and (2) TVSB Terminal Voice Switch Backup (New), Figure E-35
(8)WIS is not in NAPRS, and is a composite of all Weather related Services and is representative of the planned
future CSS-Wx service
10This table is provided as a sample only, refer to the latest NAS-RD for the approved requirements.
If no NAPRS service exists, links go to Appendix E.
7.7.2 Remote/Distributed and Standalone Systems
The NAS-RD-2013 does not provide RMA requirements for Remote/Distributed and
Standalone Systems. This Handbook provides a methodology to determine Remote/Distributed
and Standalone Systems RMA requirements that must be validated by SMEs. The inputs to this
methodology are the STLSC matrices, the FAA Communications Diversity Order 6000.36,
financial and site specific criteria. The RMA characteristics for these systems, therefore, are
established primarily by life cycle cost (LCC) and diversity considerations. The
Remote/Distributed Service Threads are presented in
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
70
Table 7-9. Diversity issues are a unique function of local traffic patterns, terrain, etc. This topic
is discussed in greater detail in Paragraph 8.1.1.2.
Table 7-9 Remote/ Distributed and Standalone Systems Services Thread
Remote/Distributed and
Standalone Systems Service
Threads
Control
Facility
Remote
Site
Service Type
ADSS Automatic Dependent
Surveillance Service
ARTCC,
TRACON,
ATCT
GBT (SBS) ADS-B Surveillance
Reports
ARINC HF Voice
Communications Link (1)
Oceanic ARINC Oceanic A/G Voice
Comm.
ARSR Air Route Surveillance
Radar
ARTCC
TRACON
ARSR Primary and
Secondary Long
Range Radar
BDAT Beacon Data (Digitized)
ARTCC,
TRACON
ARSR, ASR Digitized Secondary
Radar Reports
BUECS Backup Emergency
Communications Service
ARTCC BUEC En Route A/G Voice
Comm.
ECOM En Route
Communications
ARTCC RCAG En Route A/G Voice
Comm.
ECSS Emergency
Communications System Service
TRACON,
ATCT
RTR Terminal A/G Voice
Comm.
FCOM Flight Service Station
Communications
AFSS AFSS,
ATCT,
VOR
AFSS A/G Voice
Comm.
FDAT Flight Data Entry and
Printout
ARTCC TRACON
ATCT
Flight Plan Data
Transfer
FSSAS Flight Service Station
Automated Service
ARTCC AFSS Weather, Flight and
State Data
Management
IDAT Interfacility Data Service
ARTCC ARTCC
TRACON
Computer-to-
Computer Data
Transfer
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
71
Remote/Distributed and
Standalone Systems Service
Threads
Control
Facility
Remote
Site
Service Type
MDAT Mode S Data Link Data
Service
ARTCC,
TRACON
ARSR ASR Mode S Data Link
Reports from Radar
Site
MSEC Mode S Secondary Radar
Service
ARTCC,
TRACON
ARSR ASR Mode S Secondary
Radar Reports
NAMS NAS Message Transfer
Service
NEMC
Switching
Center
ARTCC Transfer of Message.
Data between NEMC
Switching Center &
ARTCC NAMS
Concentrator RDAT Radar Data (Digitized)
ARTCC,
TRACON
ARSR Digitized Primary
Radar Reports
R/F Navigation Service ATCT,
ARTCC
VOR, DME Radio Frequency
Navigation and
Landing Services
RTADS Radar Tower Automation
Display Service
TRACON ATCT Automation data
Provided to Towers
RTDS Radar Tower Display
System
ATCT TRACON Radar Display in
ATCT from Remote
TRACON Source
TCE Transceiver Communications
Equipment
ATCT (3) Terminal A/G Voice
Comm.
TCOM Terminal Communications
TRACON,
ATCT
RTR Terminal A/G Voice
Comm.
TRAD Terminal Radar
TRACON,
ATCT
ASR Primary Radar
Reports
TSEC Terminal Secondary Radar
TRACON,
ATCT
ASR Secondary Radar
Reports
WAASS Wide Area
Augmentation System Service
WAASS
Master
Station
Wide Area
Reference
Stations
GPS Navigation
Accuracy
Enhancement
(1)ARINC HF Voice Communications is a leased service and is not in NAPRS, link to ARINC site
(2)STDDS is not in NAPRS, link to FSEP
(3)Multiple handheld radios in ATCT
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
72
7.7.3 Infrastructure and Enterprise Systems
Infrastructure and Enterprise Systems provide power, the environment, communications and
enterprise infrastructure to the facilities. Within the system of systems taxonomy in Figure 7-1,
there are four subcategories of Infrastructure System types defined: Power, HVAC,
Communications Transport and Enterprise Infrastructure Systems. The scope of the systems
included in the Infrastructure Systems category is limited to those systems that provide power,
environment, communications and enterprise infrastructure services to the facilities that house
the information systems. As a result, Infrastructure & Enterprise Systems can cause failures of
the systems they support, such that traditional allocation methods and the assumption of
independence of failures do not apply to them.
Section 7.7.3 defines RMA approaches to Infrastructure and Enterprise Systems to support NAS
service availability. Applicable standards, orders and references are listed in this section to
provide guidance to the practitioner in the specification of RMA requirements for Infrastructure
and Enterprise Systems.
7.7.3.1 Power Systems The RMA requirements for power systems, as defined in the STLSC matrices, are based on the
STLSCs of the threads they support as well as the facility level in which they are installed. All
ARTCCs have the same RMA requirements and the same power architecture. The inherent
availability requirements for Critical Power Distribution Systems (CPDS) are derived from the
NAS-RD-2013 severity requirements for NAS Services/Functions.
In the Terminal domain, there is a wide range of traffic levels between the largest facilities and
the smallest facilities. At larger terminal facilities, the service thread loss severity is comparable
to that of ARTCCs and the severity requirements are the same. Loss of service threads resulting
from power interruptions can have a critical effect on air traffic efficiency as operational
personnel reduce capacity to maintain safe separation. This could increase safety hazards to
unacceptable levels during the transition to manual procedures.
The power system architecture codes used in the matrices were derived from FAA Order
6950.2D, Electrical Power Policy Implementation at National Airspace System Facilities. This
order contains design standards and operating procedures for power systems to ensure power
system availability consistent with the severities of the service threads supported by the power
services.
However at smaller terminal facilities, manual procedures can be invoked without a significant
impact on either safety or efficiency. Accordingly, the severity ratings of these facilities can be
reduced from those applied to the larger facilities.
Inherent availability requirements should in no way be interpreted to be an indication of the
predicted operational performance of a CPDS. The primary purpose of these requirements is
simply to establish whether a dual path redundant architecture is required or whether a less
expensive radial CPDS architecture is adequate for smaller terminal facilities.
In order to meet the severity requirements, dual path architectures have been employed. The
power for Safety-Critical Service Thread pairs should be partitioned across the dual power paths
such that failure of one power path will not cause the failure of both service threads in the
Safety-Critical Service Thread pair.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
73
For smaller facilities, such as those using commercial power with a simple Engine Generator or
battery backup, there is no allocation of NAS-RD-2013 RMA requirements. Although CPDS
architectures can be tailored to meet inherent availability requirements through the application
of redundancy, there is no such flexibility in simple single path architectures using commercial-
off-the-shelf (COTS) components. Accordingly, for these systems, only the configuration will
be specified using the Power Source Codes defined in FAA Order 6950.2D; no NAS-RD-2013
allocated inherent availability requirements are imposed on the acquisition of COTS power
system components. The reliability and maintainability of COTS components shall be in
accordance with best commercial practices.
The power system requirements for Terminal facilities are presented in Figure 7-13 (in which
standard power system configurations meeting these requirements have been established). The
standards for power systems are contained in FAA Order 6950.2D. The table indicates that the
larger facilities require a dual path redundant CPDS architecture that is capable of meeting the
.999998 inherent availability requirement. Smaller facilities can use single path CPDS
architecture capable of meeting .9998 inherent availability. The smallest facilities do not require
a CPDS architecture and use the specified power system architecture code with no NAS-RD-
2013 allocated availability requirement. Figure 7-14 and
Figure 7-15 “Other” Power System present the power system requirements for En Route
facilities and “Other” (non-operational) facilities respectively.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
74
Figure 7-13 Terminal Power System11
11 These Architectures are based on the most current Power Systems Orders as of 2006, but should be verified with
Power System Implementation AJW-222, ATO-W/ATC Facilities before inclusion in any new design or RMA
calculation.
Reference Section 7.6 of the RMA Handbook
I & ES
Service
Threads
NA
S R
D R
oll-U
p
Fa
cilit
y P
ow
er
Sy
ste
m In
he
ren
t A
va
ila
bilit
y R
eq
uir
em
en
t
AS
DE
S A
irp
ort
Su
rfa
ce
De
tectio
n E
qu
ipm
en
t S
erv
ice
TF
MS
Tra
ffic
Flo
w M
an
ag
em
en
t S
yste
m
TB
FM
R T
ime
Ba
se
d F
low
Ma
na
ge
me
nt R
em
ote
Dis
pla
y
Te
rmin
al S
urv
eill
an
ce
Sa
fety
-Critica
l S
erv
ice
Th
rea
d P
air (
1)
Te
rmin
al S
urv
eill
an
ce
Sa
fety
-Critica
l S
erv
ice
Th
rea
d P
air (
2)
TV
S T
erm
ina
l V
oic
e S
witch
(N
AP
RS
Fa
cili
ty)
TV
SB
Te
rmin
al V
oic
e S
witch
Ba
cku
p
RV
RS
Ru
nw
ay V
isu
al R
an
ge
Se
rvic
e
VG
S V
isu
al G
uid
an
ce
Se
rvic
e
RA
LS
R/F
Ap
pro
ach
an
d L
an
din
g S
erv
ice
s
WIS
We
ath
er
Info
rma
tio
n S
erv
ice
LL
WS
Lo
w L
eve
l W
ind
Se
rvic
e
AD
SS
Au
tom
atic D
ep
en
de
nt S
urv
eill
an
ce
Se
rvic
e
FC
OM
Flig
ht S
erv
ice
Sta
tio
n C
om
mu
nic
atio
ns
MD
AT
Mo
de
S D
ata
Lin
k D
ata
Se
rvic
e
MS
EC
Mo
de
S S
eco
nd
ary
Ra
da
r S
erv
ice
RT
AD
S R
em
ote
To
we
r A
lph
an
um
eric D
isp
lay S
yste
m S
erv
ice
RT
DS
Ra
da
r T
ow
er
Dis
pla
y S
yste
m
TC
OM
Te
rmin
al C
om
mu
nic
atio
ns
TC
E T
ran
sce
ive
r C
om
mu
nic
atio
ns E
qu
ipm
en
t
EC
SS
Em
erg
en
cy C
om
mu
nic
atio
ns S
yste
ms S
erv
ice
TR
AD
Te
rmin
al R
ad
ar
TS
EC
Te
rmin
al S
eco
nd
ary
Ra
da
r
ST
DD
S S
WIM
Te
rmin
al D
ata
Dis
trib
utio
n S
yste
m
Ma
nu
al P
roce
du
res
1
1
E-C 2 2 2 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) D(2) 2 P
E-C 3 3 2 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) D(2) 2 P
E-C 3 3 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) D(2) D(2) D(2) 2 P
E-C 3 2 2 2 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) N D(2) D(2) D(2) 2 P
E-C 3 3 3 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) N D(2) D(2) D(2) 2 P
E 3 3 D(3) D(3) D(3) D(3) D(3) D(3) D(3) N D(3) D(3) D(3) 3 P
E-C 2 2 3 2 D(3) N
2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3
Terminal Facility Power Architecture
Level 12 - Consolidated TRACON, Multiple Towers (e.g., PCT) H C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2C2 C2
Level 12 - Single Tower, collocated TRACON, Dual Beacon H C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1C1 C1
Level 12 - Single Tower, collocated TRACON, Single Beacon H B B B B B B B B B B B B B B B B B B
Level 12 - Single Tower, no TRACON (e.g.JFK) R 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A2A 2A
Level 11 H C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1C1 C1
Levels 10, 9 & 8 H B B B B B B B B B B B B B B B B B
Levels 7 & 6 R 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A2A 2A
Levels 5 & 4 R 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A1A 1A
Levels 3 & 2 R U U U U U U U U U U U U U U
Level 1 R 4 4 4 4 4 4 4 4 4 4 4 4 4
Remote Site Power Architecture A/1 # D D S D 1A 1A 1A 1A 1A 1A 1A 1A1A 1A
3.1.1 Information Services
3.1.1.1 Aeronautical Information Management E 3 3 3 P
3.1.1.2 Flight and State Data Management E-C 2 2 D(3) D(2) N D(2) P
3.1.1.3 Surveillance InformationManagment S-C 2 2 D(2) D(2) D(2) D(2) D(2) P
3.1.1.4 Weather Information Management E 3 3 P
3.1.2 Traffic Services
3.1.2.1 Separation Management S-C 2 2 2 2 3 D(2) D(2) D(2) D(3) D(3) D(2) N D(2)D(2) D(2) P
3.1.2.2 Trajectory Management E-C 2 2 2 2 D(2) D(2) D(2) D(3) D(3) D(2) N D(2)D(2) D(2) P
3.1.2.3 Flow Contingency Management E-C 2 2 2 2 3 D(2) D(2) P
3.1.2.4 (.0-1 only) Short Term Capacity Management E 3 P
3.1.2.4 (all other) Short Term Capacity Management E-C 2 2 2 D(2) D(2) P
3.1.3 Mission Support Services
3.1.3.1 Long Term Capacity Management * R P
3.1.3.2.0-1 System and Service Analysis * E P
3.1.3.2.0-1.0-1 System and Service Analysis * R P
3.1.3.2.0-1 (.0-2 thru.0-4) System and Service Analysis E 3 3 3 3 3 3 3 3 3 P
3.1.3.2.0-1 (.0-5 thru.0-7) System and Service Analysis * E P
3.1.3.2.0-2 (all) System and Service Analysis * E P
3.1.3.2.0-3 System and Service Analysis R 3 3 3 D(3) D(3) P
3.1.3.3 System and Service Management E 3 3 3 3 3 3 3 3 3 D(3) D(3) D(3) D(3) D(3) D(3) D(3) D(3)D(3) D(3) 3 P
3.1.3.4 Safety Management * E P
3.2.1 Surveillance Data Collection * S-C 2 2 2 2 2 P
3.2.2 Weather Data Collection * E 3 3 2 P
3.2.3 Navigation Support * E-C 2 2 2 2 P
Small Tower STLSC
Associated Unstaffed Facility Architecture
Distributed Unstaffed Facility
NAS RD-2013 Section 3.1 Mission Services
Service/Capability/Power -Terminal Service Thread STLSC Matrix
Control Facility Information Systems Service ThreadsR/D & Standalone Systems
Service Threads
TARS Terminal Automated Radar Service (Umbrella Service)TCVEX Terminal Communications Voice Exchange Service
(Umbrella Service)
Large TRACON STLSC
Medium TRACON STLSC
Small TRACON STLSC
NAS RD-2013 Section 3.2 Technical Infrastructure Services
Large Tower STLSC
Medium Tower STLSC
Color Key:
Safety-Critical S-C
Efficiency-Critical E-C
Essential E
Routine R
Power System Architectures::C2 CPDS Type 2 C1 CPDS Type 1
B BASIC 2A Comec'l Pwr+EG+UPS1A Comec'l Pwr + EG + Mini UPS
U Comec'l Pwr+ UPS (no EG)D Comec'l Pwr+BatteriesV Photovoltaic/Wind + Batteries
Z Independent Generation1 Comec'l Pwr+EG4 Comec'l Pwr
8 Dual Indep. Comm. Pwr. S Same as Host Facility Power System ArchitectureH = High Inherent Availability = .999998
R = Reduced Inherent Availability = .9998# = Commercial Power Provided by Airport
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
75
Figure 7-14 En Route Power System12
12 These Architectures are based on the most current Power Systems Orders as of 2006, but should be verified with
Power System Implementation AJW-222, ATO-W/ATC Facilities before inclusion in any new design or RMA
calculation.
Reference Section 7.6 of the RMA Handbook
Mission
Support
Systems
Service
Threads
NA
S R
D R
oll-U
p
Fa
cilit
y P
ow
er
Sy
ste
m In
he
ren
t A
va
ila
bilit
y R
eq
uir
em
en
t
CF
AD
Co
mp
osite
Flig
ht D
ata
Pro
ce
ssin
g S
erv
ice
CO
DA
P C
om
po
site
Oce
an
ic D
isp
lay a
nd
Pla
nn
ing
Se
rvic
e
CO
FA
D A
nch
ora
ge
Co
mp
osite
Offsh
ore
Flig
ht D
ata
Se
rvic
e
CR
AD
Co
mp
osite
Ra
da
r D
ata
Pro
ce
ssin
g S
erv
ice
(EA
S/E
BU
S)
CR
AD
Co
mp
osit R
ad
ar
Da
ta P
roce
ssin
g S
erv
ice
(CC
CH
/EB
US
)
TB
FM
Tim
e B
ase
Flo
w M
an
ag
em
en
t
TF
MS
Tra
ffic
Flo
w M
an
ag
em
en
t
VS
CS
S V
oic
e S
witch
ing
an
d C
on
tro
l S
yste
m S
erv
ice
VT
AB
S V
SC
S T
rain
ing
an
d B
acku
p S
yste
m (
NA
PR
S F
acili
ty)
WIS
We
ath
er
Info
rma
tio
n S
erv
ice
AD
SS
Au
tom
atic D
ep
en
de
nt S
urv
eill
an
ce
Se
rvic
e
AR
INC
HF
Vo
ice
Co
mm
un
ica
tio
ns L
ink
AR
SR
Air R
ou
te S
urv
eill
an
ce
Ra
da
r
BD
AT
Be
aco
n D
ata
(D
igitiz
ed
)
BU
EC
S B
acku
p E
me
rge
ncy C
om
mu
nic
atio
ns S
erv
ice
EC
OM
En
Ro
ute
Co
mm
un
ica
tio
ns
FC
OM
Flig
ht S
erv
ice
Sta
tio
n C
om
mu
nic
atio
ns
FD
AT
Flig
ht D
ata
En
try a
nd
Prin
tou
t
IDA
T In
terf
acili
ty D
ata
Se
rvic
e
MD
AT
Mo
de
S D
ata
Lin
k D
ata
Se
rvic
e
MS
EC
Mo
de
S S
eco
nd
ary
Ra
da
r S
erv
ice
NA
MS
NA
S M
essa
ge
Tra
nsfe
r S
erv
ice
RD
AT
Ra
da
r D
ata
(D
igitiz
ed
)
TR
AD
Te
rmin
al R
ad
ar
TS
EC
Te
rmin
al S
eco
nd
ary
Ra
da
r
R/F
Na
vig
atio
n S
erv
ice
FD
IOR
Flig
ht D
ata
In
pu
t/O
utp
ut R
em
ote
RM
LS
S R
em
ote
Mo
nito
rin
g/M
ain
ten
an
ce
Lo
gg
ing
Syste
m S
erv
ice
Ma
nu
al P
roce
du
res
1
1
E-C 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3
E-C 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 3
Control Facility Power System Architecture H C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2
Remote Site Power System Architecture D 2 1 1 S S 1 1A 1A S
3.1.1 Information Services
3.1.1.1 Aeronautical Information Management E 3 3 P
3.1.1.2 Flight and State Data Management E-C 2 2 2 2 2 2 3 2 2 2 2 P
3.1.1.3 Surveillance Information Managment S-C 2 2 2 2 2 2 2 2 2 2 2 P
3.1.1.4 Weather Information Management E P
3.1.2 Traffic Services
3.1.2.1 Separation Management S-C 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 P
3.1.2.2 Trajectory Management E-C 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 P
3.1.2.3 Flow Contingency Management E-C 2 2 2 2 2 2 2 2 2 P
3.1.2.4 (.0-1 only) Short Term Capacity Management E 3 3 P
3.1.2.4 (all other) Short Term Capacity Management E-C 2 2 2 2 2 2 P
3.1.3 Mission Support Services
3.1.3.1 Long Term Capacity Management * R P
3.1.3.2.0-1 System and Service Analysis * E P
3.1.3.2.0-1.0-1 System and Service Analysis * R P
3.1.3.2.0-1 (.0-2 thru.0-4) System and Service Analysis E 3 3 3 3 3 3 3 3 3 3 3 P
3.1.3.2.0-1 (.0-5 thru.0-7) System and Service Analysis * E P
3.1.3.2.0-2 (all) System and Service Analysis * E P
3.1.3.2.0-3 System and Service Analysis R 3 3 P
3.1.3.3 System and Service Management E 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 P
3.1.3.4 Safety Management * E P
3.2.1 Surveillance Data Collection S-C 2 2 2 2 2 2 2 P
3.2.2 Weather Data Collection E 3 3 P
3.2.3 Navigation Support E-C P
NAS RD-2013 Section 3.2 Technical Infrastructure Services
Service/Capability - En Route Thread STLSC Matrix
Control Facility Information Systems Service
Threads
R/D & Standaone Systems
Service Threads
ETARS En Route Terminal Automated Radar Service (Umbrella
Service)
ECVEX Rn Route Communication Voice Exchange Service
(Umbrella Service)
Service Thread Loss Severity Category
Associated Unstaffed Facility Architecture
NAS RD-2013 Section 3.1 Mission Services
Color Key:
Safety-Critical S-C
Efficiency-Critical E-C
Essential E
Routine R
Power System Architectures::C2 CPDS Type 2 C1 CPDS Type 1
B BASIC 2A Comec'l Pwr+EG+UPS1A Comec'l Pwr + EG + Mini UPS
U Comec'l Pwr+ UPS (no EG)D Comec'l Pwr+BatteriesV Photovoltaic/Wind + Batteries
Z Independent Generation1 Comec'l Pwr+EG4 Comec'l Pwr
8 Dual Indep. Comm. Pwr. S Same as Host Facility Power System ArchitectureH = High Inherent Availability = .999998
R = Reduced Inherent Availability = .9998# = Commercial Power Provided by Airport
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
76
Figure 7-15 “Other” Power System13
7.7.3.2 Heating, Ventilation and Air Conditioning (HVAC) Subsystems FAA Facilities utilize a broad variety of heating, ventilation and air conditioning (HVAC)
subsystems. FAA facility standards specify temperature and humidity ranges for operations and
equipment spaces, but do not require any specific availability levels. FAA Order 6480.7D
requires that ATCT and TRACONs be equipped with redundant air conditioning systems for
critical spaces14 , but no specific performance requirements can be relied upon in designing for
overall operational availability. Failures of NAS facilities due to habitability or safety issues are
dealt with procedurally according to pre-approved contingency plans and generally involve
transfer of ATC functions to adjoining facilities. From an RMA point of view, the important
13 These Architectures are based on the most current Power Systems Orders as of 2006, but should be verified with
Power System Implementation AJW-222, ATO-W/ATC Facilities before inclusion in any new design or RMA
calculation. 14 Critical spaces in ATC facilities are the tower cab, communications equipment rooms, telco rooms, operations
rooms, and the radar and automation equipment rooms.
Reference Section 7.6 of the RMA Handbook
I & ES
Service
Threads
Mission
Support
Systems
Service
Threads
NA
S R
D R
oll-U
p
TF
MS
S T
raffic
Flo
w M
an
ag
em
nt S
yste
m S
erv
ice
TD
WR
S T
erm
ina
l D
op
ple
r W
ea
the
r R
ad
ar
Se
rvic
e
WM
SC
R W
ea
the
r M
essa
ge
Sw
itch
ing
Ce
nte
r R
ep
lace
me
nt
WD
AT
WM
SC
R D
ata
Se
rvic
e
WIS
We
ath
er
Info
rma
tio
n S
erv
ice
FC
OM
Flig
ht S
erv
ice
Sta
tio
n C
om
mu
nic
atio
ns
FS
SA
S F
ligh
t S
erv
ice
Sta
tio
n A
uto
ma
ted
Se
rvic
e
WA
AS
Wid
e A
rea
Au
gm
en
tatio
n S
yste
m S
erv
ice
NM
RS
N
AS
Me
ssa
gin
g a
nd
Lo
gg
ing
Syste
m
RM
LS
S R
em
ote
Mo
nitro
ing
an
d L
og
gin
g S
yste
m S
erv
ice
Ma
nu
al P
roce
du
res
2 3 3 3 2 3 3 3 2 3 M
Control Facility Power System Architecture 2A 2A 2A 2A 2A S
Remote Site Power System Architecture 2A D S
3.1.1 Information Services
3.1.1.1 Aeronautical Information Management E 3 3 M
3.1.1.2 Flight and State Data Management E-C 3 3 2 P
3.1.1.3 Surveillance InformationManagment* S-C
3.1.1.4 Weather Information Management E 3 3 3 2 3 3
3.1.2 Traffic Services
3.1.2.1 Separation Management S-C
3.1.2.2 Trajectory Management E-C 2 P
3.1.2.3 Flow Contingency Management E-C 2 2 2 P
3.1.2.4 (.0-1 only) Short Term Capacity Management E 3 P
3.1.2.4 (all other) Short Term Capacity Management E-C 2 2
3.1.3 Mission Support Services
3.1.3.1 Long Term Capacity Management * R M
3.1.3.2.0-1 System and Service Analysis * E
3.1.3.2.0-1.0-1 System and Service Analysis * R
3.1.3.2.0-1 (.0-2 thru.0-4) System and Service Analysis E 3 3 3
3.1.3.2.0-1 (.0-5 thru.0-7) System and Service Analysis * E
3.1.3.2.0-2 (all) System and Service Analysis) * E
3.1.3.2.0-3 System and Service Analysis * R
3.1.3.2 System and Service Analysis * E M
3.1.3.3 System and Service Management E 3 3
3.1.3.4 Safety Management * E
3.2.1 Surveillance Data Collection S-C 3 3 3
3.2.2 Weather Data Collection E 3 3 3 3 3
3.2.3 Navigation Support E-C 3 3 3 P
Service/Capability - Other Service Thread STLSC Matrix
R/
Control Facility Information Systems Service ThreadsR/D & Standalone Systems
Service Threads
Safety Critical Thread PairingService Thread Loss Severity Category
NAS RD-2013 Section 3.1 Mission Services
NAS RD-2013 Section 3.2 Technical Infrastructure Services
Color Key:
Safety-Critical S-C
Efficiency-Critical E-C
Essential E
Routine R
Power System Architectures::C2 CPDS Type 2 C1 CPDS Type 1
B BASIC 2A Comec'l Pwr+EG+UPS1A Comec'l Pwr + EG + Mini UPS
U Comec'l Pwr+ UPS (no EG)D Comec'l Pwr+BatteriesV Photovoltaic/Wind + Batteries
Z Independent Generation1 Comec'l Pwr+EG4 Comec'l Pwr
8 Dual Indep. Comm. Pwr. S Same as Host Facility Power System ArchitectureH = High Inherent Availability = .999998
R = Reduced Inherent Availability = .9998# = Commercial Power Provided by Airport
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
77
HVAC failure or degradation consideration is to prevent immediate failure of Safety-Critical
services. Design of systems contributing to Safety-Critical services should fail gracefully in
event of cooling/heating loss. For example they should not automatically shut down on
detecting loss of coolant airflow. System designers should be aware of facility Contingency
Plans and design for degradation with the "Transition Hazard Interval", as shown in Figure 7-4,
in mind.
7.7.3.3 Enterprise Infrastructure Enterprise Infrastructure consists of the enterprise networks, systems and services required to
provide a NAS service. Similar to power systems, enterprise infrastructure is required to support
NAS services and NAS service threads at facilities. Unlike power systems, these networks,
systems and services are distributed across geographic regions, extending well beyond the
facility and provide the communications and information framework needed to support NAS
services nationally.
The following subsections provide a background on these networks, systems and services and
the relevance to RMA as well as the challenges associated with acquisition and implementation.
A RMA approach and methodology for enterprise infrastructure is presented supplementing
what is defined in Section 6.
7.7.3.3.1 Overview of Enterprise Infrastructure Systems
Efforts are underway to implement Enterprise Infrastructure Systems to support new and
existing NAS Services. Many new services are being introduced, and existing capabilities and
services are being migrated to Enterprise Infrastructure Systems based on Service-Oriented
Architectures (SOAs) and Cloud-computing technologies.[48][76]
The following sections provide background information on SOA, Cloud-Computing, and
discussions on challenges associated with implementing these technologies and in Section
7.7.3.3.3, derives the RMA requirements for EIS.
7.7.3.3.1.1 Service-Oriented Architecture SOA is a platform design pattern that standardizes hosting and orchestration of a distributed set
of published services which facilitates software reuse across platforms. The FAA is
transitioning to a Service-Oriented Architecture (SOA) under the FAA System-Wide
Information Management (SWIM) Program. [77] The purpose of SWIM is to improve
enterprise information sharing across the NAS. The following is a list of key SOA enabling
components currently being implemented by the SWIM program:
Services Registry: Services will be published within the NAS Service Registry /
Repository (NSRR) Universal Description, Discovery and Integration (UDDI)
Registry.[44]
o All Services typically register with the UDDI.
o Consumer Services find Producer Services via a UDDI (SWIM will use publish /
subscribe web services approach).
o The Orchestration Engine relies on the UDDI registry in order to determine if the
services to be orchestrated are registered and available.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
78
Enterprise Messaging and Communications: The NAS Enterprise Messaging Service
(NEMS) Message Brokers and Enterprise Service Bus (ESB) will be used for Enterprise
Messaging. FTI provides the physical links to support SWIM messaging.[44]
Governance: SWIM will be responsible for the governance and orchestration of services.
Services will be orchestrated as a part of a work flow.
Information on SWIM standards and documentation can be found on the SWIM website:
http://www.faa.gov/about/office_org/headquarters_offices/ato/service_units/techops/atc_comms
_services/swim/
The following list outlines the challenges associated with migrating NAS services to a SOA:
Implementing Enterprise Information Management will most likely require distributed
data information architecture. The SWIM Terminal Data Distribution System (STDDS)
is an example of this architecture. This may require some form of distributed data
registry and repository beyond the SWIM Service Registry.
SWIM plans for a reliable and durable messaging framework through NEMS. Systems
planning to utilize SOA messaging solutions need to include requirements for NEMS
messaging as well as the ability for the maintainer to diagnose problems in message
delivery between services.
Dependency on several core services for operation. These include messaging services,
interface management, enterprise service management, system security and enterprise
governance. Reliability and maintainability requirements need to be specified for these
services, including MTTR and MTBF values as well as known approaches to improve
reliability and maintenance of SOA core components.
Potential for single-points of vulnerability and long-recovery times. As a result,
requirements need to be specified which mitigate single-points of vulnerability, such that
NAS systems and services can continue to operate, maintaining the ability to
communicate, process and publish / consume information in the absence of these SOA
core components.
Increased message size and message processing leads to increased latencies and impacts
service availability. The RMA practitioner needs to consider the impact of the
messaging protocol, content, possible link speeds and whether or not the overall latency
requirement and the availability for the NAS service can be maintained.
Traditionally, RMA efforts have focused on inherent availability and increasing SW reliability
to reduce a system’s MTBF. Employing redundancy and utilizing fault tolerance techniques
allows systems to achieve higher availabilities. As a result, MTBF has improved dramatically
for Safety-Critical, Efficiency-Critical and Essential NAS services. Further, the introduction of
independent service threads, redundant systems, and fault tolerant software has facilitated
meeting high availability requirements for Safety-Critical applications. These approaches will
need to be carried forward and applied to these new Enterprise Infrastructure Systems and
enterprise services which provide these levels of NAS services. Further, due to the complexity
and demonstrated reliability of implemented SOAs and Cloud-Computing, recovery approaches
will need to be specified in order to minimize MTTR, in addition to improving reliability, such
that the necessary availability can be achieved.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
79
7.7.3.3.1.2 Cloud Architectures Cloud-computing is a utility computing model which leverages virtualization in
implementation, where resources are provisioned and used only when they are required and
where resources may be procured as a service and used on an as-needed basis.
Cloud service providers offer products in a number of categories and configurations. Cloud
service offerings are typically categorized by the architectural layers accessed by the user with
designations such as IaaS, PaaS, SaaS and others. The FAA is already utilizing a Cloud service
for email in a configuration that is classified as Software as a Service (SaaS). Elements of the
SWIM program are implemented as a Cloud software platform and are categorized as Platform
as a Service (PaaS). Implementation or re-hosting of NAS systems in the Cloud could also be
done at the VM level and would be categorized as being implemented on Infrastructure as a
Service (IaaS). Other configurations and forms of service are offered or may be offered in the
future, but it is incumbent on the system designer to thoroughly understand both the architecture
and contractual aspects of the service offering in designing for availability.
Cloud offerings are also categorized by the physical and logical locations and user populations
of the Cloud. The National Institutes of Standards and Technology (NIST) categorize Clouds as
Private, Public, Community, or Hybrid. As the names imply, a Private Cloud is dedicated to a
single customer and does not share resources beyond the customer’s contracted or owned
boundaries. A Public Cloud on the other hand offers services to a range of unaffiliated
customers and is likely to share resources (compute, storage and network) among all users. A
Community Cloud is a restricted Public Cloud in which the Cloud population is restricted to an
agreed or understood limited set of customers. A Cloud sharing service among aviation related
users for collaboration would be an example. A Hybrid Cloud is a Cloud that provides for
automated migration or expansion of services between Private and Public Clouds based on
demand or other factors. As with service offerings, the system designer must make choices that
will impact on availability when choosing a Cloud type. Security is a major consideration for
Cloud applications and becomes a factor in RMA considerations, both due to exposure to non-
technical risks to availability as well as the impact of increased complexity.
Cloud-hosted applications, services and infrastructure can support high-availability and
redundancy by providing clustered general purpose processing, network, and storage resources.
However, the design and implementation of software or service redundancy features is usually
left to the system designer and not implemented transparently by the vendor. Well-designed
Cloud-Computing architectures can be resilient and can support the geographic distribution of
storage and processing, such that applications and services can be supported across multiple
sites in cases of contingency and continuity of operations as a result of man-made or natural
disasters.
A Cloud architecture typically consists of:
A clustered set of servers which support load-balancing as well as automated failover of
virtual machines.
A hypervisor operating system, which is an operating system running on each server
within the cluster to host virtual machines (Windows, Linux, etc.), Virtual Machine -
hosted applications (such as SWIM Core services) and virtual applications.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
80
A converged network supporting user traffic, storage area network and Cloud
management traffic. This includes communications between clustered servers at the
hypervisor level to manage failover and virtual machine (VM) restarts.
Network-based storage (SAN)
o Supports hosting of VM images and supports rapid automated failover of hosted
VMs by exposing that VM to multiple servers in a cluster.
o Supports operating system (OS) management, such as the hypervisor OS.
A Cloud management, provisioning and orchestration layer consisting of hardware,
hypervisor OS and virtual machines, hosted applications, and services management.
Security overlay supporting access controls, compartmentalization of data and
processing.
There are challenges in allocating RMA requirements to enterprise Cloud-Computing systems
and services since,
These systems are complex, distributed real-time systems which utilize existing COTS
information technology hardware and software not specifically designed to meet NAS
requirements.
Cloud-computing services may be outsourced to a large commercial service provider
with associated Service-Level Agreements (SLAs).
When Clouds fail, there are potentially long recovery times and data loss. Implemented
Clouds consist of layers for management, hardware (servers, network, and storage),
hypervisor OS, VMs and applications. If a Cloud fails, recovery times tend to be very
long because it takes time to restore operations at each level of the Cloud and orchestrate
restoration such that the environment may be recovered and recertified for operation. As
a result, geographically separated primary, backup and disaster recovery sites are
typically required to support both contingency and business continuity operations.
However, there have been several highly publicized cases in commercial industry where
unknown single points of vulnerability existed which led to all sites failing and outages
lasting hours or even days due to long recovery times.[46] As a result, care must be
taken in adopting new technologies to support new and existing NAS Services.
These difficulties in implementing NAS services with SOA and Cloud-Computing do not
alleviate the need for the NAS service to meet the availability requirements for services
described in NAS-RD-2013. In the following sections, RMA approaches are described, which
leverage these new technologies, services, and architectural approaches such that NAS
availability requirements can be achieved.
7.7.3.3.2 Communications Transport
The FAA is transitioning to Internet Protocol (IP) communications architecture and is leasing
services where RMA characteristics are incorporated in SLAs. In designing systems employing
both inter and intra-facility communications, the designer must now take into account both FAA
Order 6000.36 "Communications Diversity" [57] which specifies route diversity design
approaches for typical NAPRS Services, and selection of appropriate SLA options. In
specifying the Communications Transport requirements for a new service the characteristics of
a number of new or planned FAA systems and services should be taken into account.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
81
The FTI is implemented as a services contract and covers inter-facility communications only.
Services provided by the FTI contractor are detailed in the “FTI Operations Reference Guide”
[78]. FTI provides both traditional serial telecommunications services replacing legacy FAA
owned or leased transport links, IP backbone and routing services, and SOA services for on-
ramping to the SWIM infrastructure. NVS, Data Comm and SIM are all planned to utilize FTI
IP and SOA services. FTI offers a tiered set of RMA and latency characteristics based on a
SLA, with diversity and avoidance routing options based on the requirements of FAA Order
6000.36. While FTI RMA characteristics are not incorporated into service thread availability
calculations, system designers should be careful to specify the FTI RMA service class
appropriate to the service thread requirements. The RMA characteristics of FTI services are set
out in in Table 7.1 of the FTI Operations Reference Guide and are reproduced here in Table
7-10. A comprehensive listing of FTI services can be found in the FAA Telecommunications
Services Description (FTSD), Attachment J.1 to the FTI contract DTFA01-02-D-03006.
Table 7-10 RMA Characteristics of FTI Services
RMA
Level
12 Month
Availability
# of Outages in
12 month
period
Mean Time
Between Outages
(MTBO)
Restoration
RMA 1 0.9999971 15 584 Hours 6 seconds
RMA 2 0.9999719 15 584 Hours 58.8 seconds
RMA 3 0.9998478 15 584 Hours 8 minutes
RMA 4 0.9979452 15 583 Hours 3 hours
RMA 5 0.9972603 15 582 Hours 4 hours
RMA 6 0.9904215 6 522 Hours n/a
RMA 7 0.997 n/a n/a 24 hours
7.7.3.3.3 Deriving RMA Requirements for Enterprise Infrastructure Systems
(EIS)
Earlier in Section 6, a process was defined to derive RMA Requirements. A section of the
process, for determining a service thread’s STLSC, is maintained for EISs. A key difference
between EISs and information systems is that in most cases an EIS will support multiple
services across multiple facilities. “RMA requirements for an EIS are premised on the
service thread with the highest loss severity category that the EIS supports.” Using this
approach the RMA practitioner can determine the loss severity category for the EIS. RMA
requirements for an EIS should never be less than what is prescribed for its most critical NAS
service.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
82
Several considerations must be made before appropriate RMA requirements can be specified for
an EIS. First, the impact of losing a service at a larger facility, such as a TRACON with a
Facility Level of 9 through 12, may require the practitioner to specify a more reliable EIS
architecture. For existing service threads the STLSC is scaled based on facility level and the
value is provided within the STLSC matrices Figure 7-9, Figure 7-10, and Figure 7-11. In
addition to facility level, it is important that the RMA practitioner consider the effects of the
concentration of services which results if an EIS is utilized to support numerous services. A
Service Risk Assessment (SRA) is useful to study the effects of concentrated services. One
method of increasing service reliability is to specify higher Communications Transport service
levels. In addition to increased Communications Transport availability there are various
methods for increasing availability discussed in Section 7.7.3.3.4.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
83
Table 7-11 Enterprise Infrastructure Architecture (EIS)
Severity of Service
Supported by EIS
Description of Recommended EIS
Architecture
FTI Service Class
Recommendations
Essential At least two service providing units
with automatic failover between
them. One network connection to
the client side of the service-client
pair.
A single FTI service with a
service class featuring RMA 4
(.998 availability) or greater.
Efficiency-Critical At least two pairs of redundant
service providing units located in
separate geographic locations with
automatic failover between them.
At least two network connections to
the client side of the service-client
pair with each connection going to
a separate point of presence.
Dual avoided FTI services
with a service class featuring
RMA-4. Service classes
featuring RMA-3 (.9998) can
be considered but provide a
less robust connection.
Safety-Critical Due to data synchronization issues
arising from increased latency
inherent in a widely distributed
enterprise system and the potential
for a single point of failure within
the FTI Operations IP Network a
Safety-Critical EIS platform needs
to be vetted and certified to deliver
Safety-Critical services.
An appropriate FTI service
class should be selected as
part of the development of a
Safety-Critical EIS platform.
After determining the severity of services, the associated scalability factors that the EIS will
support and the effects of concentration of services, the RMA practitioner should refer to this
Handbook for a recommended EIS architecture.
Table 7-11 defines a basic set of architecture models based on severity of the EIS. These basic
architectures leverage current architectural approaches and best practices as applied to
information systems and extend them to EISs. Additionally, Table 7-11 provides guidance for
selecting an appropriate FTI service class. FTI service class recommendations are applicable to
both EIS architectures and producer or consumer systems. The architectures in Table 7-11 are
intended to provide guidance to the practitioner for understanding the absolute minimum for an
EIS to support each severity level. An abstracted view of the architectures is provided in Figure
7-16 for Essential services and Figure 7-17 for Efficiency-Critical services.
Multiple challenges exist before an EIS can host Safety-Critical services. Any Safety-Critical
services that rely on SOA and/or Cloud-Computing models will need a fully vetted and
approved architecture prior to implementation. Information systems providing Safety-Critical
services also have to synchronize input data between the service’s constituent Efficiency-
Critical threads for inputs which are not provided to both threads simultaneously. This is
typically achieved with dual redundant processors which reside in each thread and synchronize
data between threads through a redundant interface. When a failover occurs latency will extend
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
84
the transitional hazard period while data is synchronized between threads. Because EISs are
widely distributed the effects of latency on data synchronization will be compounded.
Figure 7-16 EIS Architecture for Essential Services
The difference between architectural recommendations for information systems and EIS is
redundancy. A redundant configuration will provide increased availability to support multiple
services and resilience to software faults.
Within Figure 7-17, dual data paths are provided to the client, allowing access to either EIS A
or B via the FTI Operations IP network. Physically redundant configurations should be used for
all systems supporting Efficiency-Critical services which are supporting the service. EIS should
be located in separate facilities such that the geographic separation between them will provide
for contingencies.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
85
Figure 7-17 EIS Architecture for Efficiency-Critical Services
7.7.3.3.4 Increasing Reliability in EISs
The following set of approaches reduces the negative impact of errors and increases reliability
for EIS services.
Preventative Maintenance: Certain classes of failures can be avoided by the use of
preventative maintenance processes. For example, a software application with a
memory leak can lead to a total system failure but if the machine is restarted prior to the
system failing it can be avoided. This is particularly problematic in remote, unstaffed
facilities where there is not a maintainer present to repair or restart the system. One
approach to avoiding these types of failures, particularly for remote, unstaffed facilities
is to provision an automatic restart during periods of low utilization prior to when the
system is expected to fail. Such preventative maintenance processes can be automated
within the software application or via independent software applications. To minimize
the impact of the loss of the service, restarts should ideally occur during a scheduled
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
86
outage. Preventive Maintenance is also applicable to hardware components and
improves overall availability.
Recovery Approaches: Recovery approaches are defined below to address cases where
failures are either expected, cannot be prevented, or in cases where software reliability is
unknown. Recovery approaches also address the case where software reliability growth
plans, as described in Appendix A cannot be applied to the system, software, or service
in question.
o Automatic Diagnosis, Operator Intervention: A failing system should aid the
maintainer in determining the source of a hardware failure or software fault. For EIS,
this is difficult, particularly for service threads that span multiple systems and
facilities. As a result, the key to this technique is employing data mining and
software to pinpoint failures based on trace log information, such as failed requests,
failed message transfers, etc. Pinpointing the problem will help aid the maintainers
and the operators in isolating the failure, thus reducing the MTTR.
o Fine Grained Partitioning and Recursive Recovery: This approach applies to
systems that do not tolerate unannounced restarts which result in long downtimes. It
applies to Enterprise Services and software components and assumes that most
software defects can cause software to crash, deadlock, spin, leak memory or
otherwise fail in a way that leaves a reboot or restart as the only option. This
approach can be useful for systems residing in remote, unstaffed facilities for
systems where a maintainer is not present and in cases where system restarts are not
acceptable.
The approach involves partitioning of services, separating stateful software
components in a hierarchical manner such that faults can be isolated, and allows
services or processes to be restarted without the entire system failing. These restarts
can occur periodically, at a time of minimal impact prior to when a failure is
expected to occur, or it can occur upon when the failure is detected. For more
information on this approach and how it could be applied and specified, please refer
to [54].
o Reversible, Undoable Systems: Rapid rollback of updates which are stateful and
persistent is an approach that addresses failures which result from an erroneous input
to the system or an update to the system which causes it to fail. This approach is
typically built into virtualization and storage area network solutions via a snapshot
mechanism which allows the virtualized system and the state to be completely
restored to the point in time that the snapshot was taken. This approach is
recommended for EIS systems with software applications, services and / or operating
systems that are expected to be updated frequently.
o Rapid, Graceful Restart: This approach is applicable in systems where predicted
software reliability is unknown; where there is limited influence on software
reliability growth and where system restarts are tolerated, but not long downtimes.
Several approaches are available, such as machine reboots to a known state or
snapshot. Or a physical machine that is specified and configured to be able to be
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
87
restarted and operational within a short-period of time. These approaches are
typically applied to systems where information is not persisted or stored. However, it
can be applied to services that upon restoration retrieve state data upon restart from
another system.
Redundancy Approaches: In fault-less environments, redundancy is not required.
However, in situations where the level of software reliability is not predictable and the
ability to manage and control software reliability growth is somewhat limited,
redundancy is critical to achieving operational availability for Efficiency-Critical and
Safety-Critical NAS Services.
o Redundancy with Failover: Redundancy is the provisioning of functional capabilities
which act as a backup, which automatically kick-in when a failure is detected. Based
on Section 7.3, redundancy is necessary to meet availability requirements for
systems supporting Efficiency-Critical and Safety-Critical NAS Services. These
redundancy approaches also extend to Enterprise Infrastructure Systems and
Services. Enterprise Services and Systems should be designed to provide stateful
redundancy with rapid failover and potentially hot-standby. State-less redundancy,
i.e. cold-standby, can also be considered for such applications as long as the
switchover time does not impinge on the overall availability goals and when loss of
information is not a concern. This approach can be applied to physical systems
running enterprise software, virtualized systems and software (i.e. VM) and
enterprise services. For example, redundancy could be applied to a UDDI registry
via two virtual machines running on a Cloud. Or the UDDI registry can be extended
to the facility, redundant with the centralized UDDI registry, such that services at
that facility can still discover registered services, should the centralized UDDI
registry go down. In another example, there could be a set of 20 duplicate services
running across multiple physical machines or virtual machines which are redundant
with one another and provide load balancing abilities. All of these examples
provision functional capabilities to act as a backup when failures occur.
o N, N+1 Redundant Systems with Failover: This approach allows the maintainer to
upgrade one system (becoming N+1 baseline), while leaving the backup system
unchanged at the current baseline (N). This approach can also be applied to physical
systems, virtual machines and enterprise services as described in the previous
section.
o Recovery-Oriented Computing (ROC) Approach for Enterprise Infrastructure
Systems: The traditional fault-tolerance community has focused on reducing MTBF
and has occasionally devoted attention to recovery, however the ROC approach “…
focuses on MTTR under the assumption that failures are generally problems that can
be known ahead of time and should be avoided.” [54] ROC assumes that failures
will occur and that software faults are inevitable. ROC takes a more pragmatic
approach by focusing on reducing recovery times in order to increase availability.
The approach takes nothing away from improving MTBF but instead recognizes,
through over a decade of experience in implementing complex enterprise-level
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
88
systems, that significant failures should be expected. By improving MTTR, the
operational availability is improved beyond a MTBF-only approach.
The approach offers a different perspective in planning for failure. Instead of
specifying requirements to fix, avoid, prevent or tolerate failure approaches which
can be extremely expensive to implement, this approach focuses on the recovery of
failures based on priority of impact and specifies approaches that are known to speed
up these recovery times.
For more information on this approach and methodology please consult the
following references. The ROC approach was developed at UC Berkeley (see
reference [54] for approach and rationale) and adopted and expanded by Microsoft
as a part of their Resilience Modeling and Analysis process (see reference [53] for
process, methodology and templates). Further, more references are provided which
should aid in the specification of requirements and in implementation.
7.8 Summary of Process for Deriving RMA Requirements Section 6 of this Handbook summarizes a process for deriving RMA requirements for all areas
of the System of Systems taxonomy. Section 7.1 details the NAS-RD severity assessment
process which assigns a severity level to each of the NAS-RD functional requirements. A
process for developing service threads is provided in Section 7.2 along with the introduction of
the NAS System of Systems taxonomy. Section 7.3 explains how to assess a service thread’s
severity by considering the impact of transitioning to manual backup procedures which are in
place to provide service in the event of a service tread loss. Since the impact of the loss of a
service is not the same for all FAA facilities, Section 7.4 presents a method for scaling a STLSC
based on facility size. Section 7.5 describes the process of assigning STLSCs to service threads.
The following sections, Sections 7.6 and 7.7 assign STLSCs to service threads and explain how
RMA requirements are derived for each area of the taxonomy. Section 7.6 provides STLSC
matrices for each domain of the NAS, Terminal and En-Route. Service threads which do not
reside within a specific domain reside in the “Other” STLSC matrix. RMA characteristics for
Remote/Distributed Service Threads are determined using technical and Lifecycle Cost
considerations, however the STLSC matrices in Section 7.6 provide a starting point for
Remote/Distributed threads. With the necessary background information provided in previous
sections, Section 7.7 begins to discuss the development of RMA requirements for each area of
the taxonomy. This includes requirements for inherent availability, reliability or MTBF. An
approach for deriving RMA requirements for infrastructure and enterprise systems concludes
Section 6. Sample RMA requirements are provided in Appendix A.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
89
ACQUISITION STRATEGIES AND GUIDANCE
Acquisition cycles can span multiple months or even years. Successful deployment of a
complex, high-reliability system that meet the user’s expectations for reliability, maintainability
and availability is dependent on the definition, execution, and monitoring of a set of interrelated
tasks. The first step is to derive from the NAS-RD-20XX, the requirements for the specific
system being acquired. Next, the RMA portions of the procurement package must be prepared
and technically evaluated. Following that, a set of incremental activities intended to establish
increasing levels of confidence that the system being designed built and tested meets those
requirements run throughout the design and development phases of the system. Completing the
cycle is an approach to monitoring performance in the field to determine whether the resulting
system meets, or even exceeds, requirements over its lifetime. This information then forms a
foundation for the specification of new or replacement systems.
Figure 8-1 depicts the relationship of the major activities of the recommended process. Each
step is keyed to the section that describes the document to be produced. The following
paragraphs describe each of these documents in more detail.
Figure 8-1 Acquisition Process Flow Diagram
8.1 Preliminary Requirements Analysis This section presents the methodology to apply NAS-Level Requirements to major system
acquisitions. The NAS-Level Requirements are analyzed to determine the RMA requirements
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
90
allocated to the system and their potential implications on the basic architectural characteristics
of the system to be acquired. Where system RMA levels require redundancy this must be
explicitly factored into the specification at this stage. SMEs should take into account operational
needs and economic impacts of RMA requirements. The potential requirements are then
compared with the measured performance of currently fielded systems to build confidence in
the achievability of the proposed requirements and to ensure that specified RMA characteristics
of a newly acquired system will support levels of safety equal to, or better than, those of the
systems they replace.
To begin this process, determine the category of the system being acquired: Information
Systems, Remote/Distributed and Standalone Systems, or Infrastructure and Enterprise Systems.
Each of these categories is treated differently, as discussed in the following section.
8.1.1 System of Systems Taxonomy of FAA NAS Systems and Associated
Allocation Methods
There is no single allocation methodology that can logically be applied across all types of FAA
systems. Allocations from NAS-Level requirements to the diverse FAA systems comprising the
NAS require different methodologies for different system types. NAS systems are classified
into four major categories: 1) Information Systems, 2) Remote/Distributed and Standalone
Systems, 3) Mission Support Systems and 4) Infrastructure & Enterprise Systems as discussed
in Section 7.2. The taxonomy of FAA system classifications described in 7.2 and illustrated in
Figure 7-1 is repeated in Figure 8-2. This taxonomy represents the basis on which definitions
and allocation methodologies for the various categories of systems are established. Strategies
for each of these system categories are presented in the paragraphs that follow.
Figure 8-2 NAS System of Systems Taxonomy
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
91
1. Information Systems are generally computer systems located in major facilities staffed
by Air Traffic Control personnel. These systems consolidate large quantities of
information for use by operational personnel. They usually have high severity and
availability requirements because their failure could affect large volumes of information
and many users. Typically, they employ fault tolerance, redundancy, and automatic fault
detection and recovery to achieve high availability. These systems can be mapped to the
NAS Services and Capabilities functional requirements.
2. The Remote/Distributed and Standalone Systems category includes remote sensors,
communications, and navigation sites – as well as distributed subsystems such as display
terminals – that may be located within a major facility. Failures of single elements, or
even combinations of elements, can degrade performance at an operational facility, but
generally they do not result in the total loss of the surveillance, communications,
navigation, or display capability.
3. Infrastructure & Enterprise Systems provide power, the environment, communications
and enterprise infrastructure to the facilities.
4. Mission Support Systems are the systems used to manage the design, operation and
maintenance of the systems used in the performance of the air traffic control mission.
Remote/Distributed Service Threads achieve the overall availability required by NAS-RD-2013
through the use of qualitative architectural diversity techniques as specified in FAA Order
6000.36. Primarily, these involve multiple instantiations of the service thread with overlapping
coverage. The ensemble of service thread instantiations provides overall continuity of service
despite failures of individual service thread instantiations. The RMA requirements for the
systems and subsystems comprising R/D service threads are determined by the NAS
Requirements Group. Acquisition Managers determine the most cost effective method of
implementation taking into account what is technically achievable and Life Cycle Cost
considerations. Procedures for determining the RMA characteristics of the Power Systems
supplying service threads are discussed in Section 8.1.1.4.
8.1.1.1 Information Systems The starting point for the development of RMA requirements is the set of three matrices
developed in the previous section, Figure 7-9, Figure 7-10, and Figure 7-11. For Information
Systems, select the matrix pertaining to the domain in which a system is being upgraded or
replaced and review the service threads listed in the matrix to determine which service thread(s)
pertain to the system that is being upgraded or replaced.
For systems that are direct replacements for existing systems:
1. Use the Service/Function STLSC matrix to identify the service thread that encompasses
the system being replaced. If more than one service thread is supported by the system,
use the service thread with the highest STLSC value (e.g. the ERAM supports both the
CRAD surveillance service thread and the CFAD flight data processing service thread).
2. Use the severity associated with the highest SLTSC value to determine the appropriate
system severity requirement.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
92
Use the NAS-RD-20XX requirements presented in Table 7-8 to determine appropriate base-line
MTBF, MTTR, and recovery time (if applicable) values for each of the service threads to ensure
consistency with STLSC severities.
For systems that are not simple replacements of systems contained in existing service threads,
define a new service thread. The appropriate STLSC Matrix for the domain and the service
thread Reliability, Maintainability, and Recovery Times Table, Table 7-8, need to be updated,
and a new Service Thread Diagram needs to be created and included in Appendix E. As
discussed in the preceding section, the practical purpose of the severity is to determine
fundamental system architecture issues such as whether or not fault tolerance and automatic
recovery are required, and to ensure that adequate levels of redundancy will be incorporated into
the system architecture. The primary driver of the actual operational availability will be the
reliability of the software and automatic recovery mechanisms.
8.1.1.2 Remote/Distributed and Standalone Systems This category includes systems with Remote/Distributed and Standalone elements, such as radar
sites, air-to-ground communications sites and navigation aids. These systems are characterized
by their spatial diversity. The surveillance and communications resources for a major facility
such as a TRACON or ARTCC are provided by a number of remote sites. Failure of a remote
site may or may not degrade the overall surveillance, communications, or navigation function,
depending on the degree overlapping coverage, but the service and space diversity of these
remote systems makes total failure virtually impossible.
Attempts have been made in the past to perform a top-down allocation to a subsystem of
distributed elements. To do so requires that a hypothetical failure definition for the subsystem
be defined. For example, the surveillance subsystem could be considered to be down if two out
of fifty radar sites are inoperable. This failure definition is admittedly arbitrary and ignores the
unique characteristics of each installation, including air route structure, geography, overlapping
coverage, etc. Because such schemes rely almost entirely on “r out of n” criteria for subsystem
failure definitions, the availability allocated to an individual element of a Remote/Distributed
and Standalone System may be much lower than that which could be reasonably expected from
a quality piece of equipment.
For these reasons, a top down allocation from NAS requirements to elements comprising a
distributed subsystem is not appropriate, and this category of systems has been isolated as
Remote/Distributed Service Threads in the STLSC matrices in Figure 7-9, Figure 7-10, and
Figure 7-11. STLSC values are listed for Remote/Distributed Service Threads only to provide
RMA practitioners with a starting point for deriving the RMA characteristics of
Remote/Distributed and standalone systems.
The RMA requirements for the individual elements comprising a Remote/Distributed and
Standalone System should be determined by life-cycle cost considerations and the experience of
FAA acquisition specialists in dealing with realistic and achievable requirements. The overall
reliability characteristics of the entire distributed subsystem are achieved through the use of
diversity.
FAA Order 6000.36, “Communication and Surveillance Service Diversity,” establishes the
national guidance to reduce the vulnerability of these Remote/Distributed services to single
points of failure. The order provides for the establishment of regional Communications and
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
93
Surveillance Working Groups (CSWGs) to develop regional communications diversity and
surveillance plans for all ARTCC, pacer airports, other level 5 terminals.
The scope of FAA Order 6000.36 includes communications services and surveillance services.
The NAPRS services to which the order applies are listed in Appendix 1 of the order. They
correspond to the NAPRS services that were mapped to the Remote/Distributed and Standalone
Systems category and designated as supporting critical NAS Architecture Capabilities in the
matrices in Figure 7-9, Figure 7-10, and Figure 7-11.
FAA Order 6000.36 defines five different diversity approaches that may be employed:
1. Service Diversity – services provided via alternate sites; e.g. overlapping radar or
communications coverage.)
2. Route or Circuit Diversity – Physical separation of outside dual route or loop cable
systems.
3. Space Diversity – antennas at different locations.
4. Media Diversity – radio/microwave, public telephone network, satellite, etc.
5. Frequency Diversity – the utilization of different frequencies to achieve diversity in
communications and surveillance.
The type(s) and extent of diversity to be used are to be determined, based on local and regional
conditions, in a bottom-up fashion by communications working groups.
FAA Order 6000.36 tends to support the approach recommended in this Handbook – exempting
Remote/Distributed services and systems from top-down allocation of NAS-RD-2013
availability requirements. The number and placement of the elements should be determined by
FAA specialists knowledgeable in the operational characteristics and requirements for a specific
facility instead of by a mechanical mathematical allocation process. Ensuring that the NAS-
Level severities are not degraded by failures of Remote/Distributed and Standalone Systems in
a service thread can best be achieved through the judicious use of diversity techniques tailored
to the local characteristics of a facility.
The key point in the approach for Remote/Distributed and Standalone Systems is that the path
to achieving NAS-Level availability requirements employs diversity techniques, establishes that
the RMA specifications for individual Remote/Distributed elements are an outgrowth of a
business decision by FAA Service Unit, and that these decisions are based on trade-off analyses
that involve factors such as what is available, what may be achievable, and how increasing
reliability requirements might save on the costs of equipment operation and maintenance.
Distributed display consoles have been included in this category, since the same allocation
rationale has been applied to them. For the same reasons given for remote systems, the
reliability requirements for individual display consoles should be primarily a business decision
determined by life cycle cost tradeoff analyses. The number and placement of consoles should
be determined by operational considerations.
Airport surveillance radars are also included in this category. Even though they are not
distributed like the en route radar sensors, their RMA requirements still should be determined
by life cycle cost tradeoff analyses. Some locations may require more than one radar – based on
the level of operations, geography and traffic patterns – but, as with subsystems with distributed
elements, the decision can best be made by personnel knowledgeable in the unique operational
characteristics of a given facility.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
94
Navigation systems are remote from the air traffic control facilities and may or may not be
distributed. The VOR navigation system consists of many distributed elements, but an airport
instrument landing system (ILS) does not. Because the service threads are the responsibility of
Air Traffic personnel, NAVAIDS that provide services to aircraft (and not to Air Traffic
personnel) are not included in the NAPRS 6040.15 service threads. Again, RMA requirements
for navigation systems should be determined by life-cycle cost tradeoff analyses, and the
redundancy, overlapping coverage, and placement should be determined on a case-by-case basis
by operational considerations determined by knowledgeable experts.
8.1.1.3 Mission Support Systems Mission Support services used for airspace design and management of the NAS infrastructure
are not generally real-time services and are not reportable services within NAPRS. For these
reasons, it is not appropriate to allocate NAS-RD-20XX availabilities associated with real-time
services used to perform the air traffic control mission to this category of services and systems.
The RMA requirements for the systems and subsystems comprising Mission Support Service
Threads are determined by SMEs and verified by Acquisition Managers in accordance with
what is achievable and Life Cycle Cost considerations.
8.1.1.4 Infrastructure and Enterprise Systems The following four subsections discuss procurement impacts of Infrastructure and Enterprise
Systems including 1) power systems, 2) heating ventilation and air conditioning (HVAC)
systems, 3) Communications Transport and 4) Enterprise Infrastructure Systems (EIS).The
complex interactions of infrastructure systems with the systems they support violate the
independence assumption that is the basis of conventional RMA allocation and prediction. By
their very nature, systems in an air traffic control facility depend on the supporting
infrastructure systems for their continued operation. Failures of infrastructure systems can be a
direct cause of failures in the systems they support.
Moreover, failures of infrastructure services may or may not cause failures in the service threads
they support, and the duration of a failure in the infrastructure service is not necessarily the
same as the duration of the failure in a supported service thread. For example, short power
interruption of less than a second can cause a failure in a computer system that may disrupt
operations for hours. In contrast, an interruption in HVAC service may have no effect at all on
the supported services, provided that HVAC service is restored before environmental conditions
deteriorate beyond what can be tolerated by the systems they support.
Communications Transport services are procured by the FAA on a leased services basis.
Leased services are treated differently than in-house services from the RMA view point.
Communications Transport services are not designed by the FAA, rather they are specified
according to desired performance parameters such as RMA and Diversity/Avoidance.
Enterprise Services constitute yet another subcategory of services from the RMA point of
view. Multiple enterprise services are hosted on an EIS across various facilities with varying
levels of service availability. One or more of those services may be utilized depending on the
service end-user. RMA requirements are dependent on the highest service severity level of the
constituent EIS services. Acquisition managers should utilize SMEs and consider the need for
more stringent RMA requirements due to aggregated services and contingency planning
requirements.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
95
8.1.1.4.1 Power Systems
Due to the complex interaction of the infrastructure systems with the service threads they
support, top-down allocations of NAS-RD-2013 availability requirements are limited to simple
inherent availability requirements that can be used to determine the structure of the power
system architecture. The allocated power system requirements are shown in Figure 7-13, Figure
7-14, and
Figure 7-15 “Other” Power System. The inherent availability requirement for power systems at
larger facilities is derived from the NAS-RD-2013 requirement of .99999 for Safety-Critical
capabilities. It should be emphasized that these inherent availability requirements serve only to
drive the power system architectures, and should not be considered to be representative of the
predicted operational availability of the power system or the service threads it supports.
At smaller terminal facilities, the inherent availability requirements for the Critical Power
Distribution System can be reduced because the reduced traffic levels at these facilities allow
manual procedures to be used to compensate for power interruptions without causing serious
disruptions in either safety or efficiency of traffic movement.
The smallest terminal facilities do not require a Critical Power Distribution System. The power
systems at these facilities generally consist of commercial power with an engine generator or
battery backup. The availability of these power systems is determined by the availability of the
commercial power system components employed. Allocated NAS-RD-2013 requirements are
not applicable to these systems.
The FAA Power Distribution Systems are developed using standard commercial off-the-shelf
power system components whose RMA characteristics cannot be specified by the FAA. The
RMA characteristics of commercial power system components are documented in IEEE Std
493-1997, Recommended Practice for the Design of Reliable Industrial and Commercial Power
Systems, (Gold Book). This document presents the fundamentals of reliability analysis applied
to the planning and design of electric power distribution systems, and contains a catalog of
commercially available power system components and operational reliability data for the
components. Engineers use the Gold Book and the components discussed in it to determine the
configuration and architecture of power systems required to support a given level of availability.
Since the RMA characteristics of the power system components are fixed, the only way power
system availability can be increased is through the application of redundancy and diversity in
the power system architecture.
It should be noted, that although the inherent reliability and availability of a power distribution
system can be predicted to show that the power system is compliant with the allocated NAS-
RD-20XX availability requirements, the dependent relationship between power systems and the
systems they support precludes the use of conventional RMA modeling techniques to predict
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
96
the operational reliability and availability of the power system and the service threads it
supports.15
The FAA has developed a set of standard power system architectures and used computer
simulation models to verify that the standard architectures comply with the derived NAS-RD-
2013 requirements. The standards and operating practices for power systems are documented in
FAA Order 6950.2D, Electrical Power Policy Implementation at National Airspace System
Facilities. Since the verification of the power system architecture availability with the NAS-RD-
2013 availability requirements has been demonstrated, there is no need for additional modeling
efforts. All that is required is to select the appropriate architecture.
The focus on FAA power systems is on the sustainment of the existing aging power systems,
many of whose components are approaching or have exceeded end-of-life expectations, and the
development of a new generation of power systems for future facility consolidation and renewal
of aging facilities. The primary objectives of this Handbook with respect to power systems are
to:
Document the relationship between service threads and the power system architectures
in FAA Order 6950.2D.
Demonstrate that the inherent availability of existing power system architectures is
consistent with the derived NAS-RD-2013 availability requirements.
Identify potential “red flags” for terminal facilities that may be operating with
inadequate power distribution systems as a consequence of traffic growth.
Provide power system requirements for new facilities.
The matrices in Figure 7-13, Figure 7-14, and
Figure 7-15 “Other” Power System encapsulate the information required to achieve these
objectives. It is only necessary to look at the power system architecture row(s) in the
appropriate matrix to determine the required power system architecture for a facility.
8.1.1.4.2 HVAC Subsystems
HVAC Subsystems are normally procured in the process of building or upgrading an Air Traffic
Control facility. Building design is controlled by a series of Orders including FAA Order
6480.7D which requires that ATCT and TRACONs be equipped with redundant air
conditioning systems for critical spaces16, but does not specify RMA requirements. Power for
cooling and ventilation of critical spaces is provided from the Essential Power bus and is backed
up by the local generator.
15 The interface standards between infrastructure systems and the systems they support is an area of concern. For
example, if power glitches are causing computer system failures, should the power systems be made more stable or
should the computer systems be made more tolerant? This tradeoff between automation and power systems
characteristics is important and deserves further study; however it is considered outside the scope of this
Handbook.
16 Critical spaces in ATC facilities are the tower cab, communications equipment rooms, telco rooms, operations
rooms, and the radar and automation equipment rooms.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
97
System designers will not normally be required to provide equipment heating and cooling and
will not have control over HVAC specifications. Temperature and humidity standards for
HVAC are set at the facility level, if system equipment has unusual requirements it may be
necessary to make special provisions, but this should be avoided due to cost and maintenance
impacts that may need to be borne directly by the system program office. System designers
should be aware of facility Contingency Plans and design for degradation with the "Transition
Hazard Interval", as shown in Figure 7-4.
8.1.1.4.3 Communications Transport
Communications Transport is a leased service that facilitates the transfer of information
between FAA systems. The FTI contract provides standard FAA ground to ground
Communications Transport services for inter-facility communications, as described in Section
7.7.3.3.2. The FTI provides both legacy telephony and data services and IP services. New inter-
facility communications links must be procured through the FTI contract [82]. Services
provided by the FTI contractor are detailed in the “FTI Operations Reference Guide” [78].
Similarly, the Data Comm program provides or will provide leased Data Communications
services between ATC facilities, Flight Operations Centers (FOCs) and appropriately equipped
aircraft both on the ground and in the air.
The FTI offers a tiered set of RMA characteristics based on a contractual Service Level
Agreement (SLA), incorporating diversity and avoidance routing options to meet the
requirements of FAA Order 6000.36. System designers should be careful to specify the FTI
RMA service class appropriate to the service thread requirements. The RMA characteristics of
FTI services are set out in Table 7.1 of the FTI Operations Reference Guide and are reproduced
in Table 7-10.
The RMA methodology and approach for EIS is described in Section 6.7.3.3.1 and should be
followed at appropriate points throughout the acquisition process. It is particularly important to
plan for provision of procedural and physical redundancy. The NAS Voice System (NVS) and
the Surveillance Infrastructure Modernization (SIM) programs will utilize FTI IP transport and
routing capabilities to increase the flexibility of NAS contingency and business continuity
options. The use of SRAs as mentioned in Section 6.7.3.3.1 is an important element in the
process of designing procedural backups involving multi-site contingency planning. An
example of an appropriate process for selection of FTI service levels can be found in Section
4.2 of the SWIM “Solution Guide for Segment 2A” [79]
Guidance for selection of Data Comm options is not available at this time so, designers of
systems requiring air to ground Data Communications Transport of ATC to FOC data
connectivity should consult the Data Comm program for up to date guidance.
8.1.1.4.4 Enterprise Infrastructure Systems (EIS)
An EIS is acquired and implemented to support multiple service threads. The EIS must be
developed to ensure that service threads utilizing an EIS meet their respective operational
availability requirements. The SLAs and /or system level requirements need to take into account
that the EIS will contribute to the overall operational availability of service threads. This
includes the specification of RMA approaches defined in Section 7.7.3.3.3.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
98
EIS functionality for RMA requirements must be assessed based on the use of the service. EIS
extend beyond the facility and host a concentration of services. Due to the concentration of
services EISs RMA requirements should be addressed by diversifying services and data.
Diversity of services means that there are redundant services running on separate processors
within the EIS. Diversity of data means that there are redundant and physically separate data
paths through the EIS.
SMEs are needed to assess this service in order to determine where loss of data is not acceptable
and the availability of information is important to the overall availability of the service thread.
An assessment of these requirements is addressed in section 6.7.3.3.4. The following EIS RMA
principles should be specified:
Durable and reliable messaging is required for the EIS to support service threads where
information cannot be lost in transit.
Availability of data to ensure that data provided by the EIS is available to the
Information System should there be a failure at another point in the EIS.
Recoverability of data to ensure that data being provided and transported by the EIS will
remain available to the Information System requiring it.
Further, since many EIS leverage COTS software and hardware, software reliability growth
planning and software assurance activities are particularly important for service threads that are
Essential or Efficiency-Critical. Refer to Appendix F for more information regarding
Government oversight and the required contractor activities as they relate to software reliability
growth.
8.1.2 Analyzing Scheduled Downtime Requirements
Operational functions serve as an input to RMA requirements as discussed in this section of the
acquisition planning process. The issue of scheduled downtime for a system must be addressed.
Scheduled downtime is an important factor in ensuring the operational suitability of the system
being acquired and reducing negative economic impact to the NAS.
The anticipated frequency and duration of scheduled system downtime to perform preventive
maintenance tasks, software upgrades, adaptation data changes, etc. must be considered with
respect to the anticipated operational profile for the system. The preventive maintenance
requirements of the system hardware include cleaning, changing filters, real-time performance
monitoring, and running off-line diagnostics to detect deteriorating components.
Many NAS systems are not needed on a 24/7 basis. Some airports have periods of reduced
operational capacity and some weather systems are only needed during periods of adverse
weather. If projected downtime requirements can be accommodated without unduly disrupting
Air Traffic Control operations by scheduling downtime during periods of reduced operational
capacity or when the system is not needed, then there is no negative impact. A requirement to
limit the frequency and duration of required preventive maintenance could be added to the
maintainability section of the System Specification Document (SSD). However, since most of
the automation system hardware is COTS, the preventive maintenance requirements should be
known in advance and will not be affected by any requirements added to the SSD. Therefore,
additional SSD maintainability requirements are only appropriate for custom-developed
hardware.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
99
Conversely, if scheduled downtime cannot be accommodated without disrupting air traffic
control operations, it is necessary to re-examine the approach being considered. Alternative
solutions should be evaluated, for example, adding an independent backup system to supply the
needed service while the primary system is unavailable.
8.1.3 Modifications to STLSC Levels
In order to provide a more accurate assessment of availability requirements, service threads
should be evaluated based on the characteristics (i.e. size, type, and any environmental factors)
of the facility in which the services are to be deployed. This approach is facilitated by the
grouping schema implemented in the STLSC matrices in Section 7.6. This approach permits the
scaling of service thread requirements to the needs of varying facilities. To ensure appropriate
requirements are derived, the RMA practitioner is advised to consult with SMEs to determine
the severity of the service threads and identify alternative approaches that can be used in the
event of a service thread loss.
8.1.4 Redundancy and Fault Tolerance Requirements
Required inherent availability of a service thread determines the need for redundancy and fault
tolerance in the hardware architecture. If the failure and repair rates of a single set of system
elements cannot meet the inherent availability requirements, redundancy and automatic fault
detection and recovery mechanisms must be added. There must be an adequate number of
hardware elements that, given their failure and repair rates, the combinatorial probability of
running out of spares is consistent with the inherent availability requirements.
There are other reasons beyond the inherent availability of the hardware architecture that may
dictate a need for redundancy and/or fault tolerance. Even if the system hardware can meet the
inherent hardware availability, redundancy may be required to achieve the required recovery
times and provide the capability to recover from software failures.
All service threads with a STLSC of “Efficiency-Critical” have rapid recovery time
requirements because of the potentially severe consequences of lengthy service interruptions on
the efficiency of NAS operations. These recovery time requirements will, in all probability, call
for the use of redundancy and fault- tolerant techniques. The lengthy times associated with
rebooting a computer to recover from software failures or “hangs” indicates a need for a standby
computer that can rapidly take over from a failed computer.
In addition, a software reliability growth plan can improve software reliability as specified in
Appendix F. Based on the software control category, as well as the hazard level, guidance is
provided in specifying the level of oversight by the government and processes to be followed by
the contractor. The level of oversight and effort is based on software risk. For more information
on this approach, how to determine the software control category, hazard category and software
risk, as well as the amount of oversight and process required by the contractor, please refer to
Appendix F.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
100
8.1.5 Preliminary Requirements Analysis Checklist
Determine the category of the system being acquired from the Taxonomy Chart in Figure
7-3. For Information Systems, identify the service thread containing the system to be acquired,
Section 7.6. Determine availability requirements from the NAS-RD-20XX. Determine the RMA requirements for that service thread from the NAS-RD-20XX,
corresponding to Table 7-8 in this Handbook. For power systems, determine the availability requirements according to the highest STLSC
of the service threads being supported and the Facility Level, as specified in Section 7.7.3.1. Select a standard power system configuration based on FAA Order 6950.2D that will meet
the availability requirements. For Enterprise Infrastructure Systems, determine the availability requirements according to
the highest STLSC of the service threads being supported by the EIS and the Facility
Levels, as discussed in Section 7.7.3.3. Select an appropriate Enterprise Infrastructure architecture that will meet the service
availability requirements, as presented in Table 7-11. For remote communications links use the requirements in the Communications section of
the NAS-RD-20XX. Ensure the RMA requirements for other distributed subsystems such as radars, air to ground
communications, and display consoles are determined by technical feasibility and life cycle
cost considerations.
8.2 Procurement Package Preparation The primary objectives to be achieved in preparing the procurement package are as follows:
To provide the specifications that define the RMA and fault tolerance requirements for
the delivered system and form the basis of a binding contract between the successful
offeror and the Government.
To define the effort required of the contractor to provide the documentation,
engineering, and testing required to monitor the design and development effort, and to
support risk management, design validation, and the testing of reliability growth
activities.
To provide guidance to prospective offerors concerning the content of the RMA sections
of the technical proposal, including design descriptions and program management data
required to facilitate the technical evaluation of the offerors’ fault-tolerant design
approach, risk management, software fault avoidance and reliability growth programs.
8.2.1 System Specification Document (SSD)
The SSD serves as the contractual basis for defining the design characteristics and performance
that are expected of the system. From the standpoint of fault tolerance and RMA characteristics,
it is necessary to define the quantitative RMA and performance characteristics of the automatic
fault detection and recovery mechanisms. It is also necessary to define the operational
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
101
requirements needed to permit FAA facilities personnel to perform monitoring and control and
manual recovery operations as well as diagnostic and support activities.
While it is not appropriate to dictate specifics as to the system design, it is important to take
operational needs and system realities into account. These characteristics are driven by
operational considerations of the system and could affect its ability to participate in a redundant
relationship with another service thread. Examples include limited numbers of consoles and
limitations on particular consoles to accomplish particular system functions.
A typical specification prepared in accordance with FAA-STD-067 will contain the following
sections:
1. Scope
2. Applicable Documents
3. Definitions
4. General Requirements
5. Detailed Requirements
6. Notes
The information relevant to RMA is in Section 5 of FAA-STD-067, “Detailed Requirements.”
The organization within this section can vary, but generally, RMA requirements appear in three
general categories:
1. System Quality Factors
2. System Design Characteristics
3. System Operations
Automation systems also include a separate subsection on the functional requirements for the
computer software. Functional requirements may include RMA-related requirements for
monitoring and controlling system operations. Each of these sections will be presented
separately. This section and Appendix A contain sample checklists and/or sample requirements.
These forms of guidance are presented for use in constructing a tailored set of specification
requirements. The reader is cautioned not to use the requirements verbatim, but instead to use
them as a basis for creating a system-specific set of specification requirements.
8.2.1.1 System Quality Factors System Quality Factors contain quantitative requirements specifying characteristics such as
reliability, maintainability, and availability, as well performance requirements for data
throughput and response times.
Availability
The availability requirements to be included in the SSD are determined by the procedures
described in Section 8.1.
The availability requirements in the SSD are built upon inherent hardware availability. The
inherent availability represents the theoretical maximum availability that could be achieved by
the system if automatic recovery were one hundred percent effective and there were no failures
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
102
caused by latent software defects. This construct strictly represents the theoretical availability of
the system hardware based only on the reliability (MTBF) and maintainability (MTTR) of the
hardware components and the level of redundancy provided. It does not include the effects of
scheduled downtime, shortages of spares, or unavailable or poorly trained service personnel.
Imposing an inherent availability requirement only serves to ensure that the proposed hardware
configuration is potentially capable of meeting the NAS-Level requirement, based on the
reliability and maintainability characteristics of the system components and the redundancy
provided. Inherent availability is not a testable requirement. Verification of compliance with the
inherent availability requirement is substantiated by the use of straightforward combinatorial
availability models that are easily understood by both contractor and government personnel.
The contractor must, of course, supply supporting documentation that verifies the realism of the
component or subsystem MTBF and MTTR values used in the model.
The inherent availability of a single element is based on the following equation:
Equation 8-1
The inherent availability of a string of elements, all of which must be up for the system to be up,
is given by:
Equation 8-2
The inherent availability of a two-element redundant system (considered operational if both
elements are up, or if the first is up and the second is down, or if the first is down and the
second is up) is given by:
Equation 8-3
Equation 8-4
(Where )1( AA or the probability that an element is not available.)
The above equations are straightforward, easily understood, and combinable to model more
complicated architectures. They illustrate that the overriding goal for the verification of
compliance with the inherent availability requirement should be to “keep it simple.” Since this
requirement is not a significant factor in the achieved operational reliability and availability of
the delivered system, the effort devoted to it need not be more than a simple combinatorial
model as in Equation 8-4, or a comparison with the tabulated values in Appendix B. This is
simply a necessary first step in assessing the adequacy of proposed hardware architecture.
Attempting to use more sophisticated models to “prove” compliance with operational
nT AAAAA 321
)( 212121 AAAAAAAInherent
)1( 21 AAAInherent
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
103
requirements is misleading, wastes resources, and diverts attention from addressing more
significant problems that can significantly impact the operational performance of the system.
The inherent availability requirement provides a common framework for evaluating repairable
redundant system architectures. In a SSD, this requirement is intended to ensure that the
theoretical availability of the hardware architecture can meet key operational requirements.
Compliance with this requirement is verified by simple combinatorial models. The inherent
availability requirement is only a preliminary first step in a comprehensive plan that is described
in the subsequent sections to attempt to ensure the deployment of a system with operationally
suitable RMA characteristics.
The use of the inherent availability requirement is aimed primarily at service threads with a
STLSC level of “Efficiency-Critical.” (As discussed in Paragraph 7.3, any service threads
assessed as potentially “Safety-Critical” must be decomposed into two “Efficiency-Critical”
service threads.) Systems participating in threads with an “Efficiency-Critical” STLSC level
will likely employ redundancy and fault tolerance to achieve the required inherent availability
and recovery times. The combined availability of a two element redundant configuration is
given by Equation 8-4. The use of inherent availability as a requirement for systems
participating in service threads with a STLSC level of “Essential” and not employing
redundancy can be verified with the basic availability equation of Equation 8-4.
Reliability
Most of the hardware elements comprising modern automation systems are commercial off-the-
shelf products. Their reliability is a “given.” True COTS products are not going to be
redesigned for FAA acquisitions. Attempting to do so would significantly increase costs and
defeat the whole purpose of attempting to leverage commercial investment. There are, however,
some high-level constraints on the element reliability that are imposed by the inherent
availability requirements in the preceding paragraphs.
For hardware that is custom-developed for the FAA, it is inappropriate to attempt a top-level
allocation of NAS-Level RMA requirements. Acquisition specialists who are cognizant of life
cycle cost issues and the current state-of-the-art for these systems can best establish their
reliability requirements.
For redundant automation systems, the predominant sources of unscheduled interruptions are
latent software defects. For systems extensive newly developed software, these defects are an
inescapable fact of life. For these systems, it is unrealistic to attempt to follow the standard
military reliability specification and acceptance testing procedures that were developed for
electronic equipment having comparatively low reliability. These procedures were developed
for equipment that had MTBFs on the order of a few hundred hours. After the hardware was
developed, a number of pre-production models would be locked in a room and left to operate
for a fixed period of time. At the end of the period, the Government would determine the
number of equipments still operating and accept or reject the design based on proven statistical
decision criteria.
Although it is theoretically possible to insert any arbitrarily high reliability requirement into a
specification, it should be recognized that the resulting contract provision would be
unenforceable. There are several reasons for this. There is a fundamental statistical limitation
for reliability acceptance tests that is imposed by the number of test hours needed to obtain a
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
104
statistically valid result. A general “rule of thumb” for formal reliability acceptance tests is that
the total number of test hours should be about ten times the required MTBF. As the total
number of hours available for reliability testing is reduced below this value, the range of
uncertainty about the value of true MTBF increases rapidly, as does the risk of making an
incorrect decision about whether or not to accept the system. (The quantitative statistical basis
for these statements is presented in more detail in Appendix C.
For “Efficiency-Critical” systems, the required test period for one system could last hundreds of
years. Alternatively, a hundred systems could be tested for one year. Neither alternative is
practical. The fact that most of the failures result from correctable software mistakes that should
not reoccur once they are corrected also makes a simple reliability acceptance test impractical.
Finally, since it is not realistic to terminate the program based on the result of a reliability
acceptance test, the nature and large investment of resources in major system acquisitions
makes reliability compliance testing impractical.
In the real world, the only viable option is to keep testing the system and correcting problems
until the system becomes stable enough to send to the field – or the cost and schedule overruns
cause the program to be restructured or terminated. To facilitate this process a System-Level
driver, with repeatable complex ATC scenarios, is valuable. In addition, a data extraction and
data reduction and analysis (DR&A) process that assists in ferreting out and characterizing the
latent defects is also necessary.
It would be wrong to conclude there should be no reliability requirements in the SSD. Certainly,
the Government needs reliability requirements to obtain leverage over the contractor and ensure
that adequate resources are applied to expose and correct latent software defects until the system
reaches an acceptable level of operational reliability. Reliability growth requirements should be
established that define the minimum level of reliability to be achieved before the system is
deployed to the first site, and a final level of reliability that must be achieved by the final site.
The primary purpose of these requirements is to serve as a metric that indicates how aggressive
the contractor has been at fixing problems as they occur. The FAA customarily documents
problems observed during testing as Program Trouble Reports (PTRs).
Table 7-8 provides an example of the NAS-RD-2013 requirements for the MTBF, MTTR, and
recovery time for each of the service threads. For systems employing automatic fault detection
and recovery, the reliability requirements are coupled to the restoration time. For example, if a
system is designed to recover automatically within t seconds, there needs to be a limit on the
number of successful automatic recoveries, i.e. an MTBF requirement for interruptions that are
equal to, or less than, t seconds. A different MTBF requirement is established for restorations
that take longer than t seconds, to address failures for which automatic recovery is unsuccessful.
The establishment of the MTBF and recovery time requirements in Table 7-8 draws upon a
synthesis of operational needs, the measured performance of existing systems, and the practical
realities of the current state of the art for automatic recovery. The recovery time is determined
as the time from detection of the failure to full service recovery and is tested by evaluators not
operators. The reliability requirements, when combined with a 30 minute MTTR using Equation
8-1 yields availabilities that meet or exceed the inherent availability requirements for the service
threads.
The allowable recovery times were developed to balance operational needs with practical
realities. While it is operationally desirable to make the automatic recovery time as short as
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
105
possible, reducing the recovery time allocation excessively can impose severe restrictions on the
design and stability of the fault tolerance mechanisms. It also can dramatically increase the
performance overhead generated by the steady state operation of error detecting “heartbeats”
and other status monitoring activities.
Although some automatic recoveries can be completed quickly, recoveries that require a
complete system “warm start,” or a total system reboot, can take much longer. These recovery
times are determined by factors such as the size of the applications and operating system, and
the speed of the processor and associated storage devices. There are only a limited number of
things that can be done to speed up recovery times that are driven by hardware speed and the
size of the program.
The reliability MTBF requirements can be further subdivided to segregate failures requiring
only a simple application restart or system reconfiguration from those that require a warm start
or a complete system reboot. The MTBF requirements are predicated on the assumption that
any system going to the field should be at least as good as the system it replaces. Target
requirements are set to equal the reliability of currently fielded systems, as presented in the
6040.20 NAPRS reports.
The MTBF values in the table represent the final steady-state values at the end of the reliability
growth program, when the system reaches operational readiness. However, it is both necessary
and desirable to begin deliveries to the field before this final value is reached. The positive
benefits of doing this are that testing many systems concurrently increases the overall number of
test hours, and field testing provides a more realistic test environment. Both of these factors
tend to increase the rate of exposure of latent software defects, accelerate the reliability growth
rate, and build confidence in the system’s reliability.
The NAS-RD-2013 reliability values in Table 7-8 refer to STLSC specifically associated with
the overall service threads, but because of the margins incorporated in the service thread
availability allocation, the reliability values (MTBFs) in Table 7-8 can be applied directly to any
system in the thread. When incorporating the NAS-RD-2013 reliability values into a SSD, these
should be the final values defined by some program milestone, such as delivery to the last
operational site, to signal the end of the reliability growth program. To implement a reliability
growth program, it is necessary to define a second set of MTBF requirements that represent the
criteria for beginning deliveries to operational sites. The values chosen should represent a
minimum level of system stability acceptable to field personnel. FAA field personnel need to be
involved both in establishing these requirements and in their testing at the William J. Hughes
Technical Center (WJHTC). Involvement of field personnel in the test process will help to
build their confidence, ensure their cooperation, and foster their acceptance of the system.
Appendix A provides examples of reliability specifications that have been used in previous
procurements. They may or may not be appropriate for any given acquisition. They are intended
to be helpful in specification preparation.
Maintainability
Maintainability requirements traditionally pertain to such inherent characteristics of the
hardware design as the ability to isolate, access, and replace a failed component. These
characteristics are generally fixed for COTS components. The inherent availability requirements
impose some constraints on maintainability because the inherent availability depends on the
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
106
hardware MTBF and MTTR and the number of redundant elements. In systems constructed
with COTS hardware, the MTTR is considered to be the time required to remove and replace all
or a spared element of the COTS hardware. Additional maintainability requirements may be
specified in this section provided they do not conflict with the goal to employ COTS hardware
whenever practical.
The FAA generally requires a Mean Time to Repair of 30 minutes. For systems using COTS
hardware, the MTTR refers to the time required to remove and replace the COTS hardware
System Performance Requirements
System performance and response times are closely coupled to reliability issues. The
requirement to have rapid and consistent automatic fault detection and recovery times imposes
inflexible response time requirements on the internal messages used to monitor the system’s
health and initiate automatic recovery actions. If the allocated response times are exceeded,
false alarms may be generated and inconsistent and incomplete recovery actions will result.
At the same time, the steady state operation of the system monitoring and fault tolerance
heartbeats imposes a significant overhead on the system workload. The system must be
designed with sufficient reserve capacity to be able to accommodate temporary overloads in the
external workload or the large numbers of error messages that may result during failure and
recovery operations. The reserve capacity also must be large enough to accommodate the
seemingly inevitable software growth and overly optimistic performance predictions and model
assumptions.
Specification of the automatic recovery time requirements must follow a synthesis of
operational needs and the practical realities of the current performance of computer hardware.
There is a significant challenge in attempting to meet stringent air traffic control operational
requirements with imperfect software running on commercial computing platforms. The FAA
strategy has been to employ software fault tolerance mechanisms to mask hardware and
software failures.
A fundamental tradeoff must be made between operational needs and performance constraints
imposed by the hardware platform. From an operational viewpoint, the recovery time should be
as short as possible, but reducing the recovery time significantly increases the steady state
system load and imposes severe constraints on the internal fault tolerance response times
needed to ensure stable operation of the system. The FAA’s definition of recovery time is the
elapsed time between detection of failure to full restoration of service. This is a measurement
made during testing, not in operation.
Although it is the contractor’s responsibility to allocate recovery time requirements to lower
level system design parameters, attempting to design to unrealistic parameters can significantly
increase program risk. Ultimately, it is likely that the recovery time requirement will need to be
relaxed to an achievable value. It is preferable to avoid the unnecessary cost and schedule
expenses that result from attempting to meet an unrealistic requirement. While the Government
always should attempt to write realistic requirements, it also must monitor the development
effort closely to continually assess the contractor’s performance and the realism of the
requirement. A strategy for accomplishing this is presented in Paragraph 8.4.3.2.
Once the automatic recovery mechanisms are designed to operate within a specific recovery
time, management must recognize that there are some categories of service interruptions that
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
107
cannot be restored within the specified automatic recovery time. The most obvious class of this
type of failure is a hardware failure that occurs when a redundant element is unavailable. Other
examples are software failures that cause the system to hang, unsuccessful recovery attempts,
etc. When conventional recovery attempts fail, it may be necessary to reboot some computers in
the system and may or may not require specialist intervention.
The recommended strategy for specifying reliability requirements that accommodate these
different categories of failures is to establish a separate set of requirements for each failure
category. Each set of requirements should specify the duration of the interruption and the
allowable MTBF for a particular type of interruption. For example:
Interruptions that are recovered automatically within the required recovery time
Interruptions that require reloading software
Interruptions that require reboot of the hardware
Interruptions that require hardware repair or replacement
8.2.1.2 System Design Characteristics This section of the SSD contains requirements related to design characteristics of hardware and
software that can affect system reliability and maintainability. Many of these requirements will
be unique to the particular system being acquired.
8.2.1.3 System Operations This section of the SSD contains RMA-related requirements for the following topics:
Monitor and Control (M&C) - The Monitor and Control function is dual purpose. It
contains functionality to automatically monitor and control system operation, and it
contains functionality that allows a properly qualified specialist to interact with the
system to perform monitor and control system operations, system configuration, system
diagnosis and other RMA related activities. Design characteristics include functional
requirements and requirements for the Computer/Human Interface (CHI) with the
system operator.
System Analysis Recording (SAR) - The System Analysis and Recording function
provides the ability to monitor system operation, record the monitored data, and play it
back at a later time for analysis. SAR data is used for incident and accident analysis,
performance monitoring and problem diagnosis.
Startup/Start over - Startup/Start over is one of the most critical system functions and
has a significant impact on the ability of the system to meet its RMA requirements,
especially for software intensive systems.
Software Deployment, Downloading, and Cutover - Software Loading and Cutover is a
set of functions associated with the transfer, loading and cutover of software to the
system. Cutover could be to a new release or a prior release.
Certification17 - Certification is an inherently human process of analyzing available data
to determine if the system is worthy of performing its intended function. One element of
data is often the results of a certification function that is designed to exercise end-to-end
system functionality using known data and predictable results. Successful completion of
17 This definition is from a procurement perspective. For the full definition of Certification of NAS systems see
FAA Order 6000.30, Definitions Para 11.d.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
108
the certification function is one element of data used by the Specialist to determine the
system is worthy of certification. Some systems employ a background diagnostic or
verification process to provide evidence of continued system certifiability.
Transition – Transition is a set of requirements associated with providing functionality
required to support the transition to upgraded or new systems.
Maintenance Support – Maintenance support is a collection of requirements associated
with performing preventative and corrective maintenance of equipment and software.
Test Support – Test support is a collection of requirements associated with supporting
system testing before, during and after installation of the system. System-Level drivers
capable of simulating realistic and stressful operations in a test environment and a data
extraction and analysis capability for recording and analyzing test data are both essential
components in an aggressive reliability growth program. Requirements for additional
test support tools that are not in System Analysis Recording should be included here.
M&C Training – Training support is a collection of requirements associated with
supporting training of system specialists.
8.2.1.4 Leasing Services Leasing services does not relieve the programs from considering RMA requirements of the
NAS-RD-20XX. Services which are procured to provide capabilities specified in the RD must
be designed to meet the specified RMA requirements. In the case of leased services, this can be
approached through including specific service level requirements in the leasing agreement in the
form of an SLA. The FTI contract for communications services described in Section 8.1.1.4.3 is
a good example of incorporation of procurement of services meeting required FAA RMA levels
through SLAs. In addition to RMA levels, programs leasing services must also take into account
the diversity and avoidance requirements of Order 6000.36. If a vendor is unable to meet these
requirements, the program should examine the risks involved before proceeding with a leasing
strategy. This is a situation suitable for conducting a Service Risk Assessment (SRA).
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
109
8.2.1.5 System Specification Document RMA Checklist Include NAS-RD-20XX inherent availability requirements. Include NAS-RD-20XX MTBF, MTTR, and recovery time requirements. Develop initial MTBF criteria for shipment of the system to the first operational site. Consider potential need for additional RMA quality factors for areas such as Operational
Positions, Monitor & Control Positions, Data Recording, Operational Transition, etc. Review checklists of potential design characteristics. Review checklists of potential requirements for System Operations. Incorporate requirements for test tools such as System-Level Drivers and Data Extraction
and Analysis to support a reliability growth program. Ensure the RMA requirements for other distributed subsystems such as radars, air to ground
communications, and display consoles are not derived from NAS-Level NAS-RD-20XX
requirements. These requirements must be determined by technical feasibility and life cycle
cost considerations.
8.2.2 Statement of Work
The Statement of Work describes the RMA-related tasks required of the contractor to design,
analyze, monitor risk, implement fault avoidance programs, and prepare the documentation and
engineering support required to provide Government oversight of the RMA, Monitor and
Control function, fault tolerant design effort, support fault tolerance risk management and
conduct reliability growth testing. Typical activities to be called out include:
Conduct Technical Interchange Meetings (TIMs)
Prepare Documentation and Reports, e.g.,
o RMA Program Plans
o RMA Modeling and Prediction Reports
o Failure Modes and Effects Analysis
Perform Risk Reduction Activities
Develop Reliability Models
Conduct Performance Modeling Activities
Develop a Monitor and Control Design
8.2.2.1 Technical Interchange Meetings The following text is an example of an SOW requirement for technical interchange meetings:
The Contractor shall conduct and administratively support periodic TIMs when directed by the
Contracting Officer. TIMs may also be scheduled in Washington, DC, Atlantic City, NJ, or at
another location approved by the FAA. TIMs may be held individually or as part of scheduled
Program Management Reviews (PMRs). During the TIMs the Contractor and the FAA will
discuss specific technical activities, including studies, test plans, test results, design issues,
technical decisions, logistics, and implementation concerns to ensure continuing FAA visibility
into the technical progress of the contract.
This generic SOW language may be adequate to support fault tolerance TIMs, without
specifically identifying the fault tolerance requirements. The need for more specific language
should be discussed with the Contracting Officer.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
110
8.2.2.2 Documentation It is important for management to appropriately monitor and tailor documentation needs to the
severity and/or size of the system being acquired.
For the purposes of monitoring the progress of the fault-tolerant design, informal documentation
is used for internal communication between members of the contractor’s design team.
Acquisition managers should develop strategies for minimizing formal “boilerplate” CDRL
items and devise strategies for obtaining Government access to real-time documentation of the
evolving design.
Dependent on factors, including severity and size of the system being acquired, documentation
required to support RMA and Fault Tolerance design monitoring may include formal
documentation such as RMA program plans, RMA modeling and prediction reports and other
standardized reports for which the FAA has standard Data Item Descriptions (DIDs). Table 8-1
depicts typical DIDs, their Title, Description and Application. Additional information,
including a more comprehensive listing of DIDs can be found at https://sowgen.faa.gov/.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
111
Table 8-1 RMA Related Data Item Description
DID Ref. No. Title Description Applicability/Interrelationship Relevance to RMA
DI-NDTI-81585A Reliability Test
Plan This plan describes the overall
reliability test planning and its total
integrated test requirements. The
purpose of the Reliability Test Plan
(RTP) is to document:
(1) the RMA-related requirements
to be tracked;
(2) the models, prototypes, or
other techniques the Contractor
proposes to use; (3) how the
models, prototypes, or other
techniques will be validated and
verified; (4) the plan for collecting
RMA data during system
development; and
(5) the interactions among
engineering groups and software
developers that must occur to
implement the RTP.
This document will be used by the
procuring activity for review,
approval, and subsequent surveillance
and evaluation of the contractor’s
reliability test program. It delineates
required reliability tests, their purpose
and schedule. The Reliability Test Plan
describes strategy for predicting RMA
values. The Reliability Test Reports
document the actual results from
applying the predictive models and
identify performance risk mitigation
strategies, where required.
The Reliability Test Plan identifies and
describes planned contractor activities for
implementation of reliability test and
Environmental Stress Screening (ESS), if
required by the contract. The plan lists all
the tests to be conducted for the primary
purpose of obtaining data for use in
reliability analysis and evaluation of the
contract item or constituent elements
thereof. This DID should be reviewed by
program RMA personnel. The Plan should
identify the set of RMA requirements to be
tested. The report should demonstrate that
the proposed set of RMA requirements is
sufficient for effective risk management.
DI-RELI-80687 Failure, Modes,
Effects, and
Criticality
Analysis
Report
The report contains the results of
the contractor’s failure modes,
effects and criticality analysis
(FMECA).
The FMECA Report shall be in
accordance with MIL-STD-15438. The report is used to measure fault
detection and failure tolerance and to
identify single point failure modes and their
compensating features. This DID should be
reviewed by program RMA personnel.
DI-RELI-81496 Reliability
Block
Diagrams and
Mathematical
Model Report
This report documents data used to
determine mission reliability and
support reliability allocations;
predictions, assessments, design
analyses, and trade-offs associated
with end items and units of the
hardware breakdown structure.
This DID is applicable during the
Mission Analysis, Investment Analysis
and Solution Implementation phases.
This report should include reliability block
diagrams, mathematical models, and
supplementary information suitable for
allocation, prediction, assessment, and
failure mode, effects, and critically analysis
task related to the end item and units of
hardware breakdown structure. This DID
should be reviewed by program RMA
personnel.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
112
DID Ref. No. Title Description Applicability/Interrelationship Relevance to RMA
DI-TMSS-81586A Reliability Test
Reports These reports are formal records of
the results of the contractor’s
reliability tests and will be used by
the procuring activity to evaluate
the degree to which the reliability
requirements have been met
including:
(1) currently estimated values of
specified RMA measures; (2)
Variance Analysis Reports and
corresponding Risk Mitigation
Plans; (3) uncertainties or
deficiencies in RMA estimates; (4)
allocation of RMA requirements to
hardware and software elements
that result in an operational system
capable of meeting all RMA
requirements.
The Reliability Test Reports contain
the results of each test or other action
taken to demonstrate the level of
reliability achieved in the contract end
item and its constituent elements
required by the contract. The
Reliability Test Plan describes the
strategy for predicting RMA values.
The Reliability Test Reports document
actual results from applying the
strategy.
This report provides a means of feedback to
ensure the reliability requirements in the
contract are met. This DID should be
reviewed by program RMA personnel.
FAA-SE-005 Reliability
Prediction
Report (RPR)
The purpose of this report is to document analysis results and
supporting assumptions that
demonstrate that the Contractor’s
proposed system design will
satisfy the RMA requirements in
the system specifications. It
provides the FAA with an
indication of the predicted
reliability of a system or subsystem
it is acquiring.
This report contains all of the
information necessary to calculate
reliability predictions and are used to
assess system compliance with the
RMA requirements, identify areas of
risk, support generation of
Maintenance Plans, and support
logistics planning and cost studies. The
models and analyses documented in
this report shall also support the
reliability growth projections.
The relevance to RMA of this report
include the use of reliability block
diagrams, reliability mathematical models,
reliability prediction, operational
redundancy and derived MTBF. This
review of this DID should be the
responsibility of program RMA personnel.
The report should document the results of
analysis of the proposed system’s ability to
satisfy the reliability design requirements of
the system... The report should document
the results of analysis of the proposed
system’s ability to satisfy both the
reliability and maintainability design
requirements of the specification.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
113
These reports must be generated according to a delivery schedule that is part of the contract. The
timing and frequency of these reports should be negotiated to match the progress of the
development of the fault-tolerant design. The fact that these CDRL items are contract
deliverables, upon which contractual performance is measured, limits their usefulness.
8.2.2.3 Risk Reduction Activities The SOW must include adequate levels of contractor support for measurement and tracking of
critical fault tolerance design parameters and risk reduction demonstrations. These activities are
further described in Section 8.4.3.
8.2.2.4 Reliability Modeling Reliability modeling requirements imposed on the contractor should be limited to simple
combinatorial availability models that demonstrate compliance with the inherent availability
requirement. Complex models intended to predict the reliability of undeveloped software and the
effectiveness of fault tolerance mechanisms are highly sensitive to unsubstantiated assumptions,
tend to waste program resources, and generate a false sense of complacency.
8.2.2.5 Performance Modeling In contrast to reliability modeling, performance modeling can be a valuable tool for monitoring
the progress of the design. The success of the design of the fault tolerance mechanisms is highly
dependent on the response times for internal health and error messages. The operation of the
fault tolerance mechanisms in turn can generate a significant processing and communications
overhead.
It is important that the Statement of Work include the requirement to continually maintain and
update workload predictions, software processing path lengths, and processor response time and
capacity predictions. Although performance experts generally assume lead on performance
modeling requirements, these requirements should be reviewed to ensure that they satisfy the
RMA/fault-tolerant needs.
8.2.2.6 Monitor and Control Design Requirement The specification of the Monitor and Control requirements is a particularly difficult task, since
the overall system design is either unknown at the time the specification is being prepared, or, in
the case of a design competition, there are two or more different designs. In the case of
competing designs, the specification must not include detail that could be used to transfer design
data between offerors. The result is that the SSD requirements for the design of the M&C
position are likely to be too general to be very effective in giving the Government the necessary
leverage to ensure an effective user interface for the monitoring and control of the system.
The unavoidable ambiguity of the requirements is likely to lead to disagreements between the
contractor and the Government over the compliance of the M&C design unless the need to
jointly evolve the M&C design after contract award is anticipated and incorporated into the
SOW.
(An alternative way of dealing with this dilemma is presented in Section 8.2.3.2, requiring the
offerors to present a detailed design in their proposals and incorporate the winner’s design into
the contractual requirements.)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
114
8.2.2.7 Fault Avoidance Strategies The Government may want to mandate that the contractor employ procedures designed to
uncover fault tolerance design defects such as fault tree analysis or failure modes and effects
analysis. Caution should be used in generally mandating these techniques for software
developments, as they are more generally applied to weapons systems or nuclear power plants
where cause and effect are more obvious than in a decision support system. Recent advances in
software development techniques are, however making these techniques more feasible for
application to critical systems, see for example Rebecca Menes and Herb Hecht, “Safety and
Certification of UAVs”, SAE Aerotech Symposium, 2007.
It is assumed that more general fault avoidance strategies such as those used to promote software
quality will be specified by software engineering specialists independent of the RMA/Fault
Tolerance requirements.
8.2.2.8 Reliability Growth Planning for an aggressive reliability growth program is an essential part of the development and
testing of software-intensive systems used in critical applications. As discussed in Section 5, it is
no longer practical to attempt a legalistic approach to enforce contractual compliance with the
reliability requirements for high reliability automation systems. The test time required to obtain a
statistically valid sample on which to base an accept/reject decision would be prohibitive. The
inherent reliability of an automation system architecture represents potential maximum reliability
if the software is perfect. The achieved reliability of an automation system is limited by
undiscovered latent software defects causing system failures. The objective of the reliability
growth program is to expose and correct latent software defects so that the achieved reliability
approaches the inherent reliability.
The SSD contains separate MTBF values for the first site and the last site that can be used as
metrics representing two points on the reliability growth curve. These MTBF values are
calculated by dividing the test time by the number of failures. Because a failure review board
will determine which failures are considered relevant and also expunge failures that have been
fixed or that do not reoccur during a specified interval, there is a major subjective component in
this measure. The MTBF obtained in this manner should not be viewed as a statistically valid
estimate of the true system MTBF. If the contractor fixes the cause of each failure soon after it
occurs, the MTBF could be infinite because there are no open trouble reports – even if the
system is experiencing a failure every day. The MTBF calculated in this manner should be
viewed as metrics that measure a contractor’s responsiveness in fixing problems in a timely
manner. The MTBF requirements are thus an important component in a successful reliability
growth program.
The SOW needs to specify the contractor effort required to implement the reliability growth
program. The SSD needs to include requirements for the additional test tools, simulators, data
recording capability, and data reduction and analysis capability that will be required to support
the reliability growth program. Software Reliability Planning is explained in detail in Appendix
F.
There is a plethora of tools that can be used in at the various Engineering Life-cycle phases for
Software Reliability. Authoritative sources on this subject include DOT/FAA/AR-06/35
“Software Development Tools for Safety-Critical, Real-Time Systems Handbook” [25] and
DOT/FAA/AR-06/36 “Assessment of Software Development Tools for Safety-Critical, Real-
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
115
Time Systems. [21] Specific GOTS tools include the AMSAA Reliability Growth technology
and DO-278 standard for Software Integrity Assurance for CNS/ATM Systems [40]. [10][16]
8.2.2.9 Statement of Work Checklist Provide for RMA and Fault Tolerance TIMs. Define CDRL Items and DIDs to provide the documentation needed to monitor the
development of the fault-tolerant design and the system’s RMA characteristics. Provide for Risk Reduction Demonstrations of critical elements of the fault-tolerant design. Limit required contractor RMA modeling effort to basic one-time combinatorial models of
inherent reliability/availability of the system architecture. Incorporate requirements for continuing performance modeling to track the processing
overhead and response times associated with the operation of the fault tolerance mechanisms,
M&C position, and data recording capability. Provide for contractor effort to evolve the M&C design in response to FAA design reviews. Provide for contractor effort to use analytical tools to discover design defects during the
development. Provide for contractor support for an aggressive reliability growth program.
8.2.3 Information for Proposal Preparation
The IFPP describes material that the Government expects to be included in the offeror’s
proposal. The following information should be provided to assist in the technical evaluation of
the fault tolerance and RMA sections of the proposal.
8.2.3.1 Inherent Availability Model A simple inherent availability model should be included to demonstrate that the proposed
architecture is compliant with the NAS-Level availability requirement. The model’s input
parameters include the element MTBF and MTTR values and the amount of redundancy
provided. The offeror should substantiate the MTBF and MTTR values used as model inputs,
preferably with field data for COTS products, or with reliability and maintainability predictions
for the individual hardware elements.
8.2.3.2 Proposed M&C Design Description and Specifications As discussed in Section 8.2.2.6, it will be difficult or impossible for the Government to
incorporate an unambiguous specification for the M&C position into the SSD. This is likely to
lead to disagreements between the contractor and the Government concerning what is considered
to be compliant with the requirements.
There are two potential ways of dealing with this. One is to request that offerors propose an
M&C design that is specifically tailored to the needs of their proposed system. The M&C
designs would be evaluated as part of the proposal technical evaluation. The winning
contractor’s proposed M&C design would then be incorporated into the contract and made
contractually binding.
Traditionally, the FAA has not used this approach, although it is commonly used in the
Department of Defense. The approach satisfies two important objectives. It facilitates the
specification of design-dependent aspects of the system and it encourages contractor innovation.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
116
The other is to attempt to defer specification of the M&C function until after contract award,
have the contractor propose an M&C design, review the approach and negotiate a change to the
contract to incorporate the approved approach.
The selection of either approach should be explored with the FAA Contracting Officer.
8.2.3.3 Fault Tolerant Design Description The offeror’s proposal should include a complete description of the proposed design approach
for redundancy management and automatic fault detection and recovery. The design should be
described qualitatively. In addition, the offeror should provide quantitative substantiation that the
proposed design can comply with the recovery time requirements.
The offeror should also describe the strategy and process for incorporating fault tolerance
mechanisms in the application software to handle unwanted, unanticipated, or erroneous inputs
and responses.
8.3 Proposal Evaluation The following topics represent the key factors in evaluating each offeror’s approach to
developing a system that will meet the operational needs for reliability and availability.
8.3.1 Reliability, Maintainability and Availability Modeling and Assessment
The evaluation of the offeror’s inherent availability model is simple and straightforward. All that
is required is to confirm that the model accurately represents the architecture and that the
mathematical formulas are correct. The substantiation of the offeror’s MTBF and MTTR values
used as inputs to the model should be also reviewed and evaluated. Appendix B provides tables
and charts that can be used to check each offeror’s RMA model.
8.3.2 Fault-Tolerant Design Evaluation
The offeror’s proposed design for automatic fault detection and recovery/redundancy
management should be evaluated for its completeness and consistency. A critical factor in the
evaluation is the substantiation of the design’s compliance with the recovery time requirements.
There are key two aspects of the fault-tolerant design. The first is the design of the infrastructure
component that contains the protocols for health monitoring, fault detection, error recovery, and
redundancy management.
Equally important is the offeror’s strategy for incorporating fault tolerance into the application
software. Unless fault tolerance is embedded into the application software, the ability of the
fault-tolerant infrastructure to effectively mask software faults will be severely limited. The
ability to handle unwanted, unanticipated, or erroneous inputs and responses must be
incorporated during the development of the application software.
8.3.3 Performance Modeling and Assessment
An offeror should present a complete model of the predicted system loads, capacity, and
response times. Government experts in performance modeling should evaluate these models.
Fault tolerance evaluators should review the models in the following areas:
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
117
Latency of fault tolerance protocols: The ability to respond within the allocated response
time is critical to the success of the fault tolerance design. It should be noted that, at the
proposal stage, the level of the design may not be adequate to address this issue.
System Monitoring Overhead and Response Times: The offeror should provide
predictions of the additional processor loading generated to support both the system
monitoring performed by the M&C function as well as by the fault tolerance heartbeat
protocols and error reporting functions. Both steady-state loads and peak loads generated
during fault conditions should be considered.
Relation to Overall System Capacity and Response Times: The system should be sized
with sufficient reserve capacity to accommodate peaks in the external workload without
causing slowdowns in the processing of fault tolerance protocols. Adequate memory
should be provided to avoid paging delays that are not included in the model predictions.
8.4 Contractor Design Monitoring The following topics represent the key design monitoring activities applied to a system that will
ensure that it meets the operational needs for reliability and availability.
8.4.1 Formal Design Reviews
Formal design reviews are a contractual requirement. Although these reviews are often too large
and formal to include a meaningful dialog with the contractor, they do present an opportunity to
escalate technical issues to management’s attention.
8.4.2 Technical Interchange Meetings
The contractor’s design progress should be reviewed in monthly TIMs. In addition to describing
the design, the TIM should address the key timing parameters governing the operation of the
fault tolerance protocols, the values allocated to the parameters, and the results of model
predictions and or measurements made to substantiate the allocations.
8.4.3 Risk Management
The objective of the fault tolerance risk management activities is to expose flaws in the design as
early as possible, so that they can be corrected “off the critical path” without affecting the overall
program cost and schedule. Typically, major acquisition programs place major emphasis on
formal design reviews such as the specification reviews, the system design reviews, preliminary
and critical design reviews. After the CDR has been successfully completed, lists of Computer
Program Configuration Items (CPCIs) are released for coding, beginning the implementation
phase of the contract. After CDR, there are no additional formal technical software reviews until
the end of implementation phase when the Functional and Physical Configuration Audits (FCA
and PCA) and formal acceptance tests are conducted.
Separate fault tolerance risk management activities should be established for:
Fault tolerant infrastructure
Error handling in software applications
Performance monitoring
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
118
The fault-tolerant infrastructure will generally be developed by individuals whose primary
objective is to deliver a working infrastructure. Risk management activities associated with the
infrastructure development are directed toward uncovering logic flaws and timing/performance
problems.
In contrast to hardware designers and the overall system architect, application developers are not
primarily concerned with fault tolerance. Their main challenge is to develop the functionality
required of the application. Under schedule pressure to demonstrate the required functionality,
building in the fault tolerance capabilities that need to be embedded into the application software
is often overlooked or indefinitely postponed during the development of the application. Once
the development has been largely completed, it can be extremely difficult to incorporate fault
tolerance into the applications after the fact. Risk management for software application fault
tolerance consists of establishing standards for applications developers and ensuring that the
standards are followed.
Risk management of performance is typically focused on the operational functionality of the
system. Special emphasis needs to be placed on the performance monitoring risk management
activity to make sure that failure, failure recovery operations, system initialization/re-
initialization, and switchover characteristics are properly modeled.
8.4.3.1 Fault Tolerance Infrastructure Risk Management The development of a fault-tolerant infrastructure primarily entails constructing mechanisms that
monitor the health of the system hardware and software as well as provide the logic to switch,
when necessary, to redundant elements.
The primary design driver for the fault tolerance infrastructure is the required recovery time.
Timing parameters must be established to achieve a bounded recovery time, and the system
performance must accommodate the overhead associated with the fault tolerance monitoring and
deliver responses within established time boundaries. The timing budgets and parameters for the
fault-tolerant design are derived from this requirement. The fault-tolerant timing parameters, in
turn, determine the steady state processing overhead imposed by the fault tolerance
infrastructure.
The risk categories associated with the fault tolerance infrastructure can be generally categorized
as follows:
System Performance Risk
System Resource Usage
System Failure Coverage
If the system is to achieve a bounded recovery time, it is necessary to employ synchronous
protocols. The use of these protocols, in turn, impose strict performance requirements on such
things as clock synchronization accuracy, end-to-end communications delays for critical fault
tolerance messages, and event processing times.
The first priority in managing the fault tolerance infrastructure risks is to define the timing
parameters and budgets required to meet the recovery time specification. Once this has been
accomplished, performance modeling techniques can be used to make initial predictions and
measurements of the performance of the developed code can be compared with the predictions to
identify potential problem areas.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
119
The risk management program should address such factors as the overall load imposed on the
system by the fault tolerance infrastructure and the prediction and measurement of clock
synchronization accuracy, end-to-end communication delays,
Although it is virtually impossible to predict the system failure coverage in advance, or verify it
after-the-fact with enough accuracy to be useful, a series of risk reduction demonstrations using
Government generated scenarios that attempt to “break” the fault-tolerant mechanisms has
proven to be effective in exposing latent design defects in the infrastructure software. Using this
approach, it is often possible that the defects can be corrected before deployment.
8.4.3.1.1 Application Fault Tolerance Risk Management
Monitoring the embedded fault tolerance capabilities in application software is particularly
challenging because functionality, not fault tolerance, is the primary focus of the application
software developers. Risk management in this area consists of:
Establishing fault tolerance design guidelines for application developers, and Monitoring the
compliance of the application software with the design guidelines.
The overall fault tolerance infrastructure is primarily concerned with redundancy management –
that is, with monitoring the “health” of hardware and software modules and performing whatever
reconfigurations and switchovers are needed to mask failures of these modules. In essence, the
fault tolerance infrastructure software deals with the interaction of “black boxes.”
In contrast with this basic infrastructure, application fault tolerance is intimately connected with
details of the functions that the application performs and with how it interfaces with other
applications. Consider a possible scenario: one application module asks another to amend a
flight plan, but the receiving application has no record of that flight plan. Among the possible
responses, the receiving application could simply reject the amendment, it could request that the
entire flight plan be resubmitted, or it could send an error message to the controller who (it
assumes) submitted the request.
What should not be allowed to happen in the above scenario would be for the error condition to
propagate up to the interface between the application module and the fault tolerance
infrastructure. At that level, the only way to handle the problem would be to switch to a standby
application module – and that module would just encounter the same problem. Simply stated, the
fault tolerance infrastructure is not equipped to handle application-specific error conditions. This
high-level capability should only handle catastrophic software failures such as a module crash or
other non-recoverable error.
In the development of fault-tolerant application software it is important to establish definitive
fault tolerance programming standards for the application software developers. These standards
should specify different classes of faults and the manner in which they should be handled.
Programmers should be required to handle errors at the lowest possible level and prohibited from
simply propagating the error out of their immediate domain.
Since an application programmer’s primary focus is on delivering the required functionality for
their application, it will be a continuing battle to monitor their compliance with the fault
tolerance programming standards. Automated tools are available that can search the source code
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
120
exception handling and identify questionable exception handling practices. Failure Modes and
Effects Analysis (FMEA) techniques can be used to review the error handling associated with
transactions between software application modules. Traditional FMEA and Failure Mode,
Effects and Criticality Analysis (FMECA) techniques or System Safety practices defined in
MIL-STD-882D are oriented toward military weapons systems and are focused toward failures
that directly cause injury or loss of life.
What is needed for application fault tolerance is a systematic approach to identify potential
erroneous responses in the communications between software applications and verification that
appropriate responses to the error conditions are incorporated into the software.
The important point to recognize is that the fault tolerance infrastructure alone cannot ensure a
successful fault-tolerant system. Without “grassroots” fault tolerance embedded throughout the
application software, the redundancy management fault tolerance infrastructure will be
ineffective in ensuring a high reliability system.
Fault tolerance must be embedded in the applications from the ground up, as the software is
developed. It can be extremely difficult to attempt to incorporate it after the fact.
The job of the application fault tolerance risk management activity is to ensure that the
programmers have fault tolerance programming standards at the start of software development
and to continuously track their adherence to the standards throughout the implementation phase.
8.4.3.2 Performance Monitoring Risk Management As noted in 8.2.1.1, system performance and response times are closely coupled to reliability
issues. The requirement to have rapid, consistent automatic fault detection and recovery times
imposes rigid and inflexible response time requirements on the internal messages used to
monitor the system’s health and initiate automatic recovery actions. If the allocated response
times are exceeded, false alarms may be generated and inconsistent and incomplete recovery
actions will result.
Although it is the contractor’s responsibility to allocate recovery time requirements to lower
level system design parameters, attempting to design to unrealistic parameters can significantly
increase program risk. Ultimately, it is likely that the recovery time requirement will need to be
reduced to an achievable value. It is preferable, however, to avoid the unnecessary cost and
schedule expenses that result from attempting to meet an unrealistic requirement. The
Government should attempt to write realistic requirements. It is also necessary to watch the
development closely through a contractor-developed, but Government-monitored, risk
management effort. Establishing performance parameters tailored to the performance dependent
RMA characteristics and formally monitoring those parameters through periodic risk
management activities is an effective means of mitigating the associated risks.
8.4.3.3 Software Reliability Growth Plan Monitoring In cases where effort is specified for the contractor as a part of Software Reliability Growth
Planning as described in Appendix F monitoring by the government may be required to verify
that the contractor is meeting the requirements of the plan and the plan is being implemented and
followed. Activities may be required such as audits, contractor performance reporting as well as
monitoring software assurance activities and the review of deliverables. The level-of-effort
required by the contractor and the government is dependent upon the effort identified in
Appendix G.2.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
121
8.5 Design Validation and Acceptance Testing As discussed previously, it is not possible to verify compliance with stringent reliability
requirements within practical cost and schedule constraints. There is, however, much that can be
done to build confidence in the design and operation of the fault tolerance mechanisms and in the
overall stability of the system and its readiness for deployment.
8.5.1 Fault Tolerance Diagnostic Testing
Despite an aggressive risk management program, many performance and stability problems do
not materialize until large scale testing begins. The SAR and the DR&A capabilities provide an
opportunity to leverage the data recorded during system testing to observe the operation of the
fault tolerance protocols and diagnose problems and abnormalities experienced during their
operation.
For system testing to be effective, the SAR and DR&A capabilities should be available when
testing begins. Without these capabilities it is difficult to diagnose and correct internal software
problems.
8.5.2 Functional Testing
Much of the test time at the WJHTC is devoted to verifying compliance with each of the
functional requirements. This testing should also include verification of compliance with the
functional requirements for the systems operations functions including:
Monitor and Control (M&C)
System Analysis and Recording (SAR)
Data Reduction and Analysis (DR&A)
8.5.3 Reliability Growth Testing
As discussed in Appendix C, a formal reliability demonstration test in which the system is either
accepted or rejected based on the test results is not feasible. The test time required to obtain a
statistically valid sample is prohibitive, and the large number of software failures encountered in
any major software development program would virtually ensure failure to demonstrate
compliance with the requirements. Establishing perfect “pass-fail” criteria for a major system
acquisition is not a viable alternative, rather the program manager should establish an acceptable
level of failure rate as an acceptance criteria. Attaining and maintaining this level is illustrated in
Figure F-5.
Reliability growth testing is an on-going process of testing, and correcting failures. Reliability
growth was initially developed to discover and correct hardware design defects. Statistical
methods were developed to predict the system MTBF at any point in time and to estimate the
additional test time required to achieve a given MTBF goal.
Reliability growth testing applied to automation systems is a process of exposing and correcting
latent software defects. The hundreds of software defects exposed during system testing, coupled
with the stringent reliability requirements for these systems, preclude the use of statistical
methods to accurately predict the test time to reach a given MTBF prior to system deployment.
There is no statistically valid way to verify compliance with reliability requirements at the
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
122
WJHTC prior to field deployment. There is a simple reason for this: it is not possible to obtain
enough operating hours at the WJHTC to reduce the number of latent defects to the level needed
to meet the reliability requirements.
The inescapable conclusion is that it will be necessary to field systems that fall short of meeting
the reliability requirements. The large number of additional operating hours accumulated by
multiple system installations will increase the rate that software errors are found and corrected
and the growth of the system MTBF.
To be successful, the reliability growth program must address two issues. First, the contractor
must be aggressive at promptly correcting software defects. The contractor must be given a
powerful incentive to keep the best people on the job through its completion, instead of moving
them to work on new opportunities. This can be accomplished by a process called “expunging.”
The system MTBF was computed by dividing the operating hours by the number of failures.
However if the contractor could demonstrate that the cause of the failure had been corrected then
the failure was “expunged” from the list of failures. If a failure cannot be repeated within 30
days, it is also expunged from the database.
Thus, if all PTRs are fixed immediately, the computed MTBF would be infinity even if the
system were failing on daily basis. This measure is statistically meaningless as a true indicator of
the system MTBF. It is, however, a useful metric for assessing the responsiveness of the
contractor in fixing the backlog of accumulated PTRs. Since the Government representatives
decide when to expunge errors from the database, they have considerable leverage over the
contractor by controlling the value of the MTBF reported to senior program management
officials. There may be other or better metrics that could be used to measure the contractor’s
responsiveness in fixing PTRs. The important thing is that there must be a process in place to
measure the success of the contractor’s program to support reliability growth.
The second issue that must be addressed during the reliability growth program is the
acceptability of the system to field personnel. In all probability, the system will be deployed to
field sites before it has met the reliability requirements. Government field personnel should be
involved in the reliability growth testing at the WJHTC and concur in the decision concerning
when the system is sufficiently stable to warrant sending it to the field.
As discussed in Appendix C, it is not possible to verify compliance with stringent reliability
requirements within practical cost and schedule constraints. There is, however, much that can be
done to build confidence in the design and operation of the fault tolerance mechanisms and in the
overall stability of the system and its readiness for deployment. The way to accomplish this is to
provide the test tools, personnel, and test time to pursue an aggressive reliability growth
program.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
123
SERVICE THREAD MANAGEMENT
The NAS-RD RMA requirements have been designed so that they should be largely independent
of changes in the NAS Architecture or the NAS-RD-20XX functional requirements. The concept
of associating RMA requirements with severity levels and severity levels with NAS functions
remains constant even as the NAS-RD evolves.
Table 7-2 defines the mapping between the approved set of NAPRS services and to the set of
service threads. The complete set of service threads consists of most, but not all, of the NAPRS
services, NAPRS facilities that have been converted to service threads, and newly created service
threads that are not defined in NAPRS services.
When existing service threads are substantially changed or new service threads are added, SMEs
should be consulted to ensure that service thread severity is appropriately categorized and that
other impacts to the NAS are captured.
9.1 Revising Service Thread Requirements One of the advantages of the service thread- based approach is that the service threads can
remain almost constant as the NAS Architecture evolves. Many, if not most, of the changes to
the NAS Architecture involve replacement of a facility representing a block in the reliability
block diagram for the thread. Thus, the basic thread does not need to change, only the name of a
block in the thread. As the NAS evolves, the service thread diagrams should evolve with it. It is
the responsibility of the NAS Requirements Management Group to maintain these service thread
diagrams. Because the service threads need to address requirements and specification issues long
before prototyping and deployment, systems approved for acquisition or major prototype systems
may need to be included.
All changes from the set of service threads set forth for the approved services in NAPRS should
be coordinated with ATO Technical Operations.
9.2 Adding a New Service Thread Service threads may need to be added in the future to accommodate new NAS capabilities. With
the structure provided by this Handbook, new service threads will not necessarily have to rely on
the NAS-Level RMA severity definitions. Rather, it will be possible to move straight to defining
the SLTSC for the new service thread and its place in the NAS. Then, all RMA-related
requirements should follow more easily.
A major objective of this Handbook is to couple the RMA requirements to real-world NAS
services. This linkage is achieved within the STLSC matrices which associate RMA
requirements service threads which are derived from real-world NAS services. Since there will
be a need to create services in addition to those defined in NAPRS, it will be important to
continue to distinguish between the official operational FAA services and services that have been
created to support requirements development and acquisition planning.
All deviations from the set of service threads set forth for the approved services in NAPRS
should be coordinated with ATO Technical Operations. With this coordination, as new services
are deployed, there can be an orderly transition between the hypothetical service threads and the
NAPRS services.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
124
9.3 FSEP and NAPRS
FAA Order 6000.5, Facility, Service, and Equipment Profile (FSEP). The NAS is a complex
collection of facilities, systems, procedures, aircraft, and people. The physical components of the
NAS, excluding people, which provide for safe separation and control over aircraft are referred
to as the NAS Infrastructure. The FSEP provides an accurate inventory of the NAS Operational
Infrastructure and the NAPRS pseudo-services.
Current information on the FSEP and appropriate Technical Operations points of contact can be
found on the FSEP website at: FSEP website link
FAA Order 6040.15, National Airspace Performance Reporting System (NAPRS) serves as a
timely and accurate performance information reporting system for use in determining and
evaluating the operating condition of NAS facilities/services.
Current information on reportable NAPRS facilities and services and appropriate Technical
Operations points of contact can be found on the NAPRS website at: NAPRS website link
Note: Not all facilities and services in FSEP are NAPRS reportable.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
125
RMA REQUIREMENTS ASSESSMENT
This handbook allocates RMA requirements to service threads that are based on the NAPRS
services. The service thread approach applies the NAS-Level requirements to real-world services
and facilities that are precisely defined and well-understood in the engineering as well as
operational communities in the FAA.
Several benefits accrue from using this approach, including the ability to close the loop between
the measured RMA characteristics of operational services and systems and the NAS-Level
requirements for these systems.
10.1 RMA Feedback Paths Testing and Operational RMA data should be fed back into the mission, investment analysis, and
system design phases. This will allow FAA personnel to validate that NAS level RMA
requirements are appropriate and achievable. Overall constraints on availability due either to
technical factors or to cost impacts revealed through testing or evaluation of operational
performance may require adjustment of the NAS EA availability category for a system. Such an
adjustment may require a change in the original operations concept to account for reduced
availability for a specific function. Alternatively, it may be accepted that a system can only
provide a service that requires a lower availability level. Clearly, failure of a system to meet its
original designed availability such that a change to a lower availability level is required may
have a serious impact on a program. This feedback path is shown in Figure 10-1 in black,
originating from “Operational Evaluation” and flowing back to the original “Operations
Concept”.
Figure 10-1 RMA Process Diagram
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
126
If operational testing reveals failures to meet either MTBF or MTTR requirements, it may still be
possible to provide the required availability by adjusting one of the two. This case is shown in
the red line flowing from “Operational Evaluation” to “System Specification”. An increase in
MTBF may require redesign using higher quality components or a higher degree of redundancy.
Since operational evaluation tests both functionality and maintainability, it can reveal
deficiencies in MTTR as well as in MTBF. Reduction in MTTR will normally require
simplification of maintenance, provision providing of more spares, planning for more preventive
maintenance, or hiring more staff. In either case, a cost impact is likely, but is usually of lesser
magnitude that changes required after fielding a system.
WJHTC staff routinely attempt to verify compliance with the SSD, as illustrated by the red
quadrant of Figure 10-1. With this feedback loop, however, it often proves too costly or time
consuming to verify compliance of high availability systems with RMA requirements to any
level of statistical significance. About the best that can be done is to demonstrate that the system
has achieved suitable stability for field deployment and continue to collect reliability information
in the field. With many systems in the field, the rate of exposing (and correcting) latent software
defects increases and the software reliability growth rate increases.
After formal testing, in-service acceptance, and once systems have been deployed for operational
use, data on their RMA characteristics is collected through the NAPRS, the official service
established to provide insight into the performance of fielded systems.
The availabilities assigned to service threads provide a second feedback loop from the NAPRS
field performance data to the NAS-Level requirements, as shown in red in Figure 10-2.
This redundancy provides a mechanism for verifying the realism and achievability of the
requirements, and helps to ensure that the requirements for new systems will be at least as good
as the performance of existing systems.
As with the black feedback path in Figure 10-1, the real world achieved availability of fielded
systems can be used to set expectations for future systems and to adjust requirements in future
issuances of the NAS RD. This process may involve revising the ConOps for a particular NAS
function, or in causing a re-examination of the assigned availability category of the NAS
functions involved.
Figure 10-2 Deployed System Performance Feedback Path
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
127
Closing the loop provides two benefits, it allows system engineers to:
1. Check realism of requirements and identify operational deficiencies.
2. Look at overall characteristics of the current NAS Architecture and identify weak spots
and/or areas where financial resources are not being allocated properly.
For multi-phase programs, the examination of NAPRS statistics should be an ongoing
responsibility of the program’s systems engineering staff, which should analyze fielded system
data, and identify RMA issues, where they exist and propose solutions.
Where new systems or capabilities are identified outside of the scope of an existing program, the
identifying office should engage with the FSEP and NAPRS staffs to identify fielded threads that
may contribute to their proposed system and ensure that the RMA characteristics of existing
services are compatible with planned service thread development.
A histogram showing the distribution of the equipment and service availabilities of service
threads for operationally deployed systems as shown in (Figure 10-3) presents operational data
for five years from FY 2000 through FY 2005. The histogram reports the number of input values
that are equal to, or greater than, the bin value – but still less than the next bin value – and
displays its frequency in the Frequency column. The last value in the table reports the number of
input values equal to, or greater than, the last bin value. The figure shows that most of the service
thread availabilities are in the .999 to .9999 range.
Figure 10-3 Service Thread Availability Historgram
Figure 10-4 illustrates the mean time between unscheduled interruptions for the same period.
Most of the MTBO values fall in the range of 1,000 to 50,000 hours, although the values range to
0
10
20
30
40
50
60
0.99 0.999 0.9999 0.99999 More
Fre
qu
en
cy
Availability
Service Thread Availability Histogram
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
128
more than 100,000 hours. A significant number of data points are greater than 100,000 hours.
The average MTBO for all facilities is 52,000 hours, while the median is only 16,000 hours. A
cursory examination of the raw data indicates that most of the facilities with MTBOs below
10,000 hours are older facilities, while newly acquired systems are generally above 30,000
hours.
Figure 10-4 Reliability Histogram for Unscheduled Interruptions
The data shown in Figure 10-3 and Figure 10-4 require significant manual effort to extract from
FAA databases. One difficulty is that there is a lack of standardization of data elements among
reporting systems and between systems reported on. The FAA plans to further enhance the
reporting of maintenance actions in the future through implementation of the Automated
Maintenance Management System (AMMS) which will simplify data recording by technicians
through use of barcodes and reduce current requirements for multiple data entry into NAPRS and
other systems. A challenge for future FAA maintenance systems, such as AMMS, will be to
provide data for analysis in more convenient and useable formats and to provide a common data
model.
Standardization of operational parameters tested both in operations and during the test phase is
also needed to facilitate formal RMA feedback within the acquisition process. The FAA
Acquisition Management System (AMS) defines the existence of Critical Performance
0
10
20
30
40
50
60
Fre
qu
en
cy
MTBO
Reliability Histogram
Frequency
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
129
Parameters Requirements (CPRs) and mandates a high-level review of test results. The AMS
does not define specific CPRs, but leaves their selection up to the program. The AMS CPR
concept is based on the Department of Defense (DoD) concept of key performance parameters
(KPPs). The DoD practice mandates that availability be considered a key performance parameter
and receive upper management review at all key decision points in the DoD acquisition process.
Where availability above the Routine level is required in a NAS system, designation of
availability or other suitable RMA metrics will facilitate additional feedback and management
visibility at earlier phases of the acquisition cycle. In summary, before the introduction of service
threads, there has been no satisfactory way to relate the NAS-Level requirements to the
performance of existing systems. The use of service threads based on the NAPRS services as a
basis for the NAS-RD-2013 RMA requirements now allows the requirements to be compared
with the performance of existing systems.
10.2 Requirements Analysis The block labeled “NAS Level Requirements” in Figure 9 2 has been expanded in Figure 9 5 to
illustrate the process and considerations used to assess the reasonableness of the NAS-Level
Service Thread RMA requirements.
Proposed RMA requirements are compared with the performance of currently fielded systems as
measured by NAPRS. If the proposed requirements are consistent with the performance of
currently fielded systems, then the requirements can be assumed to be realistic.
If the performance of currently fielded systems exceeds the proposed requirements, the principle
that new systems being acquired must be at least as good as the systems they are replacing
dictates that the RMA requirements must be made more stringent.
On the other hand, if the proposed new requirements significantly exceed the performance of
existing systems, the requirements either are unrealistically stringent or the fielded systems are
not performing in an operationally acceptable manner. The fact that the requirements are not
consistent with the observed performance of existing systems is not, per se, an unacceptable
situation. The motivation for replacing existing systems is often that the reliability of these
systems has deteriorated to the point where their operational suitability is questionable, or the
cost to maintain them has become excessive. The operational suitability of the existing systems
must be considered when proposed requirements are being evaluated.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
130
Figure 10-5 Requirements Analysis
10.3 Architecture Assessment Another benefit of using the service thread approach is that, through use of the NAPRS system, it
readily supports closed- loop corrective action systems, such as Failure Reporting, Analysis, and
Corrective Action System (FRACAS) or Data Reporting, Analysis, and Corrective Action
System (DRACAS), that can be used to assess the NAS Architecture. The additional feedback
path is illustrated in Figure 10-6. This data can support the analysis of the contributions of the
components of a service thread to the overall reliability of the service. The objective of this
analysis process is to work toward improving the overall reliability of the NAS Architecture by
identifying the weak links and applying resources to those areas that will have the greatest
potential for improving the overall NAS reliability. For example, if analysis shows that if the
predominant cause of interruptions of surveillance services is the failure of communications links
or power interruptions, then attempting to acquire highly reliable radar or surveillance processing
systems alone will not improve the overall reliability of surveillance services. The analysis of
field data can assist system engineers in focusing on those areas of the NAS Architecture that
offer the greatest opportunity for improving the reliability of NAS services.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
131
Figure 10-6 Architecture Assessment
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
132
NOTES
11.1 Updating this Handbook This Handbook is designed to be a living document that will develop and be refined over time,
both through changes in the NAS and the NAS-RD-20XX, and through its use to assist in the
preparation of RMA packages for system acquisitions. While the first process will be driven by
FAA Systems Engineering, the second process will only be possible if the users of the Handbook
comment on its use.
While the Handbook is being used for its intended purpose, the acquisition manager or Business
Unit personnel should keep notes regarding the areas where the Handbook was either helpful or
where it was lacking. These notes and comments about the Handbook should then be provided to
the NAS Systems Engineering Services Office, NAS Requirements Services Division (ANG-
B1), so that they can be incorporated into future revisions of the Handbook.
11.2 Bibliography The following sources contain information that is pertinent to the understanding of the RMA
related issues discussed in this Handbook. Reviewing these resources will equip the user of this
Handbook with the necessary background to develop, interpret and monitor the fulfillment of
RMA requirements.
Abouelnaga, Ball, Dehn, Hecht, and Sievers. Specifying Dependability for Large Real-Time
Systems. 1995 Pacific Rim International Symposium on Fault-Tolerant Systems (PRFTS),
December 1995.
Avizienis, Algirdas and Ball, Danforth. On the Achievement of a Highly Dependable and Fault-
Tolerant Air Traffic Control System. IEEE Computer, February 1987.
Ball, Danforth. User Benefit Infrastructure: An Integrated View. MITRE Technical Report MTR
95W0000115, September 1996.
Ball, Danforth. COTS/NDI Fault Tolerance Options for the User Benefit Infrastructure. MITRE
Working Note WN 96W0000124, September 1996.
Brooks, Frederick. The Mythical Man-Month: Essays on Software Engineering. Wesley
Publishing, 1975.
DeCara, Phil. En Route Domain Decision Memorandum.
En Route Automation Systems Supportability Review, Volume I: Hardware, October 18, 1996.
En Route Automation Systems Supportability Review, Volume II: Software & Capacity,
February 12, 1997.
Hierro, Max del. Addressing the High Availability Computing Needs of Demanding Telecom
Applications. VMEbus Systems, October 1998.
IEEE STD 493-1997, Recommended Practice for the Design of Reliable Industrial and
Commercial Power Systems, (Gold Book).
Mills, Dick. Dancing on the Rim of the Canyon, August 21, 1998
Resnick, Ron I. A Modern Taxonomy of High Availability. http://www.interlog.com/~resnick/
ron.html, December 1998
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
133
Talotta, Michael. Software Diversity Study for EDARC Replacement, May 1997.
Voas, Ghosh, Charron, and Kassab. Reducing Uncertainty about Common Mode Failures.
Pacific Northwest Software Quality Conference, October 1996.
Voas, Jeffery and Kassab, Lora. Simulating Specification Errors and Ambiguities in Systems
Employing Design Diversity. http://www.itd.nrl.navy.mil/ITD/5540/publications/
CHACS/1997/1997kassab-PNSQ97.pdf, December 1998.
Wellman, Frank. Software Costing. Prentice Hall, 1992.
Other documents that are pertinent to the understanding of the RMA related issues discussed in
this Handbook (but may not be directly applicable to FAA) include:
DoD Reliability, Availability, Maintainability-Cost (RAM-C) Report Manual, 01 Jun 2009
ANSI/GEIA-STD-0009 Reliability Program Standard for Systems Design Development and
Manufacturing, 09 Jul 2010
Designing and Assessing Supportability in DoD Weapon Systems: A Guide to Increased
Reliability and Reduced Logistics Footprint, October 24, 2003
DoD Guide for Achieving Reliability, Availability, and Maintainability, August 3, 2005
MIL-STD-3034, Reliability-Centered Maintenance Process, June 22, 2011
DoD Manual 4152.22-M, Reliability Centered Maintenance, June 2011
MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, 2 December 1991.
MIL-HDBK-78A, Reliability Test Methods, Plans, and Environments for Engineering,
Development, Qualification, and Production, April 1996
11.3 References All references used through this document are listed in Table 11-1.
Table 11-1 References
Reference # ID Title
[1] JINTAO99 Pan, Jiantao, “Software Reliability,” Carnegie Mellon
University (CMU), 1999
[2] FAA-HDBK-006A “Reliability, Maintainability, and Availability (RMA)
Handbook,” FAA-HDBK-006A, Federal Aviation
Administration, 23 August 2010
[3] ANSI/IEEE STD-729-
1991
“Standard Glossary of Software Engineering
Terminology”, STD-729-1991, ANSI/IEEE, 1991
[4] ROOK90 Rook, Paul editor, “Software Reliability Handbook,”
Centre for Software Reliability, City University, London,
U.K., 1990
[5] PENTTI2002 Pentti, Haappanen, Atte, Helminen, “Failure Mode and
Effects Analysis of Software-Based Automation Systems,”
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
134
Reference # ID Title
STUK-YTO-TR 190, VTT Industrial Systems, Helsinki,
Finland, August 2002
[6] KEENE94 Keene, S. J., “Comparing Hardware and Software
Reliability,” Reliability Review, 14(4), December 1994,
pp. 5-7, 21
[7] RAC96 “Introduction to Software Reliability: A state of the Art
Review,” Reliability Analysis Center (RAC), 1996
[8] KLUTKE2003 Klutke, Georgia-Ann, Kiessler, Peter C., Wortman, M.
A., “A Critical Look at the Bathtub Curve,” IEEE
Transactions on Reliability, Vol. 52, No. 1, March 2003
[9] HALLEY1693 Halley , E., “An estimate of the degrees of the mortality of
mankind, drawn from curious tables of the births and
funerals at the city of Breslau; with an attempt to
ascertain the price of annuities upon lives,” Philosophical
Transactions Royal Society of London, Vol. 17, pp. 596–
610, 1693.
[10] AMSAA2003 “Reliability Technology – Reliability Growth,” United
States Army Materiel Systems Analysis Activity
(AMSAA), 2003
http://www.amsaa.army.mil/ReliabilityTechnology/RelGr
owth.htm
[11] RELIASOFT2010 “Reliability Growth Analysis,” ReliaSoft Corporation,
2010
[12] NASA-STD-8739.8 “SW Assurance Standard,” NASA-STD-8739.8
w/Change 1, National Aeronautics Space Administration
(NASA), 28 July 2004
[13] DODRAM2005 “DoD Guide for Achieving Reliability, Availability, and
Maintainability,” Department of Defense (DoD), August
23, 2005
[14] CHILLAREGE1992 Chillarege, Ram, Bhandari, Inderpal S., Chaar, Jarir
K.,Halliday, Michael J., Moebus, Diane S., Ray, Bonnie
K., Wong, Man-Yuen, “Orthogonal Defect Classification
– A Concept for In-Process Measurements,” IEEE
Transactions on Software Engineering, Vol 18, No. 11,
November 1992
[15] AST2004 “Associate Administrator for Commercial Space
Transportation Research and Development
Accomplishments FY 2004,” FAA, October 2004
[16] TR-652 “AMSAA Reliability Growth Guide,” TR-652, AMSAA,
September 2000
[17] DoD 3235.1-H “Test and Evaluation of System Reliability, Availability,
and Maintainability – A Primer,” DoD 3235.1-H, DoD,
March 1982
[18] FAA/AR-01/116 “Software Service History Handbook,” DOT/FAA/AR-
01/116, FAA, January 2002
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
135
Reference # ID Title
[19] DO-178B “Software Considerations in Airborne Systems and
Equipment Certification,” DO-178B, Radio Technical
Commission for Aeronautics (RTCA) Inc., December
1992
[20] AC 20-115B “RTCA, Inc., Document RTCA/DO-I 78B,” Advisory
Circular 20-115B, FAA, January 1993
[21] FAA/AR-06/36 “Assessment of Software Development Tools for Safety-
Critical, Real-Time Systems,” DOT/FAA/AR-06/36,
FAA, July 2007
[22] FAA-HDBK-008A “NAS SR-1000 Requirements Allocation Handbook,”
FAA-HDBK-008A, FAA, 5 March 2010
[23] FAA-HDBK-
T&E2008
“Air Traffic Organization NextGen and Operations
Planning Services – Test and Evaluation Handbook,”
Version 1.0 FAA, 21 August 2008
[24] NASA-GB-8719.13 “NASA Software Safety Guidebook,” NASA-GB-8719.13,
National Aeronautics and Space Administration (NASA),
31 March 2004
[25] FAA/AR-06/35 “Software Development Tools for Safety-Critical, Real-
Time Systems Handbook,” DOT/FAA/AR-06/35, FAA,
June 2007
[26] AC25.1309-1A “System Design and Analysis,” AC25.1309-1A, FAA, 21
June 1988
[27] MURRAY2007 Murray, Daniel P., Hardy, Terry L., “Developing Safety-
Critical Software Requirements for Commercial Reusable
Launch Vehicles,” 14 May 2007
[28] MIL-STD-882C “System Safety Program Requirements,” MIL-STD-
882C, DoD, 19 January 1993
[29] ATO-S 2008-12 “Safety Risk Management Guidance for System
Acquisitions (SRMGSA),” Air Traffic Organization,
Office of Safety (ATO-S) 2008-12, Version 1.5, FAA,
December 2008
[30] FAA-JO-6040.15 FAA Order JO 6040.15G, National Airspace Reporting
System (NAPRS).
[31] FAAV&V2009 “FAA AMS Lifecycle Verification and Validation
Guidelines,” Version 1.0, FAA, 2 December 2009
[32] RM2010 “Requirements Management,” FAA, 3 August 2010
https://employees.faa.gov/org/linebusiness/ato/operations
/technical_operations/best_practices/discipline/requireme
nts/
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
136
Reference # ID Title
[33] SSHDBK2000 “FAA System Safety Handbook,” FAA, 30 December
2000
http://www.faa.gov/library/manuals/aviation/risk_manage
ment/ss_handbook/
[34] MIL-STD-1629A “Procedures for Performing a Failure Mode, Effects, and
Criticality Analysis,” MIL-STD-1629A, DoD, 24
November 1980
[35] IEC 60812:2006(E) “Analysis techniques for system reliability – Procedure
for failure mode and effects analysis (FMEA),” IEC
60812:2006(E), International Electrotechnical
Commission (IEC), 2006
[36] FAA-STD-026A "Software Development for the National Airspace System
(NAS)," FAA-STD-026A, FAA, 1 June 2001
[37] MIL-HDBK-217 “Reliability Prediction of Electronic Equipment,” MIL-
HDBK-217, DoD, 2 January 1990
[38] MIL-STD-781 “And Confidence Intervals on Mean Time Between
Failures,” DoD, 13 July 1979
[39] MIL-HDBK-472 "Maintainability Prediction," DoD, 24 MAY 1966
[40] DO-278 "Software Integrity Assurance Considerations for
Communications, Navigation, Surveillance and Air
Traffic Management (CNS/ATM) Systems," DO-278A,
Radio Technical Commission for Aeronautics (RTCA),
13 December 2011
[41] SWIMPROG “SWIM Program Overview,” FAA , Last Accessed 26
June 2013
http://www.faa.gov/about/office_org/headquarters_office
s/ato/service_units/techops/atc_comms_services/swim/pr
ogram_overview/
[42] FAASOAC Hritz, Mike, “SOA, Cloud and Service Technology In The
FAA NAS,” Cloud Services and Technology Symposium,
25 September 2012
http://www.servicetechsymposium.com/dl/presentations/s
oa_cloud_and_service_technology_at_the_faa.pdf
[43] VVSTORY Bilicki, Harry, Lee, David, Molz, Maureen, “Verification
& Validation Though Storyboarding,” FAA, 11 October
2012
http://www.faa.gov/about/office_org/headquarters_office
s/ang/offices/tc/library/v&vsummit2012/h.%20bilicki/sto
ry%20board%20hb-dl-10-2-2012.ppsx
[44] SWIMGOV "System Wide Information Management (SWIM)
Governance Policies," v1.1, FAA, 13 August 2010
http://www.faa.gov/about/office_org/headquarters_office
s/ato/service_units/techops/atc_comms_services/swim/do
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
137
Reference # ID Title
cumentation/media/compliancy/2010-08-13 Governance
Policies v1 1.pdf
[45] SUPPSERV "Common Support Services Cloud Request for
Information (RFI) Attachment 2," DRAFT Version 1.8,
FAA, 11, September 2012
https://faaco.faa.gov/attachments/Attachment_2_Draft_Cl
oud_Tenant_Application_Descriptions.docx
[46] CLDDOWN Hesseldahl, Arik, "Amazon’s Cloud Is Down Again,
Taking Heroku and GitHub With It," All Things D, 22
October 2012
http://allthingsd.com/20121022/amazons-cloud-is-down-
again-taking-heroku-and-github-with-it/
[47] SERVDISRUP "Summary of the Amazon EC2 and Amazon RDS Service
Disruption in the US East Region," Amazon Web
Services, 29 April 2011
http://aws.amazon.com/message/65648/
[48] SWIMDOCS “SWIM Documents,” FAA, Last Accessed 26 June 2013
http://www.faa.gov/about/office_org/headquarters_office
s/ato/service_units/techops/atc_comms_services/swim/do
cumentation/
[49] FAACLDSTRTG "FAA Cloud Computing Strategy," v1.0, FAA, May 2012
http://www.faa.gov/about/office_org/headquarters_office
s/ato/service_units/techops/atc_comms_services/swim/do
cumentation/media/cloud_computing/FAA%20Cloud%2
0Computing%20Strategy%20v1.0.pdf
[50] SOA2010 Velte, Anthony T., “Cloud Computing: A Practical
Approach,” ISBN 978-0-07-162694-1, McGraw Hill,
2010
[51] ENTSERV Cory Janssen, "Enterprise Services," Techopedia, Last
Accessed 27 June 2013
http://www.techopedia.com/definition/25404/enterprise-
services
[52] CLDTNTS “Technical Descriptions of Representative Tenant
Programs,” , DTFACT-13-R-00013 Attachment J-3,
FAA, 24 April 2013
[53] CLDRES Bills, D., Foy, S., Li, M., Mercuri, M., Wescott, J.,
"Resilience by design for cloud services, A structured
methodology for prioritizing engineering investments,"
Microsoft, 2013
[54] ROC2002 Patterson, D., Brown, A., Broadwell, P., Candea, G.,
Chen, M., Cutler, J., Enriquez, P., Fox, A., Kıcıman, E.,
Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff,
W., Traupman, J., and Treuhaft, N., "Recovery Oriented
Computing (ROC): Motivation, Definition, Techniques,
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
138
Reference # ID Title
and Case Studies," Computer Science Technical Report
UCB//CSD-02-1175, University of California at
Berkeley, 15 March 2002
[55] NASEASV4 "Mid-Term Systems/Services Functionality Description
(SV-4)," Version 2.0, FAA - ANG - NAS EA, 20
December 2012
[56] DAUGLOSS “Glossary of Defense Acquisition Acronyms and Terms,”
15th Ed., DoD, December 2012
[57] COMSDIV "Communications Diversity," 6000.36A, FAA, 14
November 1995
[58] O6000.36A "Communications Diversity," Order 6000.36A, FAA, 14
November 1995
[59] FTISERV04 "Engineering Guide to FTI Services," Version 2.0, FAA,
August 2004
[60] GASHI09 Gashi, I., Stankovic, V., Leita, C., Thonnard, O., "An
Experimental Study of Diversity with Off-the-Shelf
AntiVirus Engines," Eighth IEEE International
Symposium on Network Computing and Applications,
IEEE, 2009
[61] Sridharan12 Sridharan, S., "A Performance Comparison of
Hypervisors for Cloud Computing - Master of Science in
Computer and Information Sciences Thesis," University
of North Florida, August 2012
[62] JO1900.47C “Air Traffic Organization Operational Contingency
Plan,” Joint Order 1900.47C, FAA, 22 October 2009
[63] O1110.154 "Establishment of Federal Aviation Administration Next
Generation Facilities Special Program Management
Office," FAA Order 1110.154, FAA, 1 September 2010
[64] VOLKMER13 Volkmer, H., "There will be no reliable cloud (part 1-3),"
4-9 April 2013
http://blog.hendrikvolkmer.de/2013/04/03/there-will-be-
no-reliable-cloud-part-1/
http://blog.hendrikvolkmer.de/2013/04/09/there-will-be-
no-reliable-cloud-part-2/
http://blog.hendrikvolkmer.de/2013/04/12/there-will-be-
no-reliable-cloud-part-3/
[65] TTL13 "Voice Quality Measurement," Technology Training
Limited, Last Accessed 9 July 2013
[66] GABELA11 Gabela, M., Gilad-Bachracha, R., Bjrnera, N.,
Schustera,A., "Latent Fault Detection in Cloud Services,"
Microsoft Research, 13 July 2011
http:\\research.microsoft.com\pubs\151507\main.pdf
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
139
Reference # ID Title
[67] CHEN02 Chen, W., Toueg, S., Aguilera, M. K., "On the Quality of
Service of Failure Detectors," Vol. 51 No. 5, IEEE
Transactions on Computers, May 2002
http://research.microsoft.com/en-
us/people/aguilera/qos_ieee_tc2002.pdf
[68] JAVAEX13 "The JavaTM Tutorials - Exceptions," Oracle, Last
Accessed 10 July 2013
http://docs.oracle.com/javase/tutorial/essential/exceptions
/
[69] BAYLE07 Bayle, T. "Preventive Maintenance Strategy for Data
Centers," APC, 2007
[70] MCKEOWN13 McKeown, M., Kommalapati, H., Roth, J., "Disaster
Recovery and High Availability for Windows Azure
Applications," Microsoft , 20 June 2013
http://msdn.microsoft.com/en-
us/library/windowsazure/dn251004.aspx
[71] ATCFORMULA “Air Traffic Control Complexity Formula for terminal
and En Route Pay Setting by Facility”
Federal Aviation Administration (FAA), June 2009
[72] FAA-NATCA “Agreement Between The Federal Aviation
Administration And The National Air Traffic Controllers
Association”
Federal Aviation Administration(FAA), June 2011
[73] RMATFDM “ Reliability, Maintainability, Availability Report for the
Terminal Flight Data Manager (TFDM)”
Federal Aviation Administration(FAA), August 2012
[74] ATADS “Air Traffic Activity System (ATADS)”
Federal Aviation Administration (FAA), July 2013
http://aspm.faa.gov/opsnet/sys/Main.asp
[75]
[76] SWIMUPDATE “System Wide Information Management (SWIM) -
Program Overview and Status Update”, PMO Industry
Form SWIM, Jim Robb, 2012
http://www.faa.gov/about/office_org/headquarters_office
s/ato/service_units/techops/atc_comms_services/swim/do
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
140
Reference # ID Title
cumentation/media/briefings/swim_presentation_industry
%20forum.pdf
[77] SWIMCUR “SWIM The Current”, Issue 7, September 2012
http://www.faa.gov/about/office_org/headquarters_office
s/ato/service_units/techops/atc_comms_services/swim/do
cumentation/media/newsletters/SWIM_TheCurrent_7_FI
NAL.pdf
[78] FTIOPS “FTI Operations Reference Guide”, provided GFI
[79] SWIMSOL Thomson, D., Prabhu, V., Jalleta, E., & Balakrishnan, K.
(2013). (SWIM) Solution Guide for Segment 2A.
Washington D.C. : Federal Aviation Administration .
[80] SWIMVOCAB Federal Aviation Administration. (2013, May 10). SWIM
Controlled Vocabulary. Retrieved September 18, 2013,
from Federal Aviation Administration:
http://www.faa.gov/about/office_org/headquarters_office
s/ato/service_units/techops/atc_comms_services/swim/vo
cabulary/
[81] UDDIOASIS OASIS. (2004). UDDI Spec Technical Committee Draft.
Burlington : OASIS.
[82] FTICONT Attachment J.1
FAA Telecommunications Services Description (FTSD)
DTF A01-02-D-03006
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
141
Appendix A SAMPLE REQUIREMENTS
This appendix presents sample requirements that the reader may find useful in developing
System Level Specifications and other procurement documents. The reader is cautioned that
these checklists contain requirements that may not be applicable to every system. Some of the
requirements may need to be tailored to a specific system. The requirements are organized
around the documents and paragraphs of those documents where they are most applicable. The
numbers in parentheses, e.g., (3.7.1.A) (~18190-18220), following the sample requirements
provide cross references to the SR-1000 requirements from which they were derived. Numbers
not preceded by a tilde “~” refer to the March 1995 version of SR-1000; numbers preceded by a
tilde refer to the SR-1000A version that is based on the NAS Architecture.
The standard outline for System Level Specifications has three separate paragraphs for
requirements related to RMA: System Quality Factors, System Design Characteristics and
System Operations. The paragraphs below present sample requirements for each of the three
sections.
A.1 System Quality Factors
System Quality Factors include those requirements associated with attributes that apply to the
overall system. They typically include requirements for Availability, Reliability and
Maintainability.
Availability Requirements – The following table presents potential availability requirements.
Potential Availability Quality Factor Requirements
The system shall have a minimum inherent availability of (*). (3.8.1.B) 18 * This
value is determined by referencing Table 7-8
Reliability Requirements – The following table presents potential availability requirements.
Potential Reliability Quality Factor Requirements
The predicted Mean Time Between Failures (MTBF) for the system shall be not less than
(*) hours. * This value is determined by referencing Table 7-8.
The reliability of the system shall conform to Table 7-8 (Reliability Growth Table).
18 Parenthetical references are to the NAS-SR-1000a.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
142
Maintainability Requirements – The following table presents potential maintainability
requirements.
Potential Maintainability Quality Factor Requirements
The mean time to repair (MTTR) for all equipment shall be 30 minutes or less.
The mean time to restore service (MTTRS) shall be 30 minutes or less.
The maximum time to restore service shall be 120 minutes or less for failed Floor
Replaceable Units (FRUs) and Lowest Replaceable Units (LRUs).
The maximum time to restore service shall be 8 hours or less for Maintenance
Significant Items (MSIs).
Restoral times service shall include diagnostic time (fault isolation), removal of the
failed Lowest Replaceable Units (LRU), Floor Replaceable Units (FRU), or
Maintenance Significant Items (MSI) replacement and installation of the new LRU,
FRU, or MSI including any adjustments or data loading necessary to initialize the
LRU, FRU, or MSI (including any operating system and/or application software), all
hardware adjustments, verifications, and certifications required to return the
subsystem to normal operation, and repair verification assuming qualified repair
personnel are available and on-site when needed.
Subsystem preventive maintenance shall not be required more often than once every
three months. (3.7.1.)
Preventive maintenance on any subsystem shall not require more than 2 staff hours of
continuous effort by one individual
A.2 System Design Characteristics
System Design Characteristics related to RMA – The following table presents potential system
availability related design characteristics.
Potential Availability System Design Characteristics
The system shall have no single point of failure. (3.8.1.C)
Reliability Design Characteristics – The following table presents potential system reliability
related design characteristics.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
143
Potential Reliability Design Characteristics
The system shall restart without requiring manual reentry of data.
Where redundant hardware or software is used to satisfy reliability requirements, the
system shall automatically switchover from a failed element to the redundant
element.
Where redundant hardware or software is used to satisfy reliability requirements, the
system shall monitor the health of all redundant elements.
Maintainability Design Characteristics – The following table presents potential maintainability
design characteristics.
Potential Maintainability Design Characteristics
The system shall support scheduled hardware maintenance operations without
increasing specialist workload.
The system shall support scheduled software maintenance operations without
increasing specialist workload.
The system shall enable field level technical personnel to correct equipment failures by
replacing faulty Lowest Replaceable Unit (LRUs) and Floor Replaceable Unit
(FRUs).
The system shall permit the technician to physically remove and replace a Floor
Replaceable Unit (FRU) diagnosed within (TBD) minutes.
The system shall permit replacement of any Lowest Replaceable Unit (LRU) while all
functional operations continue uninterrupted on redundant equipment.
The system shall permit replacement of any Floor Replaceable Unit (FRU) while all
functional operations continue uninterrupted on redundant equipment.
The system shall permit replacement of any Maintenance Significant Item (MSI) while
all functional operations continue uninterrupted on redundant equipment.
[Optional, for systems employing multiple, independent data paths] Maintenance
operations performed on a single data path shall not impact operations on the
alternate data path.
The system shall support the capture of system state at a particular point in time.
The system shall support automated scheduled restarts to a known state.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
144
A.3 System Operations
Maintainability Functional Requirements – The following table presents potential maintainability
functional requirements.
Potential Maintainability Functional Requirements
Failed resources shall be isolatable from the system for performance of maintenance
operations.
System elements shall require no more than one hour of Periodic Maintenance (PM) to
less than one hour per year for each element, subsystem and their respective Lowest
Replaceable Unit (LRUs) and Floor Replaceable Unit (FRUs) excluding any
mechanical devices (such as printers).
All Lowest Replaceable Unit (LRUs) shall be accessible and removable at the
equipment's operational location.
All Floor Replaceable Unit (FRUs) shall be accessible and removable at the
equipment's operational location.
All Maintenance Significant Item (MSIs) shall be accessible and removable at the
equipment's operational location.
The system shall be available for operational use during routine tasks:
Maintenance
Hardware diagnostics
Software diagnostics
Verification testing
Certification testing
Training
The system shall provide for the building and implementing of specific databases.
The system shall provide for identifying software problems. (3.7.1.A) (~18190-18220)
The system shall provide for identifying hardware problems. (3.7.1.A) (~18190-18220)
The system shall provide for collecting support data. (3.7.1D.2) (~18900-18910)
The system shall provide for displaying problem description data. (3.7.1.D.2.c)
(~18870)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
145
The system shall receive software versions from selected software support sites.
The system shall reload a selected software version from a storage device.
The system shall test that the version or modification to existing software meets
requirements for operational use. (3.7.1.C.1.c) (~18450)
The system shall verify that the version or modification to existing software meets
requirements for operational use. (3.7.1.B) (~18320)
The system shall validate that the version or modification to existing software meets
requirements for operational use. (3.7.1.B) (~18320)
The system shall accept new operational software.
The system shall accept new maintenance software.
The system shall accept new test software.
The system shall accept new training software.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
146
Monitor and Control – The following table presents potential Monitor and Control (M&C)
General functional requirements.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
147
Potential M&C General Functional Requirements
The system shall have a Monitor and Control (M&C) function. (3.7.1.A) (~17880-
18000)
Specialists shall be provided with a means to interact with the M&C function via M&C
commands. (3.7.1.C.3.a)
The Monitor and Control (M&C) function shall monitor system health. (3.7.1.A)
(~17890, 17970-17980)
The Monitor and Control (M&C) function shall monitor system performance. (3.7.1.A)
(~17900)
The Monitor and Control (M&C) function shall control system configuration. (3.7.1.A)
(~17990-18000)
[Optional, for systems with multiple data paths.] The Monitor and Control (M&C)
function shall support verification and certification of one data path while the other
data path supports normal operation. (3.7.1.B) (~18320) (3.7.1.C.1.c) (~18450)
Upon Monitor and Control (M&C) command, The M&C function shall, create a hard
copy printout of specialist-selected textual output, including displayed status and
error messages.
The system shall continue operations without interruption whenever one or more M&C
Positions fail.
The system shall perform automatic recovery actions in response to the failure of any
hardware or software component without reliance on the Monitor and Control
(M&C) function.
Upon Monitor and Control (M&C) command, the M&C function shall restore
applications databases after restart from internal recovery.
Upon Monitor and Control (M&C) command, the M&C function shall restore
applications databases after restart by reconstitution from external sources.
Upon Monitor and Control (M&C) command, the M&C function shall test non-
operational assemblies and identify failed assemblies to the LRU and FRU level
without any degradation to normal operations. (3.7.1.A.1.b) (~18060)
Upon Monitor and Control (M&C) command, the M&C function shall initiate off-line
diagnostics to test and isolate an indicated fault in an LRU without the use of
operational equipment. (3.7.1.A.1.b) (~18060)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
148
Upon Monitor and Control (M&C) command, the M&C function shall initiate off-line
diagnostics to test and isolate an indicated fault in an FRU without the use of
operational equipment. (3.7.1.A.1.b) (~18060)
Upon Monitor and Control (M&C) command, the M&C function shall initiate off-line
diagnostics to test and isolate an indicated fault in an MSI without the use of
operational equipment. (3.7.1.A.1.b) (~18060)
The system shall automatically recover from a power outage.
The system shall automatically recover from a software fault.
The M&C function shall support the rapid rollback of updates to a known state.
System Monitoring Functional Requirements – The following table presents potential M&C
System Monitoring functional requirements.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
149
Potential M&C System Monitoring Functional Requirements
The Monitor and Control (M&C) function shall monitor all critical parameters required
to determine the operational status of each software component of the system.
(3.7.1.A) (~17880-17890)
The Monitor and Control (M&C) function shall collect equipment status data. (3.7.1.A)
(~17890)
The Monitor and Control (M&C) function shall collect equipment performance data.
(3.7.1.A) (~17890)
The Monitor and Control (M&C) function shall display equipment status data. (3.7.1.A)
(~17890)
The Monitor and Control (M&C) function shall display equipment performance data.
(3.7.1.A) (~17890)
The Monitor and Control (M&C) function shall monitor parameters required to
determine the operational status of each hardware component of the system, at a
minimum to the Lowest Replaceable Unit (LRU) level. (3.7.1.A) (~17880-17890)
The Monitor and Control (M&C) function shall monitor parameters required to
determine the operational status of each external system interface. (3.7.1.A)
(~17880-17890)
The Monitor and Control (M&C) function shall monitor parameters required to
determine the current system configuration. (3.7.1.A) (~17880-17890)
The Monitor and Control (M&C) function shall monitor parameters required to
determine the current hardware identification configuration. (3.7.1.A) (~17880-
17890)
The Monitor and Control (M&C) function shall monitor parameters required to
determine the current software identification configuration. (3.7.1.A) (~17880-
17890)
The Monitor and Control (M&C) function shall monitor parameters required to
determine the configuration of all reconfigurable resources. (3.7.1.A) (~17880-
17890)
[Optional for systems with multiple data paths] The M&C function shall monitor
parameters required to determine which data path has been selected by each
operational position. (3.7.1.A) (~17880-17890)
The Monitor and Control (M&C) function shall monitor parameters required to derive
the status of the M&C position. (3.7.1.A) (~17880-17890)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
150
The Monitor and Control (M&C) function shall monitor parameters required to derive
the availability status of each operational function of the system. (3.7.1.A) (~17880-
17890)
The Monitor and Control (M&C) function shall monitor parameters required to derive
the availability status of each operational support function of the system. (3.7.1.A)
(~17880-17890)
The Monitor and Control (M&C) function shall monitor parameters required to derive
the system-level status of the system. (3.7.1.A) (~17880-17890)
The Monitor and Control (M&C) function shall monitor parameters required to certify
the system. (3.7.1.A) (~17880-17890)
The Monitor and Control (M&C) function shall record system performance parameters
at every (TBD seconds). (3.7.1.D) (~18820)
The Monitor and Control (M&C) function shall perform system performance data
collection while meeting other performance requirements. (3.7.1.A) (~17900)
The Monitor and Control (M&C) function shall perform system performance data
collection without the need for specialist intervention. (3.7.1.A) (~17900)
The Monitor and Control (M&C) function shall determine the alarm/normal condition
and state change events of all sensor parameters, derived parameters, ATC
specialist positions, M&C positions, system functions and subsystem operations.
The Monitor and Control (M&C) function shall determine the alarm/normal condition
and state change events of all sensor parameters. (3.7.1.A.3.a) (~17960)
The Monitor and Control (M&C) function shall determine the alarm/normal condition
and state change events of all derived parameters. (3.7.1.A.3.a) (~17960)
The Monitor and Control (M&C) function shall determine the alarm/normal condition
and state change events of all specialist positions. (3.7.1.A.3.a) (~17960)
The Monitor and Control (M&C) function shall determine the alarm/normal condition
and state change events of all M&C positions. (3.7.1.A.3.a) (~17960)
The Monitor and Control (M&C) function shall determine the alarm/normal condition
and state change events of all system functions. (3.7.1.A.3.a) (~17960)
The Monitor and Control (M&C) function shall determine the alarm/normal condition
and state change events of all subsystem operations. (3.7.1.A.3.a) (~17960)
The Monitor and Control (M&C) function shall provide state change comparisons as
part of status determination. (3.7.1.A.3.a) (~17960)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
151
The M&C function shall report status notifications to the M&C position without
specialist intervention. (3.7.1.A.3.a) (~17960)
The Monitor and Control (M&C) function shall display alarm notifications to the M&C
position within (TBD seconds of their occurrence. (3.7.1.A.3.a) (~17960)
The Monitor and Control (M&C) function shall include the monitored parameter
associated with an alarm notification.
The Monitor and Control (M&C) function shall include the date and time that a
condition was declared with reporting/displaying an alarm condition.
The Monitor and Control (M&C) function shall display system state changes to the
M&C position within (TBD seconds). (3.7.1.A.3.a) (~17960)
The Monitor and Control (M&C) function shall include the monitored parameter
associated with a state change occurrence in a state change notification.
The Monitor and Control (M&C) function shall include the date and time that a
condition was declared with a state change.
The Monitor and Control (M&C) function shall display return-to-normal notifications
to the M&C position within (*) seconds. (3.7.1.A.3.a) (~17960) *Value to be
supplied by the Business Unit.
The Monitor and Control (M&C) function shall include the monitored parameter
associated with a return-to-normal condition in a return-to-normal notification.
The Monitor and Control (M&C) function shall include the date and time that a
condition was declared with return-to-normal condition.
All generated alarm/return-to-normal/state change notifications shall be retained in a
form which allows on-line specialist-selectable retrieval for a period of at least (*)
hours. (3.7.1.A.3.c) (~18310) *Value to be supplied by the Business Unit.
The Monitor and Control (M&C) function shall display specific monitored parameters
when requested by the M&C position. (3.7.1.A) (~17900)
The Monitor and Control (M&C) function shall continually monitor: [Include a specific
requirement for each that applies.] (3.7.1.A) (~17900)
network and network component utilization
processor utilization
input/output peripheral attachment path utilization
peripheral device utilization
memory page fault rates
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
152
memory utilization
software utilization
operating system parameters
The Monitor and Control (M&C) function shall display a selected set of monitored
parameters when requested by the M&C position. (3.7.1.A) (~17900)
The Monitor and Control (M&C) function shall display the most recently acquired
monitor parameters in performance data reports. (3.7.1.A) (~17900)
The Monitor and Control (M&C) function shall display the most recently determined
alarm/normal conditions in performance data reports. (3.7.1.A) (~17900)
The Monitor and Control (M&C) function shall display subsystem status when
requested by the M&C position. (3.7.1.A) (~17900)
The Monitor and Control (M&C) function shall display control parameters when
requested by the M&C position. (3.7.1.A) (~17900)
All reported parameters shall be logically grouped according to subsystem structure.
Each reported parameter logical grouping shall be uniquely identifiable.
Each reported parameter within a logical grouping shall be uniquely identifiable.
The M&C function shall continually monitor:
- Message failures
- Message failure rates
The M&C function shall monitor system process memory utilization levels.
The M&C function shall monitor system processor utilization levels.
System Control – The following table presents potential M&C Control Functional requirements.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
153
Potential M&C System Control Functional Requirements
The M&C function shall support initializing the system. (3.7.1.A.1) (~17990-1800)
The M&C function shall support startup of the system. (3.7.1.A.1) (~17990-1800)
The M&C function shall support restarting the system with recovery data. (3.7.1.A.1)
(~17990-1800)
The M&C function shall support restarting the system without recovery data.
(3.7.1.A.1) (~17990-1800)
The M&C function shall support the option of restarting individual processors with
recovery data. (3.7.1.A.1) (~17990-1800)
The M&C function shall support the option of restarting individual processors without
recovery data. (3.7.1.A.1) (~17990-1800)
The M&C function shall support the option of restarting individual consoles with
recovery data. (3.7.1.A.1) (~17990-1800)
The M&C function shall support the option of restarting individual consoles without
recovery data. (3.7.1.A.1) (~17990-1800)
The M&C function shall control the shutdown of the system. (3.7.1.A.1) (~17990-
1800)
The M&C function shall control the shutdown of individual processors. (3.7.1.A.1)
(~17990-1800)
The M&C function shall control the shutdown of individual consoles. (3.7.1.A.1)
(~17990-1800)
The M&C function shall control the loading of new software releases into system
processors. (3.7.1.A.1) (~17990-1800)
The M&C function shall have the capability to control the cutover of new software
releases in system processors. (3.7.1.A.1) (~17990-1800)
The M&C function shall have the capability to control the cutover of prior releases in
system processors. (3.7.1.A.1) (~17990-1800)
The M&C function shall control the initiating of the System Analysis Recording (SAR)
function. (3.7.1.A.1) (~17990-1800)
The M&C function shall control the stopping of the System Analysis Recording (SAR)
function. (3.7.1.A.1) (~17990-1800)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
154
The M&C function shall control what data is recorded by the System Analysis
Recording (SAR) function. (3.7.1.A.1) (~17990-1800)
The M&C function shall enable/disable the alarm/normal detection of monitored
parameters. (3.7.1.A.1) (~17990-1800)
The M&C function shall support automated or manual recursive restart of software
processes.
M&C Computer/Human Interface (CHI) Requirements – The following table presents potential
M&C CHI requirements.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
155
Potential M&C CHI Requirements
[If applicable] All M&C position displays shall be presented to the specialist in the
form of movable, resizable windows.
The M&C function shall provide a set of views that allow the specialist to “drill down”
to obtain increasingly detailed performance and resource status.
The M&C function shall simultaneously display a minimum of (TBD) displays on the
same workstation, with no restrictions as to display content.
The M&C function shall display an applicable error message if an invalid request or
command is entered.
The M&C function shall display graphical information using redundant information
coding [e.g. color, shapes, auditory coding] to highlight resource status.
The M&C function shall display list information using redundant information coding
(e.g. color, shapes, auditory coding) to highlight resource status.
The M&C function shall support command composition using a combination of
keyboard entries and pointer device selections.
The M&C function shall support command initiation using a combination of keyboard
entries and pointer device selections.
The M&C function shall display commands under development for confirmation prior
to execution.
The M&C function shall initialize all specialist-modifiable system parameters to default
values.
The M&C function shall provide consistent and standardized command entry such that
similar actions are commanded in similar ways.
The M&C function shall prevent inadvertent or erroneous actions that can degrade
operational capability.
M&C function generated messages shall be presented in concise, meaningful text, such
that the translation of error, function, or status codes is not required of the specialist
in order to understand the information.
M&C function generated alerts shall be presented in concise, meaningful text, such that
the translation of error, function, or status codes is not required of the specialist in
order to understand the information.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
156
M&C function generated warnings shall be presented in concise, meaningful text, such
that the translation of error, function, or status codes is not required of the specialist
in order to understand the information.
M&C function generated visual alarms shall warn of errors, out of tolerance conditions,
recovery actions, overloads, or other conditions that may affect system operation or
configuration. [Include individual requirements for each that applies.] (3.7.1.A.3)
(~17960 and 18450)
Errors
Out of tolerance conditions
Recovery action
Overloads
M&C function generated aural alarms shall warn of conditions that may affect system
operation or configuration. [Include individual requirements for each that applies.]
(3.7.1.A.3) (~17960 and 18450)
Errors
Out of tolerance conditions
Recovery action
Overloads
M&C function generated visual alarms shall be designed to incorporate clearly
discriminative features which distinguish the warning (e.g., color, blink, size, etc.)
from other display information.
The M&C function shall allow the M&C specialist to reset existing aural and visual
alarms with a single action.
After executing a command to disable alarm/normal detection, the M&C function shall
provide a command response for monitored parameters with the condition “status
disabled”.
System Analysis Recording (SAR) – The System Analysis and Recording function provides the
ability to monitor system operation, record the monitored data, and play it back at a later time for
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
157
analysis. SAR data is used for incident and accident analysis, performance monitoring and
problem diagnosis. The following table presents potential SAR Functional requirements.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
158
Potential System Analysis Recording Functional Requirements
The system shall provide a System Analysis and Recording (SAR) function. (3.7.1.D)
(~18800)
The SAR function shall record significant system events. (3.7.1.A) (~17890)
The SAR function shall record significant performance data. (3.7.1.A) (~17890)
The SAR function shall record significant system resource utilization. . (3.7.1.A)
(~17890)
The SAR function shall record selected data while performing all system functions.
(3.7.1.A) (~17890-17900)
The SAR function shall record selected system data, including system error logs, for
off-line reduction and analysis of system problems and performance. (3.7.1.A)
(~17890-17900)
The SAR function shall periodically record the selected data when errors/abnormal
conditions are detected. (3.7.1.A) (~17890-17900)
The SAR function shall automatically dump selected memory areas when
errors/abnormal conditions are detected. (3.7.1.A) (~17890-17900)
The SAR function shall record every (TBD) seconds internal state information when
errors/abnormal conditions are detected. (3.7.1.A) (~17890-17900)
The data items and the conditions under which they will be recorded by the SAR
function shall be determined by adaptation.
The data items and the conditions under which they will be recorded by the SAR
function shall be determined by M&C commands.
The SAR function shall record all system recordings on a removable storage media at a
single location for a minimum of (TBD) hours without specialist intervention. .
(3.7.1.A.3) (~18290)
The SAR function shall support continuous recording of system data while transitioning
from one unit of recording media to another. (3.7.1.A.3) (~18290)
The SAR function shall record identifying information, including date and time, on
each unit of recording media. [Include individual requirements for each that
applies]. (3.7.1.A) (~17890-17900)
Site identity
Program version number
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
159
Adaptation identity
Data start/end date and time
The SAR function shall record changes in resource monitoring parameters. (3.7.1.A)
(~17890-17900)
The SAR function shall record changes in recording selection parameters. (3.7.1.A)
(~17890-17900)
The SAR function system shall provide off-line data reduction of recorded system data
for analysis of the system’s technical and operational performance.
Startup/Restart – The Startup/Restart function is one of the most critical system functions and
has a significant impact on the ability of the system to meet its RMA requirements, especially for
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
160
software intensive systems. The following table presents potential Startup/Restart Functional
requirements.
Potential System Startup/Restart Functional Requirements
The system shall have the capability to re-establish communications and reconstitute its
databases as necessary following a startup/restart.
Upon startup or restart, the system shall re-establish communications with all
interfaces.
The system shall restart from a power on condition in TBD seconds. (3.8.1.D) (~19070-
19090)
Software Loading and Cutover is a set of functions associated with the transfer, loading and
cutover of software to the system. Cutover could be to a new release or a prior release. The
following table presents potential Software Loading and Cutover Functional requirements.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
161
Potential Software Loading and Cutover Functional Requirements
The system shall support the following tasks with no disruption to or degradation of on-
going system operations or performance, except during firmware upgrades.
Loading of data
System software
Operating systems
Downloadable firmware
System adaptation data
The system shall store [TBD] complete versions of application software and associated
adaptation data in each system processor.
The system shall store [TBD] levels of operating system software in each system
processor.
When software is loading into a processor, positive verification shall be performed to
confirm that all software is loaded without corruptions, with the results reported to
the M&C function. (3.7.1.A.1.c) (~18070-18080)
[If applicable.] Under the control of the M&C function and upon M&C command, the
system shall cutover system processors on the non-operational data path to a
previously loaded version of the software and adaptation data with no effect on the
operational data path.
[If applicable.] Under the control of the M&C function and upon M&C command, the
system shall test and evaluate software and associated adaptation versions on the
non-operational data path of the system with no effect on the operational data path
portion of the system.
[If applicable.] Under the control of the M&C function and upon M&C command, the
system shall analyze performance on the non-operational data path of the system
with no effect on the operational portion of the system.
[If applicable.] Under the control of the M&C function and upon M&C command, the
system shall perform problem analysis on the non-operational data path of the
system with no effect on the operational data path of the system.
Under the control of the M&C function and upon M&C command, the system shall
perform system level, end-to-end tests.
The system shall perform all system level, end-to-end tests with no degradation of on-
going operations or system performance.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
162
Certification – Certification is an inherently human process of analyzing available data to
determine if the system is worthy of performing its intended function. One element of data is
often the results of a certification function that is designed to exercise end-to-end system
functionality using known data and predictable results. Successful completion of the
certification function is one element of data used by the Specialist to determine the system is
worthy of certification. Some systems employ a background diagnostic or verification process to
provide evidence of continued system certifiability. The following table presents potential
Certification Functional requirements.
Potential Certification Functional Requirements
Prior to allowing an off-line LRU to be configured as part of the on-line operational
system, the M&C function shall automatically initiate comprehensive
tests/diagnostics on that off-line LRU (to the extent possible for that LRU), and
report the results to the M&C position. (3.7.1.B) (~18320)
The M&C function shall automatically perform real-time, on-line, periodic tests
without interruption or degradation to operations on all LRUs, and reporting any
out-of-tolerance results to the M&C position. (3.7.1.B) (~18320)
The M&C function shall, upon M&C command, modify the frequency of the
background verification tests, from a minimum frequency of once every (TBD)
hours to a maximum frequency of once every (TBD) minutes. (3.7.1.B) (~18320)
The M&C function shall, upon M&C command, initiate the background verification
test for a specified LRU, and receive a hard copy printout of the test results.
(3.7.1.B) (~18320)
The M&C function shall provide on-line system certification of the entire system
without interruption or degradation to operations. (3.7.1.C.1.c) (~18450)
The M&C function shall, upon M&C command, manually initiate on-line certification
of the system. (3.7.1.C.1.c) (~18450)
Transition – Transition is a set of requirements associated with providing functionality required
to support the transition to or upgraded to new systems. The following table presents potential
Transition Functional requirements.
Potential Transition Functional Requirements
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
163
The M&C function shall inhibit inputs to the system when the system is in a
test/monitor mode to prevent inadvertent interference with ATC operations. (3.7.3)
The M&C function shall concurrently perform system-level testing and shadow mode
testing and training at system positions without affecting on-going ATC operations.
(3.7.2.A) (3.7.3)
The M&C function shall reconstitute displayed data such that all outputs needed for
operations are available at the selected controlling position equipment.
Maintenance support is a collection of requirements associated with performing preventative and
corrective maintenance of equipment and software. The following table presents potential
Maintenance Support Functional requirements.
Potential Maintenance Support Functional Requirements
The M&C function shall control the facilities, equipment, and systems necessary to
perform preventive maintenance activities including adjustment, diagnosis,
replacement, repair, reconditioning, and recertification. (3.7.1.C) (~18360)
The system shall provide test circuitry and analysis capabilities to allow diagnosis of
the cause of a system/equipment failure, isolation of the fault, and operational
checkout. (3.7.1.C.2) (~18370)
Test Support Functions – Test support is a collection of requirements associated with supporting
system testing before, during and after installation of the system. The following table presents
potential Test Support Functional requirements.
Potential Test Support Functional Requirements
The system shall provide test sets, test drivers, scenarios, simulators and other test
support items required to provide a realistic test environment. (3.7.3)
The system shall record, reduce and analyze the test data. (3.7.3)
Training support is a collection of requirements associated with supporting training of system
specialists. The following table presents potential M&C Training requirements.
Potential M&C Training Requirements
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
164
The system shall perform training operations concurrently with ongoing ATC
operations, with no impact to operations. (3.7.2.A)
The M&C function shall configure system resources to support Air Traffic operational
training in a simulated training environment. (3.7.2.A)
Upon M&C command, the M&C function shall initiate Air Traffic operational training
in a simulated training environment (3.7.2.A)
Upon M&C command, the M&C function shall terminate Air Traffic operational
training in a simulated training environment. (3.7.2.A)
Operational software shall be used in training exercises. (3.7.2.A)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
165
RELIABILITY/AVAILABILITY TABLES FOR
REPAIRABLE REDUNDANT SYSTEMS
B.1 Availability Table
Table B- 1 illustrates the improvement in availability achieved by adding a redundant element.
The table can be used to assist in the evaluation of inherent availability models of redundant
systems.
Table B- 1. Combinatorial Availability for a “Two Needing One” Redundant
Configuration
Element Availability
System Availability for
N = 2, R = 1
0.99 0.9999
0.995 0.999975
0.999 0.999999
0.9995 0.99999975
0.9999 0.99999999
0.99995 1.00000000
0.99999 1.00000000
0.999995 1.00000000
0.999999 1.00000000
0.9999995 1.00000000
0.9999999 1.00000000
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
166
B.2 Mean Time between Failure (MTBF) Graphs
The graphs shown in Figure B-1 , Figure B-2 and Figure B-3 illustrate the reliability
improvement achieved with a dual redundant configuration. The X-axis represents the MTBF of
a single element, and Y axis represents the MTBF of the redundant configuration. The system
reliability for repairable redundant systems is also affected by the time to return failed elements
to service. The separate curves on each graph represent different values of MTTR.
The three graphs are based on different ranges of reliability of the individual elements
comprising the redundant configuration. The charts were computed using the Einhorn equations
presented in Appendix C.
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
0 2000 4000 6000 8000 10000 12000
Element MTBF (Hours)
Re
du
nd
an
t C
om
bin
ati
on
MT
BF
(H
ou
rs)
MTTR = .5
MTTR = 1.0
MTTR = 2.0
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
167
Figure B-1 Mean Time between Failure for a "Two Needing One" Redundant
Combination
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
0 2000 4000 6000 8000 10000 12000
Element MTBF (Hours)
Re
du
nd
an
t C
om
bin
ati
on
MT
BF
(H
ou
rs)
MTTR = .5
MTTR = 1.0
MTTR = 2.0
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
168
Figure B-2 Mean Time between Failure for a “Two Needing One” Redundant Combination
0
50,000,000
100,000,000
150,000,000
200,000,000
250,000,000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000
Element MTBF (Hours)
Re
du
nd
an
t C
om
bin
ati
on
MT
BF
(H
ou
rs)
MTTR = .5
MTTR = 1.0
MTTR = 2.0
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
169
Figure B-3 Mean Time between Failure for a “Two Needing One” Redundant Combination
0
100,000,000
200,000,000
300,000,000
400,000,000
500,000,000
600,000,000
700,000,000
800,000,000
900,000,000
1,000,000,000
0 5000 10000 15000 20000 25000 30000 35000
Element MTBF (Hours)
Re
du
nd
an
t C
om
bin
ati
on
MT
BF
(H
ou
rs)
MTTR = .5
MTTR = 1.0
MTTR = 2.0
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
170
STATISTICAL METHODS AND LIMITATIONS
C.1 Reliability Modeling and Prediction
The statistical basis for reliability modeling was originally developed in the 1950’s when
electronic equipment was fabricated with discreet components such as capacitors, resistors, and
transistors. The overall reliability of electronic equipment is related to the numbers and failure
rates of the individual components used in the equipment. Two fundamental assumptions form
the basis for conventional parts count reliability models:
The failure rates of components are assumed to be constant. (After a short initial
burn-in interval and before end-of-life wear out—the “bathtub curve.”)
The failures of individual components occur independently of one another.
The constant failure rate assumption allows the use of an exponential distribution to describe the
distribution of time to failure, so that the probability a component will survive for time t is given
by
[C-1]
(Where R is the survival probability, is the constant failure rate, and t is the time.)
The assumption of independent failures means that the failure of one component does not affect
the probability of failure of another component. Hence the probability of all components
surviving is the product of the individual survival probabilities.
[C-2]
Because of the exponential distribution of failures, the total failure rate is simply the sum of the
individual failure rates and the total reliability is
[C-3]
Where T is given by
[C-4]
The equation for predicting the equipment failure rate using the parts count method is given by
MIL-HDBK-217 as
[C-5]
teR
nT RRRR 21
t
TTeR
n
i
iT
1
n
i
QGiEquip N1
)(
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
171
Where
EQUIP = Total equipment failure rate (failures/106 hours)
G = Generic failure rate for the ith generic part (failures/106 hours)
Q = Quality factor for the ith generic part
Ni = Quantity of ith generic part
N = Number of different generic part categories in the equipment
This reliability prediction technique worked reasonably well for simple “black box” electronic
equipment. However, the introduction of fault-tolerant redundant computer systems created a
need for more complex modeling techniques that are discussed in Section C.5.
C.2 Maintainability
Maintainability is defined in MIL-STD-721Rev C as: “The measure of the ability of an item to
be retained in or restored to specified condition when maintenance is performed by personnel
having specified skill levels, using prescribed procedures and resources, at each prescribed level
of maintenance and repair.”
Maintainability prediction methods depend primarily on two basic parameters, the failure rates of
components at the level of maintenance actions, and the repair or replacement time for the
components. Historically, maintainability predictions for electronic equipment involved a
detailed examination of the components’ failure rates and the measured time required for
diagnosing and replacing or repairing each of the failed components. A statistical model
combined the failure rates and repair times of the equipment’s components to determine an
overall MTTR or MDT.
Maintainability was a design characteristic of the equipment. Repair times were affected by the
quality of built in test equipment (BITE), diagnostic tools, and the ease of access, removal, and
replacement of failed components.
With the advent of redundant, fault-tolerant systems, in which restoration of service is performed
by automatic switchover and corrective maintenance is performed off-line, maintainability is not
as significant as it once was. In addition, the move to utilizing more commercial off the shelf
(COTS) equipment that is simply removed and replaced has made the traditional maintainability
calculations as expressed in MIL-HDBK-472 less relevant.
C.3 Availability
Availability is defined in MIL-STD-721 as a measure of the degree to which an item is in an
operable and Committable State at the start of a mission when the mission is called for at an
unknown (random) time. As such, availability is the probability that the system will be available
when needed, and the availabilities for independent subsystems can be combined by simply
multiplying the availabilities. (Availability is also used to express the percentage of units that
may be available at the start of a mission, e.g. how many aircraft in a squadron will be available
at the start of a mission.)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
172
Availability is measured in the field by subtracting the downtime from the total elapsed time to
obtain the time that the system was operational and dividing this time by the total elapsed time.
Operational availability includes all downtime. Other availability measures have been defined
that exclude various categories of downtime such as those caused by administrative delays and
logistics supply problems. The purpose of these other measures of availability is to develop
metrics that more accurately reflect the characteristics of the system itself removing downtime
that is attributable to deficiencies in the human administration of the system.
Availability is usually not predicted directly, but is usually derived from both the failure and
repair characteristics of the equipment. Availability is expressed as:
[C-6]
Where MTBF is the Mean Time between Failures and MTTR is the mean time to repair, or
equivalently, Availability can also be expressed as:
[C-7]
(Where MUT is the Mean Up Time and MDT is the Mean Downtime.)
As discussed earlier, availability allows reliability and maintainability to be traded off. Although
this practice may be acceptable for equipment where optimizing life cycle costs is the primary
consideration, it is may not be appropriate for systems that provide critical services to air traffic
controllers, where lengthy service interruptions may be unacceptable, regardless of how
infrequently they are predicted to occur.
C.4 Modeling Repairable Redundant Systems
The increasing use of digital computers for important real-time and near-real-time operations in
the 1960’s created a demand for systems with much greater reliability than that which could be
achieved with the current state of the art for electronic systems constructed with large numbers
of discrete components. For example, the IBM 360 series computers employed in NAS Stage A
had a MTBF on the order of 1000 hours. The path to higher reliability systems was to employ
redundancy and automatic fault detection and recovery. The introduction of repairable
redundant systems required new methods for predicting the reliability of these systems. One of
the first attempts at predicting the reliability of these systems was presented in a paper by S. J.
Einhorn in 1963. He developed a method for predicting the reliability of a repairable redundant
system using the mean time to failure and mean time to repair for the elements of the system. He
assumed that the system elements conformed to the exponential failure and repair time
distributions and that the failure and repair behaviors of the elements are independent of one
another. The Einhorn equation for predicting the reliability of an r out of n redundant system is
presented below.
[C-8]
MTTRMTBF
MTBFA
MDTMUT
MUTA
rnr
jnjn
rj
DUr
nn
DUj
n
MUT
1
1
1
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
173
Where MUT is the mean UP time, n is the total number of elements in a subsystem, r is the
number of elements that are required for the system to be UP, and the number of combinations of
n things taken r at a time is given by
[C-9]
The Einhorn method provided a relatively simple way to predict the combinatorial reliability of
an “r out of n” repairable redundant configuration of identical elements. This method assumes
perfect fault detection, isolation and recovery and does not account for switchover failures or
allow for degraded modes of operation.
In order to incorporate these additional factors, Markov models were developed to model
reliability and availability. Typically, Markov models of redundant systems assume that the
overall system is organized as a set of distinct subsystems, where each subsystem is composed of
identical elements and the failure of a subsystem is independent of the status of the other
subsystems. In each of the subsystems, redundancy is modeled by a Markov process with the
following typical assumptions:
Failure rates and repair rates are constants.
System crashes resulting from the inability to recover from some failures even though
operational spares are available are modeled by means of “coverage” parameters.
(Coverage is defined as the probability that the system can recover, given that a fault
has occurred.)
For recoverable failures, the recovery process is instantaneous if useable spares are
available.
Spare failures are detected immediately.
As soon as a failed unit is repaired, it is assumed to be in perfect condition and is
returned to the pool of spares.
Reliability analysis using Markov models follows four distinct steps:
1. Development of the state transition diagram
2. Mathematical representation (Differential equation setup)
3. Solution of the differential equations
4. Calculation of the reliability measures
An example of a general state transition diagram for a system with three states is provided Figure
C-1 . The circles represent the possible states of the system and the arcs represent the transitions
between the states. A three-state model is used to represent the behavior of a simple system with
two elements, one of which must be operational for the system to be up. In State 1, both elements
are operational. In State 2, one of the elements has failed, but the system is still operational. In
State 3, both elements have failed and the system is down.
)!(!
!
!
)1()2)(1(
rnr
n
r
rnnnn
r
n
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
174
Figure C-1 General State Transition Diagram for Three-State System
From the state transition diagram in a set of differential equations can be formulated as follows:
If a system is in State 1 at time t + t, then between time t and t + t, one of two events must
have occurred: (1) either the system was in state 1 at time t and stayed in that state throughout
the interval t, or (2) it was in State 2 or State 3 at time t and a transition to State 1 occurred in
the interval t.
The probability of event 1, that the system stayed in State 1 throughout the interval t is equal to
one minus the probability that a transition occurred from State 1 to either State 2 or State 3.
[C-10]
The probability of event 2 is given by the probability that the system was in State 2 times the
probability of a transition from State 2 to State 1 in t, plus the probability that the system was in
State 3 times the probability of a transition from State 3 to State 1 in t.
[C-11]
Since the two events are statistically independent, the probability of being in State 1 at time t +
t is the sum of the probabilities of the two events
[C-12]
Rearranging the terms in Equation [C-12] and letting t approach zero, yields the following
differential equation
[C-13]
Similarly, the equations for the other two states are
[C-14]
S1
S2
S3
p12p23
p21p32
p13
p31
)(1)1( 11312 tPtptpEP
)()()()()2( 331221 tPtptPtpEP
ttPptPptPtppttP )()()()(1)( 331221113121
)()()()()( 331221113121 tPptPptPpptPdt
d
)()()()()( 332223211122 tPptPpptPptPdt
d
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
175
[C-15]
Equations [C-13], [C-14], and [C-15] can be written in matrix form as
[C-16]
Or
[C-17]
Where A represents the transition probability matrix (TPM) for the state diagram and the
elements of the matrix represent the transition rates between states. In a reliability or availability
model, these rates are determined primarily by the failure and repair rates of the system
elements.
Typically, the State Transition Diagram will not include all of the possible state transitions. For
example, reliability models generally do not include any transitions out of the failed state, while
availability models will add a repair rate out of the failed state corresponding to the time to
restore the system to full operation following a total failure. Other transitions between them may
not be possible in the particular system being modeled.
Figure C-2 presents a simplified transition diagram for a system with two elements, one of
which is required for full operation. S1 is the state when both elements are up, S2 is the state
when one element is up and the other failed, but the system is still operational. S3 is the failed
state when neither of the two elements is operational and the system is down. The only
transitions between states are a result of the failure or repair of an element. The transition
probabilities for the paths in the general model Figure C-2 that are not shown are set to zero.
Since this is a reliability model that reflects the time to failure of the system, there are no
transitions out of the failed state, S3. It is considered in Markov terminology to be an absorbing
state. Once the system has run out of spares and failed, it stays in the failed state indefinitely.
This simplified model addresses only the combinatorial probability of encountering a second
failure before the first failure has been repaired, i.e. exhausting spares. It does not consider
failures of automatic switchover mechanisms or address other factors such as degraded states,
undetected spare failures, etc.
)()()()()( 332312231133 tPpptPptPptPdt
d
)()( tPAtPdt
d
)(
)(
)(
)(
)(
)(
)(
)(
)(
3
2
1
32312313
32232112
31211312
3
2
1
tP
tP
tP
pppp
pppp
pppp
tP
tP
tP
dt
d
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
176
Figure C-2 Simplified Transition Diagram
This simple example can be used to illustrate how the differential equations can be solved.
Suppose that each element has a failure rate of 0.001 failures/hour (1000 hour MTBF) and a
repair rate of 2 repairs/hour (0.5 hours MTTR). The transition probabilities are then
p12 = .002 (because there are two elements each having a .001 failure rate)
p23 = .001 (because there is only one element left to fail)
p21 = 2
All of the other transition probabilities are zero
Thus the transition probability matrix of Equation [C-17] becomes
[C-18]
Since, for reliability prediction, the reliability is expressed by the probability of being in one of
the two “UP” states S1 or S2, the TPM can be further simplified to
[C-19]
Equations [C-13] and [C-14] then become
[C-20]
0001.0
0001.2002.
02002.
A
001.2002.
2002.A
)(2)(002.)( 211 tPtPtPdt
d
S1
S2
S3
p12p23
p21
One Up
Both Up Failed
Repair
Failure
Failure
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
177
[C-21]
Taking the Laplace transform of these equations yields
[C-22]
Rearranging terms and substituting the initial conditions for P1 (0+) and P2 (0+)
[C-23]
The equations in [C-23] can be solved using Cramer’s rule as
[C-24]
The factors of the denominator are (s + 2.003) and (s + 10-6). Expanding Equations [C-24] by
partial fractions yields
[C-25]
Since the reliability is given by the sum of the probabilities of being in State 1 or 2, then the
reliability is
[C-26]
and taking the inverse Laplace transformation,
[C-27]
This indicates that the reliability is equal to 1.0 at t=0 and decays exponentially as t increases.
The system mean time between failures (MTBF) is given by the reciprocal of the failure rate in
)(001.2)(002.)( 212 tPtPtPdt
d
)(001.2)(002.)0()(
)(2)(002.)0()(
2122
2111
sPsPPssP
sPsPPssP
0)()001.2()(002.
1)(2)()002.(
21
21
sPssP
sPsPs
622
621
102003.2
002.
)001.2(002.
2)002.(
0002.
1002.(
)(
102003.2
)001.2(
)001.2(002.
2)002.(
)001.2(0
21
)(
ss
s
s
s
sP
ss
s
s
s
ssP
)10(
000998503.
)003.2(
000998503.)(
)10(
999001497..
)003.2(
000998503.)(
62
61
sssP
sssP
)10(
1)(
6
SsR
tetR610)(
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
178
the exponent or one million hours. This is the same result that is obtained by using the Einhorn
equations [C-8]
In this simple example, there is virtually no difference between a Markov model and the Einhorn
equations. Note that Markov models can be extended almost indefinitely to include additional
system states and transitions between states. For example, our simple reliability model in Figure
C-3 can be extended to include the effects of failure to detect an element failure or successfully
switch to a spare element by adding the transition path shown in Figure C-3 .
Figure C-3 Coverage Failure
The transition path p13 represents a crash failure of the automatic fault detection and recovery
mechanisms. The transition rate from the full up state to the failed state is dependent on the
failure rate of the system elements and the value of the coverage parameter, C. Coverage is a
dimensionless parameter between zero and one that represents probability that recovery from a
failure is successful, given that a failure has occurred. The value of p13 is given by
[C-28]
If the coverage is perfect with C equal to one, then the transition probability from S1 to S3 is
zero and Figure C-3 becomes equivalent to Figure C-4 . If C is equal to zero, then the automatic
recovery mechanisms never work and the system will fail whenever either of the two elements
fail, (assuming that the automatic recovery mechanisms are invoked whenever a failure occurs
anywhere in the system, a common practice in systems employing standby redundancy).
If a model of availability instead of reliability is desired, it will be necessary to add a recovery
path from the failed state as in the availability model shown in Figure C-4 .
S1
S2
S3
p12p23
p21
p13
)1(213 Cp
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
179
Figure C-4 Availability Model
The transition from the failed state to the full up state, p31 is twice the repair rate for a single
element if the capability exists to repair and restore both failed elements simultaneously.
The preceding examples illustrate some very simple Markov models. The number of states and
transition paths can be extended indefinitely to include a wide variety of system nuances such as
degraded modes of operation, and undetected failures of spare elements. However the number of
elements in the transition probability matrix increases as the square of the number of states
making the hand calculations illustrated above a practical impossibility. Although the solution
mathematics and methodology are the same, the sheer number of arithmetic manipulations
required makes the solution of the equations a time-consuming and error-prone process. For this
reason Markov modeling is usually performed using computer tools. Many of these tools can
automatically construct the transition probability matrix (TPM) from the input parameters, solve
the differential equations using numerical methods, and then calculate a variety of RMA
measures from the resulting state probabilities.
C.5 Availability Allocation
A typical reliability block diagram consists of a number of independent subsystems in series.
Since the availability of each subsystem is assumed to be independent of the other subsystems,
the total availability of the series string is given by
[C-29]
The most straightforward method of allocating availability is to allocate the availability equally
among all of the subsystems in the reliability block diagram. The allocated availability of each
element in the reliability block diagram is then given by
[C-30]
A simpler approximation can be derived by rewriting the availability equation using the
expression
S1
S2
S3
p12p23
p21
p31
nTotal AAAA 21
nTotalSubsystem AA
1
)(
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
180
[C-31]
where A represents the unavailability. Rewriting equation [C-29]
[C-32]
Multiplying terms and discarding higher order unavailability products yields the following
approximation
[C-33]
or by rearranging terms
[C-34]
The approximation given by Equation [C-34] allows the availability allocation to be performed
by simple division instead of calculating the nth root of the total availability as in Equation
[C-30].
Thus to allocate availability equally across n independent subsystems, it is only necessary to
divide the total unavailability by the number of subsystems in the series string to determine the
allocated unavailability for each subsystem. The allocated availability for each subsystem then is
simply
[C-35]
At this point, it is instructive to reflect on where all of this mathematics is leading. Looking at
equation [C-35] if n = 10, the allocated availability required for each subsystem in the string will
be an order of magnitude greater than the total availability of the service thread. This
relationship holds for any value of total availability. For a service thread with ten subsystems, the
allocated availability for each subsystem in a service thread will always be one “nine” greater
than the number of “nines” required for the total availability of the service thread. Since none of
the current threads has ten subsystems, all availability allocations will be less than an order of
magnitude greater than the total availability required by the service thread. By simply requiring
the availability of any system in a thread to be an order of magnitude greater than the end-to-end
availability of the thread, the end-to-end availability of the thread will be ensured unless there are
more than 10 systems in the thread. This convention eliminates the requirement to perform a
mathematical allocation and eliminates the issue of whether the NAS-Level availability should
be equally allocated across all systems in the thread. Mathematical allocations also contribute to
the illusion of false precision. It is likely that allocations will be rounded up to an even number
of “nines” anyway. The risk, of course, of requiring that systems have availability an order of
magnitude greater that the threads they support is that the system availability requirement is
greater than absolutely necessary, and could conceivably cause systems to be more costly.
)1( AA
)1()1()1( 21 nTotal AAAA
)1( SubsystemTotal AnA
TotalSubsystem An
A 1
n
AA
Total
Subsystem 1
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
181
This should not be a problem for two reasons. First, as discussed in Section 8.1.1, this process
only applies to information systems. Other methods are proposed for remote and distributed
elements and facility infrastructure systems. Secondly, a system designer is only required to
show that the architecture’s inherent availability meets the allocated requirement. The primary
decision that needs to be made is whether the system needs to employ redundancy and automatic
fault detection and recovery. With no redundancy, an inherent availability of three to four
“nines” is achievable. With minimum redundancy, the inherent availability will have a quantum
increase to six to eight “nines.” Therefore, allocated availabilities in the range of four to five
“nines” will not drive the design. Any availability in this range will require redundancy and
automatic fault detection and recovery that should easily exceed the allocated requirement of five
“nines.”
C.6 Modeling and Allocation Issues
RMA Models are a key factor in the process of the allocating system of allocating NAS-Level
requirements to the systems that are procured to supply the services and capabilities defined by
the NAS requirements. At this point, although the mathematics used in RMA modeling may
appear elegant at first glance, it is appropriate to reflect upon the limitations of the statistical
techniques used to predict the RMA characteristics of modern information systems.
Although the mathematics used in RMA models is becoming increasing sophisticated, there is a
danger in placing too much confidence in these models. This is especially important when the
results can be obtained by entering a few parameters into a computer tool without a clear
understanding of the assumptions embedded in the tool and the sensitivity of the model results to
variations in the input parameters. One of the most sensitive parameters in the model of a fault-
tolerant system is the coverage parameter. The calculated system reliability or availability is
almost entirely dependent on the value chosen for this parameter. Minor changes in the value of
coverage cause wide variations in the calculated results. Unfortunately, it is virtually impossible
to predict the coverage with enough accuracy to be useful.
This raises a more fundamental issue with respect to RMA modeling and verification. The
theoretical basis of RMA modeling rests on the assumptions of constant failure rates and the
statistical independence of physical failures of hardware components. The model represents a
“steady state” representation of a straight forward physical situation.
Physical failures are no longer the dominate factor in system reliability and availability
predictions. Latent undiscovered design defects created by human beings in the development of
the system predominate as causes of failures. Although attempts have been made to incorporate
these effects into the models, some fundamental problems remain. First, conventional reliability
modeling used an empirical database on component failure history to predict the reliability of
systems constructed with these components. Historical data has been collected on software fault
density (Number of faults/ksloc). It is difficult, however, to translate this data into meaningful
predictions of the system behavior without knowing how the faults affect the system behavior
and how often the code containing a fault will be executed.
Secondly, the reliability of a computer system is not fundamentally a steady state situation, but a
reliability growth process of finding and fixing latent design defects. Although some have argued
that software reliability eventually reaches a steady state in which fixing problems introduces an
equal number of new problems, there is no practical way to relate the latent fault density of the
software (i.e. bugs/ksloc) to its run-time failure rate (i.e. failures/hour). There have been some
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
182
academic attempts to relate the fault density to the run-time performance of the software, the
usefulness of such predictions is questionable. They are of little value in acquiring and accepting
new systems, and uncertainty concerning the frequency of software upgrades and modifications
makes the prediction of steady state software reliability of fielded systems problematic.
Because of the questionable realism and accuracy of RMA predictions from sophisticated RMA
models, it is neither necessary nor desirable to make the allocation of NAS requirements to
systems unnecessarily complicated. It should not require a Ph.D. in statistics or a contract with a
consulting firm to perform them. Accordingly, throughout the development of the allocation
process, the objective is to make the process understandable, straightforward, and simple so that
a journeyman engineer can perform the allocations.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
183
FORMAL RELIABILITY DEMONSTRATION TEST
PARAMETERS
MIL-STD-781 defines the procedures and provides a number of suggested test plans for
conducting reliability demonstration tests. The statistics and equations defining the
characteristics of test plans are also presented. For a detailed description of formal reliability
testing, the reader is referred to MIL-STD-781.
This appendix summarizes the fundamental statistical limitations underlying the testing issues
introduced in Appendix C. Figure D-1 illustrates what is known in statistics as an Operating
Characteristic (OC) Curve. The OC curve presents the probability of accepting a system versus
multiples of the test MTBF. An ideal reliability qualification test would have the characteristics
of the heavy dashed line. Unfortunately, basing the decision of whether to accept or reject a
system on the basis of a limited sample collected during a test of finite duration has an OC more
like the other curve that represents a test scenario from MIL-STD-781 in which the system is
tested for 4.3 times the required MTBF. The system is accepted if it incurs two or fewer failures
during the test period and rejected if it incurs 3 or more failures during the test period.
Figure D-1 Operating Characteristic Curves
An explanation of the important points on an OC curve is illustrated in
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Pro
bab
ilit
y o
f A
ccep
tin
g
Multiples of Minimum Required MTBF
Operating Characteristic (OC) Curve
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
184
Figure D-2 . This curve represents the fixed length test described in the preceding paragraph.
There are two types of incorrect decisions that can occur when an acceptance decision is based
on a limited data sample. The Type I error occurs when a “good” system whose true reliability
meets or exceeds the requirements fails the test. The probability of occurrence of the Type I error
is given by α and is known as the producer’s risk. The Type II error occurs when the test passes a
“bad” system whose true MTBF is below the minimum acceptable requirement. The probability
of the Type II error is given by β and is known as the consumer’s risk.
Figure D-2 Risks and Decision Points Associated with OC Curve
The OC curve graphically illustrates the primary reliability test design .parameters. The region
below θ1 is the rejection region. The region above θ0 is the acceptance region. The region in
between θ1 and θ0 is an uncertain region in which the system is neither bad enough to demand
rejection, nor good enough to demand acceptance. With a discrimination ratio of 3, the
contractor still has an 18% probability of failing the reliability demonstration test even if the true
MTBF of his system is three times the requirement.
In order to balance the producer’s and consumer’s risks, it is necessary to establish two points on
the OC curve. The first point is a lower test MTBF (θ1) that represents the minimum value
acceptable to the Government. The probability of accepting a system that does not meet the
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Pro
ba
bilit
y o
f A
cc
ep
tin
g
Multiples of Minimum Required MTBF
Operating Characteristic (OC) Curve
Discrimination Ratio = 3
Producer's Risk
α = 18%
Consumer's Riskβ = 20%
SpecifiedValue θ0
Minimum AcceptableValue θ1
β α
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
185
FAA’s minimum acceptable value is β and represents the risk (in this example 20%) to the
Government of accepting a “bad” system.
The second point is the upper test MTBF (θ0) that represents the specified value of MTBF. The
probability of accepting a system at this point is (1- α) and the probability of rejecting a “good”
system, α, represents the risk to the contractor, in this example, 18%.
The ratio θ0/ θ1 is known as the discrimination ratio, in this case, 3. This example assumes a fixed
test duration of 4.3 times the lower test MTBF. To more closely approach the ideal case in the
previous figure where the discrimination ratio is one and both risks are zero, the test duration
must be significantly increased.
Figure D-3 illustrates the effect of increasing the test duration on the OC curve. The OC curves
are based on a selection of fixed length tests from MIL-STD-781. The test times associated with
each of the curves expressed as multiples of the lower test MTBF are as follows:
XVII = 4.3
XV = 9.3
XI = 21.1
IX = 45
Figure D-3 Effect of Increasing Test Time on OC Curve
The steepest curve has a discrimination ratio of 1.5, a consumer’s risk of 9.9% and a producer’s
risk of 12%. These reduced risks come at a significant price however. With a test duration
multiple of 45, testing a MTBF requirement of 20,000 hours for a modern fault-tolerant system
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Pro
bab
ilit
y o
f A
ccep
tin
g
Multiples of Minimum Required MTBF
Operating Characteristic (OC) Curves
XVII
XV
XI
IX
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
186
would require 100 years of test exposure. This would require either testing a single system for
100 years, or testing a large number of systems for a shorter time, both of which are impractical.
A general rule of thumb in reliability qualification testing is that the test time should be at least
ten times the specified MTBF, which is still impractical for high reliability systems. Even the
shortest test duration of the MIL-STD-781 standard test scenarios would require ten years of test
time for a 20,000 hour MTBF system. Below this test duration, the test results are virtually
meaningless.
While there are many different ways of looking at the statistics of this problem, they all lead to
the same point: to achieve a test scenario with risk levels that are acceptable to both the
Government and the contractor for a high reliability system requires an unacceptably long test
time.
The above arguments are based on conventional text book statistics theory. They do not address
another practical reality: Modern software systems are not amenable to fixed reliability
qualification tests. Software is dynamic; enhancements, program trouble reports, patches, and so
on presents a dynamic, ever changing situation that must be effectively managed. The only
practical alternative is to pursue an aggressive reliability growth program and deploy the system
to the field only when it is more stable than the system it will replace. Formal reliability
demonstration programs such as those used for the electronic “black boxes” of the past are no
longer feasible.
The OC curves in the charts in this appendix are based on standard fixed length test scenarios
from MIL-STD-781 and calculated using Excel functions as described below.
The statistical properties of a fixed duration reliability test are based on the Poisson distribution.
The lower tail cumulative Poisson distribution is given by
c
k
Tk
k
eT
acP0 !
)|(
[E-1]
Where
P(ac|θ) = the probability of accepting a system whose true MTBF is θ.
c = maximum acceptable number of failures
θ = True MTBF
θ0 = Upper test MTBF
θ1 = Lower test MTBF
T = Total test time
The Excel statistical function “POISSON (x, mean, cumulative)” returns the function in Equation
[C-1] when the following parameters are substituted in the Excel function:
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
187
POISSON (c, T/θ, TRUE)
The charts were calculated using the fixed values for c and T associated with each of the sample
test plans from MIL-STD-781. The x axis of the charts is normalized to multiples of the lower
test MTBF, θ1, and covers a range of 0.1 to 5.0 times the minimum MTBF acceptable to the
Government.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
188
SERVICE THREAD DIAGRAM AND DEFINITIONS
SERVICE THREADS: Service threads are strings of systems/functions that support one or more
of the NAS EA Functions. These service threads represent specific data paths (e.g. radar
surveillance data) to air traffic specialists or pilots. The threads are defined in terms of narratives
and Reliability Block Diagrams depicting the systems that comprise them. They are based on the
reportable services defined in NAPRS.19,20
Service Thread Diagrams are developed using the following template:
1) SERVICE THREAD NAME – Named Service Thread per Table 6-TBD
2) DOMAIN NAME: For example –
Enterprise
Navigation
En Route
Terminal
Oceanic
Flight Service Station
19 Note that some new service threads have been added to the set of NAPRS services, and some of the NAPRS
services that are components of higher-level threads have been removed. 20 See Section 7 for a detailed discussion of the service thread concept.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
189
1) SYSTEM/FUNCTION BLOCK DIAGRAM – Depicts the systems/functions and their
relationships in block-diagram form.
2) NAS- RD-2013 TRACE – Provides a mapping to the associated NAS-RD-2013
enterprise-level functional requirements per Table TBD.
3) SERVICE THREAD LOSS SEVERITY CATEGORY (STLSC): Each service thread
is assigned a Service Thread Loss Severity Category based on the severity of impact that
loss of the thread could have on the safe and/or efficient operation and control of aircraft.
(See Section 7.4 for a detailed discussion of the STLSC concept.) The Service Thread
Loss Severity Categories are:
Safety-Critical - A key service in the protection of human life. Loss of a Safety-
Critical service increases the risk in the loss of human life.
Efficiency-Critical - A key service that is used in present operation of the NAS. Loss
of an Efficiency-Critical Service has a major impact in the present operational
capacity.
Essential - A service that if lost would significantly raise the risk associated with
providing efficient NAS operations.
Routine - A service which, if lost, would have a minor impact on the risk associated
with providing safe and efficient NAS operations.
In a STLSC, Service Threads can also be characterized as Remote/Distributed where
loss of a service thread element, i.e., radar, air/ground communications site, or display
console, would incrementally degrade the overall effectiveness of the service thread but
would not render the service thread inoperable21.
21 21 Remote/Distributed STLSC is specific to this Handbook and does not appear in NAS-RD-2013.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
190
Appendix E Table of Figures
Figure E-1 Functional Notional Architecture ................................ Error! Bookmark not defined.
Figure E-2 Automatic Dependent Surveillance Service (ADSS) .. Error! Bookmark not defined.
Figure E-3 Airport Surface Detection Equipment (ASDES) ......... Error! Bookmark not defined.
Figure E-4 Beacon Data (Digitized) (BDAT) ................................ Error! Bookmark not defined.
Figure E-5 Backup Emergency Communications Service (BUECS) .......... Error! Bookmark not
defined.
Figure E-6 Composite Flight Data Processing Service (CFAD) ... Error! Bookmark not defined.
Figure E-7 Composite Oceanic Display and Planning Service (CODAP) .. Error! Bookmark not
defined.
Figure E-8 Anchorage Composite Offshore Flight Data Service (COFAD) Error! Bookmark not
defined.
Figure E-9 Composite Radar Data Processing Service (CRAD) (CCCH/EBUS) ................ Error!
Bookmark not defined.
Figure E-10 Composite Radar Data Processing Service (CRAD) (EAS/EBUS)Error! Bookmark
not defined.
Figure E-11 En Route Communications (ECOM) ......................... Error! Bookmark not defined.
Figure E-12 En Route Terminal Automated Radar Service (ETARS) ........ Error! Bookmark not
defined.
Figure E-13 FSS Communications Service (FCOM) .................... Error! Bookmark not defined.
Figure E-14 Flight Data Entry and Printout Service (FDAT) ........ Error! Bookmark not defined.
Figure E-15 Flight Data Input/Output Remote (FDIOR) .............. Error! Bookmark not defined.
Figure E-16 Flight Service Station Automated Service (FSSAS) . Error! Bookmark not defined.
Figure E-17 Interfacility Data Service (IDAT) .............................. Error! Bookmark not defined.
Figure E-18 Low Level Wind Service (LLWS) ............................ Error! Bookmark not defined.
Figure E-19 MODE-S Data Link Data Service (MDAT) .............. Error! Bookmark not defined.
Figure E-20 MODE-S Secondary Radar Service (MSEC) ............ Error! Bookmark not defined.
Figure E-21 NADIN Service Threads (NADS, NAMS, NDAT) .. Error! Bookmark not defined.
Figure E-22 Radar Data (Digitized) (RDAT) ................................ Error! Bookmark not defined.
Figure E-23 Remote Monitoring/Maintenance Logging System Service (RMLSS) ............. Error!
Bookmark not defined.
Figure E-24 Remote Tower Alphanumeric Display System Service (RTADS). Error! Bookmark
not defined.
Figure E-25 Remote Tower Display Service (RTDS) ................... Error! Bookmark not defined.
Figure E-26 Runway Visual Range Service (RVRS) .................... Error! Bookmark not defined.
Figure E-27 Terminal Automated Radar Service (TARS) ............ Error! Bookmark not defined.
Figure E-28 Terminal Communications Service (TCOM) ............ Error! Bookmark not defined.
Figure E-29 Terminal Doppler Weather Radar Service (TDWRS) ............. Error! Bookmark not
defined.
Figure E-30 Traffic Flow Management System Service (TFMSS) ............. Error! Bookmark not
defined.
Figure E-31 Terminal Radar Service (TRAD) ............................... Error! Bookmark not defined.
Figure E-32 Terminal Surveillance Backup (TSB) (NEW) ........... Error! Bookmark not defined.
Figure E-33 Terminal Secondary Radar (TSEC) ........................... Error! Bookmark not defined.
Figure E-34 Terminal Voice Switch (TVS) (NAPRS Facility) ..... Error! Bookmark not defined.
Figure E-35 Terminal Voice Switch Backup (TVSB) (New) ........ Error! Bookmark not defined.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
191
Figure E-36 Visual Guidance Service (VGS) ................................ Error! Bookmark not defined.
Figure E-37 VSCS Training and Backup System (VTABS) (NAPRS Facility) Error! Bookmark
not defined.
Figure E-38 Voice Switching and Control System Service (VSCSS) ......... Error! Bookmark not
defined.
Figure E-39 WAAS/GPS Service (WAAS) ................................... Error! Bookmark not defined.
Figure E-40 WMSCR Data Service (WDAT) ............................... Error! Bookmark not defined.
Figure E-41 WMSCR Service Threads (WMSCR) ....................... Error! Bookmark not defined.
Figure E-42 R/F Approach and Landing Services ......................... Error! Bookmark not defined.
Figure E-43 R/F Navigation Service.............................................. Error! Bookmark not defined.
Figure E-44 ARTCC En Route Communications Services (Safety-Critical Thread Pair) .... Error!
Bookmark not defined.
Figure E-45 Terminal Voice Communications Safety-Critical Service Thread Pair ............ Error!
Bookmark not defined.
Figure E-46 Terminal Surveillance Safety-Critical Service Thread Pair (1)Error! Bookmark not
defined.
Figure E-47 Terminal Surveillance Safety-Critical Service Thread Pair (2)Error! Bookmark not
defined.
Figure E-1 Functional Notional Architecture
NAS Service:Transport people and goods from
point A to Point B in a safe and secure manner
Assure Separation
Manage Airspace
Provide ATC Advisories
Manage Traffic Flow
Perform TM
Synchronization
Identify Emergency
Service
Support Navigation
Manage NAS Infrastructure
Efficiency-Critical Efficiency-CriticalEssential Routine Essential EssentialRoutineSafety-Critical
ECVEX Umbrella Service
BUECS RegularService
CRAD UmbrellaService
CODAPRegularService
CFAD UmbrellaService
FDIOR RegularService
ADSS RegularService
RDAT RegularService
FDATRegularService
TFMSS RegularService
IDATRegularService
BDATRegularService
MDATRegularService
MSECRegularService
Essential Efficiency-Critical Efficiency-CriticalEfficiency-Critical Efficiency-CriticalEfficiency-CriticalEfficiency-Critical Efficiency-Critical Efficiency-Critical
Efficiency-CriticalEfficiency-Critical Efficiency-Critical
Efficiency-Critical Efficiency-Critical
Air Traffic Services
Tech Ops Services
RFSPSystem
ATOPSystem
FDPSSystem
TFMSSystem VSCS
System
VTABSRegular Service
CARSRSystem
ATCRBSystem
ATCBISystem
MODESSystem
EBUSSystem
FDIOCSystem
Essential
Essential
Safety-Critical
Essential
EssentialEssential
Safety-Critical
Efficiency-Critical
Essential
Manage Flight
Planning
Efficiency-Critical
EASSystem
ETARSRegularService
EADSSystem
ASRSystem
ARSRSystem
EssentialEssential Essential
Efficiency-Critical
Efficiency-Critical
Systems
FSSASRegularService
TCVEXUmbrella Service
TCOMRegularService
Efficiency-Critical
ASDESystem
ECOMRegularService
TARS RegularService
Efficiency-Critical
RTDSRegularService
RTADSRegularService
TRADRegularService
Efficiency-CriticalEfficiency-Critical
Efficiency-Critical
Essential
Efficiency-Critical
Efficiency-Critical
Efficiency-Critical
Essential
WAAS RegularService
Essential
LLWS RegularService
Essential
WDATRegularService
Essential
TVS System
Efficiency-Critical
TVSBSystem
Efficiency-Critical
COFADRegularService
Efficiency-Critical
ECSSRegular Service
Efficiency-Critical
RVRSRegularService
Essential
TBFM RegularService
Efficiency-Critical
TBFMRSystem
Efficiency-Critical
TDWRSRegularService
Essential
RMLSSRegular Service
Essential
RMLSSystem
Essential
FCOMRegularService
Essential
VSCSSRegularService
Efficiency-Critical
TSECRegularService
ASDESRegularService
Essential
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
192
The Service Threads detailed in Figures E-2 through E-47 were selected based on the criticality
and broad application of the services represented. These threads are not currently represented in
the NAS EA, however, ANG is currently developing a Functional Architecture and will link
systems to high level NAS functions through Tech Ops NAPRS/FSEP services. Figure E-1
represents a notional hierarchy structure relating function, service, and systems. Future editions
of this Handbook will align with this structure.
Automatic Dependent Surveillance Service (ADSS)
DISPLAYS
*
*
*
*
*
*
*
*
*
*
*
*
Terminal
GBT
SBS
ADSS Automation TARS
3.1.2.1 Provides Separation Management
3.1.1.4 Provides Weather Information
3.1.2.2 Provides Trajectory Management
Efficiency-Critical
Figure E-2 Automatic Dependent Surveillance Service (ADSS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
193
ASDE
ASDES
3.1.2.1 Provide Separation Management
3.1.2.2 Provide Trajectory Management
Airport Surface Detection System Service Thread (ASDES)
Tower
Essential
Figure E-3 Airport Surface Detection Equipment (ASDES)
REMOTE RADAR SITE
ATCRB/
MODE-S
3.1.2.1 Provide Separation Management
Beacon Data (Digitized) (BDAT)
CD
ARTCC
ESEC
(BROADBAND)
BDAT
(DIGITAL)
En Route
EAS
Efficiency-Critical
Figure E-4 Beacon Data (Digitized) (BDAT)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
194
BUEC
BUECS
3.1.2.1 Provide Separation Management
Backup Emergency Communication Service (BUECS)
En Route
Efficiency-Critical
Figure E-5 Backup Emergency Communications Service (BUECS)
ARTS
CCCHCFAD
1012 Provide Flight Data Management (2)
Composite Flight Data Processing Service (CFAD)
DSR
“D”
“D”
“D”
“D”
FSP
FSP
FSP
FSP
FSP
CFAD
ARTS
TMCC
OTHER
ARTCC’s
FDIOR
FDAT
IDAT
En Route
ARTS
EASCFAD
3.1.1.2 Provide Flight and State Data Management
Composite Flight Data Processing Service (CFAD)
“D”
“D”
“D”
“D”
FSP
FSP
FSP
FSP
FSP
CFAD
ARTS
TMCC
OTHER
ARTCC’s
FDIOR
FDAT
IDAT
En Route
Efficiency-Critical
Figure E-6 Composite Flight Data Processing Service (CFAD)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
195
ODAPS
CODAP
3.1.1.2 Provide Flight and State Data Management
Composite Oceanic Display and
Planning Service (CODAP)
Oceanic
Efficiency-Critical
Figure E-7 Composite Oceanic Display and Planning Service (CODAP)
OCS FDP
COFAD
3.1.1.2 Provide Flight and State Data Management
Anchorage Composite Offshore Flight Data
Service (COFAD)
OFDPS
Oceanic
Efficiency-Critical
Figure E-8 Anchorage Composite Offshore Flight Data Service (COFAD)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
196
Composite Radar Data Processing Service (CRAD)
DSR LCN
Surveillance
Subsystem
En Route
3.1.2.1 Provide Separation Management
3.1.1.4 Provide Weather Information
3.1.2.2 Provide Trajectory Management
Composite Radar Data Processing Service (CRAD) (CCCH/EBUS)
LCN
Display
Subsystem
Surveillance
Subsystem
Comm
Links
CCCH DSR
*Being Replaced By CRAD/EAS
Efficiency-Critical
EBUS
Figure E-9 Composite Radar Data Processing Service (CRAD) (CCCH/EBUS)
1021 Provide Aircraft to Aircraft Separation (2)
1022 Provide Aircraft to Terrain Separation (2)
1023 Provide Aircraft to Airspace Separation (2)
1031 Provide Weather Information (3)
1032 Provide Traffic Advisories (3)
1041 Provide Airborne Traffic Management Synchronization (3)
Composite Radar Data Processing Service (CRAD)
DSRSurveillance
Subsystem
3.1.2.1 Provide Separation Management
3.1.1.4 Provide Weather Information
3.1.2.2 Provide Trajectory Management
Composite Radar Data Processing Service (CRAD) (EAS/EBUS)
Display
Subsystem
Comm
Links
EAS
Surveillance
Subsystem
En Route
*Replacing CRAD/CCCH
Efficiency-Critical
EBUS
Figure E-10 Composite Radar Data Processing Service (CRAD) (EAS/EBUS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
197
RCAG
ECOM
3.1.2.1 Provide Separation Management
En Route Communications
(ECOM)
ARTCC
Remote Center Air/Ground Communications Facility
En Route
Efficiency-Critical
Figure E-11 En Route Communications (ECOM)
ARTS
ARTS
ARTS
ETARS
3.1.2.1 Provide Separation Management
3.1.1.4 Provide Weather Information
En Route Terminal Automated Radar Service
(ETARS)
Automation
ASR
ATCRBTSEC
TRAD
Terminal
Efficiency-Critical
ARSR
Figure E-12 En Route Terminal Automated Radar Service (ETARS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
198
3.1.1.4 Provide Weather Information
FSS Communications Service (FCOM)
FSS/AFSS
FCOM
RCO
RCO
RCO
RCO
FCOM
FCOM
FCOM
FSS
Essential
Figure E-13 FSS Communications Service (FCOM)
3.1.1.2 Provide Flight and State Data Management
Flight Data Entry and Printout Service (FDAT)
FDAT
ARTCC
CERAP
TRACON
ATCT
En Route
EAS
RAPCON
Efficiency-Critical
Figure E-14 Flight Data Entry and Printout Service (FDAT)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
199
3.1.1.2 Provide Flight and State Data Management
3.1.1.4 Provide Weather Information
Flight Data Input/Output Remote (FDIOR)
FDIOR
ARTCC
CERAP
TRACON
ATCT
En Route
EAS
RAPCON
Essential
Figure E-15 Flight Data Input/Output Remote (FDIOR)
3.1.1.2 Provide Flight and State Data Management
3.1.1.4 Provide Weather Information
Flight Service Station Automated Service (FSSAS)
FSSAS
FSDPS
AFSS
FSSAS
FSSAS
AFSS
AFSS
AFSS
FSSAS
FSSPS
ARTCC
FSS
Essential
Figure E-16 Flight Service Station Automated Service (FSSAS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
200
3.1.1.2 Provide Flight and State Data Management
3.1.2.2 Provide Trajectory Management
Interfacility Data Service (IDAT)
ARTCC TRACON
ARTCC
TMCC
IDAT
IDAT
IDAT
En Route
Efficiency-Critical
Figure E-17 Interfacility Data Service (IDAT)
LLWAS
LLWS
3.1.1.4 Provide Weather Information
Low Level Wind Service (LLWS)
Tower
Essential
Figure E-18 Low Level Wind Service (LLWS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
201
REMOTE RADAR SITE
ATCRB/
MODE-S
3.1.2.1 Provide Separation Management
MODE-S Data Link Data Service
(MDAT)
Data Link
Processor
MDAT
Via RCL/RML/FAA Lines
En Route
EAS
Efficiency-Critical
Figure E-19 MODE-S Data Link Data Service (MDAT)
REMOTE RADAR SITE
3.1.2.1 Provide Separation Management
MODE-S Secondary Radar Service
(MSEC)
MSEC
Via RCL/RML/FAA Lines
ATCRB/
MODE-S
En Route
EAS
Efficiency-Critical
Figure E-20 MODE-S Secondary Radar Service (MSEC)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
202
NADIN
Switch
NADIN
Switch
NADIN
Switch
NDAT
3.1.1.2 Provide Flight and State Data Management
3.1.1.4 Provide Weather Information
NADIN Service Threads (NADS,NAMS,NDAT)
NADIN
Switch
Atlanta
NADS
Salt Lake
NADS
NADIN
Concentrator
NADIN
Concentrator
NADIN
Concentrator
NADIN
Concentrator
NADIN
Concentrator
NADIN
Concentrator
NAMS NAMS NAMS NAMS NAMS NAMS
Enterprise
Efficiency-Critical
Figure E-21 NADIN Service Threads (NADS, NAMS, NDAT)
REMOTE RADAR SITE
CARSR
3.1.2.1 Provide Separation Management
Radar Data (Digitized) (RDAT)
CD
ARTCC
ERAD
(BROADBAND)
RDAT
(DIGITAL)
En Route
EAS
Efficiency-Critical
Figure E-22 Radar Data (Digitized) (RDAT)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
203
Remote
Facility
3.1.3 Provide Category: Mission Support Services
Remote Monitoring/Maintenance Logging System Service (RMLSS)
Enterprise
Facility
ARTCC
MPS
MASS
MMS
Essential
Figure E-23 Remote Monitoring/Maintenance Logging System Service (RMLSS)
3.1.2.1 Provide Separation Management
3.1.2.2 Provide Trajectory Management
Remote Tower Automated
Display System Service (RTADS)
Terminal
ASR
ART
S
ART
S
ARTS
ART
S
REMOTE RADAR SITE
ASR
Satellite Tower
TRACON
ATCRB
Automation
TML
VIDEO
COMP
ATC DISPLAYS
TSEC
TRAD
Via
RCL/RML/FAA
Lines
*
*
*
*
*
*
RTADS
RTADS
*
*
*
ATC DISPLAYS
Efficiency-Critical
Figure E-24 Remote Tower Alphanumeric Display System Service (RTADS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
204
ASR
3.1.2.1 Provide Separation Management
3.1.2.2 Provide Trajectory Management
Remote Tower Display Service (RTDS)
Terminal
ART
S
ART
S
ARTS
ART
S
REMOTE RADAR SITE
ASR
Satellite Tower
TRACON
ATCRB
Automation
TML
VIDEO
COMP
ATC DISPLAYS
TSEC
TRAD
Via
RCL/RML/FAA
Lines
*
*
*
*
*
*
RTADS
RTADS
*
*
*
ATC DISPLAYS
Efficiency-Critical
Figure E-25 Remote Tower Display Service (RTDS)
RVR
RVRS
3.1.2.1 Provide Separation Management
Runway Visual Range Service (RVRS)
Tower
Essential
Figure E-26 Runway Visual Range Service (RVRS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
205
ARTS
ARTS
ARTS
ART
S
ART
S
ARTS
ARTS
REMOTE RADAR SITE
ASR
3.1.2.1 Provide Separation Management
3.1.1.4 Provide Weather Information
3.1.2.2 Provide Trajectory Management
Terminal Automated Radar Service (TARS) Showing TRAD/TSEC Broadband
Backup to TARS for Safety-Critical Terminal Radar Service
TRACON
ATCRB
Automation
DISPLAYS
TSEC
TRAD
Via
RCL/RML/FAA
Lines
*
*
*
*
*
*
*
*
*
*
*
*
TRAD
TARS
TSEC
Terminal
Efficiency-Critical
Figure E-27 Terminal Automated Radar Service (TARS)
3.1.2.1 Provide Separation Management
3.1.2.2 Provide Trajectory Management
Terminal Communications Service
(TCOM)
ATCT
TCOM
RTR
RTR
RTR
RTR
TCOM
TCOM
TCOM
Remote Transmitter/Receiver
Airport Traffic Control Tower
(via RCL/RML/FAA Lines)
Tower
Efficiency-Critical
Figure E-28 Terminal Communications Service (TCOM)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
206
TDWR
TDWRS
3.1.1.4 Provide Weather Information
Terminal Doppler Weather Radar
Service (TDWRS)
Terminal
Essential
Figure E-29 Terminal Doppler Weather Radar Service (TDWRS)
TMCC
TFMSS
3.1.2.3 Provide Flow Contingency Management
Traffic Flow Management System Service (TFMSS)
Enterprise
Efficiency-Critical
Figure E-30 Traffic Flow Management System Service (TFMSS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
207
ARTS
ARTS
ARTS
ART
S
ART
S
ARTS
ARTS
REMOTE RADAR SITE
ASR
3.1.2.1 Provide Separation Management
Terminal Radar Service (TRAD)
TRACON
Automation
DISPLAYS
TRAD
Via
RCL//FAA
Lines
*
*
*
*
*
*
*
*
*
*
*
*
TRAD
Terminal
Efficiency-Critical
Figure E-31 Terminal Radar Service (TRAD)
Figure E-32 Terminal Surveillance Backup (TSB) (NEW)
Terminal
Voice Switch
Terminal
Surveillance
Backup
3.1.2.1 Provide Separation Management
3.1.1.4 Provide Weather Information
3.1.2.2 Provide Trajectory Management
Terminal Surveillance Backup (New)
Terminal
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
208
ARTS
ARTS
ARTS
ART
S
ART
S
ARTS
ARTS
REMOTE RADAR SITE
ATCRB
3.1.2.1 Provide Separation Management
Terminal Secondary Radar
(TSEC)
TRACON
Automation
DISPLAYS
TSEC
Via
RCL//FAA
Lines
*
*
*
*
*
*
*
*
*
*
*
*
TSEC
Terminal
Efficiency-Critical
Figure E-33 Terminal Secondary Radar (TSEC)
Terminal
Voice Switch
3.1.2.1 Provide Separation Management
Terminal Voice Switch (NAPRS Facility)
Terminal
Efficiency-Critical
Figure E-34 Terminal Voice Switch (TVS) (NAPRS Facility)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
209
Terminal
Voice Switch
Terminal
Voice Switch
Backup
3.1.2.1 Provide Separation Management
Terminal Voice Switch Backup (New)
Terminal
Efficiency-Critical
Figure E-35 Terminal Voice Switch Backup (TVSB) (New)
Visual Guidance
Systems
VISUAL GUIDANCE SERVICE
3.2.3 Provide Navigation support Enterprise
Visual Guidance Service
Tower
Figure E-36 Visual Guidance Service (VGS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
210
VTABS
3.1.2.1 Provide Separation Management
3.1.1.4 Provide Weather Information
3.1.2.2 Provide Trajectory Management
VSCS Training and Backup System (VTABS) (NAPRS Facility)
VTABS SERVICE
En Route
Efficiency-Critical
Figure E-37 VSCS Training and Backup System (VTABS) (NAPRS Facility)
VSCSVSCS
ARTCC
VSCSS
Voice Switching and Control System Service (VSCSS)
VSCSS
Controller
RCAG
BUEC
BUECS (Service)Via TELCO/RML/RCL
ECOM (Service) Via TELCO/RML/RCL
3.1.2.1 Provide Separation Management
3.1.1.4 Provide Weather Information
3.1.2.2 Provide Trajectory Management
En Route
Efficiency-Critical
Figure E-38 Voice Switching and Control System Service (VSCSS)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
211
WAAS/GPS
WAAS/GPS SERVICE
3.2.3 Provide Navigation support Enterprise
WAAS/GPS Service
Navigation
Essential
Figure E-39 WAAS/GPS Service (WAAS)
WMSCR
WDAT
3.1.1.4 Provide Weather Information
WMSCR Data Service (WDAT)
WMSCR
Salt Lake City Atlanta
Enterprise
Essential
Figure E-40 WMSCR Data Service (WDAT)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
212
WMSCR
WDAT
3.1.1.4 Provide Weather Information
WMSCR Service Threads
WMSCR
NADIN
Switch
NADIN
Switch
Atlanta
WMSCS
Salt Lake
WMSCS
Enterprise
Essential
Figure E-41 WMSCR Service Threads (WMSCR)
R/F Approach
&
Landing Systems
R/F APPROACH & LANDING SERVICES
3.2.3 Provide Navigation support Enterprise
R/F Approach and Landing Services
Tower
Efficiency-Critical
Figure E-42 R/F Approach and Landing Services
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
213
R/F Navigation
Systems
R/F NAVIGATION SERVICE
3.2.3 Provide Navigation support Enterprise
R/F Navigation Service
Navigation
Efficiency-Critical
Figure E-43 R/F Navigation Service
VSCS
VSCS
VSCSS
ARTCC En Route Communications Services (Safety-Critical Thread Pair)
VTABS
RCAG
Sites
BUEC
SitesBUECS (Service)
Via TELCO/RML/RCL
ECOM (Service) Via TELCO/RML/RCL
En Route
Efficiency-Critical
Figure E-44 ARTCC En Route Communications Services (Safety-Critical Thread Pair)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
214
RTR
RTR
RTR
RTR
RTR
Terminal Voice Communications Safety-Critical Service Thread Pair
RTR
RTR
RTR
RTR
Remote Transmitter/Receiver
TCOMTerminal
Voice Switch
Terminal
Voice Switch
Backup
TCOM
TCOM
TCOM
Terminal Voice Communication
Terminal
Efficiency-Critical
Figure E-45 Terminal Voice Communications Safety-Critical Service Thread Pair
ARTS
ARTS
ARTS
ARTS
ART
S
ART
S
ARTS
ARTS
Terminal Surveillance Safety-Critical Service Thread Pair (1)
Automation
DISPLAYS
TRAD/TSEC
*
*
*
*
*
*
*
*
*
*
*
*
TARS
TRAD/TSEC
Terminal
Efficiency-Critical
Figure E-46 Terminal Surveillance Safety-Critical Service Thread Pair (1)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
215
ART
S
ARTS
ARTS
ARTS
ARTS
ART
S
ART
S
ARTS
ARTS
Terminal Surveillance Safety-Critical Service Thread Pair (2)
Automation
DISPLAYS
TRAD/TSEC
*
*
*
*
*
*
*
*
*
*
*
*
TARS
TRAD/TSEC Terminal
Surveillance
Backup
Terminal
Efficiency-Critical
Figure E-47 Terminal Surveillance Safety-Critical Service Thread Pair (2)
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
216
EVOLUTION OF THE FAA RMA PARADIGM
The tools and techniques that are the foundation for reliability management were developed in
the late 1950’s and early 1960’s. In that timeframe, the pressures of the cold war and the space
race led to increasing complexity of electronic equipment, which in turn created reliability
problems that were exacerbated by the applications of these “equipments” in missile and space
applications that did not permit repair of failed hardware. This section examines the traditional
approaches to RMA specification and verification and describes how changes that have occurred
over the past four decades have created a need for a dramatic change in the way these RMA
issues are addressed.
F.1 The Traditional RMA Paradigm
The FAA has traditionally viewed RMA requirements in a legalistic sense. The requirements
have been part of legally binding contracts with which contractors must comply.
Because actual RMA performance can only be determined after a system is installed, a
contractor’s prospective ability to comply with RMA requirements was evaluated using the
predictions of models. Reliability predictions were based on the numbers of discrete components
used in these systems and their associated failure rates. A catalog of failure rates for standard
components was published in MIL-HDBK-217[37]. These failure rates were based on hundreds
of thousands of hours of operating time.
The predicted reliability of equipment still under development was estimated by extrapolating
from attested failure rates with adjustments reflecting the numbers and types of components used
in the new piece of equipment. If the predicted reliability was unacceptable, engineers used
various screening techniques to try to reduce the failure rates of the components. These
compensatory efforts generally increased the costs of equipment built to military specifications,
and despite efforts to improve reliability, complex electronic equipment often had MTBFs of
fewer than 1,000 hours.
To verify that electronic equipment was compliant with the specified reliability, several
preproduction models were often placed in a sealed room for a period of time. There, statistical
decision methods, as described in MIL-STD-781[38], were employed to decide whether the
requirements actually were met and the design was suitable for release to full production.
Maintainability requirements were verified by statistical techniques such as those defined in
MIL-HDBK-472 [39]. These techniques involved statistically combining component failure rates
with actually measured times to identify, remove, and replace a sample of inserted failed
components.
The military standards and handbooks that defined the statistical methods used for predicting and
verifying reliability and maintainability were based on well-established concepts that could be
found in any introductory textbook on engineering statistics.
F.2 Agents of Change
Several factors have tended to make the traditional paradigm obsolete. Among these are:
Dramatic increases in system reliability resulting both from a combination of hardware
technology advances, the use of redundancy, and application of fault tolerance
techniques.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
217
Fundamental statistical limitations associated with reliability prediction and
verification for high reliability systems.
Difficulties associated with the use of availability as a contractual specification.
Increased use of software-intensive digital systems.
Emphasis on use Commercial Off-the-Shelf (COTS) hardware.
The implications of these changes on traditional RMA practices and policies are discussed
below.
F.2.1 Technology and Requirements Driven Reliability Improvements
Since the 1960’s, advances in microelectronics and large scale integration have increased the
reliability of digital hardware by almost two orders of magnitude. When the FAA first began to
acquire digital systems in the 1960’s, the hardware elements typically had reliabilities around
1000 hours. Over the years, technology advancements in integrated circuits have yielded
dramatic improvements in the reliability of digital hardware. Greater use of automation in critical
applications increased the reliability requirements for these systems, and the increased
requirements exceeded the improvements resulting from microelectronics technology advances
alone. Redundancy and fault tolerance techniques were employed to further increase system
reliability. Figure F-1 summarizes National Airspace Performance Reporting System (NAPRS)
data for FY1999 through FY 2004 that illustrates the dramatic improvement of system reliability
in the past 40 years.
Figure F-1 FAA System Reliability Improvements
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005
MT
BF
(H
ou
rs)
Reliability Advances
DSR
DCCR
CCCH
DARC
ARTS
CDCDCC
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
218
F.2.2 Fundamentals Statistical Limitations
Since statistical methods were first applied to RMA modeling, allocation, prediction, and
verification, there has been an exponential growth in the reliability of digital hardware. There has
also been a related growth in demand for higher reliability in systems for use in critical ATC
applications. This exponential growth in the reliability of FAA systems has certainly benefited
their users, but it also has created significant challenges to those who specify, predict, and verify
the RMA characteristics of these systems. Conventional statistical methods, those that have
traditionally been used for these purposes, simply do not scale well to high levels of reliability.
F.2.2.1 Reliability Modeling
Forty years ago, the use of digital computers to process surveillance data and other important
real-time and near-real-time operations created a demand for more reliable systems. When the
FAA began to acquire NAS En Route Stage A in the early 1960’s, the IBM 360 series computer
elements that were used in the Central Computer Complex had an MTBF on the order of 1000
hours, and the combined MTBF of all of the elements in the Central Computer Complex (CCC)
was predicted to be approximately 60 hours. In contrast, the required reliability for the NAS
Stage A CCC was 10,000 hours. The FAA and IBM had to try to achieve this unheard of level of
reliability with hardware elements whose reliability was an order of magnitude less than the
requirement. They found a way. Together, the FAA and the researchers pioneered the use of
redundancy and automatic fault detection and recovery techniques.
The reliability of the CCC was predicted using a set of Markov-based combinatorial equations
developed by S. J. Einhorn in 1963. Drawing on the MTBF and MTTR together with the amount
of redundancy, the equations predicted the reliability of repairable redundant configurations of
identical elements. They modeled the average time taken from the depletion of spares, with
resulting failures, and the return to service of failed elements. Einhorn’s equations were based
solely on the combinatorial probability of running out of spare elements and assumed perfect
fault coverage, perfect software, and perfect switchover. They did not address the effectiveness
of the automatic fault detection and recovery mechanisms or the effect of software failures on the
predicted MTBF.
A simple sensitivity analysis of the effectiveness parameter for automatic fault detection and
recovery mechanisms yielded a graph such as the one shown in Figure F-2. At 100%
effectiveness the CCC 10,000 hour reliability requirement is exceeded by 50%. At 0%
effectiveness, the predicted CCC reliability would be 60 hours. The true MTBF lies somewhere
in between, but the reliability falls so quickly when the fault detection and recovery effectiveness
is less than perfect, that it is virtually impossible to predict it with enough accuracy to be useful.
Developers included fault handling provisions for all known failure classes in an attempt to
achieve 100% effectiveness, but they knew that without access to the number of unknown failure
modes they were unlikely to reach this goal.22
The models used to predict reliability for fault-tolerant systems were so sensitive to the
effectiveness parameter for the fault tolerance mechanisms that their predictions had little
22Note: as reliability models became more sophisticated and computerized, this parameter became known as
“coverage,” the probability that recovery from a failure will be successful, given that a failure has occurred. The
concepts and underlying mathematics for reliability and availability models is discussed in greater detail in
Appendix C.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
219
credibility, and thus little value in forecasting the real-world reliability characteristics of these
systems.
Figure F-2 NAS Stage A Recovery Effectiveness
For a more modern example, consider a hypothetical redundant configuration of two computer
servers with MTBFs of 30,000 hours each and a MTTR of 0.5 hours. Although this configuration
has a theoretical inherent reliability of one billion hours, the chart in Figure F-3 shows that when
coverage drops from 100% to 99%, the predicted reliability drops from one billion hours to 1.5
million hours. At a coverage level of 95%, the predicted reliability drops to 300,000 hours. (Note
that the MTBF axis of the chart is logarithmic.)
Although a 300,000 hour MTBF with a fault coverage of 95% should be more than adequate for
FAA requirements, there is no assurance that this level will be achieved. If the assumed coverage
level is reduced to a more conservative value of 85%, the predicted reliability is still 100,000
hours. This analysis underscores the fact that constructing elaborate and complex mathematical
computer models is unnecessary when it can be shown that the model results are almost entirely
dependent on an input parameter whose value is essentially either a guess or a value the model
has itself derived precisely to get the desired result. The inability to estimate coverage accurately
makes it virtually impossible, when using automatic fault detection and recovery mechanisms, to
predict the reliability of redundant configurations with enough accuracy to be useful.
Another important conclusion that can be drawn from Figure F-3 is that simple combinatorial
models are adequate to verify the theoretical inherent reliability capabilities of the hardware
architecture to meet the requirements. Predicting inherent reliability and availability should be
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
0.9500.9550.9600.9650.9700.9750.9800.9850.9900.9951.000
MT
BF
Effectiveness (Coverage)
NAS Stage A Reliability Sensitivity to Recovery Mechanism Effectiveness
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
220
viewed as simply the first step, among many, in evaluating a contractor’s ability to meet the
RMA requirements.
The conclusion is evident: there is no significant additional benefit to be gained from spending
program resources on developing sophisticated computer models. Scarce resources can be better
applied toward developing and applying tools and techniques to find and remove latent defects in
the recovery mechanisms and software applications that could keep the system from achieving
its theoretical maximum inherent reliability. All such tools developed under the program should
be delivered to the FAA for their use after system acceptance.
Figure F-3 Coverage Sensitivity of Reliability Models
F.2.2.2 Reliability Verification and Demonstration
The preceding section illustrated some difficulties in predicting the reliability and availability of
proposed systems before they are developed. Fundamental statistical limitations also make it
difficult to verify the reliability or availability of requirements-driven systems after they are
developed. Although statistical applications work best with large sample sizes, reliability testing
generally obtains limited samples of failures over limited test intervals. High reliability systems
seldom fail; therefore, it is impractical to accumulate enough operating hours prior to fielding a
system to obtain a statistically valid sample of failures. A “rule of thumb” is that to obtain a
statistically valid sample, the number of test hours should be approximately ten times the
required MTBF. For example, a 30,000 hour MTBF system should test either a single system for
over 30 years, or test 30 systems for one year. Neither of these alternatives is realistic in the
context of a major system acquisition.
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
0.9500.9600.9700.9800.9901.000
MT
BF
(H
ou
rs)
Coverage
COVERAGE SENSITIVITY
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
221
Several quantitative parameters are used to describe the characteristics of a formal reliability
qualification test, including confidence intervals, producer’s risk, consumer’s risk, and
discrimination ratio. The end result, however, is that – when an accept/reject decision is based on
inadequate test time – there is a significant probability of either accepting a system that does not
meet the requirements (consumer’s risk), or of rejecting a system that does, in fact, meet the
requirements (producer’s risk).23
Arguments underlying these decisions are based strictly on conventional text-book statistics
theory. They fail to address the practical reality that modern software systems are not suited to
evaluation by fixed reliability qualification tests alone. Today’s software is dynamic and
adaptive. Enhancements, program trouble reports, patches, and the like present an ever-changing
reality that must be effectively managed. The only practical alternative, in today’s world, is to
pursue an aggressive reliability growth program and deploy a system to the field only when a
new version of a system can be shown to be more stable than the system it will replace. Formal
reliability demonstration programs, such as those used for the electronic “black boxes” of the
past, are no longer feasible for modern automation systems.
F.2.3 Use of Availability as a Conceptual Specification
For the last twenty years, FAA specifications have focused primarily on availability
requirements in place of the more traditional reliability and maintainability requirements that
preceded them. Availability requirements are useful at the highest levels of management. They
provide a quantitative and consistent way of summarizing the need for continuity of NAS
Services. They can facilitate the comparison and assessment of architectural alternatives by
FAA headquarters system engineering personnel. They also bring a useful performance metric to
analyses of operationally deployed systems and Life Cycle Cost tradeoffs. And, because
availability includes all sources of downtime and reflects the perspective of system users, it is a
good overall operational performance measure of the performance of fielded systems.
There are, however, significant problems with employing availability as a primary RMA
requirement in contractual specifications. This operational performance measure combines
equipment reliability and maintainability characteristics with operation and maintenance factors
that are beyond the control of the contractor as well as outside of the temporal scope of the
contract.
The fundamental concept of availability implies that reliability and maintainability can be traded
off. In other words, a one-hour interruption of a critical service that occurs annually is seen as
equivalent to a 15-second interruption of the same service that occurs every couple of days ―
both scenarios provide approximately the same availability. It should be obvious that
interruptions lasting a few seconds are unlikely to have a major impact on ATC operations, while
interruptions lasting an hour or more have the potential to significantly impact traffic flow and
safety of operations. Contractors should not be permitted, however, to trade off reliability and
maintainability arbitrarily to achieve a specific availability goal. Such tradeoffs have the
potential to adversely impact NAS operations. They also allow a readily measured parameter
such as recovery time to be traded off against an unrealistic and immeasurable reliability
requirement following a logic such as: “It may take two hours to recover from a failure, but it
23 The basic mathematics underlying these effects is summarized in Appendix D. For a more detailed discussion of reliability
qualification testing, see MIL-STD-781[38].
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
222
will be 20,000,000 hours between failures, so the availability is still acceptable, i.e., seven
‘nines.’
As pointed out above, during system development, availability can only be predicted using
highly artificial models. Following development, system availability is not easily measured
during testing at the WJHTC. The fundamental statistical sampling limitations associated with
high levels of reliability and availability are not the only problem. Availability cannot be
measured directly; it can only be calculated from measurements of system downtime and the
total operating time.
Deciding how much of the downtime associated with a failure should be included or excluded
from calculations of availability is difficult and contentious. In an operational environment,
matters are clear cut: simply dividing the time that a system is operational by the total calendar
time yields the availability. In a test environment, however, adjustments are required for
downtimes caused by things like administrative delays, lack of spares, and the like – factors that
the contractor cannot control. Failure review boards are faced with the highly subjective process
of deciding which failures are relevant and how much of the associated downtime to include.
For these reasons, the FAA needs to establish specifications that can be more readily monitored
during development and measured at the contractor’s plant and the WJHTC prior to acceptance
of the system.
F.2.4 RMA Issues for Software-Intensive Systems
The contribution of hardware failures to the overall system reliability for a software-intensive
system is generally negligible. Software reliability, i.e., the presence of latent design defects
(faults), is by far the dominant factor in the overall reliability of these systems. Most models that
predict software reliability rely on historical data of the numbers of latent defects per thousand
source lines of code (at various stages of development) to discover and recommend the removal
of latent software defects. These models may be useful for estimating test time, manpower and
costs to reduce the fault density to acceptable levels; but they provide no reliable insight into the
run-time behavior of the software or the predicted operational reliability of the system.
Although some academic papers have attempted to develop models that can relate fault density
to the run-time behavior of software, the accuracy and usefulness of these models is questionable
and unproven. Again, the fundamental problem in predicting software reliability is the need to
predict, with some degree of certainty, how frequently each latent fault in the code is likely to
result in an operational failure. Essentially, this is a function of how often a particular section of
code is executed. For routines such as surveillance processing that are scheduled at regular
intervals, this is perhaps feasible. Other areas of code, however, may only be executed rarely. For
a complex system containing a million or more lines of code, with various frequencies of
occurrence, the prediction problem becomes intractable.
F.2.4.1 Software Reliability Characteristics
Unlike Hardware Reliability, Software Reliability is not a direct function of time. Hardware parts
may age and wear out with time and usage, but software is not a prone to directly effects of
physical forces like wear-out during its life cycle. Software will not change over time unless
there is direct intervention to intentionally change or upgrade it. [1]
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
223
Software Reliability is an important attribute of software quality, together with functionality,
usability, performance, serviceability, capability, install-ability, maintainability, and
documentation. Software Reliability is hard to achieve, because the complexity of software tends
to be high. With any system having a high degree of complexity, including software, it will be
hard to reach a certain level of reliability. System developers tend to push complexity into the
software layer, given the rapid growth of system size and ease of doing so by upgrading the
software.
For example, NextGen Air Traffic Control (ATC) systems will contain between one and two
million lines of code. [4] While the complexity of software is inversely related to software
reliability, it is directly related to other important factors in software quality, especially
functionality, capability, etc. Emphasizing these features will tend to add more complexity to
software. [1]
Some of the distinct characteristics of software compared to hardware are listed below [6]:
Failure cause – Software defects are mainly design defects
Wear-out – Software does not have energy related wear-out phase. Errors can occur
without warning.
Repairable System Concept – Periodic restarts can help fix software problems.
Time Dependency and Life Cycle – Software reliability is not a function of
operational time.
Environmental Factors – Do not affect software reliability, except it might affect
program inputs.
Reliability Prediction – Software reliability cannot be predicted from any physical
basis, since it depends completely on human factors in design.
Redundancy – Cannot improve software reliability if identical software components
are used.
Interfaces – Software interfaces are purely conceptual other than visual.
Failure Rate Motivators – Usually not predictable from analyses of separate
statements.
Built with Standard Components – Well-understood and extensively-tested
standard parts will help improve maintainability and reliability. But in software
industry, this trend has not been observed for the most part. Code reuse has been
around for some time, but to a very limited extent.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
224
F.2.4.1.1 Software Reliability Curve
Over time, hardware exhibits the failure characteristics shown in Figure F-4.
Figure F-4 Notional Hardware Failure Curve24,25
However, software failure does not have same characteristics as hardware failure. The paper
“Introduction to Software Reliability: A state of the Art Review” [7] introduced the Software
Failure curve shown in Figure F-5. It is evident from this curve that there are two major
differences between the hardware and software curves. The first difference is that in the last
phase of the curve, software does not have an increasing failure rate as hardware does. In this
phase, software approaches obsolescence and there is little motivation for upgrades or changes.
Therefore, the failure rate will not change. The second difference is that in the useful-life phase,
software will experience a drastic increase in failure rate each time an upgrade is made. The
failure rate levels off gradually, partly because of the defects found and fixed after the upgrades.
[1]
Figure F-5 Notional Software Failure Curve
24 This curve is also known as the Bathtub Curve. 25 The origins of the Bathtub Curve are unknown [8]. The curve appears in actuarial life-table analysis as long ago as
1693 [9].
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
225
F.2.4.2 Software Reliability Growth
During the execution of software product development processes deficiencies are inadvertently
introduced into products. These deficiencies directly affect the reliability goal or requirements
for a software product. However, it is possible to remove the aforementioned deficiencies so as
to meet reliability goal or requirements. Reliability Growth is the improvement in reliability
over a period of time due to changes in product design or the manufacturing process. [10]
Reliability growth is related to factors such as the management strategy toward taking corrective
actions, effectiveness of the fixes, reliability requirements, the initial reliability level, reliability
funding and competitive factors. For example, one management team may take corrective
actions for a specified percentage of failures seen during testing, while another management
team with the same design and test information may take corrective actions on a different
percentage of the failures seen during testing. Different management strategies may attain
different reliability values with the same basic design. The effectiveness of the corrective actions
is also relative when compared to the initial reliability at the beginning of testing. If corrective
actions give a 400% improvement in reliability for equipment that initially had one tenth of the
reliability goal, this is not as significant as a 50% improvement in reliability if the system
initially had one half the reliability goal. [11]
F.2.4.2.1 Reliability Growth Program
The key factors of a reliability growth program include [11] [22]:
Planning
Requirements
Design
Implementation
Tests and Evaluation
Fielding
Operation
There are processes and procedures for “Software Reliability” that are associated with every
aspect of the Engineering Life-cycle. “FAA System Safety Handbook” [33] discusses many of
the processes and procedures associated with Software Reliability in the Engineering Life-cycle
context. In contrast, Appendix F of this document focuses on Reliability Growth and hence on
just particular aspects of the Engineering Life-cycle.
F.2.4.2.2 Reliability Growth Process
Reliability growth is the product of an iterative process, see Appendix G. Later stages in this
process help to identify potential sources of failures. Further effort is then spent to resolve any
problem areas in either manufacturing process design or product design. The basic iterative
process can be conceptualized as a simple feedback loop with three essential elements to achieve
reliability growth: detection of failure sources, feedback of problems identified, and redesign
effort based on problems identified.
If testing is included to identify failure sources, then “fabrication of hardware/ prototypes/
system” is a necessary fourth element to complete the design loop. “Detection of failure
sources” will also need to function as “verification of redesign effort”.[13]
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
226
Figure F-6 Notional Reliability Growth Management26
F.2.5 RMA Considerations for Systems Using COTS or NDI Hardware Elements
Given the pressure to justify the costs of new systems, there is a strong desire to use COTS
hardware as an alternative to custom-developed hardware for FAA systems. This means that the
Government is unlikely to be able to exercise control over the internal design characteristics of
the basic hardware elements used to construct systems. Both the reliability and maintainability of
the elements are predetermined and largely beyond the control of the FAA. The only real option
is to require field data to substantiate a contractor’s claims for the reliability and maintainability
of the COTS products. The FAA’s ability to influence the design of systems employing
COTS/NDI components is primarily limited to demanding the removal and replacement of some
unwanted hardware elements from their mountings, and possibly to requiring that hardware be
built to industrial, instead of commercial standards
26 See [16].
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
227
SOFTWARE RELIABILITY GROWTH IN THE
ENGINEERING LIFE-CYCLE
The objective of Software Reliability Planning is to develop a Reliability Growth Plan (RGP).
An RGP is dependent on the identification of all relevant software systems/components and
associated effort to be applied to such software systems/components. The following sections
address these dependencies. An RGP should contain quantifiable goals for Reliability as well as
milestones for achieving these goals.
G.1 Relevant Software Identification
The first step in developing an RGP is to identify relevant software components. The
identification of relevant software components is accomplished by using the “Service Thread
Loss Severity Category (STLSC) Matrix,” and the “Service Thread Reliability, Maintainability,
and Recovery Times.” [22] In essence, Subject Matter Experts (SMEs) will review all software
systems/components of a program/project and determine their standing using the aforementioned
documents.
G.2 Effort Identification
The second step in developing an RGP is to determine the level of Reliability Growth effort.
The identification of the level of effort uses the Software Control Categories of MIL-STD-882C
“System Safety Program Requirements” [28] as shown in Table G-127. For an FAA
project/program within Acquisition Management System (AMS), it is recommended that major
software components/systems associated with a particular thread be categorized using Table G-1.
Table G-2 shows the Software Risk Index28 used in determining the level of effort needed for the
Reliability categories29 introduced in the STLSC Matrix. Table G-2 assigns a Risk Index to a
particular Reliability category with a particular risk level.
Table G-1 MIL-STD-882C Software Control Categories30
Software Control
Category
Degree of Control
IA Software exercises autonomous control over potentially hazardous
hardware systems, subsystems or components without the possibility
of intervention to preclude the occurrence of a hazard. Failure of the
software, or a failure to prevent an event, leads directly to a hazard’s
occurrence.
IIA Software exercises control over potentially hazardous hardware
systems, subsystems, or components allowing time for intervention
by independent safety systems to mitigate the hazard. However, these
systems by themselves are not considered adequate.
27 The Effort Identification method presented here is based on SSHDBK2000 “FAA System Safety Handbook” [33]
and NASA-STD-8739.8 “Software Assurance Standard” [12] with modifications. 28 The notion of Risk used in this document is compatible with the “Safety Risk Management Guidance for System
Acquisitions (SRMGSA)" [29]. 29 It should be noted that, as per FAA-HDBK-006A [2], no Safety-Critical threads have been identified to date. 30 See [24] and [28].
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
228
Software Control
Category
Degree of Control
IIB Software item displays information requiring immediate operator
action to mitigate a hazard. Software failures will allow, or fail to
prevent, the hazard’s occurrence.
IIIA Software item issues commands over potentially hazardous hardware
systems, subsystems or components requiring human action to
complete the control function. There are several, redundant,
independent safety measures for each hazardous event.
IIIB Software generates information of a Safety-Critical nature used to
make Safety-Critical decisions. There are several redundant,
independent safety measures for each hazardous event.
IV Software does not control Safety-Critical hardware systems,
subsystems or components and does not provide Safety-Critical
information.
Table G-2 Software Risk Index
Software Risk Index Risk Definition
5 High Risk: Software controls Safety-Critical hazards
4 Medium Risk: Software controls of Safety-Critical or Efficiency-Critical hazards that are reduced, but still significant
2 and 3 Moderate Risk: Software controls less significant hazards
1 Low Risk: Software controls Negligible/Marginal hazards
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
229
Table G-3 assigns the pair of Software Control Category and Reliability category a Software
Risk Index.
Table G-3 Software Risk Matrix Guidance31
Hazard Category
Software Control Category Negligible/Marginal Essential Efficiency-Critical Safety-Critical
IA 2 4 5 5
IIA & IIB 5 2 4 5
IIIA & IIIB 1 1 3 4
IV 1 1 2 3
Table G-4 assigns the pair of Software Control Category and Reliability category a level of effort
associated with Reliability Growth. The definitions of each Reliability Growth Effort Level (i.e.
Full, Moderate, Minimum) follow, below.
Table G-4 Software Risk Matrix Guidance32
Hazard Category
Software Control Category Negligible/Marginal Essential Efficiency-Critical Safety-Critical
IA Minimum Moderate Full Full
IIA & IIB Minimum Minimum Moderate Full
IIIA & IIIB None Minimum Moderate Moderate
IV None None Minimum Minimum
Table G-5 shows the level of oversight recommended for each Software Risk Index.
Table G-5 Oversight Guidance33
Software Risk Index Degree of Oversight
5 Full Independent Validation and Verification (IV & V) organization, as well as in-house Software Assurance (SA)
4 In-house SA organization; Possible software Independent Assessment (IA)
31 See [24] 32 See [24] 33 See [24]
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
230
Software Risk Index Degree of Oversight
3 In-house SA organization
1,2 Minimal in-house SA
G.2.1 “Full” Effort
Systems and subsystems that have severe hazards which can escalate to major failures in a very
short period of time require the greatest level of software safety effort. Some examples of these
types of systems include life support, power generation, and conditioning systems. These
systems may require a formal, rigorous program of quality and safety assurance to ensure
complete coverage and analysis of all requirements, design, code, and tests. Safety analyses,
software development analyses, safety design features, and Software Assurance (SA) oversight
are highly recommended. In addition, IV&V activities may be required. [24]
G.2.2 “Moderate” Effort
Systems and subsystems which fall into this category typically have either 1) a limited hazard
potential or 2) the response time for initiating hazard controls to prevent failures is long enough
to allow for human operators to respond to the hazardous situation. Examples of these types of
systems include microwave antennas and low power lasers. These systems require a rigorous
program for safety assurance of software identified as Safety-Critical. Non-Safety-Critical
software must be regularly monitored to ensure that it cannot compromise safety controls or
functions. Some analyses are required to assure there are no “undiscovered” Safety-Critical areas
that may need software safety features. Some level of Software Assurance oversight is still
needed to assure late design changes do not affect the safety severity.
A project of this level may require IV&V. However, it is more likely to require a Software
Independent Assessment (IA). Software independent assessment (IA) is defined as a review of
and analysis of the program/project’s system software development lifecycle and products. The
IA differs in scope from a full IV&V program in that IV&V is applied over the lifecycle of the
system whereas an IA is usually a one-time review of the existing products and plans. In many
ways, IA is an outside audit of the project’s development process and products (documentation,
code, test results, and others). [24]
G.2.3 “Minimum” Effort
For systems in this category, either the inherent hazard potential of a system is very low or
control of the hazard is accomplished by non-software means. Failures of these types of systems
are primarily reliability concerns. This category may include such things as scan platforms and
systems employing hardware interlocks and inhibits. Software development in these types of
systems must be monitored on a regular basis to ensure that safety is not inadvertently
compromised or that features and functions are added which make the software Safety-Critical.
A formal program of software safety is not usually necessary. Of course, good development
practices and SA are always necessary. [24]
G.2.4 “None” Effort
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
231
This category implies that no efforts specific to Software Reliablity are needed for software
associated with a particular thread. This does not mean that best practices for software
development should be bypassed. Rather, this means only that specific efforts to increase
Software Reliability are not needed because software is not associated with Safety-Critical
applications. However, a program/project manager may still choose to use some of the processes
associated with Software Relaiability Growth since these processes are general and not
necessarily only related with Software Reliability Growth. For example, Change Management is
a good practice in general for all software development independent of whether the software is
associated with Safety-Critical applications.
G.3 Goals Guidance
Table G-6 below details all applicable methods/processes form all phases of the Engineering
Life-cycle that a program/project manager needs to keep in mind while planning for reliability
growth. Table G-6 also includes a Cost/Benefit analysis and specific effort guidance for these
methods/processes. Finally, Table G-6 below also includes a classification of the type of
methods/process. It should be noticed that Table G-6 Method/Process items that are further
decomposed contain roll-ups of the approximate average for Cost Rating, Benefit Rating, Effort,
and Task Type of sub-components. Table G-7 contains miscellaneous notes and references for
specific Software Reliability methods/processes that are useful in reliability planning.
Table G-6 and Table G-7 should be used in determining the goals of an RGP. In essence, the
product of using these tables will be a list of processes to be applied during the Engineering Life-
cycle. The lists of processes along with associated milestones specify the goals in an RGP.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
232
Table G-6 Software Reliability Growth Effort Specifics
Phase Method/Process Source Cost
Rating34
Benefit
Rating35
Effort36 Task Type
Minimal Moderate Full Software
Eng.
Sys &
Software
Safety
SA &
IV&V
Planning
Development Management Plan DO-278 L H E E E X
Configuration Management Plan DO-278 L H E E E X
Verification & Validation Management Plan NSSG L H R HR E X X
Accreditation Management Plan DO-278 L H NR R E X X X
Quality Assurance Management Plan DO-278 L H NR R E X X X
Development Environment Planning DO-278 L M R HR E X X
Language and Compiler Planning DO-278 L M R HR E X X
Test Environment Planning DO-278 L M R HR E X X X
Development Standards Planning DO-278 L M NR R E X X X
Planning Process Review DO-278 L M E E E X X X
Planning Products Review HBK006 L M E E E X X X
Configuration
Management
Configuration Identification DO-278 L M R HR E X X X
Baselines and Traceability DO-278 L M R HR E X X X
Problem Reporting, Tracking, and Corrective Actions DO-278 L H R HR E X X X
Change Control DO-278 L M R HR E X X X
Change Review DO-278 L M R HR E X X X
Configuration Status Accounting DO-278 L M R HR E X X X
Archive, Retrieval, Release DO-278 L M R HR E X X X
Data Control Categorization DO-278 L L R HR E X X X
Load Control DO-278 L M R HR E X X X
Life Cycle Environment Control DO-278 L L R HR E X X X
CM Products Review HBK006 L M R HR E X X X
Quality Assurance
QA Audits DO-278 M M R HR E X X
34 Legend: L - Low; LH - Low to High; LM - Low to Moderate; M - Moderate; MH - Moderate to High; H - High 35 Legend: L - Low; M - Medium; H - High 36 Legend: NR - Not Recommended; R - Recommended; HR - Highly Recommended; E - Essential
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
233
Phase Method/Process Source Cost
Rating34
Benefit
Rating35
Effort36 Task Type
Minimal Moderate Full Software
Eng.
Sys &
Software
Safety
SA &
IV&V
QA Conformity Review DO-278 M H R HR E X X
QA Products Review HBK006 L M R HR E X X
Requirements
Requirements Management NSSG L H E E E X
Development of Software Safety Requirements NSSG L H E E E X X
Generic Software Safety Requirements NSSG L H R HR HR X X
Fault and Failure Tolerance NSSG M H R HR E X X
Hazardous Commands NSSG L H E E E X X
Timing, Sizing and Throughput Considerations NSSG M H HR E E X X
Checklists and cross references NSSG L H HR HR E X X
Software Safety Requirements Analysis NSSG M H R HR E X X
Safety Requirements Flow-down Analysis NSSG LH H HR E E X
Requirements Criticality Analysis NSSG L H R HR E X
Specification Analysis NSSG M H NR R HR X X
Reading Analysis and Traceability Analysis NSSG L H NR R HR X
Control-flow analysis NSSG M H NR R HR X
Information-flow analysis NSSG M H NR R HR X
Functional simulation models NSSG M M NR R HR X
Formal Methods - Specification Development NSSG MH H NR HR HR X X
Model Checking NSSG M H NR HR HR X
Timing, Throughput, and Sizing Analysis NSSG L H HR E E X
Fault Tree Analysis NSSG M H R HR E X
Failure Modes, Effects, and Criticality Analysis (FMECA) 37 NSSG H H NR R HR X
Requirements Products Review NSSG M H R E E X X X
Design
COTS/GOTS Analysis DO-278 M H R HR E X
Language Restrictions and Coding Standards NSSG L H E E E X
Defensive Programming NSSG L H HR E E
Complexity Analysis HBK006 L H R HR E X
Design Analysis NSSG M H HR E E X X
Update Previous Analyses NSSG M H HR E E X
37 For Enterprise Services see [53].
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
234
Phase Method/Process Source Cost
Rating34
Benefit
Rating35
Effort36 Task Type
Minimal Moderate Full Software
Eng.
Sys &
Software
Safety
SA &
IV&V
Severity Analysis NSSG M M HR E E X
Software Component Risk Assessment NSSG M M HR E E X
Software FTA and Software FMEA HBK006 M M NR R HR X
Design Safety Analysis NSSG M H HR E E X
Independence Analysis NSSG M H R HR E X
Formal Methods and Model Checking NSSG H M NR R HR X
Design Logic Analysis (DLA) NSSG MH H NR R HR X
Design Data Analysis NSSG M H R HR E X
Design Interface Analysis NSSG M H R HR E X
Design Traceability Analysis NSSG L H E E E X X
Software Element Analysis NSSG M M R HR E X
Rate Monotonic Analysis NSSG H M NR R HR X
Dynamic Flowgraph Analysis NSSG H L NR NR R X
Markov Modeling NSSG H L NR R HR X
Requirements State Machines NSSG H L NR R HR X
Recovery Oriented Computing (ROC)38 HBK006 H L R HR E X
Design Products Review NSSG LM H R HR E X X X
Coding
Software Development Techniques NSSG LM M E E E X
Coding Checklists and Standards NSSG L M E E E X
Unit Level Testing NSSG LM H HR HR E X
Program Slicing NSSG M M R R HR X X X
Code Analyses NSSG M M R HR E X X
Code Logic Analysis NSSG H L NR NR R X
Code Data Analysis NSSG M M R HR E X
Code Interface Analysis NSSG M H R HR E X
Unused Code Analysis NSSG M M R HR E X
Interrupt Analysis NSSG L H R HR E X
Test Coverage Analysis NSSG LM M R E E X
Safety-Critical Unit Test Plans NSSG LM M HR HR E X
Final Timing, Throughput, and Sizing Analysis NSSG L H HR E E X
38 For Enterprise Services see [54].
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
235
Phase Method/Process Source Cost
Rating34
Benefit
Rating35
Effort36 Task Type
Minimal Moderate Full Software
Eng.
Sys &
Software
Safety
SA &
IV&V
Coding Products Review NSSG L H E E E X
V&V
Integration Testing NSSG M H R HR E X
System Testing NSSG M H E E E X
Regression Testing NSSG L H E E E X
Safety Testing NSSG M H E E E X
Fault Injection Testing39 HBK006 M H R HR E X
Test Analysis NSSG M M E E E X
Test Coverage Analysis NSSG L H R HR E X
Reliability Modeling NSSG MH L NR R HR X
Test Results Analysis NSSG M H E E E X
Requirements-Based Test Selection DO-278 M H HR E E X
Normal Range Test Cases DO-278 M H HR E E X
Robustness Test Cases DO-278 M L R HR E X
Requirements Based Testing Methods DO-278 M H R HR E X
Test Coverage Analysis DO-278 M H R HR E X
Requirements-Based Test Coverage Analysis DO-278 M H R HR E X
Structural Coverage Analysis DO-278 M H R HR E X
Structural Coverage Analysis Resolution DO-278 M H R HR E X
Software Verification Process Traceability DO-278 M H R HR E X
Verification of Adaptation Data Items DO-278 M H R HR E X
V&V Products Review HBK006 L H E E E X
Accreditation
Use of Previously Developed Software DO-278 L H R E E X
Means of Compliance and Planning DO-278 L H R E E X
Compliance Substantiation DO-278 L H R E E X
Modifications to Previously Developed Software DO-278 L H R E E X
Reuse of Previously Approved Software in CNS/ATM System DO-278 L H R E E X
Change of Application or Development Environment DO-278 L H R E E X
Upgrading a Development Baseline DO-278 L H R E E X
Software CM Considerations DO-278 L H R E E X
39For Enterprise services see [53].
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
236
Phase Method/Process Source Cost
Rating34
Benefit
Rating35
Effort36 Task Type
Minimal Moderate Full Software
Eng.
Sys &
Software
Safety
SA &
IV&V
Software Quality Assurance Considerations DO-278 L H R E E X
Tool Qualification DO-278 L H R E E X
Determining if Tool Qualification is Needed DO-278 L H R E E X
Determining the Tool Qualification Level DO-278 L H R E E X
Tool Qualification Process DO-278 L H R E E X
Alternative Methods DO-278 L H R E E X
Exhaustive Input Testing DO-278 H L NR R HR X
Considerations for Multiple-Version Dissimilar Software
Verification
DO-278 L H R E E X
Independence of Multiple-Version Dissimilar Software DO-278 L H R E E X
Multiple Processor-Related Verification DO-278 L H R E E X
s Multiple-Version Source Code Verification DO-278 L H R E E X
Tool Qualification for Multiple-Version Dissimilar
Software
DO-278 L H R E E X
Multiple Simulators and Verification DO-278 L H R E E X
Software Reliability Models DO-278 MH L R E E X
Service Experience DO-278 L H R E E X
Reliance of Service Experience DO-278 L H R E E X
Sufficiency of Service Experience DO-278 L H R E E X
Collection, Reporting, and Analysis of Problems Found
During Service Experience
DO-278 L H R E E X
Service Experience Information in the Plan for Software
Aspects Approval
DO-278 L H R E E X
COTS DO-278 L H R E E X
System Aspects of COTS Software DO-278 L H R E E X
COTS Software Planning Process DO-278 L H R E E X
COTS Software Acquisition Process DO-278 L H R E E X
COTS Software Verification Process DO-278 L H R E E X
COTS Software CM Process DO-278 L H R E E X
COTS Software Quality Assurance Process DO-278 L H R E E X
Software Life Cycle Data DO-278 L H R E E X
Changes to COTS from an Earlier Baseline DO-278 L H R E E X
Additional Process Objectives and Outputs by Assurance Level
for COTS Software
DO-278 L H R E E X
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
237
Phase Method/Process Source Cost
Rating34
Benefit
Rating35
Effort36 Task Type
Minimal Moderate Full Software
Eng.
Sys &
Software
Safety
SA &
IV&V
Alternative Methods for Providing Assurance of COTS Software
DO-278 L H R E E X
Service Experience DO-278 L H R E E X
Additional Testing DO-278 L H R E E X
Exhaustive Input Testing DO-278 H H NR R HR X
Robustness Testing DO-278 M L R HR E X
System-Level Testing DO-278 L H R E E X
Long-term Soak Testing DO-278 H H R E E X
Use of System for Training DO-278 M H R E E X
Restriction of Functionality DO-278 L H R E E X
Monitoring and Recovery DO-278 H H R E E X
Design Knowledge DO-278 M H R E E X
Audits and Inspections DO-278 L H R E E X
Prior Product Approval DO-278 L H R E E X
Accreditation Process Review HBK006 L H R E E X
Accreditation Products Review HBK006 L H R E E X
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
238
Table G-7 Software Reliability Growth Phase Notes
Phase Sub-Item References Notes
Requirements [23]
[24]
[27]
[32]
[33]
Deficient requirements are the single largest cause of
software project failure, and usually are the root cause
of the worst software defects. [24] Requirements
management is the process of eliciting, documenting,
organizing, communicating, and tracking
requirements. Management of requirements is one of
the most important activities you can do to assure a
safe system. For a complete discussion of
Requirements Management see SSHDBK2000 “FAA
System Safety Handbook” [33] and “Requirements
Management.” [32]
Critical Operational Issues (COIs) are defined in an
FAA procurement program’s Program Requirements
(PR) (Office of Management and Budget (OMB)
Exhibit 300). A COI is decomposed into Measures of
Effectiveness (MoEs) and Measures of Suitability
(MoSs), both of which are further decomposed into
Measures of Performance (MoPs). An MoS addresses
system characteristics that determine whether it is
suitable for use in the NAS. Reliability is a MoS. An
MoP is derived from one or more MoEs or MoSs and
directly translates into a test requirement. Once COIs
have been fully decomposed into MoPs, program
requirements are associated to one or more MoPs. [23]
“Developing Safety-Critical Software Requirements
for Commercial Reusable Launch Vehicles” [27]
contains a representative example of the development
of requirements for Safety-Critical Requirements.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
239
Phase Sub-Item References Notes
Configuration
Management
Change
Control/Change
Review
[18] Change Management is applicable to all phases of
Engineering Life-cycle. Change Management it is key
to Software Reliability Growth since it provides a
basis for the recording of metrics (Section Metrics)
and fault classification (Section Software Fault
Taxonomies).
Requirements Failure Mode,
Effects, and
Severity Analysis
(FMECA)
Studies have shown that the biggest reason for software
failures is not having stable software requirements
during the requirements process. Software designers
should address software in system reliability design and
analysis activities. A system’s software should be
reproduced in the reliability block diagram, else it is
given the assumption that the software will never fail.
Further, if a system includes software, then the FMECA
should recognize the software as a potential failure
point. Neglecting the software makes that assumption
that it will be error-free. [13]
FMECA is a bottoms-up analysis that starts with
individual components up to the overall failure. The
technique can be improved to indicate increased
understanding of the system as more details become
available. [24]
FMECA is an excellent hazard analysis and risk
assessment tool, but it has its own limitations. This
alternative does not take into consideration any
software and human interactions as well as combined
failures. Also, it usually provides an optimistic
estimate of reliability. As such, FMECA should be
supplemented with other analytical tools to develop
accurate reliability estimates [15]. Authoritative
sources for software FMEA/FMECA include IEC
60812:2006(E) [35].
Implementation [19]
[20]
[25]
Unit Level Testing [26]
V&V [13]
[17]
[26]
[31]
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
240
G.4 Overarching Methodologies & Tools
G.4.1 Metrics
A reliability specification necessitates an explanation of what represents mission success or
failure. Several popular reliability metrics include: failure rate, mean time between failure
(MTBF), and mean time between repairs (MTBR).
Software reliability metrics are similar to hardware metrics for a repairable system. The data
provided is commonly a series of failure times or other events. The data is used during software
development to measure time between events, analyze the improvement resulting from removing
errors and making decisions about when to release or update a software product version. Metrics
are also used to assess software or system stability. [13]
G.4.2 Software Fault Taxonomies
Orthogonal Defect Classification (ODC) essentially illustrates a way to categorize a defect into
different classes that indicate the part of the process that needs investigation. In the software
development process, there are many variations by different organizations although activities are
broadly divided into design, code, and test. [14]
An understanding of the relationship between reliability metrics and other terms is necessary for
the application of these factors. System failures may be caused by the hardware, the user, or
faulty maintenance.
The basic Software Reliability incident classification includes:
Mission failures – loss of any essential functions, including system hardware failures,
operator errors, and publication errors. Related to mission reliability.
System failures – software malfunction that may affect essential functions. Related to
maintenance reliability.
Unscheduled spares demands – Related to supply reliability.
System/ mission failures requiring spares – Related to mission, maintenance and supply
reliabilities. [17]
G.4.3 Tools
There is a plethora of tools that can be used in at the various Engineering Life-cycle phases for
Software Reliability. Authoritative sources on this subject include DOT/FAA/AR-06/35
“Software Development Tools for Safety-Critical, Real-Time Systems Handbook” [25]and
DOT/FAA/AR-06/36 “Assessment of Software Development Tools for Safety-Critical, Real-
Time Systems.[21] Specific GOTS tools include the U.S. Army Materiel Systems Analysis
Activity (AMSAA) Reliability Growth technology. [10] [16]
.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
241
POWER SYSTEM CATEGORY ALLOCATIONS
Figure H-1 Terminal STLSC with Power System
Reference Section 7.6 of the RMA Handbook
I & ES
Service
Threads
NA
S R
D R
oll-U
p
Fa
cilit
y P
ow
er
Sy
ste
m In
he
ren
t A
va
ila
bilit
y R
eq
uir
em
en
t
AS
DE
S A
irp
ort
Su
rfa
ce
De
tectio
n E
qu
ipm
en
t S
erv
ice
TF
MS
Tra
ffic
Flo
w M
an
ag
em
en
t S
yste
m
TB
FM
R T
ime
Ba
se
d F
low
Ma
na
ge
me
nt R
em
ote
Dis
pla
y
Te
rmin
al S
urv
eill
an
ce
Sa
fety
-Critica
l S
erv
ice
Th
rea
d P
air (
1)
Te
rmin
al S
urv
eill
an
ce
Sa
fety
-Critica
l S
erv
ice
Th
rea
d P
air (
2)
TV
S T
erm
ina
l V
oic
e S
witch
(N
AP
RS
Fa
cili
ty)
TV
SB
Te
rmin
al V
oic
e S
witch
Ba
cku
p
RV
RS
Ru
nw
ay V
isu
al R
an
ge
Se
rvic
e
VG
S V
isu
al G
uid
an
ce
Se
rvic
e
RA
LS
R/F
Ap
pro
ach
an
d L
an
din
g S
erv
ice
s
WIS
We
ath
er
Info
rma
tio
n S
erv
ice
LL
WS
Lo
w L
eve
l W
ind
Se
rvic
e
AD
SS
Au
tom
atic D
ep
en
de
nt S
urv
eill
an
ce
Se
rvic
e
FC
OM
Flig
ht S
erv
ice
Sta
tio
n C
om
mu
nic
atio
ns
MD
AT
Mo
de
S D
ata
Lin
k D
ata
Se
rvic
e
MS
EC
Mo
de
S S
eco
nd
ary
Ra
da
r S
erv
ice
RT
AD
S R
em
ote
To
we
r A
lph
an
um
eric D
isp
lay S
yste
m S
erv
ice
RT
DS
Ra
da
r T
ow
er
Dis
pla
y S
yste
m
TC
OM
Te
rmin
al C
om
mu
nic
atio
ns
TC
E T
ran
sce
ive
r C
om
mu
nic
atio
ns E
qu
ipm
en
t
EC
SS
Em
erg
en
cy C
om
mu
nic
atio
ns S
yste
ms S
erv
ice
TR
AD
Te
rmin
al R
ad
ar
TS
EC
Te
rmin
al S
eco
nd
ary
Ra
da
r
ST
DD
S S
WIM
Te
rmin
al D
ata
Dis
trib
utio
n S
yste
m
Ma
nu
al P
roce
du
res
1
1
E-C 2 2 2 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) D(2) 2 P
E-C 3 3 2 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) D(2) 2 P
E-C 3 3 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) D(2) D(2) D(2) 2 P
E-C 3 2 2 2 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) N D(2) D(2) D(2) 2 P
E-C 3 3 3 2 2 2 3 3 D(2) D(3) D(2) D(2) D(2) D(2) D(2) N D(2) D(2) D(2) 2 P
E 3 3 D(3) D(3) D(3) D(3) D(3) D(3) D(3) N D(3) D(3) D(3) 3 P
E-C 2 2 3 2 D(3) N
2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3
Terminal Facility Power Architecture
Level 12 - Consolidated TRACON, Multiple Towers (e.g., PCT) H C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2C2 C2
Level 12 - Single Tower, collocated TRACON, Dual Beacon H C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1C1 C1
Level 12 - Single Tower, collocated TRACON, Single Beacon H B B B B B B B B B B B B B B B B B B
Level 12 - Single Tower, no TRACON (e.g.JFK) R 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A2A 2A
Level 11 H C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1C1 C1
Levels 10, 9 & 8 H B B B B B B B B B B B B B B B B B
Levels 7 & 6 R 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A 2A2A 2A
Levels 5 & 4 R 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A1A 1A
Levels 3 & 2 R U U U U U U U U U U U U U U
Level 1 R 4 4 4 4 4 4 4 4 4 4 4 4 4
Remote Site Power Architecture A/1 # D D S D 1A 1A 1A 1A 1A 1A 1A 1A1A 1A
3.1.1 Information Services
3.1.1.1 Aeronautical Information Management E 3 3 3 P
3.1.1.2 Flight and State Data Management E-C 2 2 D(3) D(2) N D(2) P
3.1.1.3 Surveillance InformationManagment S-C 2 2 D(2) D(2) D(2) D(2) D(2) P
3.1.1.4 Weather Information Management E 3 3 P
3.1.2 Traffic Services
3.1.2.1 Separation Management S-C 2 2 2 2 3 D(2) D(2) D(2) D(3) D(3) D(2) N D(2)D(2) D(2) P
3.1.2.2 Trajectory Management E-C 2 2 2 2 D(2) D(2) D(2) D(3) D(3) D(2) N D(2)D(2) D(2) P
3.1.2.3 Flow Contingency Management E-C 2 2 2 2 3 D(2) D(2) P
3.1.2.4 (.0-1 only) Short Term Capacity Management E 3 P
3.1.2.4 (all other) Short Term Capacity Management E-C 2 2 2 D(2) D(2) P
3.1.3 Mission Support Services
3.1.3.1 Long Term Capacity Management * R P
3.1.3.2.0-1 System and Service Analysis * E P
3.1.3.2.0-1.0-1 System and Service Analysis * R P
3.1.3.2.0-1 (.0-2 thru.0-4) System and Service Analysis E 3 3 3 3 3 3 3 3 3 P
3.1.3.2.0-1 (.0-5 thru.0-7) System and Service Analysis * E P
3.1.3.2.0-2 (all) System and Service Analysis * E P
3.1.3.2.0-3 System and Service Analysis R 3 3 3 D(3) D(3) P
3.1.3.3 System and Service Management E 3 3 3 3 3 3 3 3 3 D(3) D(3) D(3) D(3) D(3) D(3) D(3) D(3)D(3) D(3) 3 P
3.1.3.4 Safety Management * E P
3.2.1 Surveillance Data Collection * S-C 2 2 2 2 2 P
3.2.2 Weather Data Collection * E 3 3 2 P
3.2.3 Navigation Support * E-C 2 2 2 2 P
Small Tower STLSC
Associated Unstaffed Facility Architecture
Distributed Unstaffed Facility
NAS RD-2013 Section 3.1 Mission Services
Service/Capability/Power -Terminal Service Thread STLSC Matrix
Control Facility Information Systems Service ThreadsR/D & Standalone Systems
Service Threads
TARS Terminal Automated Radar Service (Umbrella Service)TCVEX Terminal Communications Voice Exchange Service
(Umbrella Service)
Large TRACON STLSC
Medium TRACON STLSC
Small TRACON STLSC
NAS RD-2013 Section 3.2 Technical Infrastructure Services
Large Tower STLSC
Medium Tower STLSC
Color Key:
Safety-Critical S-C
Efficiency-Critical E-C
Essential E
Routine R
Power System Architectures::C2 CPDS Type 2 C1 CPDS Type 1
B BASIC 2A Comec'l Pwr+EG+UPS1A Comec'l Pwr + EG + Mini UPS
U Comec'l Pwr+ UPS (no EG)D Comec'l Pwr+BatteriesV Photovoltaic/Wind + Batteries
Z Independent Generation1 Comec'l Pwr+EG4 Comec'l Pwr
8 Dual Indep. Comm. Pwr. S Same as Host Facility Power System ArchitectureH = High Inherent Availability = .999998
R = Reduced Inherent Availability = .9998# = Commercial Power Provided by Airport
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
242
Figure H-2 En Route STLSC with Power System
Reference Section 7.6 of the RMA Handbook
Mission
Support
Systems
Service
Threads
NA
S R
D R
oll-U
p
Fa
cilit
y P
ow
er
Sy
ste
m In
he
ren
t A
va
ila
bilit
y R
eq
uir
em
en
t
CF
AD
Co
mp
osite
Flig
ht D
ata
Pro
ce
ssin
g S
erv
ice
CO
DA
P C
om
po
site
Oce
an
ic D
isp
lay a
nd
Pla
nn
ing
Se
rvic
e
CO
FA
D A
nch
ora
ge
Co
mp
osite
Offsh
ore
Flig
ht D
ata
Se
rvic
e
CR
AD
Co
mp
osite
Ra
da
r D
ata
Pro
ce
ssin
g S
erv
ice
(EA
S/E
BU
S)
CR
AD
Co
mp
osit R
ad
ar
Da
ta P
roce
ssin
g S
erv
ice
(CC
CH
/EB
US
)
TB
FM
Tim
e B
ase
Flo
w M
an
ag
em
en
t
TF
MS
Tra
ffic
Flo
w M
an
ag
em
en
t
VS
CS
S V
oic
e S
witch
ing
an
d C
on
tro
l S
yste
m S
erv
ice
VT
AB
S V
SC
S T
rain
ing
an
d B
acku
p S
yste
m (
NA
PR
S F
acili
ty)
WIS
We
ath
er
Info
rma
tio
n S
erv
ice
AD
SS
Au
tom
atic D
ep
en
de
nt S
urv
eill
an
ce
Se
rvic
e
AR
INC
HF
Vo
ice
Co
mm
un
ica
tio
ns L
ink
AR
SR
Air R
ou
te S
urv
eill
an
ce
Ra
da
r
BD
AT
Be
aco
n D
ata
(D
igitiz
ed
)
BU
EC
S B
acku
p E
me
rge
ncy C
om
mu
nic
atio
ns S
erv
ice
EC
OM
En
Ro
ute
Co
mm
un
ica
tio
ns
FC
OM
Flig
ht S
erv
ice
Sta
tio
n C
om
mu
nic
atio
ns
FD
AT
Flig
ht D
ata
En
try a
nd
Prin
tou
t
IDA
T In
terf
acili
ty D
ata
Se
rvic
e
MD
AT
Mo
de
S D
ata
Lin
k D
ata
Se
rvic
e
MS
EC
Mo
de
S S
eco
nd
ary
Ra
da
r S
erv
ice
NA
MS
NA
S M
essa
ge
Tra
nsfe
r S
erv
ice
RD
AT
Ra
da
r D
ata
(D
igitiz
ed
)
TR
AD
Te
rmin
al R
ad
ar
TS
EC
Te
rmin
al S
eco
nd
ary
Ra
da
r
R/F
Na
vig
atio
n S
erv
ice
FD
IOR
Flig
ht D
ata
In
pu
t/O
utp
ut R
em
ote
RM
LS
S R
em
ote
Mo
nito
rin
g/M
ain
ten
an
ce
Lo
gg
ing
Syste
m S
erv
ice
Ma
nu
al P
roce
du
res
1
1
E-C 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3
E-C 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 3
Control Facility Power System Architecture H C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2 C2
Remote Site Power System Architecture D 2 1 1 S S 1 1A 1A S
3.1.1 Information Services
3.1.1.1 Aeronautical Information Management E 3 3 P
3.1.1.2 Flight and State Data Management E-C 2 2 2 2 2 2 3 2 2 2 2 P
3.1.1.3 Surveillance Information Managment S-C 2 2 2 2 2 2 2 2 2 2 2 P
3.1.1.4 Weather Information Management E P
3.1.2 Traffic Services
3.1.2.1 Separation Management S-C 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 P
3.1.2.2 Trajectory Management E-C 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 P
3.1.2.3 Flow Contingency Management E-C 2 2 2 2 2 2 2 2 2 P
3.1.2.4 (.0-1 only) Short Term Capacity Management E 3 3 P
3.1.2.4 (all other) Short Term Capacity Management E-C 2 2 2 2 2 2 P
3.1.3 Mission Support Services
3.1.3.1 Long Term Capacity Management * R P
3.1.3.2.0-1 System and Service Analysis * E P
3.1.3.2.0-1.0-1 System and Service Analysis * R P
3.1.3.2.0-1 (.0-2 thru.0-4) System and Service Analysis E 3 3 3 3 3 3 3 3 3 3 3 P
3.1.3.2.0-1 (.0-5 thru.0-7) System and Service Analysis * E P
3.1.3.2.0-2 (all) System and Service Analysis * E P
3.1.3.2.0-3 System and Service Analysis R 3 3 P
3.1.3.3 System and Service Management E 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 P
3.1.3.4 Safety Management * E P
3.2.1 Surveillance Data Collection S-C 2 2 2 2 2 2 2 P
3.2.2 Weather Data Collection E 3 3 P
3.2.3 Navigation Support E-C P
NAS RD-2013 Section 3.2 Technical Infrastructure Services
Service/Capability - En Route Thread STLSC Matrix
Control Facility Information Systems Service
Threads
R/D & Standaone Systems
Service Threads
ETARS En Route Terminal Automated Radar Service (Umbrella
Service)
ECVEX Rn Route Communication Voice Exchange Service
(Umbrella Service)
Service Thread Loss Severity Category
Associated Unstaffed Facility Architecture
NAS RD-2013 Section 3.1 Mission Services
Color Key:
Safety-Critical S-C
Efficiency-Critical E-C
Essential E
Routine R
Power System Architectures::C2 CPDS Type 2 C1 CPDS Type 1
B BASIC 2A Comec'l Pwr+EG+UPS1A Comec'l Pwr + EG + Mini UPS
U Comec'l Pwr+ UPS (no EG)D Comec'l Pwr+BatteriesV Photovoltaic/Wind + Batteries
Z Independent Generation1 Comec'l Pwr+EG4 Comec'l Pwr
8 Dual Indep. Comm. Pwr. S Same as Host Facility Power System ArchitectureH = High Inherent Availability = .999998
R = Reduced Inherent Availability = .9998# = Commercial Power Provided by Airport
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
243
Figure H-3 “Other” STLSC with Power System
Reference Section 7.6 of the RMA Handbook
I & ES
Service
Threads
Mission
Support
Systems
Service
Threads
NA
S R
D R
oll-U
p
TF
MS
S T
raffic
Flo
w M
an
ag
em
nt S
yste
m S
erv
ice
TD
WR
S T
erm
ina
l D
op
ple
r W
ea
the
r R
ad
ar
Se
rvic
e
WM
SC
R W
ea
the
r M
essa
ge
Sw
itch
ing
Ce
nte
r R
ep
lace
me
nt
WD
AT
WM
SC
R D
ata
Se
rvic
e
WIS
We
ath
er
Info
rma
tio
n S
erv
ice
FC
OM
Flig
ht S
erv
ice
Sta
tio
n C
om
mu
nic
atio
ns
FS
SA
S F
ligh
t S
erv
ice
Sta
tio
n A
uto
ma
ted
Se
rvic
e
WA
AS
Wid
e A
rea
Au
gm
en
tatio
n S
yste
m S
erv
ice
NM
RS
N
AS
Me
ssa
gin
g a
nd
Lo
gg
ing
Syste
m
RM
LS
S R
em
ote
Mo
nitro
ing
an
d L
og
gin
g S
yste
m S
erv
ice
Ma
nu
al P
roce
du
res
2 3 3 3 2 3 3 3 2 3 M
Control Facility Power System Architecture 2A 2A 2A 2A 2A S
Remote Site Power System Architecture 2A D S
3.1.1 Information Services
3.1.1.1 Aeronautical Information Management E 3 3 M
3.1.1.2 Flight and State Data Management E-C 3 3 2 P
3.1.1.3 Surveillance InformationManagment* S-C
3.1.1.4 Weather Information Management E 3 3 3 2 3 3
3.1.2 Traffic Services
3.1.2.1 Separation Management S-C
3.1.2.2 Trajectory Management E-C 2 P
3.1.2.3 Flow Contingency Management E-C 2 2 2 P
3.1.2.4 (.0-1 only) Short Term Capacity Management E 3 P
3.1.2.4 (all other) Short Term Capacity Management E-C 2 2
3.1.3 Mission Support Services
3.1.3.1 Long Term Capacity Management * R M
3.1.3.2.0-1 System and Service Analysis * E
3.1.3.2.0-1.0-1 System and Service Analysis * R
3.1.3.2.0-1 (.0-2 thru.0-4) System and Service Analysis E 3 3 3
3.1.3.2.0-1 (.0-5 thru.0-7) System and Service Analysis * E
3.1.3.2.0-2 (all) System and Service Analysis) * E
3.1.3.2.0-3 System and Service Analysis * R
3.1.3.2 System and Service Analysis * E M
3.1.3.3 System and Service Management E 3 3
3.1.3.4 Safety Management * E
3.2.1 Surveillance Data Collection S-C 3 3 3
3.2.2 Weather Data Collection E 3 3 3 3 3
3.2.3 Navigation Support E-C 3 3 3 P
Service/Capability - Other Service Thread STLSC Matrix
R/
Control Facility Information Systems Service ThreadsR/D & Standalone Systems
Service Threads
Safety Critical Thread PairingService Thread Loss Severity Category
NAS RD-2013 Section 3.1 Mission Services
NAS RD-2013 Section 3.2 Technical Infrastructure Services
Color Key:
Safety-Critical S-C
Efficiency-Critical E-C
Essential E
Routine R
Power System Architectures::C2 CPDS Type 2 C1 CPDS Type 1
B BASIC 2A Comec'l Pwr+EG+UPS1A Comec'l Pwr + EG + Mini UPS
U Comec'l Pwr+ UPS (no EG)D Comec'l Pwr+BatteriesV Photovoltaic/Wind + Batteries
Z Independent Generation1 Comec'l Pwr+EG4 Comec'l Pwr
8 Dual Indep. Comm. Pwr. S Same as Host Facility Power System ArchitectureH = High Inherent Availability = .999998
R = Reduced Inherent Availability = .9998# = Commercial Power Provided by Airport
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
244
GLOSSARY
I.1 Acronyms
See Table I-1 for a list of acronyms and their definitions.
Table I-1 Acronyms
Acronym Definition
ADSS Automatic Dependent Surveillance Service
AMS Acquisition Management System
AMSAA United States Army Materiel Systems Analysis Activity
ANSI American National Standards Institute
ASDES Airport Surface Detection Equipment
ATC Air Traffic Control
ATM Air Traffic Management
ATO-S Air Traffic Organization – Office of Safety
BDAT Beacon Data (Digitized)
BITE Built In Test Equipment
BUECS Backup Emergency Communications Service
CCCH Central Computer Complex Host
CFAD Composite Flight Data Processing Service
CHI Computer/Human Interface
CM Configuration Management
CMU Carnegie Mellon University
CNS Communications, Navigation, and Surveillance
CODAP Composite Oceanic Display and Planning Service
COFAD Composite Offshore Flight Data Service (Anchorage)
COI Critical Operational Issue
CONOPS Concept of Operations
COTS Commercial-Off-The-Shelf
CPDS Critical Power Distribution Systems
CPR Critical Performance Parameters Requirements
CRAD Composite Radar Data Processing Service
CRD Concept and Requirements Definition
CSWG Communications and Surveillance Working Group
DLA Design Logic Analysis
DoD Department of Defense
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
245
Acronym Definition
DoT Department of Transportation
DR&A Data Reduction and Analysis
EA Enterprise Architecture
EBUS Enhanced Back-up Surveillance System
ECOM En Route Communications
EIS Enterprise Infrastructure Systems
ERAM En Route Automation Modernization
ESB Enterprise Service Bus
ETARS En Route Terminal Automated Radar Service
FAA Federal Aviation Administration
FCA Functional Configuration Audit
FCOM FSS Communications Service
FDAT Flight Data Entry and Printout
FDIOR Flight Data Input/Output Remote
FMEA Failure Mode and Effects Analysis
FMECA Failure Mode, Effects, and Criticality Analysis
FRU Floor Replaceable Unit
FSEP Facility, Service, and Equipment Profile
FSSAS Flight Service Station Automated Service
FTA Fault Tree Analysis
FTI FAA Telecommunication Infrastructure
FTSD FAA Telecommunications Services Description
GOTS Government-Off-The-Shelf
HDBK Handbook
HVAC Heating, Ventilation and Air Conditioning
HW Hardware
IA Independent Assessment
IDAT Interfacility Data Service
IEC International Electrotechnical Commission
IEEE Institute of Electrical and Electronics Engineers
IFPP Information for Proposal Preparation
IP Internet Protocol
IV&V Independent Validation and Verification
KPPs Key Performance Parameters
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
246
Acronym Definition
LCC Lifecycle Cost
LLWAS Low Level Wind Service
LRU Lowest Replaceable Unit
M&C Monitor and Control
MDAT MODE-S Data Link Data Service
MDT Mean Down Time
MIL Military
MoE Measure of Effectiveness
MoP Measure of Performance
MoS Measure of Suitability
MSEC MODE-S Secondary Radar Service
MSI Maintenance Significant Items
MTBF Mean Time Between Failures
MTBR Mean Time Between Repairs
NADS
NAMS NAS Messaging Transfer Service
NAPRS National Airspace Performance Reporting System
NAS National Airspace System
NASA National Aeronautics and Space Administration
NASPAS National Airspace System Performance Analysis System
NDAT
NDI Non-Developmental Item
NEMS NAS Enterprise Messaging Service
NextGen Next Generation
NIST National Institutes of Standards and Technology
NSSG NASA Software Safety Guidebook
NVS NAS Voice System
OC Operating Characteristic
OCD Orthogonal Defect Classification
OMB Office of Management and Budget
ORD Operational Readiness Demonstration
OS Operating System
PCA Physical Configuration Audit
PMR Program Management Review
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
247
Acronym Definition
PR Program Requirement
PTR Program Trouble Report
QA Quality Assurance
RAC Reliability Analysis Center
RCAG Remote Communications Air/Ground
RDAT Radar Data (Digitized)
RGP Reliability Growth Plan
RMA Reliability, Maintainability, Availability
RMLSS Remote Monitoring and Logging System Service
ROC Recovery-Oriented Computing
RPR Reliability Prediction Report
RTADS Remote Tower Alphanumeric Display System
RTCA Radio Technical Commission for Aeronautics
RTP Reliability Test Plan
RVRS Runway Visual Range Service
SA Software Assurance
SAR System Analysis and Recording
SEM Systems Engineering Manual
SIM Simplified Transition Diagram
SME Subject Matter Expert
SOA Service Oriented Architecture
SOW Statement of Work
SRA Service Risk Assessment
SRMGSA Safety Risk Management Guidance for System Acquisition
SSD System Specification Document (SSD)
STDDS STDDS SWIM Terminal Data Distribution System
STLSC Service Thread Loss Severity Category
SW Software
SWIM System Wide Information Management
TARS Terminal Automated Radar Service
TCOM Terminal Communications Service
TCVEX Terminal Communications Voice Exchange
TDWRS Terminal Doppler Weather Radar Service
TechOps Technical Operations Services
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
248
Acronym Definition
TFMSS Traffic Flow Management System Service
TIM Technical Interchange Meeting
TPM Transition Probability Matrix
TRAD Terminal Radar Service
TSB Terminal Surveillance Backup
TSEC Terminal Secondary Radar
TVS Terminal Voice Switch
TVSB Terminal Voice Switch Backup
UDDI Universal Description, Discovery and Integration
V&V Validation and Verification
VGS Visual Guidance Service
VSCS Voice Switching and Control System Service
VSCSS Voice Switching and Control System Service Service
VTABS VSCS Training and Backup System
WAAS WASS/GPS Service
WDAT WMSCR Data Service
WIS Weather Information Service
WJHTC William J. Hughes Technical Center
WMSCR WMSCR Weather Message Switching Center Replacement
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
249
I.2 Definitions
See Table I-2 for a list of terms and their definitions.
Table I-2 Terms
Term Definition
Cloud Computing Cloud computing is a colloquial expression used to describe a variety of different
computing concepts that involve a large number of computers that are connected through
a real-time communication network (typically the Internet). Within the scientific
community, Cloud Computing is a synonym for distributed computing over a network and
means the ability to run a program on many connected computers at the same time. The
popularity of the term can be attributed to its use in marketing to sell hosted services in
the sense of application service provisioning that run client server software on a remote
location
Enterprise Services Enterprise services is an over-arching term to describe an architecture combining
engineering and computer science disciplines to solve practical business problems.
Enterprise services architecture generally includes high-level components and principles
of object-oriented design employed to match the current heterogeneous world of
Information Technology (IT) architecture.
The concept of enterprise services was created in 2002 by Hasso Plattner, the chairman of
SAP AG. Enterprise services architecture includes layers of components aggregating data
and application functions from applications, which creates reusable elements, which are
also called modules. The components use enterprise services for communication.
Enterprise services architecture minimizes the complexity of the connections among the
components to facilitate reuse. The enterprise services architecture allows deployment of
Web services to create applications within the current infrastructure, increasing business
value.
Enterprise services architecture emphasizes abstraction and componentization, which
provides a mechanism for employing both the required internal and external business
standards. The main goal of enterprise services architecture is to create an IT environment
in which standardized components can aggregate and work together to reduce complexity.
To create reusable and useful components, it is equally important to build an
infrastructure allowing components to conform to the changing needs of the
environment.[51]
Enterprise Service
Bus (ESB)
A standards-based integration platform that combines messaging, Web services, data
transformation, and intelligent routing to reliably connect and coordinate the interaction of
significant numbers of diverse applications across extended enterprises with transactional
integrity. [80]
Failure Mode The manner by which a failure is observed. Generally describes the way the failure
occurs and its impact on system operation.[5]
Fault Avoidance The objective of fault avoidance is to produce fault free software. This activity
encompasses a variety of techniques that share the objective of reducing the number of
latent defects in software programs. These techniques include precise (or formal)
specification practices, programming disciplines such as information hiding and
encapsulation, extensive reviews and formal analyses during the development process,
and rigorous testing.
Hypervisor In computing, a hypervisor or virtual machine monitor (VMM) is a piece of computer
software, firmware or hardware that creates and runs virtual machines.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
250
Term Definition
A computer on which a hypervisor is running one or more virtual machines is defined as a
host machine. Each virtual machine is called a guest machine. The hypervisor presents the
guest operating systems with a virtual operating platform and manages the execution of
the guest operating systems. Multiple instances of a variety of operating systems may
share the virtualized hardware resources.
Operational
Availability (AOp)
1. The percentage of time that a system or group of systems within a unit are operationally
capable of performing an assigned mission and can be expressed as uptime / (uptime +
downtime). Development of the Operational Availability metric is a Requirements
Manager responsibility.
2. The degree (expressed as a decimal between 0 and 1, or the percentage equivalent) to
which one can expect a piece of equipment or weapon system to work properly when it is
required, that is, the percent of time the equipment or weapon system is available for use.
AOp represents system “uptime” and considers the effect of reliability, maintainability, and
Mean Logistics Delay Time (MLDT). Aop is the ratio of total operating facility/service
hours to maximum facility/service hours, expressed as a percentage and derived by the
following calculation.
AOp = 100 * (Maximum Available Hours – Total Outage Time) / Maximum Available
Hours
It is the quantitative link between readiness objectives and supportability.[56] The above
definition of AOp is the NAPRS Error! Reference source not found. definition
which does not include a MLDT.
Orthogonal Defect
Classification
(ODC)
ODC illustrates a way to categorize a defect into different classes that indicate the part of
the process that needs investigation. In the software development process, there are many
variations by different organizations although activities are broadly divided into design,
code, and test. [14]
Mean Logistics
Delay Time
(MLDT)
Indicator of the average time a system is awaiting maintenance and generally includes
time for locating parts and tools; locating, setting up, or calibrating test equipment;
dispatching personnel; reviewing technical manuals; complying with supply procedures;
and awaiting transportation. The MLDT is largely dependent upon the Logistics Support
(LS) structure and environment. [56]
Mean Maintenance
Time (MMT)
A measure of item maintainability taking into account both preventive and corrective
maintenance. Calculated by adding the preventive and corrective maintenance time and
dividing by the sum of scheduled and unscheduled maintenance events during a stated
period. [56]
Mean Time
Between
Maintenance
(MTBM)
A measure of reliability that represents the average time between all maintenance actions,
both corrective and preventive. [56]
NAS Service
Registry/ Repository
(NSRR)
A SWIM-supported capability for making services visible, accessible, and understandable
across the NAS. NSRR supports a flexible mechanism for service discovery, an
automated policies-based way to manage services throughout the services lifecycle, and a
catalog for relevant artifacts. [80]
Recovery-Oriented
Computing (ROC)
A method of computing that focuses on reducing recovery times in the event of a failure
in order to increase availability.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
251
Term Definition
Reliability Growth The improvement in a reliability parameter over a period of time due to changes in
product design or the manufacturing process.[10]
Service Diversity Services provided via alternate sites, e.g. overlapping radar coverage.[57]
Software Reliability The probability that software will not cause a system failure for a specified time under
specified conditions.[3]
Stateful Having the capability to maintain state. Most common applications are inherently stateful.
Subject Matter
Expert (SME)
A person who has extensive knowledge in a particular area or on a specific topic.
System Analysis
and Recording
(SAR)
A system function that records significant system events, performance data, and system
resource utilization for the off-line analysis and evaluation of system performance.
Typical data to be recorded includes:
a. All system inputs
b. All system outputs
c. All system and component recoveries and reconfigurations
d. System status and configuration data including changes
e. Performance and resource utilization of the system and system components
f. Significant security events
System Status
Indications (e.g.,
alarm, return-to-
normal)
Indications in the form of display messages, physical or graphical indicators, and/or aural
alerts designed to communicate a change of status of one or more system elements.
Universal
Description,
Discovery and
Integration (UDDI)
An interface which provides an interoperable, foundational infrastructure for a Web
services-based software environment for both publicly available services and services
only exposed internally within an organization. [81]
Virtual Machine
(VM)
A self-contained operating environment that behaves as if it is a separate computer, with
no access to the host operating system.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
252
QUICK LOOK GUIDE TO USE OF THIS HANDBOOK
This Appendix provides a guide to the FAA Reliability, Maintainability, and Availability
handbook (FAA RMA-HDBK-006C) intended to make the handbook methodology available to
individuals who are time constrained or do not require the full level of detail provided by the
handbook. The numbered items below cover the primary activities involved in the RMA
requirements management process. The section headings provide, in general, the portion of the
FAA lifecycle management cycle the activities are conducted. An overview of the FAA lifecycle
management cycle is shown in Figure J-1. The RMA Handbook describes a new paradigm for RMA requirements management that focuses
on applying NAS-Level requirements to service threads and assigning them requirements that are
achievable, verifiable, and consistent with the loss severity of the service provided to users and
specialists.
Figure J-1 FAA Lifecycle Management Process
The focus of the RMA management approach is on early identification and mitigation of
technical risks affecting the performance of fault-tolerant systems, followed by an aggressive
software reliability growth program to provide contractual incentives to find and remove latent
software defects. It should be noted that this process is sequential and each step is dependent in
part or in whole on predecessors. It is not possible to enter the RMA process at an arbitrary
point; rather it is necessary to perform each task within the RMA process leading up to the
desired decision point. Figures J-3 through J-8 provide an overview of the primary activities
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
253
involved in the RMA requirements management process as they relate to FAA AMS acquisition
lifecycle. An overview of the ANG-B Program Level Requirements Process is shown in Figure
J-2, this diagram depicts ANG-B NAS Requirements Services Division’s support for the Service
Analysis, Concept and Requirements Definition (CRD), Initial Investment Analysis (IIA), and
Final Investment Analysis (FIA) phases of the FAA AMS. There are touch-points between the
RMA requirements process, described herein, and the ANG-B requirements process of Figure J-
2. At the highest level these take the form of “Decision Points” which convey the routing of the
process to the next activity depending on the decision made.
Figure J-2 ANG B Program Level Requirements Process
When changes are required to a previously approved system design, during development or in
service, RMA effects of the change must be analyzed and presented as part of the required
change approval process. Any system change that causes a decrease in MTBF or an increase in
MTTR or Automatic Recovery Time for an individual component must be analyzed for its effect
on the Service Threads’ performance including the NAS-RD Availability requirements. Any
change which causes a system to fail to meet these requirements must be coordinated with, and
approved by, ANG-B.
Concept and Requirements Definition Phase
1. Map the NAS-Level functional requirements of your system or service to established service
threads and corresponding NAPRS services OR create a new service thread if none corresponds
to your system or service. (Section 7.1 – 7.2.1)
Pro
gra
m
Offic
e &
AN
G-B
1
AN
G-B
1
AN
G-B
2
AN
G-B
3
En
gin
ee
rs
Re
qu
este
rC
RD
Le
ad
AN
G-B
Dire
cto
r
AN
G-B
1 P
rog
ram
-Le
ve
l R
eq
uire
me
nts
Pro
ce
ss
Disclaimer: This process map is intended to be read together with the corresponding narratives, not separate.
24Provide fPRD
Technical Kick-off
17
Develop/Concur with
iPRD
25
Develop/Concur with
fPRD
31
Final
Requirements Complete
19Determine iPRD
Concurrence
20
Concur
With
iPRD?
27Determine fPRD
Concurrence
28
Concur
With
fPRD?
author:
version:
status:
1.0ANG-B1 Program-Level Requirements Process
Initial VersionLevel 1
1
Support
Request
ANG-B1
Service Analysis CRD IIA FIA
3
Develop/Update
Service Analysis
Products
5Determine Service Analysis Products
Concurrence
12Determine CRD
ProductsConcurrence
13
Concur
With CRD
Products?
7Determine CRDR
Approval
14Provide IA Readiness
Approval
6
Concur With
Service Analysis
Products?
No Yes No Yes YesNo No Yes
9
Provide CRD
Technical Kick-off
16
Provide iPRD
Technical Kick-off
10Develop/Update CRD
Products
Yes,
Tech Refresh
or Variable
Quantity ACAT
Yes,
Facility
ACAT
8
CRDR
Approved?
15
IAR
Approved?
Yes,
New
Program
ACATYes
NoNo
4
Coordinate Service
Analysis Concurrence
11
Coordinate CRD
Concurrence
2Provide Service
Analysis Req. Guidance
23
fPRD Change
Request
21Provide II Decision
22
II Approved?
18
Coordinate iPRD
Concurrence
26
Coordinate fPRD
Concurrence
No
Yes
29Provide FI Decision
30
FI Approved?
Yes
No
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
254
SERVICE THREADS: Service threads are strings of systems/functions that support one or
more of the NAS EA Functions. These service threads represent specific data paths (e.g. radar
surveillance data) to air traffic specialists or pilots.
Milestone - Investment Analysis Readiness Decision (IARD)
Heading
Concepts and Requirements Definition Phase
ANG-B1 Program Level Requirements Process
MAP NAS-Level Functional
Requirements to Service Threads/NAPRS Services
NAS-RD-20XXFunctional
Requirements
NAPRS Services
CorrespondingThread
ServiceThreads
Create a New Service Thread
No
MILESTONE: Investment Analysis Readiness Decision
(IARD)
ANG-BDirector
ConcurrenceIARD
Develop/Update CRD Products
Concur with CRD Products
Determine CRD Products
Concurrence
2 pPRD SAR ConOPs FAR Alternatives
2 pPRD: Preliminary PR Document (pPRD) SAR: Shortfall Analysis Report ConOps: Concepts of Operation FAR: Functional Analysis Report Alternatives: Set of Alternatives for Evaluation
11
1 Numbered circles trace to numbered items in the Quick Look Guide
Figure J-3 RMA Process: IARD Phase
Initial Investment Analysis Phase
2. Confirm that the assigned Service Thread Loss Severity Categories (STLSC) for the service
threads supporting your system or service is appropriate. For new service threads, assign a
STLSC to the thread based on the effect of the loss of the service thread on NAS safety and
efficiency of operations. The thread will be designated either: Safety-Critical, Efficiency-
Critical, Essential, or Routine. (Sections: 7.3)
SERVICE THREAD LOSS SEVERITY CATEGORY (STLSC): Each service thread is
assigned a Service Thread Loss Severity Category based on the severity of impact that loss of the
thread could have on the safe and/or efficient operation and control of aircraft.
a. Designate a thread Safety-Critical if interruption could present a significant safety hazard during
the transition to reduced capacity operations.
Safety-Critical – A key service in the protection of human life. Loss of a Safety-Critical
service increases the risk in the loss of human life.
FAA experience has shown that this capability is achievable if the service is delivered by at
least two independent service threads, each built with off-the-shelf components in fault
tolerant configurations.
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
255
b. Designate a thread Efficiency-Critical if interruptions can be safely managed by reducing
capacity but may cause significant traffic disruption and large scale degraded NAS efficiency.
Efficiency-Critical – A key service that is used in present operation of the NAS. Loss of an
Efficiency-Critical Service has a major impact in the present operational capacity. Before a
failure, with fully functional automation and supporting infrastructure, a certain level of local
NAS capacity is achievable. After the failure, backup procedures are utilized which causes a
hazard period to exist while the capacity is reduced to maintain safety. This reduced capacity
may be local, but the effects could propagate regionally or nationwide. If the demand exceeds
the available capacity when implementing backup procedures, then a queue starts to build
and efficiency and/or safety is impacted.
Experience has shown that this is achievable by a service thread built of off-the-shelf
components in a fault tolerant configuration.
c. Designate a thread Essential if interruptions can be safely managed by reducing capacity (if
necessary) and may cause only a localized traffic disruption which does not result in large scale
degraded NAS efficiency.
Essential – A service that if lost would significantly raise the risk associated with providing
efficient NAS operations.
Experience has shown that this is achievable by a service thread built of good quality,
industrial-grade, off-the-shelf components.
d. Designate a thread Routine if interruptions pose a negligible risk to providing safe and efficient
operations.
Routine – A service which, if lost, would have a minor impact on the risk associated with
providing safe and efficient NAS operations.
Experience has shown that this is achievable by a service thread built of good quality,
industrial-grade, off-the-shelf components.
Service threads can be designated Remote/Distributed in the case where service is provided by a
Remote/Distributed system. Loss of a service element, i.e., radar, air/ground communications
site, or display console, would incrementally degrade the overall effectiveness (depending on the
degree of overlapping coverage) of the service thread but would not render the service thread
inoperable. These remote/distributed systems are characterized by their spatial diversity which
make total failure virtually impossible (Section 7.1.1.2).
Note: Distinction between levels of STLSC may depend on the size and type of the staffed
facility, and on environment factors including geography, air space design, etc. (Sections 7.4 –
7.4.3.)
3. Allocate NAS-Level availability requirements to service threads based on the loss severity and
associated availability requirements of the NAS capabilities supported by the threads (Section
7.6 – 7.6.3).
The results of this process are summarized in matrices (The Terminal Service Thread STLSC
matrix is provided below as an example) which provide the mapping between the NAS
architecture capabilities and the NAPRS service threads. These matrices are presented in more
detail in Section 7.6 – 7.6.3 of the RMA Handbook.
4. Recognize that the probability of achieving the availability requirements for any service thread
identified as Safety-Critical is unacceptably low; therefore, where possible, decompose the
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
256
thread into alternate (at least two) independent new threads, each with a STLSC no less than
“Efficiency-Critical.”(Section 7.3)
5. If a new service thread is determined to be “Safety-Critical,” i.e. the transition to manual
procedures presents a significant risk to safety; the potential new service thread must be divided
into at least two independent service threads that can serve as primary and backups. Recognize
that the availability requirements associated with “Efficiency-Critical” service threads will
require redundancy and fault tolerance to mitigate the effect of software failures. Employing
redundancy and utilizing fault tolerance techniques allows systems to achieve higher
availabilities (Section 8.1.4).
Figure J-4 STLSC Matrix Illustration
6. Recognize that availability is an operational performance measure that might not be well suited
to contractual requirements for acceptance of large scale critical systems. Therefore, augment
any use of availability as a contractual requirement by parameters such as MTBF, [Mean Time to
Repair/Restore (MTTR), and other performance measures that are verifiable requirements. For
any software based systems, couple these measures with an aggressive software reliability
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
257
growth program and its associated testing trends and problem reports metrics (Section 7.7 –
7.7.2)
7. Recognize the procurement of RMA requirements may vary among systems and infrastructures.
The guidelines for generating RMA requirements for the different systems/ infrastructures (in
bold) are as follows:
The NAS-RD-2013 does not provide RMA requirements for Remote/Distributed and
Standalone Systems. The Handbook provides a methodology that involves the use of
STLSC matrices, the FAA Communications Diversity Order 6000.36. The RMA
characteristics for these systems, therefore, are dictated primarily by life cycle cost (LCC)
and diversity considerations (Section 7.7.2).
The RMA requirements for Power System selection are based on the STLSCs of the threads
they support as well as the facility level in which they are installed (Section 7.7.3.1).
For Heating, Ventilation and Air Conditioning (HVAC) subsystems, FAA facility
standards specify temperature and humidity ranges for operations and equipment spaces, but
do not require any specific availability levels. FAA Order 6480.7D requires that ATCT and
TRACONs be equipped with redundant air conditioning systems for critical spaces (e.g.
tower cab, communications equipment room, etc.), but no specific performance requirements
can be relied upon in designing for overall operational availability (Section 7.7.3.2).
For Communication Transport, The RMA characteristics of FAA Telecommunication
Infrastructure (FTI) are set out in Table 7.1 of the FTI Operations Reference Guide (Section
7.7.3.3.2).
RMA requirements for an Enterprise Infrastructure (EIS) are premised on the service
thread with the highest loss severity category that the EIS supports. RMA requirements for
an EIS should never be less than what is prescribed for its most critical NAS service (Section
7.7.3.3.3).
8. Use RMA models only as a rough order of magnitude confirmation of the potential of the
proposed hardware configuration to achieve the inherent availability of the hardware, not a
prediction of operational reliability (Section 8.1.4 ).
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
258
Milestone – Initial Investment Decision (IID) Initial Investment Analysis (IIA) Phase
ANG-B1 Program Level Requirements Process
Confirm Assigned STLSC
New Service Thread
IID
IARD
Assign a Service Thread Loss Severity
Category (STLSC)1
1 STLSC: Safety-Critical Efficiency-Critical Essential Routine
No
Allocate NAS-Level Availability
Requirements to Service Threads
Service ThreadSTLSC
Service ThreadAvailability / STLSC
STLSC Matrices
Safety-Critical Thread
Decompose Service Thread
Yes
Yes
Assign Initial RMA . Contractual Parameters2
2 Contractual Parameters: STLSC e.g.: MTBF MTTR recovery times Mean Time between
successful recoveries
RMAParameters
Communications Transport
RMA
Remote-Distributed and
Standalone Systems
RMA
Power SystemsRMA
Heating, Ventilation and Air
Conditioning (HVAC)
RMA
Enterprise Architecture
RMA
RMA Parameters
STLSC Matrices /Order 6000.36
STLSC Matrices /Facility Level
FAA Order 6480.7DFTI Operations
Reference GuideSTLSC
Matrices
NAS-RD-20XXRMA Reqs.
Decomposable
Assign STLSC to Alternative
Service Threads
Yes
MILESTONE: Initial Investment Decision
(IID)
Initial PR Document (IPRD)
(Section 3.3 RMA Reqs.)
No
Service Threads Availability / STLSC
No
ANG-B DirectorConcurrence
Create Service Thread Diagram
Yes
RMATasks
Identify RMA Related Tasks Initial SOW
2 3
4
5
6
7
Develop/Concur with iPRD
Determine iPRD Concurrence
Concur with iPRD
Figure J-5 Initial Investment Analysis Phase
Final Investment Analysis Phase
9. Focus RMA effort on reducing risks, developing risk mitigation strategies and finalizing
requirements. Perform detailed risk assessment to identify risks and develop risk mitigation
strategies. Update program requirements document to contain final quantified performance
measures against which solution performance will be assessed during operational testing and
post implementation review.
Milestone - Final Investment Decision
Risk Management Activities
Final Investment Analysis Phase
ANG-B1 Program Level Requirements ProcessRisk Management Activities Prepare Procurment Package
FID
IID
MILESTONE: Final Investment Decision
(FID)
Final SOW
System Specification Document (SSD)
Finalize RMAReuirements
Finalize Detailed RMA Requirements
FinalizeSOW
QuantifiedRMA & Performance
Reqs
Quantified RMA & Performance
Reqs
Contractual RMA Reqs
Engineering RMA Assessments
(Identify & Assess Technical Risks)
RMA Tasks
Initial PR Document (iPRD) RMA
Reqs)
Initial SOW
RMA Modeling Activities
ModelingResults
Develop/Concur with fPRD
Determine fPRD Concurrence
Concur with fPRD
ANG-B DirectorConcurrence
RiskMitigation
RMATasks
Figure J-6 RMA Process: Final Investment Analysis Phase
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
259
Solution Implementation Phase
Focus RMA effort, during development, on design review and risk reduction testing activities to
identify and resolve problem areas that could prevent the system from approaching its theoretical
potential (Section 8.4.1 and 8.2.2.3)
10. Recognize that “pass/fail” reliability qualification tests are impractical for systems with high
reliability requirements and substitute an aggressive software reliability growth program (Section
8.5.3).
Reliability growth testing is an on-going process of testing, and correcting failures.
Reliability growth was initially developed to discover and correct hardware design defects.
Statistical methods were developed to predict the system MTBF at any point in time and to
estimate the additional test time required to achieve a given MTBF goal.
Milestone - In-Service Decision (ISD)
Solution Implementation Phase
Software Reliability Growth Program
IID
MILESTONE: In-Service Decision
(ISD)
Software Reliability growth testing
Final PR Document (fPRD) RMA/Performance
Reqs)
System SpecificationDocument (SSD)
QuantifiedRMA
Parameters
Fail
CorrectFailures
TestResults
Yes
Validated RMA Requirement
ISD
Figure J-7 Solution Implementation Phase
In-Service Management
11. Use NAPRS data from the NASPAS to provide feedback on the RMA performance of currently
fielded systems to assess the reasonableness and attainability of new requirements, and to verify
that the requirements for new systems will result in systems with RMA characteristics that are at
least as good as those of the systems they replace (Section 10.1).
The availabilities assigned to service threads provide a second feedback loop from the
NAPRS field performance data to the NAS-Level requirements, as shown in Figure 10-2 in
the RMA Handbook. This redundancy provides a mechanism for verifying the realism and
FAA Reliability, Maintainability, and Availability (RMA) Handbook
FAA RMA-HDBK-006C V1.1
260
achievability of the requirements, and helps to ensure that the requirements for new systems
will be at least as good as the performance of existing systems.
In-Service Management Phase
Operations: Monitor/Evaluate Operation Perfromance RMA Performance Feedback Paths
Performance Information Reporting
ISD
Finall PR Document (iPRD)
MTBF/MTTR
Final SOW
New Functions
Communications
Power Systems
Operations
System Support
Equipment HDR
Software PTR
HDR
PTR
NCP
Outages
OUtages
FAA Irder 6000.36A/B
FAA Order 6950.2E
*Solution Implementation
(SSD)MTBF/MTTR Feedbck
*Mission Analysis(Local NAS
Operational Capacity)
*Solution Implementation
(System-LevelSpecification)l
* Refer to Figure 10-1 RMA Process Diagram
*Mission Analysis (Operations Concepts)
NAS Availability Feedback
System SpecificationDocument (SSD)
NAPRS Service Reporting
(Operational Availability)
1NAPRSRMA CharacteristicsNAPRS
ReportableServices
All AMS Acquisition
Phases
1NAPRS monitor the performance of
operational systems
Figure J-8 In-Service Management Phase