guide to reliability, availability and maintainability

© State of NSW through Transport for NSW 2021

Guide to Reliability, Availability and Maintainability

T MU AM 06002 GU

Guide

Version 2.0

Issue date: 21 May 2021

T MU AM 06002 GU Guide to Reliability, Availability and Maintainability

Version 2.0 Issue date: 21 May 2021

© State of NSW through Transport for NSW 2021

Important message This document is one of a set of standards developed solely and specifically for use on

Transport Assets (as defined in the Asset Standards Authority Charter). It is not suitable for any

other purpose.

The copyright and any other intellectual property in this document will at all times remain the

property of the State of New South Wales (Transport for NSW).

You must not use or adapt this document or rely upon it in any way unless you are providing

products or services to a NSW Government agency and that agency has expressly authorised

you in writing to do so. If this document forms part of a contract with, or is a condition of

approval by a NSW Government agency, use of the document is subject to the terms of the

contract or approval. To be clear, the content of this document is not licensed under any

Creative Commons Licence.

This document may contain third party material. The inclusion of third party material is for

illustrative purposes only and does not represent an endorsement by NSW Government of any

third party product or service.

If you use this document or rely upon it without authorisation under these terms, the State of

New South Wales (including Transport for NSW) and its personnel does not accept any liability

to you or any other person for any loss, damage, costs and expenses that you or anyone else

may suffer or incur from your use and reliance on the content contained in this document. Users

should exercise their own skill and care in the use of the document.

This document may not be current and is uncontrolled when printed or downloaded. Standards

may be accessed from the Transport for NSW website at www.transport.nsw.gov.au

For queries regarding this document, please email the ASA at [email protected] or visit www.transport.nsw.gov.au



© State of NSW through Transport for NSW 2021 of 35

Standard governance

Owner: Senior Manager, Systems Engineering, Asset Management Branch

Authoriser: Director, Asset Management Partnering and Services, Asset Management Branch

Approver: Executive Director, Asset Management Branch, on behalf of the Asset Management Branch Configuration Control Board

Document history

Version Summary of changes

1.0 First issue 27 July 2015.

2.0 Second issue changes include updates and amendments to reflect current TfNSW governance.




Preface The Asset Management Branch, formerly known as the Asset Standards Authority (ASA), is a

key strategic branch of Transport for NSW (TfNSW). As the network design and standards

authority for NSW Transport Assets, as specified in the ASA Charter, the ASA identifies,

selects, develops, publishes, maintains and controls a suite of requirements documents on

behalf of TfNSW, the asset owner.

The ASA deploys TfNSW requirements for asset and safety assurance by creating and

managing TfNSW's governance models, documents and processes. To achieve this, the ASA

focuses on four primary tasks:

• publishing and managing TfNSW's process and requirements documents including TfNSW

plans, standards, manuals and guides

• deploying TfNSW's Authorised Engineering Organisation (AEO) framework

• continuously improving TfNSW’s Asset Management Framework

• collaborating with the Transport cluster and industry through open engagement

The AEO framework authorises engineering organisations to supply and provide asset related

products and services to TfNSW. It works to assure the safety, quality and fitness for purpose of

those products and services over the asset's whole-of-life. AEOs are expected to demonstrate

how they have applied the requirements of ASA documents, including TfNSW plans, standards

and guides, when delivering assets and related services for TfNSW.

Compliance with ASA requirements by itself is not sufficient to ensure satisfactory outcomes for

NSW Transport Assets. The ASA expects that professional judgement be used by competent

personnel when using ASA requirements to produce those outcomes.

About this document

This guide aims to provide supplier organisations with guidance in managing engineering

activities involving systems that are required to be reliable, available and maintainable. The

changes in this version include updates and amendments to reflect current TfNSW governance.

This is a second issue.




Table of contents 1. Introduction .............................................................................................................................................. 6

2. Purpose .................................................................................................................................................... 6 2.1. Scope ..................................................................................................................................................... 6 2.2. Application ............................................................................................................................................. 7

3. Reference documents ............................................................................................................................. 7

4. Terms and definitions ............................................................................................................................. 9

5. Reliability, availability and maintainability management .................................................................. 10 5.1. Plan reliability, availability and maintainability management activities ................................................ 11 5.2. Definition of system boundaries and assumptions .............................................................................. 13 5.3. Identification of reliability, availability and maintainability requirements .............................................. 14 5.4. Allocation of reliability, availability and maintainability requirements .................................................. 14 5.5. Development of reliability, availability and maintainability acceptance criteria ................................... 15 5.6. Reliability, availability and maintainability analysis and modelling ...................................................... 15 5.7. Validation of reliability, availability and maintainability requirements .................................................. 16 5.8. Reliability, availability and maintainability deliverables ....................................................................... 17

6. Reliability, availability and maintainability tools and techniques .................................................... 17 6.1. Reliability block diagram analysis ........................................................................................................ 18 6.2. Failure mode, effects and criticality analysis ....................................................................................... 18 6.3. Fault tree analysis ................................................................................................................................ 19 6.4. Human reliability analysis .................................................................................................................... 19 6.5. Maintenance requirements analysis .................................................................................................... 22 6.6. Failure recording analysis and corrective action system ..................................................................... 23

Appendix A Additional reference documents ...................................................................................... 26

Appendix B Examples of reliability block diagrams ........................................................................... 27

Appendix C Example of FMECA table - Bogie assembly .................................................................... 29

Appendix D Examples of fault tree analysis ........................................................................................ 30

Appendix E Examples of human error analysis .................................................................................. 31

Appendix F Example of MRA - station escalator .................................................................................... 33

Appendix G Example of FRACAS incident report – CPU motherboard ............................................ 35




1. Introduction An Authorised Engineering Organisation (AEO) engaged by Transport for NSW (TfNSW) to

undertake engineering activities is required to have reliability, availability and maintainability

(RAM) management arrangements in place that are relevant to the engineering services or

products that the AEO provides to TfNSW. These arrangements should enable the planning,

execution, and reporting of all RAM management activities for a system.

This document provides guidance on complying with the requirements of T MU MD 00009 ST

AEO Authorisation Requirements and T MU AM 06006 ST Systems Engineering Standard

which mandate RAM requirements.

This guide also elaborates on the RAM guidance described in TS 10504 AEO Guide to

Engineering Management.

AEOs should ensure that RAM management activities and outcomes are at an appropriate level

required for the scale and complexity of engineering services and systems provided, and should

incorporate RAMS requirements in the design and development of systems they are contracted

to deliver.

2. Purpose This document is intended to provide guidance to TfNSW and its supply chain AEOs applying

RAM management during engineering specification and asset life cycle stages and activities

involving systems that are required to operate dependably.

This ensures that TfNSW and its supply chain of AEOs (and non-AEOs operating under the

assurance arrangements of AEOs) are able to demonstrate sufficient control over RAM related

risks. This guidance is of particular relevance to suppliers who provide reliability-critical or

safety-critical engineering specification and design, in addition to systems engineering,

integration and maintenance services.

2.1. Scope This guide provides guidance to TfNSW and AEOs for RAM management related, in particular,

to the system specification, design and maintenance services. It also provides guidance on

RAM management principles, methods, techniques and processes used to analyse and deliver

RAM requirements from stakeholders including operational, maintenance and interfacing

targets. AEOs are assumed to have business-level policies addressing quality, performance

and safety.

For this guide, the term reliability, availability, maintainability and safety (RAMS) is used to

define an integrated management approach that includes safety. However, this guide is limited

to RAM and does not provide guidance on the safety element of RAMS management, as safety




assurance is addressed in T MU MD 20001 ST System Safety Standard for New or Altered

Assets. Refer to this document for guidance on safety management.

The specific evidence required to demonstrate RAM management processes will depend on the

scope and nature of the work undertaken by the AEO (noting that RAM is not applicable to

certain AEO services, for example land survey). For that reason, this document does not outline

the evidence required to be an AEO, rather it provides an outline of the typical RAM methods

and processes that AEOs may need to demonstrate.

2.2. Application This document applies to Transport cluster agencies and supply chain AEOs, and applies

specifically to the management of system and element level reliability, availability and

maintainability for new or altered NSW transport assets.

The level of application of RAM management principles and methods should be scaled and

tailored according to the degree of system novelty or complexity, the use of unique or

non-standard system or product configurations, and the associated level of safety risk.

The application of RAM (in particular reliability) analysis in support of system design may be

negligible or zero for some projects where type approved products are used in standard,

repeatable, and approved system configurations. This should be reflected in contractual

requirements to avoid unnecessary and excessive effort, resources, time and cost.

The need for, and application of, RAM management has different meanings to different asset

disciplines (for example, civil and structural engineers generally use the term durability, whereas

electrical, electronic and mechanical engineers generally use RAM). The impact of RAM (or

durability) management on planning and acquisition of new or altered systems and the specific

disciplines that support the system design should be understood.

T MU AM 06006 GU Systems Engineering Guide provides guidance on the level of RAM

management activities required for engineering disciplines and identifies a range of public

transport engineering projects and the level of RAMS to be applied.

3. Reference documents The following documents are cited in the text. For dated references, only the cited edition

applies. For undated references, the latest edition of the referenced document applies.

International standards

BS EN 60706-5 Maintainability of equipment - Part 5: Testability and diagnostic testing

EN 60300-3-1 Dependability management – Part 3-1: Application guide – Analysis techniques

for dependability – Guide on methodology




Australian standards

AS IEC 61025 Fault tree analysis (FTA)

AS IEC 61078 Reliability block diagrams

AS/NZS IEC 60812 Failure mode and effects analysis (FMEA and FMECA)

Transport for NSW standards

T MU AM 01003 F1 Blank FMECA Sheet

T MU AM 01003 ST Development of Technical Maintenance Plans

T MU AM 01010 ST Framework for Developing an Asset Spares Assessment and Strategy

T MU AM 06001 GU AEO Guide to Systems Architectural Design

T MU AM 06006 GU System Engineering Guide

T MU AM 06006 ST Systems Engineering Standard

T MU AM 06007 GU Guide to Requirements Definition and Analysis

T MU AM 06008 ST Operations Concept Definition

T MU AM 06009 ST Maintenance Concept Definition

T MU AM 06016 GU Guide to Verification and Validation

T MU HF 00001 GU Human Factors Integration – General Requirements

T MU MD 00009 ST AEO Authorisation Requirements

T MU MD 20001 ST System Safety Standard for New or Altered Assets

TS 10504 AEO Guide to Engineering Management

Other references

Department of Defense United States of America, 1991, MIL-HDBK-217F Military Handbook

Reliability Prediction of Electronic Equipment

Railtrack EE&CS Report, Infrastructure Risk Modelling Geographical Interlocking,

RT/S&S/IRM_FTA/11 Issue 1 January 1998

UNISIG, 5 August 2014, SUBSET-088-1, ETCS Application Level 1 - Safety Analysis; Part 1 -

Functional Fault Tree, Issue 3.5.4

Note: Appendix A contains a list of other documents not cited in the text that provide

additional information and guidance.




4. Terms and definitions The following terms and definitions apply in this document:

AEO Authorised Engineering Organisation

assurance a positive declaration intended to give confidence

authorisation the conferring of authority, by means of an official instruction and supported by

assessment and audit

availability measure of the percentage of time that an item or system is available to perform its

designated function (AS 4292.4)

BRS business requirements specification

durability the capability of a structure or any component to satisfy, with planned maintenance

(if applicable), the design performance requirements over a specified period of time under the

influence of the environmental actions, or as a result of a self-ageing process (ISO 13823).

For assets either with no availability to program maintenance or for physically

inaccessible assets or parts of assets, then the durability requirement will typically

need to be increased beyond that required for a maintainable structure, to satisfy the

specified design life.

ETCS European Train Control System

failure the inability of a system or asset to perform its intended function or satisfy some

predetermined conditional attribute

fault tree logic diagram showing the faults of sub items, external events, or combinations

thereof, which cause a predefined, undesired event (IEC 60500)

FTA fault tree analysis; deductive analysis using fault trees (IEC 60500)

FMECA failure mode, effects and criticality analysis

FRACAS failure recording analysis and corrective action system

HEART human error assessment and reduction technique

HRA human reliability analysis

MRA maintenance requirements analysis

maintainability the probability that a given active maintenance action, for an item under given

conditions of use can be carried out within a stated time interval when the maintenance is

performed under stated conditions and using stated procedures and resources (IEC 60050-191)

RAMS reliability, availability, maintainability and safety




RBD reliability block diagram; logical, graphical representation of a system showing how the

success states of its sub-items (represented by blocks) and combinations thereof, affect system

success state (AS IEC 61078)

reliability the ability of an item of equipment or a system to perform a required function under

stated conditions for a stated period of time or at a given point in time (AS 4292.4)

review a method to provide assurance by a competent person that an engineering output

complies with relevant standards and specific requirements is safe and fit for purpose

SRS system requirements specification

supplier a supplier of engineering services or products

system safety the concurrent application of a systems based approach to safety engineering

and of a risk management strategy covering the identification and analysis of hazards and the

elimination, control or management of those hazards throughout the life cycle of a system or

asset

TfNSW Transport for NSW

THERP technique for human error rate prediction

transport asset means assets used for or in connection with or to facilitate the movement of

persons and freight by road, rail, sea, air or other mode of transport, and includes transport

infrastructure

5. Reliability, availability and maintainability management T MU MD 00009 ST requires that AEOs demonstrate they have RAM management

arrangements in place, relevant to the engineering services or products provided.

The introduction of new or altered assets results in transport network complexity and RAM

implications. Implementation decisions should be made based on trade-offs between the

implementation costs and the subsequent operation and maintenance costs.

Consideration should be given to the total impact on the existing network, existing maintenance

activities such as safety working, and additional access arrangements.

The introduction of new assets that simplify the transport network configuration should generate

RAM improvements. However the introduction of new assets that do not simplify the network

may not generate RAM improvements.

Application of RAM engineering is required to ensure optimum transport network effectiveness,

safety and availability. RAM engineering is a whole-of-system life cycle philosophy and

methodology that is applied during the plan, acquire, operate/maintain, and dispose stages of

the system or asset lifecycle. RAM is most effectively applied during the plan/acquire phase for




consideration of the future operate/maintain/dispose phases. RAM analysis in the acquisition

phase will produce a maintenance schedule that will be used in the operate/maintain phase. If

there is a change to the operations concept requiring RAM to be revisited then it is conducted in

planning phase again.

RAM engineering and management activities that include planning and producing deliverables

should be carried out by suitably qualified and experienced individuals. RAM management

deliverables should be appropriate and sufficient such as to provide assurance to relevant

stakeholders that the system can satisfy the high level performance targets as required. TfNSW

should provide the performance targets (generally based on system and service modelling,

simulation and analysis), for example, the service reliability performance target of 92% for on-

time running of trains.

The following RAM activities should be undertaken but are not necessarily limited to:

• plan the RAM management activities

• define system boundaries and assumptions, dependencies and constraints for RAM

analysis

• identify the system RAM requirements

• allocate the requirements to elements (sub-systems)

• develop the RAM system acceptance criteria

• undertake RAM analysis and modelling

• validate the system RAM requirements

System failure recording and analysis is undertaken using a range of tools and processes.

These include, but are not limited to the following:

• failure mode, effects and criticality analysis (FMECA)

• reliability block diagrams (RBDs)

• fault tree analysis (FTA)

• failure recording analysis and corrective action system (FRACAS)

5.1. Plan reliability, availability and maintainability management activities T MU AM 06006 ST requires that projects consider RAMS performance and how it relates to

operational performance for novel systems early in the system life cycle, starting with

development of the operations concept definition and maintenance concept definition in

accordance with T MU AM 06008 ST Operations Concept Definition and T MU AM 06009 ST

Maintenance Concept Definition respectively.




T MU AM 06006 ST also requires that projects consider sustainable operation and maintenance

of the new or altered system over the full system life cycle at the beginning of the project, before

undertaking any asset life cycle stages and activities related work, AEOs should prepare a RAM

(or durability) management plan. Depending on the level of project and system complexity, the

RAM plan may be combined with other asset related plans to demonstrate how the system

RAM requirements will be achieved.

The RAM management plan should focus on managing RAM across the asset life cycle stages

and the activities, rules and principles that are required to be adopted including the following:

• reliability

o use of proven systems and equipment (assurance figures should be obtained)

o use of systems that are applicable to the conditions (systems proven in other countries

may not be suitable to NSW or Australian environmental conditions)

o human factors (human reliability and human error analysis)

o fault tolerance and graceful degradation

o the levels of redundancy designed into the system

o design life

• availability

o maintenance scheduling

o service recovery times and service availability

o network modelling to assess capacity against planned utilisation

• maintainability

o condition monitoring and diagnostics

o condition inspections

o obsolescence management

o human factors associated with the maintenance and repair task

o maintenance resources

o access arrangements for maintenance (for example, possessions, maintenance

staging areas)

o maintenance scheduling

o isolation for maintenance

o preventative maintenance




o corrective maintenance

o human factors considerations for maintenance

The RAM management plan should also include details on the roles and responsibilities

required within the organisation to achieve the RAM objectives.

Where there are proposed changes to parts of an existing system, the RAM management plan

should consider the resulting impact to the overall system RAM from these changes. The RAM

management plan should, where practical, include an assessment of the existing system RAM

performance, and the changes to the RAM performance resulting from the new or altered

assets.

An example of an impact to the reliability is the addition of a platform display to an existing light

rail system. The light rail operating contract specifies a maximum of three isolations of the line

per year. The platform display system needs to have reliability to work within this limitation.

An example of impact to the availability is the consolidation of maintenance depots from multiple

existing locations to a new central location. The relocation of the maintenance depots results in

additional travelling distances from the central depot to faults and an increase to the

maintenance response time.

An example of impact to the maintainability is the addition of two extra running lines to an

existing double running line system. These two additional running lines alter the maintainability

(access) of the combined services route and sub-stations adjacent to the original two lines.

These assets transition from a safe location to a danger zone location and additional safety

procedures will be required to maintain these assets.

5.2. Definition of system boundaries and assumptions System and element boundaries should be defined clearly and by means of a defined system

functional and physical architecture, before starting any RAM activities.

Assumptions may be made as a result of incomplete information in instances where programs

and systems are large or complex. As system definition progresses, these assumptions should

be clarified as either statements of fact, or eliminated within the system design process.

Clarification of these assumptions should be made with the asset custodian (client

representative).

The system architecture may need to change based on iterative design decisions, based on the

inability to satisfy system RAM requirements.

Refer to T MU AM 06001 GU AEO Guide to Systems Architectural Design for more information.




5.3. Identification of reliability, availability and maintainability requirements The asset custodian (client representative) should provide, early in the asset lifecycle, the high

level system RAM objectives that are in turn aligned to transport service availability targets.

RAM requirements captured from the stakeholders should be well-defined, demonstrable,

include explicit targets and meaningful to allow efficient RAM activities to be conducted. RAM

requirements should be considered in the context of their implementation cost. If the RAM

targets are very exacting, then the resulting implementation cost may be very high and options

analyses should be conducted to determine willingness to accept such costs.

The RAM requirements capture process should start with service reliability and availability in the

business requirements specification (BRS) and be further refined in the system requirements

specification (SRS) development process. Any requirements which fall outside these criteria

should be challenged and clarified as necessary.

Refer to T MU AM 06007 GU Guide to Requirements Definition and Analysis for more

information on requirements management in general.

5.4. Allocation of reliability, availability and maintainability requirements RAM requirements allocation assures that the high level BRS RAM targets are allocated

appropriately at system and element levels. Models based on RBDs and other modelling

techniques should be employed in the allocation process for novel, highly complex systems.

The allocations should be used as an aid to achieving the RAM objectives. These system and

element level RAM targets should then be converted into RAM requirements.

To ensure realistic allocation, system and element RAM requirements should be compared to

empirical data for identical or similar systems (that is, benchmarking) whenever possible. The

empirical data should be validated for its relevance, considering factors such as the modes of

operation, the operating environment and any fine-tuning or adjustments that have been used. If

allocated values are not achievable, design options analysis across systems and elements

should be performed to reallocate system RAM requirements. The process of allocation,

comparison with empirical data, trade-offs and iteration as required should result in system and

element RAM requirements being defined.

The allocation of a RAM target to each system and element should be specific, measurable and

attainable, taking into account the criticality and risks involved in the design, development and

installation.

Systems and elements that are critical to performance should have RAM targets set higher than

other non-critical systems, based on the system level reliability or redundancy employed. When

allocating RAM targets, the number and complexity of the system interfaces and the extent to




which the system will be affected by external factors including the operating environment needs

to be considered.

5.5. Development of reliability, availability and maintainability acceptance criteria The acceptance criteria for RAM requirements should be agreed between the stakeholders

including the asset custodian (client representative) and system developer.

These stakeholders may include representatives from the Transport cluster.

This should include, but not be limited to, the RAM validation principles to be applied and the

tests and analysis to be carried out for the validation. Acceptance criteria should be agreed and

documented through the requirements allocation process, starting with the BRS and then the

SRS. Consideration should be given to the cost of implementing the acceptance criteria.

5.6. Reliability, availability and maintainability analysis and modelling T MU AM 06006 ST requires that projects use RAMS modelling to appropriately support option

selection and development and preliminary system design, to ensure that the new or altered

system will meet the stated operational capability and provide value for money over the

designed system lifetime.

During the plan and acquire stages of a project, reliability predictions should be used to assess

whether the allocated RAM requirements are achievable. An iterative process of comparing

RAM predictions with RAM requirement allocations combined with trade-off studies, will

eventually result in an efficient design that achieves whole of life RAM performance targets.

Reliability predictions combine lower level component or unit level reliability data through

reliability modelling, and the operating and environmental conditions, to estimate the integrated

system reliability. The validity of the reliability predictions is highly dependent upon the quality of

reliability data and assumptions made.

Whenever possible, reliability predictions should be based on data from similar components or

equipment already in use in service in similar operational environments. For electronic

equipment, parts count prediction methods based on MIL-HDBK-217F Military Handbook

Reliability Prediction of Electronic Equipment can be used to obtain reliability predictions. Where

this is not possible, reliability data may be extrapolated from tests or trials conducted by the

supplier or manufacturer. In all cases the sources of the data should be cited to maintain an

audit trail. Suppliers of original equipment and systems should provide evidence that they

satisfy all RAM requirements and that they are suitable for the intended application.




Reliability prediction should use reliability modelling where practicable for novel, high complexity

systems, such as a RBD, fault tree or a computerised simulation model, to describe the

reliability behaviour of the system and reliability data of the constituent elements.

RAM predictions are performed predominantly for the following purposes:

• reliability

o to evaluate expected reliability performance against target risk of failure

o to identify unexpected weaknesses in a design, including single point failures

o to provide a basis for a testing program

o to predict maintenance effort and cost

• availability

o to evaluate outage times and service disruptions against economic, community and

quality criteria

o to identify critical subsystems and components

o to determine the need for redundant or stand-by equipment configurations

• maintainability

o to determine the most effective maintenance strategy

o to optimise maintenance facilities, diagnostic and training tools, spares holdings and

manning levels

o to assess need for condition monitoring

RBDs and FTA are systematic top-down reliability modelling and analysis techniques, and are

usually best applied when introducing novel, highly complex new or altered systems during

system architecture design during the acquire stage.

In addition to RAM modelling, complimentary analysis techniques should be used during design

to concentrate on areas which are critical to the system reliability, such as failure mode, effects,

and criticality analysis (FMECA).

5.7. Validation of reliability, availability and maintainability requirements Validation should include details of the validation tasks and relevant results against the RAM

acceptance criteria. Any limitations and constraints applying to the system should also be noted.

There are numerous sources of international good practice in reliability and maintainability

validation. These include MIL-HDBK 781, EN 60300-3-1 Dependability management – Part 3-1:

Application guide – Analysis techniques for dependability – Guide on methodology and

EN 60706-5 Maintainability of equipment - Part 5: Testability and diagnostic testing.




A RAM report including results from the analysis and verification and validation activities should

be prepared and then issued to stakeholders. Refer to T MU AM 06016 GU Guide to

Verification and Validation for more information. The RAM report should clearly display all

verification and validation failures against RAM acceptance criteria. Corrective action should

then be undertaken to rectify these failures. Validation and verification activities should be

repeated following corrective actions, and the RAM report re-issued.

5.8. Reliability, availability and maintainability deliverables The following deliverables should be produced during the RAM process:

• RAM management plan including the asset life cycle stages

• BRS RAM requirements with their acceptance criteria

• SRS RAM requirements with their acceptance criteria

• element level RAM requirements with their acceptance criteria

• RAM analysis and modelling with their data

• RAM report including results from the analysis, modelling, verification and validation

activities

6. Reliability, availability and maintainability tools and techniques Careful consideration should be given to the selection of the appropriate RAM tools and design

techniques used to provide RAM results. This consideration involves a critical decision as to

whether a simple calculation or a comparison with an existing system is sufficient or whether

RAM tools and design techniques are required.

These tools and design techniques may provide different RAM results as the system definition

progresses. These progressive RAM results should be recorded in the RAM report during the

asset life cycle stages.

Different asset types may have different approaches and tools for RAM modelling and analysis.

Communications, signalling and electrical designers may use RBD analysis, failure mode,

effects, and criticality analysis (FMECA) or FTA tools, whereas bridge and structural designers

may use finite element analysis (FEA) tools.

The RAM tools and techniques are explained in Section 6.1 through to Section 6.6.




6.1. Reliability block diagram analysis AS IEC 61078 Reliability block diagrams describes RBDs as a diagrammatic analysis method

for demonstrating the contribution of component reliability to the success or failure of a complex

system.

A RBD is drawn as a series of blocks connected in parallel or series configuration, with each

block representing a component of the system with an associated failure rate. Parallel paths are

redundant, meaning that all of the parallel paths must fail for the parallel network to fail. By

contrast, any single failure along a serial path causes the entire serial path to fail.

RBDs are used to calculate the reliability of each element and the contributory effect on the

reliability of the system. This assists in the identification of single points of failure in the system.

Examples of where a RBD would be used are for the development of a station announcement

system and a blue light emergency station provided in Appendix B.

6.2. Failure mode, effects and criticality analysis AS/NZS IEC 60812 Failure mode and effects analysis (FMEA and FMECA) describes FMECA

as a bottom up analysis method that is used to understand failure modes at the unit level and

their escalation effect, both at a local and a system wide level. This method requires the system

design to be well defined down to unit level.

Each system is decomposed into its elements, usually down to line replaceable unit (LRU) level

where each element is then analysed uniquely to identify functional failures and relevant modes

of failure, and their escalated effect on the next higher level of the system.

This process is employed to identify those elements of a system which have a significant impact

on system reliability, availability and safety. This analysis is further used to promote mitigation

measures leading to improved system reliability and availability.

FMECA is typically used for high level analysis of system reliability through the following

process:

• identification of failure modes and consequences, and facilitation of design modifications

• assessment of failure causation, performance limits and vulnerability issues

• classification of failure modes relative to the severity of their effects

An output of the FMECA should be a reliability critical items list (RCIL). This is a list of items

which have at least one failure mode classified as critical according to its criticality analysis.

Consideration should also be given to common-mode failure where an event causes multiple

systems to fail. For example a transformer explosion in a substation may cause both

transformers (main and standby) in the room to fail simultaneously.




Appendix C provides an example of a FMECA used for the development of a train bogie

system.

Refer to T MU AM 01003 F1 Blank FMECA Sheet for further details.

6.3. Fault tree analysis FTA should be used for highly complex safety or reliability critical systems.

FTA should be done during the initial stage of the project and updated as more details become

available during subsequent stages of the project.

AS IEC 61025 Fault tree analysis (FTA) describes FTA as a top down deductive failure analysis.

An undesired state of a system (top event) is analysed using Boolean logic to combine a series

of lower-level events and associated probabilities. This analysis method is used to determine

the probability of a safety accident or a particular system level (functional) failure.

The basic symbols used in FTA are grouped as events, gates, and transfer symbols. Event

symbols are used for primary events and intermediate events. Primary events are not further

developed at a lower level on the fault tree. Intermediate events are found at the output of a

gate. Events in a fault tree are associated with statistical probabilities. Gate symbols graphically

describe the mathematical relationship between input and output events. The gate symbols are

derived from Boolean logic symbols. Transfer symbols are used to connect the inputs and

outputs of related fault trees, such as the fault tree of a subsystem to its system.

FTA incorporates the following phases:

• definition of the undesired system top event to analyse

• obtaining an understanding of the system functional breakdown

• construction of the Boolean fault tree from top event down to base events

• assignment of failure rates to the base events

• evaluation of the fault tree (software-based simulation or spreadsheet analysis)

• control of the hazards identified

An example of where FTA would be used is for analysing the risk of failure of the water deluge

system (that forms part of a broader fire suppression system) failing to operate when required.

The contributory factors that lead to this system top event are provided in Appendix D.

6.4. Human reliability analysis T MU AM 06006 ST requires that projects consider human reliability factors as part of the

overall reliability of the system.




The purpose of conducting human reliability analysis (HRA) is to ensure that the actual

operability performance of the system is in line with its designed requirements. Humans are an

integral part of designed systems, playing important roles in operation, accident prevention and

maintenance activities.

Operators and maintainers should be trained and competent; however ‘trained and competent

people’ is not a way of preventing human error. Human error is a normal part of human

performance, and should be appropriately assessed to create resilient systems. A trained and

competent human operator can still take erroneous decisions and actions based on stress,

fatigue, distraction and other human performance degradation factors. Early, appropriate HRA

is essential to ensure the exploration of the appropriate hierarchy of controls. Delayed or

ineffective assessments tend to create dependencies on administrative risk control which can

create latent system weaknesses.

Therefore, analysing and predicting the reliability of a system without assessing human

reliability may result in an over estimation of system performance.

Although there are many ways in which a human can positively impact on system performance,

the focus within a RAM assessment is usually to identify the following:

• human errors that may impact on the RAM of the system

• mitigation measures to reduce likelihood of human errors or to reduce impact of these

errors on the system

• minimum training and capability requirements

These measures can relate to the design of the equipment or the task, or may warrant

additional redundancy or diversity to be incorporated within the overall system design.

In order to be able to identify the errors that can be made and what their likely effect on the

performance of the system would be, it is necessary to identify the following:

• the tasks that are required to be carried out by operators and maintainers

• the likely conditions under which those tasks will be performed

• the potential errors that could be made

With many other aspects of the design in the early stages, information may be at a relatively

high level and should be used to identify those areas of the system where a more detailed

assessment is of most value.

There are a number of methods available for identifying human errors ranging from utilising past

experience through to the application of structured processes based on guidewords or

checklists. Human error should be built into existing system RAM analysis techniques such as

FMECA or FTA.




In those cases where a quantitative assessment is required, techniques for human error rate

prediction may be employed for evaluating the probability of a human error occurring and

impacting on system performance. This should then be incorporated into the system RAM and

performance models to assess the impact on the overall system performance.

Techniques to evaluate the probability of human errors fall into the following three general

categories:

• the use of screening data

• the use of historical or subjective data

• the use of human error databases

Examples of where a HRA would be used are for a ticketing system and a door release system

provided in Appendix E.

Refer to T MU HF 00001 GU Guide to Human Factors Integration – General Requirements for

more information on HRA.

6.4.1. Screening data A single screening value for human error within a system model may be used in the early

stages of an assessment. This enables an organisation to identify where the system is

particularly vulnerable to human error and to review the design in terms of the level of

redundancy or diversity that is currently built in, or to identify whether a more detailed

assessment may be required.

6.4.2. Historical or subjective data Actual performance data, if available, may be used as an estimation within reliability models.

This data is normally only available at the system level and does not specifically highlight the

human error contribution. However, it is estimated that approximately 70% - 90% of failures are

due to human error and so it is possible to factor the data in this way to obtain a more reliable

estimate.

Note: Manufacturer's data generally does not include human errors and so it will be

indicative of performance based on 100% reliability of people.

Alternatively, subjective data may be sought through consultation with users or their opinions

and may be used to modify existing data.




6.4.3. Human error databases A number of techniques are used for quantitative human error assessments where it is possible

to look up a generic human error probability and then modify it according to the specific task.

Commonly used examples include the following:

• Human error assessment and reduction technique (HEART)

HEART method is based upon the principle that every time a task is performed there is a

possibility of failure and that the probability of failure is effected by one or more error

producing conditions to varying degrees. Error producing conditions include topics such as

training level and frequency, poor procedures, poor system feedback and so on.

Factors which have a significant effect on performance are of greatest interest. These

conditions are applied to a ‘best-case-scenario’ estimate of the failure probability under

ideal conditions to then obtain a final error probability. By forcing consideration of the error

producing conditions potentially affecting a given procedure, the application of HEART also

enables the user to identify a range of potential improvements to system performance.

An example of where HEART would be used is the assessment of a critical maintenance

task.

• Technique for human error rate prediction (THERP)

THERP models human error probabilities using an event tree approach, in a similar way to

an engineering risk assessment, but also considers performance shaping factors that may

influence these probabilities. The probabilities for the HRT event tree, which is the primary

tool for assessment, are nominally calculated from historic databases, local data including

simulated data or from accident reports. The resultant tree portrays a step by step account

of the stages involved in a task in a logical order. The technique is described as a total

human reliability assessment methodology as it simultaneously manages a number of

different activities including task analysis, error identification and human error

quantification.

6.5. Maintenance requirements analysis Maintenance requirements analysis (MRA) is the inclusion of reliability, availability and safety

integrity as a part of the maintenance requirements of the system.

MRA applies reliability theory principles within a structured process designed to identify effective

maintenance and inspection tasks that would detect or delay failures of equipment. This

ensures that maintenance requirements are incorporated into the design activities.




The following elements are inherent in MRA:

• identify the maintenance item

o Identify the items to be maintained at system, element, assembly, unit, component

level as part of the asset breakdown structure.

• establish the function

o Identify all functions associated with the maintenance item.

• establish failure modes and effects

o Identify and analyse all possible failures to or deviations from the specified

functionality associated with the maintenance item. Analyse their escalation effects

from component level to unit level to assembly level to subsystem level to system

level.

• recognise failure

o Identify the means by which each failure is detected and communicated to the

maintainer.

• identify maintenance task options

o Identify how each maintenance item should be repaired or replaced (both preventative

and corrective maintenance tasks).

• establish maintenance task intervals

o Identify a maintenance program which includes the schedule of inspection or

replacement for all maintenance items.

An example of where MRA would be used is for the development of a station escalator provided

in Appendix F.

Refer to T MU AM 01003 ST Development of Technical Maintenance Plans and

T MU AM 01010 ST Framework for Developing an Asset Spares Assessment and Strategy for

further details.

6.6. Failure recording analysis and corrective action system A FRACAS should be applied from that point in the design cycle at which a version of the

product or service approximating the final operational version becomes available until the

product or service is decommissioned. The RAM plan should identify the existing FRACAS in

place, if any.

The FRACAS is a closed loop process incorporating data reporting, collecting, recording,

analysing, investigating and timely corrective action for all failure incidents. The objective of the

system is to aid design, identify corrective action tasks and evaluate test results in order to




provide confidence in the results of the safety analysis activities in addition to the correct

operation of the safety features.

The effectiveness of FRACAS is dependent upon accurate input data in the form of reports

which should document all the conditions relating to the incident.

Incident reviews should be undertaken to ensure that the impact on the safety and reliability

characteristics of the product or service are quickly assessed, with any corrective actions

requiring design changes, quickly approved.

The FRACAS process is outlined as follows and is illustrated in Figure 1:

• an incident report is raised and recorded in a database

• a data search is carried out for related events to determine if there is a growing trend in a

particular failure event type

• the incident is reviewed - if the incident is a new hazard it is recorded as such in the hazard

log

• information concerning the incident is communicated to those that need to know, in order to

control risk

• corrective actions are recommended, as necessary

• if no corrective action is required, the database is updated and the process ends

• the corrective action is authorised and implemented then assessed for success

• if the corrective action is unsuccessful, the incident is re-reviewed, corrective actions are

modified as required, details are updated in the database and the action returns for further

authorisation to proceed

• if the corrective action is successful, the database is updated and the process ends

An example of where FRACAS would be used is the development of a CPU motherboard

provided in Appendix G.




Incident raised and recorded

Search for related events

Review incident

Communicate information as necessary

Corrective action necessary?

Authorise, implement and assess plan

Corrective action successful?

Update database

No

No

Yes

Yes

Figure 1 - FRACAS process




Appendix A Additional reference documents The following documents, not cited in the text, provide additional information and guidance.

I.S. EN 50126 (all parts) Railway Applications - The Specification and Demonstration of

Reliability, Availability, Maintainability and Safety (RAMS)

EN 50128 Railway applications – Communication, signalling and processing systems –

Software for railway control and protection systems

AS ISO 55001 Asset management - Management systems - Requirements

AS IEC 62508 Guidance on human aspects of dependability

T MU MD 00009 SP AEO Authorisation Model

T MU AM 01002 MA Maintenance Requirements Analysis Manual

Department of Defense United States of America, 1987, MIL-HDBK-781A Handbook for

Reliability Test Methods, Plans, and, Environments for Engineering, Development, Qualification

and, Production

Department of Defense United States of America, 191995, MIL-STD-2155 Handbook Failure

Reporting, Analysis and Corrective Action Taken

Williams, J.C., HEART – A proposed method for achieving high reliability in process operation

by means of human factors engineering technology in Proceedings of a Symposium on the

Achievement of Reliability in Operating Plant, Safety and Reliability Society,1985, NEC,

Birmingham

Swain, A.D. and Guttmann, H.E., Handbook of Human Reliability Analysis with Emphasis on

Nuclear Power Plant Applications. 1983, NUREG/CR-1278, USNRC

Shappell, S.A. and Wiegmann, D.A., The Human Factors Analysis and Classification System—

HFACS, February 2000, DOT/FAA/AM-00/7

Stanton, N. A., Salmon P. M. et al, Human Factors Methods A practical guide for Engineering

and Design, 2nd Edition, 2013, Ashgate, Aldershot, ISBN 978-1-4094-5754-1




Appendix B Examples of reliability block diagrams Figure 2 and Figure 3 provide examples of RBDs for station public announcement and station blue light emergency display, respectively.

Loudspeakers

STM 64 PortP.MUX AMD II

Matrix Enhanced

MTBF=157.680 H

MTBF=183.960H

MTTR=2hours

MTTR=2hours

Network Fibre

MP50 Call Station

MTBF=163.549 H

MTTR=2hours

STM 64 Port

Matrix Enhanced

P.Mux AMD II

MTTR=2hours

MTBF=183.960H

MTBF=157.680 H

MTTR=2hours

MTBF=163.549 H

MTTR=2hours

Amplifier Module V400 Amplifier Mainframe

VIPET

P1 Ethernet Switch

PCAS Workstation

VIPA HOSTVAR 4

Network Fibre

MTBF=600000H

MTTR=24h

MTBF=600000H

MTTR=24h

MTBF=21400 H

MTBF=121354 H

MTTR=2hours

MTBF=39800H MTBF=65000H

MTBF=48681 HMTBF=96400 HMTBF=215800 HMTBF=118600H

MTTR=4hMTTR=1hMTTR=4h

MTTR=4 hoursMTTR=4 hoursMTTR=4 hoursMTTR=4 hours

Loudspeakers

MTBF=87600H

MTBF=87600H

MTTR=2hours

MTTR=2hours

Service board

MTBF=621.960 H

MTTR=2hours

Service board

MTBF=621.960 H

MTTR=2hours

Matrix Enhanced

MTBF=157.680 H

MTTR=2hours

Matrix Enhanced

MTBF=157.680 H

MTTR=2hours

Switch OS6450-24

MTTR=2hours

MTBF=894251H

Figure 2 - RBD station public announcement (sample fragment)

Note: In Figure 2 mean time between failures is expressed as MTBF and mean time to repair is expressed as MTTR.




Emergency Push Button

with Key Reset

Relays

240V AC UPS

Blue Light Display

Comms Module Output RelayInput Relay Alarm Module

Comms Module Output RelayInput Relay Alarm Module

MTTR=1 hour

MTBF=100,000 H MTBF=50,000 H MTBF=100,000 H MTBF= 100,000 H MTBF=50,000 H

MTBF=600,000 H

MTTR=1 hour MTTR=1 hour MTTR=1 hourMTTR=4 hours

MTTR=3 hours

MTTR=1 hour

MTBF=50,000 H MTBF=100,000 H MTBF= 100,000 H MTBF=50,000 H

MTTR=1 hour MTTR=1 hour MTTR=1 hour

MTBF= 50,000 H

MTTR= 3 hoursEmergency Push Button

with Key Reset

MTBF=100,000 H

MTTR=4 hours

Figure 3 - RBD station blue light emergency display (sample fragment)

Note: In Figure 3 mean time between failures is expressed as MTBF and mean time to repair is expressed as MTTR.




Appendix C Example of FMECA table - Bogie assembly Figure 4 provides an example of FMECA table for bogie assemble.

Figure 4 - FMECA table (sample fragment)




Appendix D Examples of fault tree analysis Figure 5 shows an example of a FTA for failure of a fire water deluge fail.

TOP1

Fire Water Deluge Fails

GATE1

Motor Failures

GATE2

Detection Failures

PUMP

Fire Pump

I E

GATE3

Power Failures

MOTOR

Fire Pump Motor

I E

DETECT

UV fire detector

I E

PANEL

Fire Detection Panel

I E

PSU

Mains Power Supply

I E

STANDBY

Standby Generator

I E

Fire suppression system example:

• Failure Rate, λ = 1/MTBF

• MDT = Mean Down Time (hrs)

λ = 9.6

MDT 21h

λ = 59.6

MDT 144h

λ = 120 ∴ MTBF = 0.95 years

MDT 84h ∴ A=0.99

λ = 0.0096

MDT 21h

λ = 100

MDT 24h

λ = 500

MDT 168h

λ = 50

MDT 168h

λ = 5

MDT 168h

λ = 10

MDT 24h

λ = 60

MDT 24h

Figure 5 – Fault tree analysis diagram




Appendix E Examples of human error analysis Table 1 and Table 2 provide examples of human error analysis for ticketing system and doors release system, respectively.

Table 1 – Example of human error analysis for ticketing system

Task Error Mitigation

Select ticket type at ticket vending machine Incorrect ticket type selected • Machine buttons labelled with the various ticket types • Machine visual display showing ticket type selected

Select destination at ticket vending machine

Incorrect destination selected • Machine buttons labelled with the various destinations • Machine visual display showing destinations

Enter coins into ticket vending machine Coins inserted into the notes reader • Ticket vending machine has coin slot labelled

Enter notes into ticket vending machine Notes inserted into the coins slot • Ticket vending machine has the notes reader labelled

Enter notes into ticket vending machine Notes inserted upside down or back to front • Machine labelled with a diagram showing the correct note orientations

Transport the ticket Ticket bent in transit • ‘Do not bend this ticket’ marked on the ticket • Ticket made from flexible plastic to avoid damage • Ticket size allows ticket to be placed in a wallet or purse

Transport the ticket Ticket dropped or crushed

Insert ticket into ticket reader Ticket inserted upside down • ‘Travel Card’ marked on upside of the ticket

Insert ticket into ticket reader Ticket inserted back to front • Direction arrow marked on the upside of the ticket




Table 2 – Example of human error analysis for doors release system

Task Error Mitigation

Locate doors release button Button not located • Button labelled with doors release • Button illuminated with green lights

Press doors release button Button not pressed • Button labelled with doors release • Button illuminated with green lights

Travel on train Button pressed accidently • Button recessed to avoid accidental presses • Button must be pressed for 3 seconds to activate

Travel on train Button obscured by passengers • Door labelled requesting passengers to stand clear

Travel on train Button damaged by passengers • Button recessed to avoid accidental contact • Button made from material that can withstand high impacts




Appendix F Example of MRA - station escalator Table 3 provides an example of MRA for station escalator.

Table 3 – Example of MRA for station escalator

Maintenance item

Functions associated

Possible failures modes

Effect Escalation effect

Failure recognition

Maintenance task options

Maintenance task intervals

Platforms Entry and exit access

Unable to provide entry or exit access

Platform blocked to passengers

Passengers unable to use the escalator

• Visual inspections for damage

• Testing

• Replace platform panels

• Daily cleaning, inspection and testing

• 6 monthly service inspection and testing

Steps Support standing or walking passengers

Unable to support passengers

Steps not safe for passengers

Passengers unable to use the escalator


• Testing

• Replace steps • Daily cleaning, inspection and testing


Tracks Provides running surface for the steps

Unable to provide running surface for the steps

Steps unable to move

Passengers unable to ride on escalator

• Testing • Lubricate tracks • Replace tracks

• Daily testing • 6 monthly service

inspection and testing

Tracks Provides running surface for the handrails

Unable to provide running surface for the handrails

Handrails unable to move

Passengers unable to use handrail for support


• Testing

• Lubricate tracks • Replace tracks



Drive gears Provides coupling and speed conversion of the motor to the steps

Unable to provide coupling and speed conversion of the motor to the steps

No or slow movement of steps.


• Testing • Lubricate gears • Replace gears






Maintenance item

Functions associated

Possible failures modes

Effect Escalation effect

Failure recognition

Maintenance task options

Maintenance task intervals

Drive gears Provides coupling and speed conversion of the motor to handrails

Unable to provide coupling and speed conversion of the motor to handrails

No or slow movement of handrails

Passengers unable to use handrail for support

• Testing • Lubricate gears • Replace gears



Hand rails Provides support and stability to passengers

Handrails unable to support passengers

No handrails Passengers unable to use handrail for support


• Testing

• Replace handrails

• Daily cleaning, inspection and testing


Motors Provides driving force for handrails and steps

Unable to drive the handrails or steps

No or slow movement of steps or handrails


• Testing • Replace motors • Daily testing • 6 monthly service


Control system

Regulates speed of steps and handrails

Unable to drive the handrails or steps

No or slow movement of steps or handrails


• Testing • Replace motors • Daily testing • 6 monthly service


Emergency stop system

Halts movement of steps and handrails in an emergency situation

Unable to halt steps and handrails

Steps and handrail movement

Passenger injuries

• Testing • Replace components



Glass screens

Protects passengers from moving components

Unable to protect passengers from moving components

Exposed moving parts

Passenger injuries


• Replace components

• Daily inspection

Glass screens

Protects passengers from falling

Unable to protect passengers from falling

Passengers fall off steps

Passenger injuries


• Replace components

• Daily inspection




Appendix G Example of FRACAS incident report – CPU motherboard Figure 6 provides an example of FRACAS incident report for CPU mother board.

Figure 6 - Example of FRACAS incident report for CPU motherboard

guide to reliability, availability and maintainability

Documents