principles of system safety engineering and management felix redmill redmill consultancy, london...

161
PRINCIPLES OF SYSTEM SAFETY ENGINEERING AND MANAGEMENT Felix Redmill Redmill Consultancy, London [email protected]

Upload: marvin-roberts

Post on 28-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

PRINCIPLES OFSYSTEM SAFETY

ENGINEERING AND MANAGEMENT

Felix RedmillRedmill Consultancy, London

[email protected]

(c) Felix Redmill, 2011 2

RISK

CERN, May '11

(c) Felix Redmill, 2011 3

SAFETY ENGINEERING AND MANAGEMENT

It is necessary both to achieve appropriate safety and to demonstrate that it has been achieved

• Achieve - not only in design and development, but in all stages of a system’s life cycle

• Appropriate - to the system and the circumstances

• Demonstrate - that all that could reasonably have been done has been done, at every stage of the life cycle

CERN, May '11

(c) Felix Redmill, 2011 4

THE U.K. LAW ON SAFETY

Health and Safety at Work Etc. Act 1974:

Safety risks imposed on others (employees and

the public at large) must be reduced ‘so far as is

reasonably practicable’ (SFAIRP)

CERN, May '11

(c) Felix Redmill, 2011 5

THE HSE’S ALARP PRINCIPLE

Limit of tolerability thresholdBroadly Acceptable Region(Risk is tolerable without reduction. But it is

necessary to maintain assurance that it remains at this level)

Broadly acceptable thresholdALARP or Tolerability Region(Risk is tolerable only if its reduction is

impracticable or if the cost of reduction is grossly disproportinate to the improvement

gained)

Increasing risk

Unacceptable Region(Risk cannot be justified except in

extraordinary circumstances)

CERN, May '11

(c) Felix Redmill, 2011 6

THE HSE’S ALARP PRINCIPLE

Intolerable region

The ALARP or

tolerability region

Broadly acceptable

region

Risk cannot be justified

except in extraordinary

circumstances

Risk is tolerable only if

further reduction is

impracticable or if its

cost is grossly

disproportionate to the

improvement gained

It is necessary to maintain

assurance that risk

remains at this level

CERN, May '11

(c) Felix Redmill, 2011 7

CALIBRATION OF THE ALARP MODELRecommended for Nuclear Industry

• Intolerability threshold:

– 1/10000 per year (for the public)

– 1/1000 per year (for employees)

• Broadly acceptable threshold:

– 1/1000000 per year (for everyone)

CERN, May '11

(c) Felix Redmill, 2011 8

A VERY SIMPLE SYSTEM

• Chemicals A and B are mixed in the tank to form product P

P opens & closes input and output valves• If emergency signal arrives, then cease operation

APEmergency signal P ControllerB

CERN, May '11

(c) Felix Redmill, 2011 9

QUESTIONS

• How could the accident have been avoided?

– Better algorithm

• How could the software designer have known that a better algorithm was required?

– Domain knowledge

• But we can’t be sure that such a fault won’t be made, so how can we find and correct such faults?

– Risk analysis techniques

CERN, May '11

(c) Felix Redmill, 2011 10

A SIMPLE THOUGHT EXPERIMENT

• Draw an infusion pump

– Note its mechanical parts

• Think about designing and developing software to control its operation

• Safety consideration: delivery of either too much or too little of a drug could be fatal

• How would you guarantee that it would kill no more than 1 in 10,000 patients per year?

– What level of confidence - on a percentage scale - do you have in your estimate?

CERN, May '11

(c) Felix Redmill, 2011 11

CONFIDENCE IN SAFETY IN ADVANCE

• What is safety?

• How can it be measured?

• What can give confidence that safety is high?

• Need also to demonstrate safety in advance

• Therefore, need to find hazards in advance

CERN, May '11

(c) Felix Redmill, 2011 12

NEW RISKS OF NEW TECHNOLOGY - STRASBOURG AIR CRASH

• Mode: climbs and descents in degrees to horizontal

– 3.3 descent represented as ‘minus 3.3’

• Mode: climbs and descents in units of 100 feet

– 3300 feet/minute descent represented as ‘minus 33’

• Plane was descending at 3300 feet/minute

• Needed to descend at angle of 3.3 to horizontal

• To interpret correctly, pilot needed to know the mode

• ‘Mode error’ in the system

• Human error? If so, which human made it?

CERN, May '11

(c) Felix Redmill, 2011 13

CONCERN WITH TECHNOLOGICAL RISKS

• We are concerned no longer exclusively with making

nature useful, or with releasing mankind from

traditional constraints, but also and essentially with

problems resulting from techno-economic development

itself [Beck]

• Risks induce often irreversible harm; not restricted to

their places of origin but threaten all forms of life in all

parts of the planet; in some cases could affect those

not alive at the time or place of an accident [Beck]

• People who build, design, plan, execute, sell and

maintain complex systems do not know how they may

work [Coates]

CERN, May '11

(c) Felix Redmill, 2011 14

RISK - AN IMPORTANT SUBJECT

• Risk is a subject of research in many fields,

e.g. Psychology, Sociology, Anthropology, Engineering

• It is a critical element in many fields

e.g. Geography (climate change), Agriculture, Sport, Leisure activities, Transport, Government policy

• It influences government policy (e.g. on bird flu) and local government decisions (e.g. closure of children’s playgrounds)

• An influential Sociological theory is that it is the most influential factor in modern society

(we are in ‘the risk society’)

• Every activity carries risk

• All decisions are ‘risky’

CERN, May '11

(c) Felix Redmill, 2011 15

FUNCTIONAL SAFETY:ACHIEVING UTILITY AND CREATING

RISK

We are concerned with safety that depends on the correct functioning of equipment

Control system

Utility(plus risks)

Equipment under control

CERN, May '11

(c) Felix Redmill, 2011 16

FUNCTIONAL SAFETY 2

Functional safety depends on hardware, software, humans, data, and interactions between all of these

Control system

Utility(plus risks)

Equipment under controlHumans (design, operation,

maintenance, etc.)(human factors)

Environment (management, culture, etc.)

CERN, May '11

(c) Felix Redmill, 2011 17

ORGANISATIONS ARE ALSO COMPLEX (Vincristine example)

• Child received injection for leukaemia into spine instead of into vein

• Root cause analysis showed 40 points at which the accident could have been averted

• Complexity of modern organisational systems• Need to identify risks in advance

– But no standard can do this for us• Requires a risk-based approach

CERN, May '11

(c) Felix Redmill, 2011 18

SAFETY - A DEFINITION

• Safety: freedom from unacceptable risk (IEC)

• Safety is not directly measurable

– But it may be addressed via risk

CERN, May '11

(c) Felix Redmill, 2011 19

VOCABULARY

• In common usage, “Risk” is used to imply:

– Likelihood (e.g. there’s a high risk of infection)

– Consequence (e.g. infection carries a high risk)

– A combination of the two

– Something, perhaps unspecified, to be avoided (e.g. going out into the night is risky)

CERN, May '11

(c) Felix Redmill, 2011 20

RISK - DEFINITIONS

• Risk: A combination of the probability of occurrence

and the severity of its consequences if the event did

occur (IEC)

• Tolerable risk: a willingness to live with a risk, so as to

secure certain benefits, in the confidence that the risk

is one that is worth taking and that it is being properly

controlled (HSE)

CERN, May '11

(c) Felix Redmill, 2011 21

SAFETY AND RISK

• Safety is only measurable in retrospect

• Safety is gauged by trying to understand risk

• Risk, being of the future, is estimable but not measurable

• The higher the risk, the lower the confidence in safety

• We increase safety by reducing risk

• Two components of risk are significant:

– Likelihood of occurrence

– Magnitude of outcome

CERN, May '11

(c) Felix Redmill, 2011 22

SOME PRINCIPLES

• Absolute safety (zero risk) cannot be achieved

• ‘Doing it well’ does not guarantee safety

– Correct functionality safety

– We must address safety as well as functionality

• Reliability is not a guarantee of safety

• We require confidence of safety in advance, not retrospectively

• We must not only achieve safety but also demonstrate it

CERN, May '11

(c) Felix Redmill, 2011 23

A RISK-BASED APPROACH

• If safety is addressed via risk

– We must base our safety-management actions on risk

• We must understand the risks in order to manage safety

CERN, May '11

(c) Felix Redmill, 2011 24

SAFETY AS A WAY OF THINKING

• If risk is too high, it must be reduced

• But how do we know how high it is?

– Carry out risk analysis

• But even if the risk is high, does it need to be

reduced?

– We must understand what risk is tolerable in

the circumstances (in the UK, apply the ALARP

Principle)

• Achieving safety demands a combination of

engineering and management approaches

CERN, May '11

(c) Felix Redmill, 2011 25

RISK - TWO COMPONENTS

• Two components:

– Probability (likelihood) of occurrence

– Consequence (magnitude of outcome)

• R = f(P.C) or f(L.C)

CERN, May '11

(c) Felix Redmill, 2011 26

A SIMPLE CALCULATION

Probability of a 100-year flood = 0.01/year

Expected damage = £50M

R (financial (Expected Value)

= 0.01 x £50,000,000 =

£500,000/year

CERN, May '11

(c) Felix Redmill, 2011 27

CONFIDENCE IN RISK VALUES

Accuracy of the result depends on the reliability of information

Reliability of information depends on the pedigree of its source

• What are the sources of the probability and consequence values?

• What are their pedigrees?

• What confidence do we have in the risk values?

• Why do we need confidence?

– Because we derive risk values in order to inform decisions

CERN, May '11

(c) Felix Redmill, 2011 28

CONFIDENCE IN RISK CALCULATIONS

• Where did the information come from?

– What trust do we have in the source?

– When and by whom was it collected?

– Was it ever valid? If so, is it still valid?

• What assumptions have we made?

– How valid are they?

– Are we aware of them?

CERN, May '11

(c) Felix Redmill, 2011 29

COT DEATH CASE

• A study concluded that the risk of a mature, non-smoking, affluent couple suffering a cot death is 8543 to 1

• Prof. Meadows deduced that Pr. of two cot deaths in the same family = 8543 x 8543 = 73 million to one

• Three cot deaths in her family resulted in Mrs Clark being convicted of infanticide

CERN, May '11

(c) Felix Redmill, 2011 30

DEPENDENCE AND NOT RANDOMNESS

• But deaths in the same family are not

independent events

(probability theory assumes independence)

• One death in the family rendered Mrs Clark “in the

highest-risk category” for another

– Probability of a second death is considerably

greater than 8543 to 1

CERN, May '11

(c) Felix Redmill, 2011 31

COMMON-MODE FAILURES

• Identifying common-mode failures is a crucial

part of traditional risk analysis

CERN, May '11

(c) Felix Redmill, 2011 32

ASSUMPTIONS

• We never have full knowledge

• We fill the gaps with assumptions

• Assumptions carry uncertainty, and therefore

risk

• Recognise your assumptions (if possible)

• If you admit to them (document them)

– Readers will recognise uncertainty

– Other persons may provide closer

approximations

CERN, May '11

(c) Felix Redmill, 2011 33

CONTROL OF RISK

• Risk is eliminated if either Pr or C is reduced to zero

• And risk is reduced by reduction of Pr or C or both

• In many cases we have no control over C, but we may be able to estimate it

• It may be difficult or impossible to derive Probability, but we may be able to reduce it (e.g. by software testing & fixing)

CERN, May '11

(c) Felix Redmill, 2011 34

WHY TAKE RISKS?

• Progress demands risk

– e.g. exploration, bridge design

– e.g. drug development (TeGenero’s TGN1412)

• Technology provides utility - at the cost of risk

• Suppliers want cheap designs

– e.g. Ronan Point (need to foresee hazards)

• To save money

– e.g. Cutting corners on a project

• To save face

– E.g. “I’d rather die than make a fool of myself”

• We can’t avoid taking risks in every decision and action

CERN, May '11

(c) Felix Redmill, 2011 35

RISK VS. RISK

• Decision is often not between risk and no risk

– But between one risk and another

– Decisions may create risks (unintended consequences)

• Surgery, medication, or keep the illness

• Consequences of being late vs. risks of driving fast

• Solar haze vs. global warming

• Software maintenance can introduce a new defect

– Change creates a new product; test results obsolete

• Recognise both risks

• Assess both

• Carry out impact analysis to identify new hazards

CERN, May '11

(c) Felix Redmill, 2011 36

SOME NOTES ON RISK

• A single-system risk may be tiny, but the overall risk of many systems may be high (individual vs. societal risk)

• Risk per usage may be remote, but daily use may accrue a high risk (one-shot vs. lifetime risk)

• Beware of focusing on a single risk; a system is likely to carry numerous risks

• Confidence in probabilities attributed to rare events must be low

• For random events with histories (e.g. electromechanical equipment failure) frequencies may be deduced

– Given similar circumstances, they may be predictive

• For systematic events (such as software failure) history is not an accurate predictor of the future

CERN, May '11

(c) Felix Redmill, 2011 37

RISK AND UNCERTAINTY

• Risk is of the future - there is always uncertainty

• Risk may be estimated but not measured

• Risk is open to perception

• Identifying and analysing risks requires information

• The quality of information depends on the source

• 'Facts' are open to different interpretations

• Risk cannot be eliminated if goals are to be achieved

• Risk management activities may introduce new risks

• In the absence of certainty, we need to increase confidence

• Reducing uncertainty depends on gathering information

CERN, May '11

(c) Felix Redmill, 2011 38

RISK MANAGEMENT STRATEGIES

• Avoid• Eliminate• Reduce• Minimise - within defined constraints• Transfer or share

- Financial risks may be insured- Technical risks may be transferred to experts,

maintainers• Hedge

- Make a second investment, which is likely to succeed if the first fails

• Accept- Must do this in the end, when risks are deemed

tolerable- Need contingency plans to reduce consequences

CERN, May '11

(c) Felix Redmill, 2011 39

RISK IS OPEN TO PERCEPTION

• Voluntary or involuntary• Control in hands of self or another• Statistical or personal• Level of knowledge, uncertainty• Level of dread or fear evoked• Short-term or long-term view• Severity of outcome• Value of the prize• Level of excitement• Status quo bias

Perception determines where we look for risks and what we identify as risks

CERN, May '11

(c) Felix Redmill, 2011 40

PROGRAMME FOR COMBATTING DISEASE (1)

• The country is preparing for the outbreak of a disease which is expected to kill 600 people. Two alternative programmes to combat it have been proposed, and the experts have deduced that the precise estimates of the outcomes are:

• If programme A is adopted, 200 people will be saved

• If programme B is adopted, there is a one third probability that 600 people will be saved and a two thirds probability that nobody will be saved

CERN, May '11

(c) Felix Redmill, 2011 41

PROGRAMME FOR COMBATTING DISEASE (2)

• The country is preparing for the outbreak of a disease which is expected to kill 600 people. Two alternative programmes to combat it have been proposed, and the experts have deduced that the precise estimates of the outcomes are:

• If programme C is adopted, 400 people will die

• If programme D is adopted, there is a one third probability that nobody will die and a two thirds probability that 600 people will die

CERN, May '11

(c) Felix Redmill, 2011 42

UNINTENDED CONSEQUENCES

• Actions always likely to have some unintended results

– Downsizing but loose the wrong staff

e.g. University redundancies in UK

– Staff come to rely entirely on a support system

– Building high in mountain causes landslide

– Abolishing DDT increased malaria

– Store less chemical at plant, more trips by road

• Unintended, but not necessarily unforeseeable

• Foreseeable, but not necessarily obvious

• Carry out hazard and risk analysis

– Better to foresee them than encounter them later

CERN, May '11

(c) Felix Redmill, 2011 43

RISK COMMUNICATION

• The risk analyst is usually not the decision maker– Risk information must be transferred (liver

biopsy)• Usually only results are communicated

– e.g. There’s an 80% chance of success• But are they correct (Bristol Royal Infirmary)?

– Overconfidence bias• Managers rely heavily on risk information,

collected, analysed, packaged, by other staff• But what confidence does the staff have in it

– What were the information sources, analysis methods, and framing choices?

– What uncertainties exist?– What assumptions were made?

CERN, May '11

(c) Felix Redmill, 2011 44

COMMUNICATION OF RISK INFORMATION

• ‘The risk is one in a million’

• One in seventeen thousand of dying in a road accident

– Age, route, time of day

• Framing

• Appropriate presentation of numbers

– 1 x 10-6

– One in a million

– One in a city the size of Birmingham

– (A launch per day for 300 years with only one failure)

CERN, May '11

(c) Felix Redmill, 2011 45

RISK IS TRICKY

• We manage risks daily, with reasonable success

• Our risk management is intuitive

– We do not recognise mishaps as the results of poor risk management

• Risk is a familiar subject

• We do not realise what we don’t know

• We do not realise that our intuitive techniques for managing simple situations are not adequate for complex ones

• Risk estimating can be simple arithmetic

• The difficult part is obtaining the correct information on which to base estimates - and decisions

CERN, May '11

(c) Felix Redmill, 2011 46

SAFETY ENGINEERING PRINCIPLES

CERN, May '11

CERN, May '11 47

A SIMPLE MODEL OF SYSTEM SAFETY

DisasterDangerSafe state (in any mode)

Failureor unsafe deviation

RestorationAccidentRecovery

(c) Felix Redmill, 2011

(c) Felix Redmill, 2011 48

SAFETY ACROSS THE LIFE CYCLE

• Safety activities must extend across the life of a system

– And must be planned accordingly

• Modern safety standards call for the definition and use of a safety life cycle

• A life-cycle model shows all the phases of a system’s life, relative to each other

• The ‘overall safety lifecycle’ defined in safety standard IEC 61508 is the best known model

CERN, May '11

(c) Felix Redmill, 2011 49

SAFETY LIFE CYCLE MODELS

• Provide a framework for planning safety engineering and management activities

• Provide a guide for creating an infrastructure for the management of project and system documentation

• Remind engineers and managers at each phase that they need to take other phases into consideration in their planning and activities

• Facilitate the demonstration as well as the achievement of safety

• The model in safety standard IEC 61508 is the best known

CERN, May '11

(c) Felix Redmill, 2011 50

Overall

modification

and retrofit

1 Concept

2

Overall scope

definition

3

Hazard and risk

analysis

Overall safety

requirements

Safety requirements

allocation

4

5

6

O & M

7

Safety

validation

8

Installation &

commissioning

Overall planning of:

9

Safety-

related

E/E/PES

10

Other tech.

safety-

related

systems

11

External

risk

reduction

facilities

Realisation of:

12

Overall installation

and commissioning

13

Overall safety

validation

14

Overall operation,

maintenance & repair

16

Decommissioning or

disposal

15

OVERALL SAFETY LIFECYCLE

CERN, May '11

(c) Felix Redmill, 2011 51

AFTER STAGE THREE

Work done after stage 3 of the model creates

additions to the overall system that were not

included in the hazard and risk analysis of

stage 3

CERN, May '11

SAFETY LIFECYCLE SUMMARY

Understand the functional goals and design

Identify the hazards

Analyse the hazards

Determine and assess the risks posed by the hazards

Specify the risk reduction measures and their SILs

Define the required safety functions (and their SILs)

Carry out safety validation

Operate, maintain, change, decommission, dispose safely

(c) Felix Redmill, 2011 52CERN, May '11

(c) Felix Redmill, 2011 53

THE SAFETY-CASE PRINCIPLE

• The achievement of safety must be demonstrated in advance of deployment of the system

• Demonstration is assessed by independent safety assessors

• Safety assessors will not (cannot)

– Examine a complex system without guidance

– Assume responsibility for the system’s safety

• Their assessment is guided by claims made by developers, owners and operators of the system

• The claims must be for the adequacy of safety in defined circumstances, considering the context and application and the benefits of accepting any residual risks

CERN, May '11

(c) Felix Redmill, 2011 54

THE SAFETY CASE

• The purpose: Demonstration of adequate and appropriate safety of a system, under defined conditions

– To convince ourselves of adequate safety

– The basis of independent safety assessment

– Provides later requirements for proof

(It can protect and it can incriminate)

• The means

– Make claims of what is adequately safe

o And for what it is adequately safe

– Present arguments for why it is adequately safe for the intended purpose

– Provide evidence in support of the arguments

CERN, May '11

(c) Felix Redmill, 2011 55

GENERALIZED EXAMPLE

• Claim: System S is acceptably safe when used in Application A

• Claims are presented as structured arguments

– The use of System S in Application A was subjected to thorough hazard identification

– All the identified hazards were analysed

– All risks associated with the hazards either were found to be tolerable or have been reduced to tolerable levels

– Emergency plans are in place in case of unexpected hazardous events

• The safety arguments are supported by evidence

– e.g., Evidence of all activities and results, descriptions of all relevant documentation, and references to it

CERN, May '11

(c) Felix Redmill, 2011 56

THE CASE MUST BE STRUCTURED

• Demonstration of adequate safety of a system requires demonstration of justified confidence in its components

• Some components may be COTS (commercial off-the-shelf)

– Don’t wait until too late to find that they cannot be justified

• Evidence is derived from different sources, and different categories of sources

• The evidence for each claim must be structured so as to support a logical argument for the “top-level” claim

CERN, May '11

(c) Felix Redmill, 2011 57

PRESENTATION OF SAFETY CLAIMS

• The claims must be structured so that the logic of the overall case is demonstrated

• The principal claims may be presented in a “top-level” document

– With references out to the sources of supporting evidence (gathered throughout the life cycle)

• Modern software-based tools are available for the documentation of safety cases (primarily: GSN (goal-structured notation))

CERN, May '11

CERN, May '11 58

THE NATURE OF EVIDENCE

• Evidence may need to be probabilistic

– e.g. for nuclear plants

• It may be qualitative

– e.g. for humans, software

• In applications of human control it may need to include

– Competence, training, management, guidelines

• It is collected throughout the system’s life

– Importantly during development

(c) Felix Redmill, 2011

CERN, May '11 59

EVIDENCE PLANNING - 1

• Any system-related documentation (project, operational, maintenance) may be required as evidence

• It should be easily and quickly accessible to

– Creators of safety arguments

– Independent safety assessors

• Numbering and filing systems should be designed appropriately

• Principal safety claims should be identified in advance

– Activities may need to be planned so that appropriate evidence is collected and stored

• The safety case should be developed throughout a development project

– And maintained throughout a system’s life

(c) Felix Redmill, 2011

CERN, May '11 60

EVIDENCE PLANNING - 2

• The safety case structure should be designed early

• Evidential requirements should be identified early and planned for

• Safety case development should be commenced early

• Evidence of software adequacy may be derived from

– Analysis

– Testing

– Proven-in-use

– Process

(c) Felix Redmill, 2011

(c) Felix Redmill, 2011 61

VALIDITY OF A SAFETY CASE

Validity (and continued validity) depends on (examples only)

• Observance of design and operational constraints, e.g.:

– Use only specified components

– Don’t exceed maximum speed

• Environment, e.g.:

– Hardware: atmospheric temperature between T1 & T2C

– Software: operating system remains unchanged

• Assumptions remain valid, e.g.:

– Those underlying estimation of occurrence probability

– Routine maintenance carried out according to spec.

CERN, May '11

(c) Felix Redmill, 2011 62

HAZARD AND RISK ANALYSIS

CERN, May '11

(c) Felix Redmill, 2011 63

NEED FOR CLARITY

• Is a software bug a risk?

• Is a banana skin on the ground a risk?

• Is a bunker on a golf course a risk?

CERN, May '11

(c) Felix Redmill, 2011 64

THE CONCEPT OF A HAZARD

• A hazard is the source of risk

– The risks that may arise depend on context

• What outcomes could result from the hazard?

• What consequences might the outcomes lead to?

• Hazards form the basis of risk estimation

CERN, May '11

(c) Felix Redmill, 2011 65

GOLF-SHOT RISK

• What is the probability of hitting your golf ball into a bunker?

• What is the consequence (potential consequence) of hitting your ball into a bunker?

• Should I “take the risk”?

– In this case: What level of risk should I take?

• What’s missing is a question about benefit!

CERN, May '11

WHAT IS RISK ANALYSIS?

• If we take risk to be a function of Probability (Pr) & Consequence (C)– Then, in order to estimate a value of a Risk, we

need to derive values of Pr and C for that risk• Values may be qualitative or quantitative• They may need to be more or less “accurate”,

depending on importance• Thus, risk analysis consists of:

– Identifying relevant hazards– Collecting evidence that could contribute to the

determination of values of Pr and C for a defined risk

– Analysing that evidence– Synthesising the resulting values of Pr and C

(c) Felix Redmill, 2011 CERN, May '11 66

(c) Felix Redmill, 2011 67

PROBABILITY OF WHAT EXACTLY?

• Automobile standards focus on the control of a vehicle (and the possible loss of it)

• Each different situation will

– Occur with a different probability

– Result in different consequences

– Carry different risks

CERN, May '11

(c) Felix Redmill, 2011 68

CHOICE OF CONSEQUENCE ESTIMATES

• Worst possible

• Worst credible

• Most likely

• “Average”

CERN, May '11

(c) Felix Redmill, 2011 CERN, May '11 69

DETERMINATION OF RISK VALUES

Threat to security

Safety hazard

Threat of damage

Potential for unreliability

Potential for unavailability

Causal analysis

Consequence analysis

Risk

(c) Felix Redmill, 2011 CERN, May '11 70

PRELIMINARY REQUIREMENTS

• Knowledge of the subject of risk

• Understanding of the current situation and context

• Knowledge of the purpose of the intended analysis

– The questions (e.g. tolerability) that it must answer

Such knowledge and understanding are essential to searching for the appropriate information

(c) Felix Redmill, 2011 CERN, May '11 71

BOTTOM-UP ANALYSIS(Herald of Free Enterprise)

Bowsun asleep in his cabin when ship is due to depart↓

Bow doors not closed↓

Ship puts to sea with bow doors open↓

Water enters car deck↓

, As ship rolls water rushes to one side↓

Ship capsizes↓ Lives lost

(c) Felix Redmill, 2011 CERN, May '11 72

TOP-DOWN ANALYSIS(Herald of Free Enterprise)

Ship puts to sea with bow doors openBosun did not close doorsBosun not availableto close doors

Problem with doors& bosun can’t close them

Bosun noton shipDoor or hinge problem

Problem with closing

mechanism

Problem with power supply

Bosun on board but not at stationBosun asleepin cabinBosun in bar

(c) Felix Redmill, 2011 73

RISK MANAGEMENT PROCESS

• Define scope of study

• Identify the hazards

• Analyse the hazards to determine the risks they pose

• Assess risks against tolerability criteria

• Take risk-management decisions and actions

• It may also be appropriate to carry out emergency planning and prepare for the unexpected

– If so, we need to carry out rehearsals

CERN, May '11

(c) Felix Redmill, 2011 74

FOUR STAGES OF RISK ANALYSIS(But be careful with vocabulary)

• Definition of scope– Define the objectives and scope of the study

• Hazard identification– Define hazards and hazardous events

• Hazard analysis– Determine the sequences leading to hazardous

events– Determine likelihood and consequences of

hazardous events• Risk assessment

– Assess tolerability of risks associated with hazardous events

CERN, May '11

(c) Felix Redmill, 2011 75

DEFINITION OF SCOPE

• Types of risks to be studied

– e.g. safety, security, financial

• Risks to whom or what

– e.g. employees, all people, environment, the company, the mission

• Study boundary

– Plant boundary

– Region

• Admissible sources of information

– e.g. stakeholders (which?), experts, public

CERN, May '11

(c) Felix Redmill, 2011 76

SCOPE OF STUDY

• How will the results be used?

– What questions are to be answered?

– What decisions are to be made?

• What accuracy and confidence are required?

• Define study parameters

– What effort to be invested?

– What budget afforded?

– How much time allowed?

CERN, May '11

CERN, May '11 77

OUTCOME SUBJECTIVELY INFLUENCED

• Defining scope is subjective– Involves judgement– Includes bias– Can involve manipulation

• Scope definition– Influences the nature and direction of the

analysis– Is a predisposing factor on its results

(c) Felix Redmill, 2011

(c) Felix Redmill, 2011 CERN, May '11 78

VOCABULARY - CHOICE OF TERMS

• Same term means different things to different people

• Same process given different titles

• There is no internationally agreed vocabulary

• Even individuals use different terms for the same process

• Beware: ask what others mean by their terms

• Have a convention and define it

(c) Felix Redmill, 2011 CERN, May '11 79

SOME TERMS IN USE

Hazard identification,Risk identification

Hazard analysis,Risk analysis

Risk assessment,Risk evaluation

Risk mitigation,Risk reduction,Risk management

Risk analysis,Risk assessment, Risk management

(c) Felix Redmill, 2011 CERN, May '11 80

AN APPROPRIATE CONVENTION?

Scope definition

Hazard identification

Hazard analysis

Risk assessment

Risk communication

Risk mitigation

Emergency planning

Risk analysis

Risk management

(c) Felix Redmill, 2011 81

HAZARD AND RISK ANALYSIS

• Obtaining appropriate information, from the most appropriate sources, and analyzing it, while making assumptions that are recognized, valid, and as few as possible

CERN, May '11

(c) Felix Redmill, 2011 82

HAZARD IDENTIFICATION

• The foundation of risk analysis

- Identify the hazards (what could go wrong)

- Deduce their causes

- Determine whether they could lead to undesirable outcomes

• Knowing chains of cause and effect facilitates decisions on where to take corrective action

• But many accidents are caused by unexpected interactions rather than by failures

CERN, May '11

(c) Felix Redmill, 2011 83

HAZARD ID METHODS

• Checklists• Brainstorming

• Expert judgement

• What-if analysis

• Audits and reports

• Site inspections

• Formal and informal staff interviews

• Interviews with others, such as customers, visitors

• Specialised techniques

CERN, May '11

(c) Felix Redmill, 2011 84

WHAT WE FIND DEPENDS ON WHERE WE LOOK

• We don’t find hazards where we don’t look• We don’t look because

– We don’t think of looking there– We don’t know of that place– We assume there are no hazards there– We assume that the risks are small

• Must take a methodical approach– And be thorough

CERN, May '11

(c) Felix Redmill, 2011 85

CRACK IN “TIRELESS” COOLING SYSTEM

CERN, May '11

(c) Felix Redmill, 2011 86

EXTRACT FROM ‘FILE ON 4’ INTERVIEW (12/12/00)

• (O'Halloran) For the Navy Captain Hurford concedes that the possibility that this critical section of pipework might fail was never even considered in the many years that these 12 submarines of the Swiftsure and Trafalgar classes have been in service.

• (Hurford) "This component was analysed against its duty that it saw in service and was supposed never to crack and so the fact that this crack had occurred in this component in the way that it did and caused a leak before we had detected it, is a serious issue.”

• (O'Halloran) How big a question mark does this place over your general risk probability assumptions … about the whole working of one of these nuclear reactors.

• (Hurford) "It places a question on the surveillance that we do when the submarines are in refit and maintenance, unquestionably”

• (O'Halloran) How long have these various submarines been in service ?• (Hurford) "The oldest of the Swiftsure class came into service in the early seventies”• (O'Halloran) So has this area of the pipework ever been looked at in any of the submarines,

the 12 hunter killer submarines now in service ?• (Hurford) "No it hasn't, because the design of the component was understood and the

calculations showed and experience showed that there would be no problem.”• (O'Halloran) But the calculations were wrong ?• (Hurford) "Clearly there is something wrong with that component that caused the crack and

we don't know if it was the calculations or whether it was the way it was made and that what is being found out in the analysis at the moment"

CERN, May '11

(c) Felix Redmill, 2011 87

REPEAT FORMAL HAZARD ID

• Hazard identification should be repeated

when new information becomes available and when significant change is proposed. e.g.– At the concept stage

– When a design is available

– When the system has been built

– Prior to system or environmental change during operation

– Prior to decommissioning

CERN, May '11

(c) Felix Redmill, 2011 88

HAZARD ANALYSIS

• Analyse the identified hazards to determine

– Potential consequences

Worst credible

Most likely

– Ways in which they could lead to undesirable outcomes

– Likelihood of undesirable outcomes

• Sometimes rough estimates are adequate

• Sometimes precision is essential

– Infusion pump (too much: poison; too little: no cure)

– X-ray dose (North Stafford)

CERN, May '11

(c) Felix Redmill, 2011 89

NOTES ON HAZARD ANALYSIS

• Two aspects of a hazard: likelihood and consequence

• Consequence is usually closely related to system goals

– So risk reduction may focus on reduction of likelihood

• Usually numerous hazards associated with a system

• Analysis is based on collecting and deriving appropriate information

• Pedigree of information depends on pedigree of its source

• The question of confidence should always be raised

• Likelihood and consequences may be expressed quantitatively or qualitatively

CERN, May '11

CERN, May '11 90

NOTES ON HAZARD ANALYSIS - 2

• For random events (such as hardware component failure)

– Historic failure frequencies may be known

– Probabilistic hazard analysis may be possible

• For systematic events (such as software failures)

– History is not an accurate predictor of the future

– Qualitative hazard analysis is necessary

• When low event frequencies (e.g. 10-6 per year) are desired, confidence in figures must be low

(c) Felix Redmill, 2011

CERN, May '11 91

THE NEED FOR QUALITATIVE ANALYSIS AND ASSESSMENT

• When failures and hazardous events are random

– Historic data may exist

– Numerical analysis may be possible

• When failures and hazardous events are systematic, or historic data do not exist

– Qualitative analysis and assessment are necessary

• Example of qualitative techniques is the risk matrix

(c) Felix Redmill, 2011

(c) Felix Redmill, 2011 92

HAZARD ANALYSIS TECHNIQUES

• FMEA analyses the effects of failures

• FMECA analyses the risks attached to failures

• HAZOP analyses both causes and effects

• Event tree analysis (ETA) works forward from

identified hazards or events (e.g. component

failures) to determine their consequences

• Fault tree analysis (FTA) works backwards from

identified hazardous events to determine their

causes

CERN, May '11

(c) Felix Redmill, 2011 93

USING THE RESULTS

• The purpose is to derive reliable information on which to base decisions on risk-management action

• ‘Derive’ - by carrying out risk analysis

• ‘Reliable’ - be thorough as appropriate

• ‘Information’ - the key to decision-making

• ‘Decisions’ - risk analysis is not carried out for its own sake but to inform decisions (usually made by others)

• Hazard identification and analysis may be carried out concurrently

CERN, May '11

(c) Felix Redmill, 2011 94

RISK ASSESSMENT

• To determine the tolerability of analysed risks

– So that risk-management decisions can be taken

• Need tolerability criteria

• Tolerability differs according to circumstance

– e.g. medical

• Tolerability differs in time

– e.g. nuclear

• Tolerability differs according to place

– e.g. oil companies treatment of staff and environment in different countries

CERN, May '11

(c) Felix Redmill, 2011 95

TOLERABLE RISK

• Risk accepted in a given context based on the current values of society

• Not trivial to determine• Differs across industry sectors• May change with time• Depends on perception• Should be determined by discussion among

parties, including– Those posing the risks– Those to be exposed to the risks– Other stakeholders, e.g. regulators

CERN, May '11

(c) Felix Redmill, 2011 96

THE HSE’S ALARP PRINCIPLE

Intolerable region

The ALARP or

tolerability region

Broadly acceptable

region

Risk cannot be justified

except in extraordinary

circumstances

Risk is tolerable only if

further reduction is

impracticable or if its

cost is grossly

disproportionate to the

improvement gained

It is necessary to maintain

assurance that risk

remains at this level

CERN, May '11

(c) Felix Redmill, 2011 97

RISK TOLERABILITY JUDGEMENT

• In the case of machinery

– We may know what could go wrong (uncertainty is low)

– It may be reversible (can fix it after one accident)

• In the case of BSE or genetically modified organisms

– The risk may only be suspected (uncertainty is high)

– It may be irreversible

• Moral values influence the judgement of how much is enough

• If we don’t take risks we don’t make progress

CERN, May '11

(c) Felix Redmill, 2011 98

HAZARD AND RISK ANALYSIS TECHNIQUES

CERN, May '11

(c) Felix Redmill, 2011 CERN, May '11 99

TECHNIQUES

Techniques support risk analysis

– They should not govern it

(c) Felix Redmill, 2011 CERN, May '11 100

TECHNIQUES TO BE CONSIDERED

• Failure (fault) modes and effects analysis (FMEA)• Failure (fault) modes, effects and criticality analysis

(FMECA)• Hazard and operability studies (HAZOP)• Event tree analysis (ETA)• Fault tree analysis (FTA)• Risk matrices• Human reliability analysis (HRA - mentioned only)• Preliminary hazard analysis (PHA)

(c) Felix Redmill, 2011 CERN, May '11 101

A SIMPLE CHEMICAL PLANT

P1V1V2P2 Vat RFluid BFluid AV3

(c) Felix Redmill, 2011 CERN, May '11 102

FAILURE MODES AND EFFECTS ANALYSIS

• Usually qualitative investigation of the modes of failure of individual system components

• Components may be at any level (e.g. basic, sub-system)

• Components are treated as black boxes

• For each failure mode, FMEA investigates

– Possible causes

– Local effects

– System-level effects

• Corrective action may be proposed

• Best done by a team of experts with different viewpoints

(c) Felix Redmill, 2011 CERN, May '11 103

FAILURE MODES AND EFFECTS ANALYSIS

• Boundary of study must be clearly defined

• Does not usually find the effects of

– Multiple faults or failures

– Failures caused by communication and interactions between components

– Integration of components

– Installation

(c) Felix Redmill, 2011 CERN, May '11 104

EXAMPLE FMEA OF A SIMPLE CHEMICAL PROCESS

StudyNo.

Item Failuremode

Possiblecauses

Localeffects

System-level

effects

Proposedcorrection

1 PumpP1

Fails tostart

1. No power

2. Burnt out

Fluid Adoes notflow

Excess ofFluid B inVat R

Monitorpumpoperat ion

2 Burns out 1. Loss oflubricant

2. Excessivetemperature

Fluid Adoes notflow

Excess ofFluid B inVat R

Add alarmto pumpmonito r

3 ValveV1

Sticksclosed

1. No power

2. Jammed

Fluid Acannotflow

Excess ofFluid B inVat R

MonitorValveoperat ion

4 Sticksopen

1. No power

2. Jammed

Cannotst opflow ofFluid A

Danger ofexcess ofFluid A inVat R

Introduceaddit ionalvalve inseries

(c) Felix Redmill, 2011 CERN, May '11 105

FAILURE MODES, EFFECTS AND CRITICALITY ANALYSIS

• FMECA = FMEA + analysis of the criticality of each failure

• Criticality in terms of risk or one of its components

– i.e. severity and probability or frequency of occurrence

– Usually quantitative

• Purpose: to identify the failure modes of highest risk

– This facilitates prioritization of corrective actions, focusing attention and resources first where they will have the greatest impact on safety

• Recording: add one or two columns to a FMEA table

(c) Felix Redmill, 2011 CERN, May '11 106

HAZARD AND OPERABILITY STUDIES (HAZOP)

• The methodical study of a documented representation of a system, by a managed team, to identify hazards and operational problems.

• Correct system operation depends on the attributes of the items on the representation remaining within design intent. Therefore, studying what would happen if the attributes deviated from design intent should lead to the identification of the hazards associated with the system's operation. This is the principle of HAZOP.

• 'Guide words' are used to focus the investigation on the various types of deviation.

(c) Felix Redmill, 2011 CERN, May '11 107

A GENERIC SET OF GUIDE WORDS

Guide word MeaningNo

More

Less

As well as

Part of

Reverse

Other than

Early

Late

Before

After

The complete negation of the design intention. No part of theintention is achieved, but nothing else happens.

A quantitative increase.

A quantitative decrease.

A qualitative increase. All the design intention is achieved, togetherwith something additional.

A qualitative decrease. Only part of the design intention is achieved.

The logical opposite of the design intention.

A complete substitution where no part of the design intention isachieved but something different occurs.

The design intention occurs earlier in time than intended.

The design intention occurs later in time than intended.

The design intention occurs earlier in a sequence than intended.

The design intention occurs earlier in a sequence than intended.

(c) Felix Redmill, 2011 CERN, May '11 108

HAZOP OVERVIEWPresentation of design

representationExamine

design representation methodically

Possibledeviation from design intent

?

Examine consequences

and causes

Document results

Define follow-up work

Time up, or completed study?

Agree documentationSign off meetingIntroductionsYesNoNoYes

(c) Felix Redmill, 2011 CERN, May '11 109

HAZOP SUMMARY

• HAZOP is a powerful technique for hazard identification - and analysis

• It requires a team of carefully chosen members• It depends on planning, preparation, and

leadership• A study usually requires several meetings• Study proceeds methodically• Guide words are used to focus attention• Outputs are the identified hazards,

recommendations, questions

(c) Felix Redmill, 2011 CERN, May '11 110

EVENT TREE ANALYSIS

• Starts with an event that might affect safety

• Follows a cause-and-effect chain to the system-level consequence

• Does not assume that an event is hazardous

• Includes both operational and fault conditions

• Each event results in a branch, so N events = 2N branches

• If event probabilities can be derived

– They may be assigned to the branches of the tree

– The probability of the final events may be calculated

(c) Felix Redmill, 2011 CERN, May '11 111

A SIMPLE EVENT TREE

Valve operatesValve monitor O.K.

Alarm relay O.K.Claxon O.K.Operator respondsYesNoYesYesNoNoNoOutcomeYesNoSafe outcomeUnsafe outcomes

(c) Felix Redmill, 2011 CERN, May '11 112

REFINED EVENT TREE

Valve operatesValve monitor functions

Alarm relay operates

Claxon soundsOperator respondsYesYesYesYesNoNo

(c) Felix Redmill, 2011 CERN, May '11 113

EVENT TREE: ANOTHER EXAMPLE

Fire startsFire spreads quickly

Sprinkler fails to work

People cannot escape

Resulting eventNo (P=0.6)Yes (P=0.1)Yes (P=0.2)No (P=0.9)No (P=0.8)YesYes (P=0.4)Multiple fatalitiesDamage and lossFire controlledFire contained

(c) Felix Redmill, 2011 CERN, May '11 114

BOTTOM-UP ANALYSIS(Herald of Free Enterprise)

Bowsun asleep in his cabin when ship is due to depart↓

Bow doors not closed↓

Ship puts to sea with bow doors open↓

Water enters car deck↓

, As ship rolls water rushes to one side↓

Ship capsizes↓ Lives lost

(c) Felix Redmill, 2011 CERN, May '11 115

TOP-DOWN ANALYSIS(Herald of Free Enterprise)

Ship puts to sea with bow doors openBosun did not close doorsBosun not availableto close doors

Problem with doors& bosun can’t close them

Bosun noton shipDoor or hinge problem

Problem with closing

mechanism

Problem with power supply

Bosun on board but not at stationBosun asleepin cabinBosun in bar

(c) Felix Redmill, 2011 CERN, May '11 116

FAULT TREE ANALYSIS

• Starts with a single 'top event' (e.g., a system failure)• Combines the chains of cause and effect which could

lead to the top event• Next-level causes are identified and added to the tree

and this is repeated until the most basic causes are identified

• Causal relationships are defined by AND and OR gates• One lower-level event may cause several higher-level

events• Examples of causal events: component failure, human

error, software bug• Each top-level undesired event requires its own fault

tree• Probabilities may be attributed to events, and from

these the probability of the top event may be derived

(c) Felix Redmill, 2011 CERN, May '11 117

EXAMPLE FTA OF SIMPLE CHEMICAL PROCESS

ORORORExcess of Fluid Ain Vat R

Too much of Fluid A enters Vat R

Insufficient Fluid B enters Vat R

Run out of Fluid B

Pump P2 fails

Valve V2 sticks shut

Valve V1 sticks openTank leakingTank not filled

(c) Felix Redmill, 2011 CERN, May '11 118

THE PRINCIPLE OF THE PROTECTION SYSTEM

Required event frequency

10-7 per hour

Dangerous failure frequency of

equipt.: 10-3 / hour

Reliability of safety function

10-4 / hour

AND

(c) Felix Redmill, 2011 CERN, May '11 119

COMPLEMENTARITY OF TECHNIQUES

• Compare results of FMEA with low-level causes from FTA

• Carry out HAZOP on a sub-system identified as risky by a high-level FMEA

• Carry out ETA on low-level items identified as risky by FTA

(c) Felix Redmill, 2011 CERN, May '11 120

A RISK MATRIX(An example of qualitative risk

analysis)

Likelihoodor

Frequency

Consequence

Negligible Moderate High Catastrophic

High

Medium

Low

(c) Felix Redmill, 2011 CERN, May '11 121

RISKS POSED BY IDENTIFIED HAZARDS

Likelihoodor

Frequency

Consequence

Negligible Moderate High Catastrophic

High H2

Medium H1, H5 H6 H4

Low H3

(c) Felix Redmill, 2011 CERN, May '11 122

A RISK CLASS MATRIX(example only)

Likelihoodor

Frequency

Consequence

Negligible Moderate High Catastrophic

High C B A A

Medium D C B A

Low D D C B

(c) Felix Redmill, 2011 CERN, May '11 123

TOLERABILITY OF ANALYSED RISKS

• Refer the cell of each analysed hazard in the risk matrix to the equivalent cell in the risk class matrix to determine the class of the analysed risks

– So, Hazard 1 poses a D Class risk, Hazard 2 a B, etc.

• Risk class

– Defines the (relative) importance of risk reduction

– Provides a means of prioritising the handling of risks

– Can be equated to a defined type of action

• Risk class gives no indication of time criticality

– This must be derived from an understanding of the risk

• What is done to manage risks depends on circumstances

(c) Felix Redmill, 2011 CERN, May '11 124

THOUOGHTS ON THE USE OF TECHNIQUES

• Techniques are intended to support what you have decided to do– They should not define what you do

• Each is useful on its own for a given purpose– Thorough analysis requires a combination of

techniques

• Risk analysis is not achieved in a single activity– It should be continuous throughout a system’s life

• Early analysis (PHA) is particularly effective• Techniques may be used to verify each other’s

results• And to expand on them and provide further detail

(c) Felix Redmill, 2011 CERN, May '11 125

IMPORTANCE OF EARLY ANALYSIS

• Too many causes to start with bottom-up analysis– Effort and financial cost would be too great– Would lead to over-engineering for small risks– Would miss too many important hazards

• Fault trees commence with accidents or hazards• Carry out Preliminary Hazard Analysis early in a

project

Root causes1000s

Hazards<100

Accidents<20

(c) Felix Redmill, 2011 CERN, May '11 126

PRELIMINARY HAZARD ANALYSIS

• At the ‘Objectives’ or early specification stage– Potential then for greatest safety effectiveness

• Take a ‘system’ perspective• Consider historic understanding of such systems• Address differences between this system and

historic systems (novel features)• Consider such matters as: boundaries, operational

intent, physical operational environment, assumptions

• Identify accident types, potential accident scope and consequences

• Identify system-level hazards• If a checklist is used, review it for obsolescence• Create a new checklist for future use

(c) Felix Redmill, 2011 CERN, May '11 127

HUMAN RELIABILITY ASSESSMENT

• Human components of systems are unreliable

• Hazards posed by them should be included in risk analysis

• Several Human Reliability Assessment (HRA) techniques have been developed for the purpose

– These will be covered later in the degree course

CERN, May '11 128

RISKS POSED BY HUMANS

• We need to

– Extend risk analysis to include HRA

– Pay more attention to ergonomic considerations in design

– Consider human cognition in interface design

– Produce guidelines on human factors in safety

• Safety is multi-dimensional, so take an interdisciplinary approach

– Include ergonomists and psychologists in development and operational teams

(c) Felix Redmill, 2011

CERN, May '11 129

RISKS POSED BY SENIOR MANAGERS

• Senior managers (should)

– Define safety policy

– Make strategic safety decisions

– Provide leadership in safety (or not)

– Define culture by design or by their behaviour

• They predispose an organisation to safety or accident

• Accident inquiry reports show that the contribution of senior management to accident causation is considerable

• Yet their contributions to risk are not included in risk analyses

• Risk analyses therefore tend to be optimistic

• Risk analyses need to include management failure

• Management should pay attention to management failure

(c) Felix Redmill, 2011

CERN, May '11 130

RISK ANALYSIS SUMMARY

• Risk analysis is the cornerstone of modern safety engineering and management

• It can be considered as a four-staged process

• The first stage, hazard identification, is the basis of all further work - risks not identified are not analysed or mitigated

• All four stages include considerable subjectivity

• Subjectivity, in the form of human creativity and judgement, is essential to the effectiveness of the process

• Subjectivity also introduces bias and error

• HRA techniques need to be included in risk analyses

• We need techniques to include management risks

• Often the greatest value of risk analysis is having to do it

(c) Felix Redmill, 2011

(c) Felix Redmill, 2011 131

SAFETY INTEGRITY LEVELS(SILs)

CERN, May '11

(c) Felix Redmill, 2011 132

SAFETY AS A WAY OF THINKING

• Reliability is not a guarantee of safety• If risk is too great, it must be reduced

(the beginning of 'safety thinking')• But how do we know if the risk is too great?

– Carry out risk analysis

(safety engineering enters software and systems engineering)

• But even if the risk is high, does it need to be reduced?– Determine what level of risk is tolerable

(this introduces the concept of safety management)

CERN, May '11

(c) Felix Redmill, 2011 133

SAFETY INTEGRITY

• If risk is not tolerable it must be reduced• High-level requirement of safety function is 'to

reduce the risk'• Analysis leads to the functional requirements• The safety function becomes part of the overall

system– Safety depends on it

• So, will it reduce the risk to (at least) a tolerable level?

• We try to ensure that it does by defining the reliability with which it performs its safety function– In IEC 61508 in terms of Pr. of dangerous failure

• This is referred to as 'safety integrity'

CERN, May '11

(c) Felix Redmill, 2011 134

SAFETY INTEGRITY IS A TARGET REQUIREMENT

• Safety integrity sets a target for the maximum tolerable rate of dangerous failures of a safety function

– (This is a ‘reliability-type’ attribute)

• Example: S should not fail dangerously more than once in 7 years

• Example: There should be no more than 1 dangerous failure in 1000 demands

• If the rate of dangerous failures of the safety function can be measured accurately, achievement of the defined safety integrity may be claimed

CERN, May '11

(c) Felix Redmill, 2011 135

WHY SAFETY INTEGRITY LEVELS?

• A safety integrity target could have an infinite number of values

• It is practical to divide the total possible range into bands (categories)

• The bands are 'safety integrity levels' (SILs)– In IEC 61508 there are four

• N.B. Interesting that the starting point was that

safety and reliability are not synonymous

and the end point is the reliance of safety

on a reliability-type measure (rate of dangerous failure)

CERN, May '11

(c) Felix Redmill, 2011 136

IEC 61508 SIL VALUES (IEC 61508)

SafetyIntegrity

Level

Low Demand Mode ofOperation

(Pr. of failure to perform itssafety functions on demand)

Continuous/High-demand Modeof Operation

(Pr. of dangerous failure perhour)

4 >= 10-5 to 10-4 >= 10-9 to 10-8

3 >= 10-4 to 10-3 >= 10-8 to 10-7

2 >= 10-3 to 10-2 >= 10-7 to 10-6

1 >= 10-2 to 10-1 >= 10-6 to 10-5

CERN, May '11

(c) Felix Redmill, 2011 137

RELATIONSHIP (IN IEC 61508) BETWEEN

LOW-DEMAND AND CONTINUOUS MODES

• Low-demand mode:'Frequency of demands ... no greater than one per year'

• 1 year = 8760 hours (approx. 104)

• Continuous and low-demand figures related by factor of 104

• Claims for a low-demand system must be justified

CERN, May '11

(c) Felix Redmill, 2011 138

APPLYING SAFETY INTEGRITY TO DEVELOPMENT PROCESS

• When product integrity cannot be measured with confidence (particularly when systematic failures dominate)

– The target is related, in IEC 61508, to the development process

• Processes are equated to safety-integrity levels

• But

– Such equations are judgemental

– Process may be an indicator of product, but not an infallible one

CERN, May '11

(c) Felix Redmill, 2011 139

IEC 61508 DEFINITIONS

• Safety integrity:

Probability of a safety-related system satisfactorily performing the required safety functions under all the stated conditions within a stated period of time

• Safety integrity level:

Discrete level (one out of a possible four) for specifying the safety integrity requirements of the safety functions to be allocated to the E/E/PE safety-related systems, where SIL 4 has the highest level of safety integrity and SIL 1 the lowest

CERN, May '11

(c) Felix Redmill, 2011 140

SIL IN TERMS OF RISK REDUCTION (IEC 61508)

EUC

risk

Increasing

risk

Necessary risk

reduction

Tolerable

risk

• The necessary risk reduction effected by a safety function

• Its functional requirements define what it should do• Its safety integrity requirements define its tolerable

rate of dangerous failure

CERN, May '11

(c) Felix Redmill, 2011 141

THE IEC 61508 MODEL OF RISK REDUCTION

EUC

Control

system

Protection

system

Safety functionsSafety functions

Utility + risks

CERN, May '11

(c) Felix Redmill, 2011 142

PRINCIPLE OF THE PROTECTION SYSTEM

Event frequency

10-7

EUC dangerous

failure frequency

10-3

Reliability of

safety function

10-4

10-4

10-3

AND

CERN, May '11

(c) Felix Redmill, 2011 143

BEWARE

• Note that a ‘protection system’ claim requires total independence of the safety function from the protected function

CERN, May '11

CERN, May '11 144

EACH HAZARD CARRIES A RISK

Tolerable level of risk 1

Risk1

Increasing riskNecessary reduction of risk 1Actual reduction of risk 1Residual

risk 1Tolerable

level of risk 2

Risk 2

(c) Felix Redmill, 2011

(c) Felix Redmill, 2011 145

EXAMPLE OF A SAFETY FUNCTION

Logic

Sensor

Actuator

Emergency

release

• The target SIL applies to the entire safety instrumentation system– Sensor, connections, hardware platform, software,

actuator, valve, data• All processes must accord with the SIL

– Validation, configuration management, proof checks, maintenance, change control and revalidation ...

CERN, May '11

(c) Felix Redmill, 2011 146

PROCESS OF SIL DERIVATION

• Hazard identification• Hazard analysis• Risk assessment

– Resulting in requirements for risk reduction• Safety requirements specification

– Functional requirements– Safety integrity requirements

• Allocation of safety requirements to safety functions– Safety functions to be performed– Safety integrity requirements

• Allocation of safety functions to safety-related systems– Safety functions to be performed– Safety integrity requirements

CERN, May '11

(c) Felix Redmill, 2011 147

THE IEC 61508 SIL PRINCIPLE

Risk

assessment

S

I

L

Development

process

Necessary

risk

reduction

CERN, May '11

(c) Felix Redmill, 2011 148

BUT NOTE

• IEC 61508 emphasises process evidence but does not exclude the need for product or analysis evidence

• It is impossible for process evidence to be conclusive

• It is unlikely that conclusive evidence can be derived for complex systems in which systematic failures dominate

• Evidence of all types should be sought and assembled

CERN, May '11

(c) Felix Redmill, 2011 149

SIL ALLOCATION - 1 (from IEC 61508)

• Functions within a system may be allocated different SILs

• The required SIL of hardware is that of the software safety function with the highest SIL

• For a safety-related system that implements safety functions of different SILs, the hardware and all the software shall be of the highest SIL unless it can be shown that the implementations of the safety functions of the different SILs are sufficiently independent

• Where software is to implement safety functions of different SILs, all the software shall be of the highest SIL unless the different safety functions can be shown to be independent

CERN, May '11

(c) Felix Redmill, 2011 150

SIL ALLOCATION - 2(from IEC 61508)

• Where a safety-related system is to implement both safety and non-safety functions, all the hardware and software shall be treated as safety-related unless it can be shown that the implementation of the safety and non-safety functions is sufficiently independent (i.e. that the failure of any non-safety-related functions does not affect the safety-related functions). Wherever practicable, the safety-related functions should be separated from the non-safety-related functions

CERN, May '11

(c) Felix Redmill, 2011 151

SIL — ACHIEVED OR TARGET RELIABILITY?

• SIL is initially a statement of target rate of dangerous failures

• There may be reasonable confidence that achieved reliability = or > target reliability (that the SIL has been met), when– Simple design– Simple hardware components with known fault

histories in same or similar applications– Random failures

• But there is no confidence that the target SIL is met when– There is no fault history– Systematic failures dominate

• Then, IEC 61508 focuses on the development process

CERN, May '11

(c) Felix Redmill, 2011 152

APPLICATION OF SIL PRINCIPLE

• IEC 61508 is a meta standard, to be used as the basis for sector-specific standards

• The SIL principle may be adapted to suit the particular conditions of the sector

• An example is the application of the principle in

the Motor Industry guideline

CERN, May '11

(c) Felix Redmill, 2011 153

EXAMPLE OF SIL BASED ON CONSEQUENCE

(Motor Industry)• In the Motor Industry Software Reliability Association

(MISRA) guidelines (1994), the allocation of a SIL is determined by 'the ability of the vehicle occupants … to control the situation following a failure'

• Steps in determining an integrity level:

a) List all hazards that result from all failures of the system

b) Assess each failure mode to determine its controllability category

c)The failure mode with the highest associated controllability category determines the integrity level of the system

CERN, May '11

(c) Felix Redmill, 2011 154

MOTOR INDUSTRY EXAMPLE

Controllability category

Acceptable failure rate Integrity level

Uncontrollable Extremely improbable 4

Difficult to control

Very remote 3

Debilitating Remote 2

Distracting Unlikely 1

Nuisance only Reasonably possible 0

CERN, May '11

(c) Felix Redmill, 2011 155

SIL CAN BE MISLEADING

• Different standards derive SILs differently• SIL is not a synonym for overall reliability

– Could lead to over-engineering and greater costs• When SIL claim is based only on development process

– What were competence, QA, safety assessment, etc.?• Glib use of the term ‘SIL’

– No certainty of what is meant– Is the claim understood by those making it?– Be suspicious until examining the evidence

• SIL X may be intended to imply use in all circumstances– But safety is context-specific

• SIL says nothing about required attributes of the system– Must go back to the specification to identify them

CERN, May '11

(c) Felix Redmill, 2011 156

THREE QUESTIONS

• Does good process necessarily lead to good product?

• Instead of using a safety function, why not simply improve the basic system (EUC)?

• Can the SIL concept be applied to the basic system?

CERN, May '11

(c) Felix Redmill, 2011 157

DOES GOOD PROCESS LEAD TO GOOD PRODUCT?

• Adhering to “good” process does not guarantee reliability

– Link between process and product is not definitive

• Using a process is only a start

– Quality, and safety, come from professionalism

• Not only development processes, but also operation, maintenance, management, supervision, etc., throughout the life cycle

• Not only must development be rigorous but also the safety requirements must be correct

– And requirements engineering is notoriously difficult

• Meeting process requirements offers confidence, not proof

CERN, May '11

(c) Felix Redmill, 2011 158

WHY NOT SIMPLY IMPROVE THE BASIC SYSTEM?

• IEC 61508 bases safety on the addition of safety functions

• This assumes that EUC + control system are fixed

• Why not reduce risk by making EUC & control system safer?

• This is an iterative process

– Then, if all risks are tolerable no safety functions required

– But (according to IEC 61508) claim cannot be lower than 10-5 dangerous failures/hour (SIL 1)

• If there comes a time when we can't (technological) or won't (cost) improve further

– We must add safety functions if all risks are not tolerable

• We are then back to the standard as it is

CERN, May '11

(c) Felix Redmill, 2011 159

CAN THE SIL CONCEPT BE APPLIED TO THE BASIC SYSTEM?

• Determine its tolerable rate of dangerous failure and translate this into a SIL (as in the MISRA guidelines)

• But note the limit on the claim that can be made (10-5 dangerous failures per hour) to satisfy IEC 61508

CERN, May '11

(c) Felix Redmill, 2011 160

DOES MEETING THE SIL GUARANTEE SAFETY?

• Derivation of a SIL is based on hazard and risk analyses

• Hazard and risk analyses are subjective processes

• If the derived risk value is an accurate representation of the risk AND the selected level of risk tolerability is not too high THEN meeting the SIL may offer a good chance of safety

• But these criteria may not hold

• Also, our belief that the SIL has been met may not be justified

CERN, May '11

(c) Felix Redmill, 2011 161

SIL SUMMARY

• The SIL concept specifies the tolerable rate of dangerous failures of a safety-related system– It defines a “safety-reliability” target

• Evidence of achievement comes from product and process– IEC 61508 SIL defines constraints on

development processes• Different standards use the concept differently

– SIL derivation may be sector-specific• It can be misleading in numerous ways• The SIL concept is a tool

– It should be used only with understanding and care

CERN, May '11