safety systems€¦ · development of thinking about risk the 1980s decade was a period of change...

SCSC Newsletter

Continued on Page 2

SAFETY SYSTEMSThe Safety-Critical Systems Club NewsletterMay 2016 Volume 25, Number 3www.SCSC.org.uk

25 Years of Safety Systemsand of the Safety-Critical Systems ClubThis issue of Safety Systems, the 75th, marks the Club’s 25th birthday. Simultaneously, it trumpets the end of an era: the ‘old guard’ is stepping aside and making way for changes to the Club’s home and its management. It celebrates 25 years of life and effort and wishes well to what is to come. But what is to come? This Editorial offers an outline of the Club’s next step after taking a cursory look back at where we came from and what we have touched on in our journey to the present.

A Potted HistoryThe Safety-Critical Systems Club held its first public meeting – a seminar – in Manchester on 11th July 1991. 256 delegates attended.

At that time, practitioners who knew themselves to inhabit the safety field possessed a terrific hunger for information on safety technology and practice, and many came seeking not only knowledge but, importantly, guidance on where to seek it.

Of course, there were many others who were not yet aware that they inhabited the safety

domain, and a part of our remit was to find and inform them.

Through the 1980s, computers, which were rapidly and persistently decreasing in both size and cost, were finding their way into all industrial fields, and their application in what became known as ‘safety-critical systems’ attracted the attention of astute engineers. A study, sponsored by the Department of Trade and Industry (DTI) and carried out by members of the Institution of Electrical Engineers (IEE, now the Institution of Engineering and Technology (IET)) and the British Computer Society (BCS), identified numerous problems that arose from this application of software and made many recommendations to the government and both industry and academe. The study report, published in 1989, had the effect of revealing to the research establishment that there was wide scope for investigation into the field.

Following this report, the DTI, together with the Science and Engineering Research Council (SERC), invited applications for research projects, with the proviso that

all had to be collaborative, with participants from both industry and academe. With about 30 projects and approximately £30 million invested, it was clear that an organisation was required to facilitate the propagation of project results, and a contract to set up and run a ‘community club’ was awarded to the BCS and the IEE jointly though, contractually, to the BCS. These brought in the Centre for Software Reliability at Newcastle University to manage the club and me (Felix Redmill) to do the technical work – organising events, editing a newsletter, doing marketing, carrying out liaison with other bodies, and more.

The Club’s objectives were defined as being to raise awareness of safety matters and its technologies, and to facilitate the transfer of information, technology, and current and emerging practices and standards. All sectors of the safety-critical community, and both technical and managerial levels within them, would be involved. It was hoped to facilitate communication among researchers, the transfer

SCSC Newsletter

Continued on Page 3

of technology from researchers to industry and feedback from users, and the communication of experience between users. The benefits were intended to be better directed research, a more rapid and effective transfer and use of technology, aid in the identification of best practice, and the definition of requirements for education and training.

Communication between users was seen as being particularly important. Feed-back on a technology from a single user to a researcher is valuable, but rapid exchange of experience between users can not only shorten learning curves but also minimise the use of unsuccessful technologies. Further, and importantly, although there were known to be many failures of software-based systems in safety-related companies, it was also known that they were often concealed rather than revealed and discussed. Involvement of the developing HSE (Health and Safety Executive) was one solution to this, but another was a club atmosphere in which community members could discuss their problems as well as their successes. Attending Club events was intended to put industrialists in touch with each other. And it was hoped that they would also be encouraged to give talks on their experiences and

write articles for the Club’s newsletter.

The Club formally came into being on 1st May 1991. Its first seminar was held at Manchester University in July and the first issue of Safety Systems was published in September. At that time, and for many years after, Club operations were managed and run by three of us: Tom Anderson, at Newcastle University, who had overall responsibility and was the Club’s comptroller, Joan Atkinson, his secretary, who conducted all administrative and logistic tasks, and me.

The DTI and SERC provided funding, on a reducing scale, for three years, with a hoped for objective of subsequent continuation. As, by the end of that time, the Club had been successful, both in attracting members and achieving its goals, it was continued, on a shoestring, and it continues still, 25 years later. Tom and Joan held their positions through all that time. After 17 years as the technical all-rounder, which included the planning and organisation of more than 70 events, including 16 annual Symposiums, I resigned from organisational and liaison work but continued to edit Safety Systems. Chris Dale became the Event Co-ordinator and, after six Symposiums, handed over in 2014 to Mike Parsons, who is still in place.

After funding ceased, our

management board became an advisory Steering Group, chaired by Bob Malcolm, whom I thank for huge support. He was succeeded by Brian Jepson, who also built and maintains the Club’s web site and, latterly, by the incumbent, Graham Jolliffe.

Development of Thinking About RiskThe 1980s decade was a period of change in the safety world. Not only was there rapid replacement of electromechanical control systems by software-based digital equipment, but there was also a major change in thinking about risk. Hitherto, safety protection had mostly been based on physical barriers, together with rules of operation and behaviour, along with un-evidenced assurances to the public of safety. But Sir Frank Layfield, in conducting his inquiry into the Sizewell B nuclear power station, pushed the Health and Safety Executive (HSE) to explain their risk-based engineering judgements to the public (and to industry and, indeed, to themselves) and this led to the HSE’s 1988 document, The Tolerability of the Risks from Nuclear Power Stations, which defined what became known as the ALARP (As Low as Reasonably Practicable) Principle. The Layfield Inquiry also led to requirements to justify claims for the tolerability of risks

25 Years of Safety Systemsand of the Safety-Critical Systems ClubContinued from Page 1

SCSC Newsletter

Continued on Page 4

Other Features in This Issue5 John Knight explains the need for higher education to lead the improvement of software engineers.9 Drew Rae shows how System Safety’s path diverged from its parent Safety Science and says that it needs a sounder scientific foundation.13 John Spriggs explains how the technical and commercial applications of modern technology are compromising safety.16 Roger Rivett says that to understand and apply reliability theory to the benefit of safety engineering, we need to know its history – and he offers an education in it.20 Richard Scaife advocates consideration of behaviour as well as technical competence in the selection of candidates for safety-related positions.24 Felix Redmill warns against certainty, advocates suspicion, and calls for better education in Risk.27 David Forsythe explains the application of a new risk-based management frame-work, intended to meet the changed role of the Sellafield nuclear site.31 Simon Brown gives an insight into the regulators’ decision-making process in the application of ALARP.34 Matthew Squair shows that the apparent arithmetic accuracy of quantitative risk assessment is seldom reflected in reality, and he proposes that risk-management strategies should be based on the type

and to cease the practice of simply claiming that a risk was remote or an accident was incredible. Implicit in this was the admission that zero risk could not be achieved and that uncertainty was necessarily implicit in the creation and use of safety-critical systems.

Thus, when the Club came into existence, there was a need to facilitate the transfer not only of proven risk assessment techniques, but, importantly, the ways of thinking about risk and its tolerability that were new even to those in the traditional safety domains, such as chemical and nuclear.

Disseminating InformationOther changes were also afoot. New approaches were being pioneered, and these would be introduced by the publication of the standard IEC 61508. This was eagerly anticipated and the Club played a key role in establishing its understanding, through seminars, tutorials, symposium papers, and informal discussions.

There had also been increasing realisation of human influence on both safety and its opposite. One of the early Club seminars was on human factors and, in spite of floods that stopped traffic in many parts of the country, this attracted a crowd of over 100.

25 Years of Safety Systems and the Safety-Critical Systems ClubContinued from Page 2

of uncertainty inherent in the situation.39 Rhys David offers a timely reminder of how we are controlled by our cognitive biases and of the subjectivity that we bring to our dealings with risk.43 Paul Bennett gives an overview of the model that he uses in system development to ensure safety, and the collection of evidence to demonstrate it.47 Odd Nordland comments on the European Union’s Common Safety Method and identifies some confusing consequences.50 Les Hatton shows that past lessons in project failures have not been learned and that software development is a long way from being engineering.53 Martyn Thomas warns of our lack of preparedness to combat cyber crime, against both citizens and nations, and of the need for urgent technical and political action.56 Phil Hall presents the third and final article in a series that offers information and advice on safety certification.60 Karen Tracey reports on the Club’s event, last November, held jointly with the Independent Safety Assurance Working Group.62 We offer a Calendar of Events on topics relevant to our community.64 The Editor makes notes on Safety Systems. And Joan, Tom and Felix bid farewell.

SCSC Newsletter

From its earliest days, the Club ran events, and sought symposium papers, on topics that had not yet made their way into widespread knowledge, or even thinking – topics such as safety management and safety culture, the increasing dependence of safety on security, the use of COTS (commercial off-the-shelf) systems and components in safety, the safety case, testing for safety, the safety lifecycle, safety integrity levels, legal and social aspects of safety, safety standards, new technologies, and more. As well as topic-specific events, the Club ran many that were sector-specific.

The Club and Its NewsletterTo meet its objectives, the Club was charged to run at least four events per year, including the annual Safety-critical Systems Symposium (SSS) with pub-lished Proceedings, and to publish three issues per year of a newsletter. One- and two-day events have mostly been seminars, but the Club has also run tutorials, many on topics that were not yet provided for in training anywhere else.

In the early years, the Club was the principal organ of communication of the progress and results of the DTI/SERC-sponsored research projects and, at the inaugural meeting, eighteen of these were exhibited at a poster session,

which proved to be one of the highlights of the occasion.

Among other initiatives, the Club created liaisons with other bodies, importantly the IET, BCS and HSE.

Safety Systems has been published every September, January and May, and this issue, the 75th, completes 25 years of publication. In that time it has given the community news items, book reviews, press releases, a regular calendar of international events, and 438 articles. I am proud that many of these were written by authoritative authors. But I am perhaps even more proud that Safety Systems has enticed many others to take courage and express themselves. This newsletter has been a forum for members and others to raise questions, utter apprehensions, and report on experiences, as well as to tell us what they think is what. And this issue is an example. It offers articles on risk, the core of our profession; articles on education, in the key subjects of both safety and software engineering, which have not yet been seriously planned to fit our needs; articles on human factors, on our history, on current practice, and on glances into our future.

We therefore have reason to believe that the Club has contributed to the progress, over the last quarter century, in the field of safety technology, practice, and management.

Our 126 events, the knowledge, and even wisdom, accumulated in 24 volumes of SSS, and a further 22 papers in a 1993 book, all suggest a worthwhile achievement, and regular feedback confirms that many others share this view. And that is why we have been strenuously determined to ensure the Club’s continuation.

What Next?And now, change is inevitable. Professor Tom Anderson and Joan Atkinson are both on the point of ‘passing on the baton’. And I have decided that perhaps it is the right time for me to join them in admiring, instead of creating, the results of the Club’s endeavours.

It has been agreed that the Club will move to the University of York and be under the management of Professor Tim Kelly.

York’s research and MSc programmes have furthered the education of numerous practitioners and brought many others into the field. It also started, and for a long time maintained, the email system-safety list, which now functions under Peter Ladkin’s stewardship at the University of Bielefeld. I offer Tim my very best wishes as he adds the Club to York’s portfolio.

May the Safety-Critical Systems Club continue to meet its objectives and remain useful to the safety community for another 25 years – and, perhaps, yet another.

25 Years of Safety Systemsand of the Safety-Critical Systems ClubContinued from Page 3

SCSC Newsletter

Continued on Page 6

Computing as a Profession and the Role of Higher Educationby John Knight

IntroductionSocietal dependence on computing is well known. All critical infrastructure systems – transportation, defence, energy production, utilities, communication, healthcare, finance, food production, manufacturing, and so on – depend upon computing for their development and operation. Any loss of computing in infrastructure systems leads to service failures that vary in significance from inconvenient to catastrophic.

With such dependence, society can and should ask questions about the engineering that underpins the computing in critical infrastructure systems. Rather than assuming that ‘all is well’, that ‘management will ensure proper engineering’, or that ‘our government will establish adequate standards’, pertinent questions about costs, residual risk, and causes of failure should be posed and answered.

Many examples exist which indicate that posing such questions has become necessary. In January 2016, serious concerns were raised about the software in the Lockheed Martin F-35 Joint Strike Fighter, a multi-hundred-billion-dollar program [9]. Over a period of many years, there have been numerous failures in the security of information systems. Often described as

‘data breaches’, these incidents are crimes and should be described as such. One of the most prominent was the theft of records from the U.S. Office of Personnel Management [1], a crime in which the personal information of more than twenty million people was stolen.

But why are these failures so serious and so common? In most cases, they are possible because of deficiencies in the basic engineering of the subject systems. Those responsible for the defective engineering are rarely malicious; for the most part they are unaware of the mistakes being made. An important role of higher education is to prepare graduates to participate in the development of the engineered systems that society requires, including computing systems, and higher education is failing society to a significant degree in the field of computing. For example, network-accessible unguarded buffers remain in modern information systems, providing a simple attack approach. This despite the fact that the first widespread network attack occurred in 1988 with the release of the Morris worm [6]. The worm relied, in part, on a buffer-overflow vulnerability. The lesson has not been learnt in 27 years.

Higher education offers a

variety of degrees, but, with few exceptions, these degrees have the traditional goal of educating the student in the principles of computing. What is needed is for:• Computing to become a recognized profession;• Higher education to lead the transition to a computing profession.

Why should higher education take the lead in effecting the transition? Because the primary responsibility for educating engineers at all levels lies with higher education. Higher education defines the degree programs and thereby defines the education that graduates will receive. Failing to take the lead would be to neglect a crucial, national responsibility, and the likely consequence would be for regulating agencies to impose requirements on degree programs or to impose a licensing infrastructure.

Limitations of Bachelor’s DegreesBachelor’s degrees in computer science, in computer engineering, in software engineering and now even in cyber security are offered widely. The graduates of such programs are the core of the workforce building and maintaining the computing systems upon which society depends.

SCSC Newsletter

Continued on Page 7

Why do such degrees not prepare graduates for the engineering of these computing systems? The answer has two elements: (a) degree programs are incomplete, and (b) degree programs are inconsistent.

The lack of completeness in bachelor’s degree programs is inevitable. The number of courses that can be included in a degree program is limited, faculty interests influence the courses that are included, and courses are frequently ‘elective’ options – the student has to choose from different topics.

The lack of consistency is also inevitable, because degree programs originate within individual institutions. The factors that affect completeness also affect consistency, because that which is included and that which is excluded will be different at each institution.

The lack of completeness and the lack of consistency lead to circumstances in which graduates from comparable institutions have covered vastly different material. A lot of overlap is likely to occur, but graduates typically differ considerably in their exposure to both theory and practice in important topics such as software development processes, software testing, software design, formal methods, software mainten-ance, hardware architecture, hardware interfacing, and so on.

Other ProfessionsAll degrees provide graduates with a significant education. But degrees in fields such as medicine, dentistry, law and veterinary medicine, are different. Degrees in these fields emphasize training that is designed to equip the graduate with skills and the ability to analyze rather than to educate the graduate about theory and principles. These degrees are referred to as professional degrees, because the goal is to prepare the graduate to practice the associated profession.

Professions such as medicine do not rely on a single level of training, nor do they require that any level of training cover the entire profession. A variety of skilled technicians operate complex medical equipment, many types of analyst interpret tests and measurements, dozens of different types of nurse provide sophisticated services, and a range of experts – surgeons, oncologists, cardiologists, and so on – plan, manage and effect care. Each of these different roles relies upon a different, tailored training program.

The preparation of the various members of any profession requires a spectrum of training levels but, in all professions, the highest level of training requires a doctoral degree. These doctoral degrees require: (a) that students prepare for the rigors of a professional degree by completing a suitable

bachelor’s degree, and (b) that students apply for admission to the professional degree program. That the highest levels of training begin with a doctoral degree is a testament to the complexity and sophistication of the associated profession.

In many professions, the problem of inconsistency in training is controlled at a national level by stating expectations of degree programs precisely. In veterinary medicine, for example, the different degree-granting institutions in the United States have different degree programs, but, in order to practice their profession, graduates are required to pass the North American Veterinary Licensing Examination (NAVLE®) set by the National Board of Medical Examiners (NBME). The NBME also administers the United States Medical Licensing Examination (USMLE®), the national examination for physicians.

The Professional Engineer QualificationThe consequences of failure in fields such as civil engineering, mechanical engineering, and electrical engineering can be severe, just as they can be in computing. The training of engineers in these fields is handled in a similar but not quite the same way as the professions discussed in the previous section.

Computing as a Profession and the Role of Higher EducationContinued from Page 5

SCSC Newsletter

Continued on Page 8

The disciplines of engineering outside com-puting also face the challenges of degree inconsistency and incompleteness. In the United States, they deal with these challenges primarily through the notion of the Professional Engineer (PE) qualification. A degree in most of these other disciplines allows the engineer to practice in the associated field but not to accept responsibility for the quality of the engineering. To become licensed as a PE, the engineer has to:• Receive a four-year degree in engineering from an accredited engineering program;• Pass the Fundamentals of Engineering examination;• Complete four years of progressive engineering experience under a PE;• Pass the Principles and Practice of Engineering examination;• Participate in a recognized program of continuing professional education.

In the United States, licensing as a professional engineer is required by many states in many circumstances, including the approval of engineering designs of systems to which the public will be exposed, senior engineering management in government, and teaching engineering [8].

Related Work On Computing As A ProfessionI am far from the first to suggest that computing should become a profession. For example, in 1999 Denning discussed the topic in detail [3]. Loui and Miller discuss the complex issue of ethics in the practice of computing in depth [7]. Ethical concerns are a significant element of the motivation for computing as a profession.

A column entitled ‘The Profession and Digital Technology’ was a feature of IEEE Computer from 2000 to 2011. In the final essay, Holmes presented numerous criticisms and concerns about the structure of the computing ‘profession’ [4]. One of Holmes’ points is the potential role of professional societies such as the IEEE. Whilst I agree with the value that professional societies bring to the field, I am convinced that the primary responsibility for progress lies with higher education.

Certainly there are those who disagree that computing should become a recognized profession. Canada’s Association of IT Professionals has developed an excellent summary of the issues [2]. Two of the concerns about software engineering are:• Software engineering is still an immature discipline and there is no agreed-upon body of knowledge;• No proof exists that a licensing mechanism would yield an improvement in

Computing as a Profession and the Role of Higher EducationContinued from Page 6 the practice of software

engineering.Neither of these concerns

is valid. The requisite body of knowledge does not have to be complete and comprehensive to be useful. If the only entries in a body of knowledge were comprehensive discuss-ions of buffer-overflow vulnerabilities and SQL-injection vulnerabilities, an immense amount of harm would be prevented, because practicing engineers would at least be aware of these two problems. Obviously, a body of knowledge for software engineering suitable for acting as the basis for a computing profession would contain vastly more material. A long-term effort by the IEEE has produced an entirely suitable starting point definition [5].

If the community waits for proof that a licensing mechanism would yield an improvement, then we will probably wait a long time. And the costs of current failures due to defective software engineering are so high that there is no more time to debate this issue.

A New Structure For Computing In Higher EducationClearly, the present degrees in computer science, bachelor’s, master’s and doctoral, must be retained to serve the needs of the science. But to facilitate the transition of computing to a

SCSC Newsletter

Computing as a Profession and the Role of Higher EducationContinued from Page 7

profession requires that higher education undertake three significant steps:1. Define and develop a set of degrees, tailored to the future professional roles of students who will become practitioners, that cover the essential practices of computing in the associated professional roles;2. Define and develop national competency examinations to provide adequate assurance that students have mastered the essential practices in each of the degrees;3. Work with the professional societies to develop licensing and sanctioning mechanisms to maintain standards of practice.

Completion of these steps will require a lot of effort, and engagement of the major funding agencies, government agencies and industries is appropriate for both funding and national coordination.

What would a set of professional degrees look like? The answer to this question depends upon whether computing as a profession follows the route taken by the other professions or the route taken by the other fields of engineering. Whichever path is followed, the result in the USA will be a combination of:• Two-year associate’s de-grees corresponding roughly to the training of a technician in medicine;• Four-year bachelor’s de-grees corresponding roughly

to the training of nurses in medicine;• Doctoral degrees corres-ponding roughly to the training of MDs in medicine.

ConclusionCatastrophic failures of critical systems are occurring with an unacceptable frequency and with unacceptable losses. Many of these failures are the result of defective computer system engineering. The responsibility for this situation lies, in part, with the system of higher education.

Computing is a profession and should be treated as such. To leave things as they are in most countries is to do a disservice to society, a disservice that is already having serious consequences. There must be a change from the status quo.

To promote this change, higher education must design and offer a suitable training infrastructure for computing. Whether that infrastructure is modeled after the existing professions or after existing established engineering disci-plines matters little.

References[1] Barrett, D., ‘U.S. Suspects Hackers in China Breached About 4 Million People’s Records, Officials Say’, Wall Street Journal, 5 June 2015.[2] Canada’s Association of IT Professionals, ‘Software Engineering’, http://www. cips.ca/softeng

[3] Denning, P., ‘Computing the Profession’, Proceeding SIGCSE ‘99 Thirtieth SIGCSE technical symposium on Computer science education, ACM, NY (1999).[4] Holmes, N., ‘The Profession and Digital Technology’, IEEE Computer, vol. 44, no. 12 (2011).[5] IEEE Computer Society, ‘Guide to the Software Engineering Body of Knowledge’, https://www.computer.org/web/swebok/index[6] Litterio, F., ‘The Internet Worm of 1988’, http://www.cs.unc.edu/~jeffay/courses/n i d s S 0 5 / a t t a c k s / s e e l y -RTMworm-89.html[7]Loui, M. and K. Miller, ‘Ethics And Professional Responsibility In Computing’, in Wiley Encyclopedia of Computer Science and Engineering, ed. B. W. Wah, pp. 1131-1142, Wiley, New York (2009).[8] National Society of Professional Engineers, http://www.nspe.org/resources/licensure/what-pe[9] Sweetman, W., ‘Testing Chief Warns Of JSF Software Delays’, Aviation Week and Space Technology, 22 January 2016.

John Knight is a Professor of Computer Science at the University of Virginia, USA. His research interests focus on: (a) formal verification of crucial properties in safety- and security-critical systems, and (b) rigorous assurance cases. He can be reached at <[email protected]>

SCSC Newsletter

Continued on Page 10

System Safety: Heresy, Reformation, and Reunificationby Drew Rae

The Early ChurchEngineering is the application of scientific principles for the purpose of solving real-world problems. For each engineering discipline, there are corres-ponding scientific fields that encompass most of the science applied by engineers within that discipline. For Chemical Engineering, there are Chemistry and Thermodynamics. For Civil Engineering, there are Structural Physics and Materials Science. For Safety Engineering, there is … surprisingly not Safety Science. A review of university curricula, standards, and conference programmes indi-cates that far from being the application of Safety Science, Safety Engineering has concerns and practices that diverge markedly from the questions and principles addressed by Safety Scientists.

Before I explore this further, let me dismiss a common explanation given for this divergence. Safety Engineers often consider themselves a different breed from ‘Occupational Health and Safety’ (OHS) practitioners, because engineers focus on ‘systems failing’ and big accidents, whilst OHS is concerned with ‘slips, trips and falls’. This is an arbitrary and unhelpful categorisation. In order to maintain the

distinction, safety engineers must ignore a large body of research that speaks directly to their own practices – because the OHS and Safety Science communities have never excluded major accidents from their research and practice.

So where – if not in the difference between big and small accidents – did Safety Science and Safety Engineering diverge? For the first half of the twentieth century there was no ‘System Safety’, and the term ‘Safety Engineering’ was synonymous with ‘Safety Management’ (the American Society of Safety Engineers, an OHS body, adopted their name in 1914). Safety Science grew beyond the early ideas of ‘accident proneness’ and ‘man failure’ towards an increased understanding of the role of design and organisations. In the 1950s the emerging science of cybernetics – which examined the ways in which control and feedback within systems led to emergent behaviours – began to be applied to organisations, and, ultimately, to the understanding of accidents.

The SchismThe move towards under-standing accidents as complex emergent phenomena in-volved a drift away from naïve epidemiological models of injury, which necessarily entailed a shift from quantitative to qualitative analysis. Under

the prevailing paradigm, accident causes could be understood and managed, but they couldn’t and shouldn’t be assigned probabilities. Meanwhile, however, there was increasing demand that engineers do exactly that. The post-war world was a scary place. Weapons – in particular, nuclear weapons – were no longer symbols of security, but potential instruments of global destruction. The public wanted re-assurances that they would not be launched accidentally.

Advances in numerical processing theory and technology had made it possible to produce hardware reliability estimates based on combinatorial models of individual component failures. It was a small step to adapt these techniques to predict the likelihood of dangerous failure. The first publicised use of Fault Tree Analysis (FTA) was as part of the Minuteman Missile Launch Control System. Within a few years, FTA was a widely practiced safety technique. Early fault trees made several simplifying assumptions. Firstly, they had to assume that accident causes were well understood, such that the chance of a fault tree being wrong was significantly less than the chance of the accident it predicted. Secondly, they had to exclude unquantifiable causes, such as human error, and to assert that these causes were independent of system

SCSC Newsletter


design. Thirdly, they discarded systems theory in favour of linear decomposition as a model for system behaviour. In other words, they rejected all of the recent advances in Safety Science in order to make the maths work.

The promoters of Fault Tree Analysis called their new discipline ‘System Safety’; an ironic choice, since their methods rejected both the ‘whole-of-system view’ and the ‘systems theory approach’ that were the hallmarks of Safety Science. System Safety was not just about Fault Tree Analysis, of course, but it focussed almost entirely on activities that assessed safety during the project-design lifecycle. It included Hazard Identification, based on control of dangerous energy, and Failure Modes and Effects Analysis (FMEA), and evolved to include other forms of hazard analysis and Quantitative Risk Assessment (QRA). System Safety spread quickly through the military and aerospace standards that mandated the practices, and was quickly adopted by other industries such as nuclear power and chemical process plants.

A New OrthodoxyI don’t want to give the impression that the System Safety community was ignorant of the limitations of focussing almost entirely on the physical

design of systems, or that they were irresponsible to do so. Many within the field pointed out the need to be cautious in presenting claims about probability, and to recognise that the causes of accidents stretched well beyond system design. However, even with a narrow scope and a simplified causal model, Quantitative Risk Assessment pushed the limits of human comprehension and computerised number crunching. Incorporating human behaviours and interactions into the models was beyond the state of the art. Before humans could be included in the model, human behaviour must first be reduced to a simple categorisation of errors, and those errors must be quantified. System Safety looked to Human Factors to provide quantitative predictions of human error to feed the safety models. For some, ‘human reliability prediction’ was a difficult but ultimately solvable problem. For others, even the notion of ‘human error’ was a problematic representation of complex phenomena. System Safety had sparked its first Holy War.

Most of the conflict about ‘human error’ has played out in fields adjacent to, but not at the heart of, System Safety – joint cognitive systems, human factors, and behavioural science. It is relevant to the story in this article primarily because it was the first sign of

a continuing trend in System Safety – when faced with the shortcomings of its methods, the discipline responds by calling for improved techniques. The markers of progress in System Safety are increased scope, automation, formalisation, and quantification of methods. Standards even explicitly link these attributes of methods to increased integrity of the resultant products.

Unfriendly Co-existenceThe emerging discipline of System Safety absorbed many practitioners and researchers who considered themselves as ‘engineers’. Those who remained within Safety Science were a mix of social scientists and psychologists, which had a mixed effect on Safety Science as a research discipline. The growing understanding of the contribution of physical design to safety shrank to a niche concern with control interfaces, and was only re-discovered through the twenty-first century ‘safety by design’ and ‘prevention through design’ movements. However, Safety Science’s understanding of the organisational contribution to accidents leapt forward, creating broad theories such as Turner and Pidgeon’s ‘Disaster Incubation’, two different waves of High Reliability Organisations, and Perrow’s ‘Normal Accidents’; as well as deeper understanding of specific mechanisms such as Vaughan’s ‘Normalisation of

System Safety: Heresy, Reformation, and ReunificationContinued from Page 9

SCSC Newsletter


Deviance’ and Kewell’s work on language and reputation in safety. This wealth of literature has gone largely un-noticed and unremarked within the System Safety community – it has certainly not sparked any change in practices. Instead, the community has preferred James Reason’s ‘Swiss Cheese Model’ – which makes the same linear causality assumptions necessary for Quantitative Risk Assessment.

One of the reasons System Safety has been slow to absorb the developments in Safety Science is a perception that Safety Science has nothing new to offer. This belief is, at least in part, fed by the ‘Behavioural Safety’ or ‘Behaviour-Based Safety’ movement, as espoused by authors such as Thomas Krause and Scott Geller. Like System Safety, Behavioural Safety represents an ideological schism within safety. In contrast to System Safety, which calls for emphasis on detailed analysis of designs, Behavioural Safety views accidents as arising ultimately from the actions of individuals during live operations. When System Safety practitioners – who self-identify as engineering specialists – encounter Behavioural Safety, they recoil in horror. What do personal protective equipment, safety posters and toolbox talks have to do with their sort of safety? Unless one

knows where to look, a cursory glance at the Safety Science literature suggests that it is all about managing individual behaviours.

Meanwhile, to those within the Safety Science community, System Safety was making fewer and fewer interesting contributions. As an engineering discipline, rather than a scientific field, it was absorbed in methods, notations, and tool support, and its researchers were neither offering up new theories nor engaged in rigorous empirical investigation.

Reunification? The past fifteen years has seen a growing interest in the epistemology of System Safety – how do we know the things we think we know, and why do we do the things we do? This in turn has lead practitioners and researchers alike to recognise that good engineering practices can only be built on a reliable foundation of scientific knowledge. Theories of accidents are not interesting side-notes but the sources of the basic assumptions that drive and limit our analysis techniques.

Some of the old lessons of Safety Science have been re-learned, albeit slowly and with considerable acrimony. Leveson’s Systems Theoretic Accident Model and Processes (STAMP) (2004), a family of techniques that fit neatly into the System Safety lifecycle

models, are more closely aligned with cybernetics and systems theory than with the top-down decomposition of Fault Tree Analysis and Failure Modes and Effects Analysis. Safety Cases have rediscovered the knowledge that physical design and human behaviour are not separate domains, but different viewpoints of the same operational system.

System Safety does not have an official birth date, but the first ‘System Safety Conference’ in 1965 is an appropriate time to start counting. In fifty years, the discipline has made considerable progress, but it has also wasted time and effort by trying to solve foundational problems through refinement, formalisation, complication and automation; too proud to learn from its parent science.

Safety Science is now commonly available as a four-year degree, followed by a two-year postgraduate program, whilst System Safety is most often taught as a short conversion course on top of a foundation degree in a different engineering field altogether. Intellectual snobbery is no longer an option.

The future of System Safety depends on restoring the scientific foundations of the field. Safety engineers cannot afford to be ignorant of systems theory as a deep and nuanced body of knowledge rather than a jingoistic label; or to get by with only a surface knowledge of the insights offered by the


SCSC Newsletter

many and varied theories of organisational accidents. They should not be talking about human error probabilities without having thought deeply about the critiques of the very concept, or applying techniques without deliberate and thoughtful consideration of the analytic sacrifices they are making.

To enhance readability, in-line citations have not been included within the main article. For more detail on the topics discussed, see:

‘Early church’ SafetyAlexander, F., 1949. The Accident-Prone Individual. Public Health Rep. 64, 357–362.Heinrich, H., 1941. Industrial Accident Prevention, 2nd ed. McGraw-Hill, New York.

‘Pre-schism’ Systems ApproachesCownie, A.R., 1966. The cybernetics of accidents. Trans. Soc. Occup. Med. 16, 111–114Fitts, P.M., 1947. Psychological Aspects of Instrument Display: Analysis of 270 ‘Pilot-Error’ Experiences in Reading and Interpreting Aircraft Instruments. DTIC DocumentKerr, W., 1957. Complementary Theories of Safety Psychology. J. Soc. Psychol. 45, 3–9. Stieglitz, W., 1966. Numerical Safety Goals – Are they Practicable? Presented at the Reliability and Maintainability Conference.

The Birth of System SafetyEricson II, C.A., 1999. Fault Tree Analysis – A History. Presented at the 17th International System Safety ConferenceHaddon, W., 1968. The changing approach to the epidemiology, prevention, and amelioration of trauma: the transition to approaches etiologically rather than descriptively based. Am. J. Public Health Nations Health 58, 1431–1438. Lewis, H.W., Budnitz, R.J., Rowe, W.D., Kouts, H.J.C., von Hippel, F., Loewenstein, W.B., Zachariasen, F., 1979. Risk Assessment Review Group Report to the U. S. Nuclear Regulatory Commission. IEEE Trans. Nucl. Sci. 26, 4686–4690. doi:10.1109/TNS.1979.4330198MIL-P-1629, 1949. Procedures for Performing a Failure Mode, Effects and Criticality Analysis. U.S. Department of DefensePowers, G.J., Tompkins, F.C., 1974. Fault tree synthesis for chemical processes. AIChE J. 20, 376–387. Swain, A.D., 1964. Some Problems in the Measurement of Human Performance in Man–Machine Systems. Hum. Factors J. Hum. Factors Ergon. Soc. 6, 687–700.United States Nuclear Regulatory Commission, 1975. Reactor Safety Study: An Assessment of Accident Risks in U.S. Commercial Nuclear Power Plants (WASH 1400).

Some Advances in Safety Science Ignored by System SafetyKewell, B.J., 2006. Language games and tragedy: The Bristol Royal Infirmary disaster revisited. Health Risk Soc. 8, 359–377La Porte, T.R., 1996. High Reliability Organizations: Unlikely, Demanding and At Risk. J. Contingencies Crisis Manag. 4, 60.Perrow, C., 1999. Normal Accidents: Living with High-Risk Technologies. Princeton University Press.Turner, B.A., 1976. The Organizational and Interorganizational Develop-ment of Disasters. Adm. Sci. Q. 21, 378–397Vaughan, D., 1997. The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA, 1 edition. ed. University Of Chicago Press, Chicago.Weick, K.E., Sutcliffe, K.M., 2001. Managing the Unexpected: Assuring High Performance in an Age of Complexity, 1st ed. Jossey-Bass.

Dr Drew Rae is a lecturer at the Safety Science Innovation Lab of Griffith University in Brisbane, Australia. His research covers the messy interface between safety as it is conceived in courses and standards, and safety as it is experienced by practitioners. Drew can be contacted by email on <[email protected]> or on twitter as @Dioptre. His podcast is ‘DisasterCast’.


SCSC Newsletter


Passing the Batonby John SpriggsAt an anniversary like this, it is tempting to look back at the achievements presented in the Safety Systems journal since its inception twenty-five years ago, but it may be more valuable to look forward. ‘Who among us would not be glad to lift the veil behind which the future lies hidden; to cast a glance at the next advances of our science?’ asked David Hilbert in Paris at the beginning of the twentieth century. Such veil lifting is still impractical, but changes in the nature of the systems that we are assuring surely require changes in the ways we assure them or, at the very least, a re-validation of existing methods in a new context. What are the emerging problems that the safety assurance professional must solve?

Some problems have crept up on us over the years. Consider safety-critical sys-tems that go beyond the simple model (of a local controller keeping a process or equipment within defined limits) by being geographically distributed. Twenty-five years ago, such a system is likely to have used dedicated channels to bring data from remote sensors and to command remote actuators. These may have been special transmission lines or radio-relay arrangements with redundant and diverse routings. Over the years, such dedicated lines will have been decommissioned in favour

of using the services of a telecommunications provider.

The assurance argument for Continuity of Service goes from one of your competent staff (on call twenty-four hours a day) following well-designed procedures for preventative and on-condition maintenance to being one of ensuring the required outcomes via a Service Level Agreement with a reputable supplier. Is it as compelling? Why would it not be, if you had a competent lawyer following a well-established procedure, appropriate engineers carrying out appropriate checks, etc.?

Then the plain old telephone system was developed into an automatically self-optimising network; previously diverse routings drift together to share the same optimum path from A to B. To fulfil the Service Level Agreement, the supplier may have to intervene manually to maintain ‘separacy’, and the costs, which were reduced by outsourcing, start to rise again. Even if you originally secured two suppliers to guarantee separate routings, you may still find your signals sharing the same fibre for parts of their journey, because both suppliers have, in turn, outsourced to the same provider.

The situation may have developed such that the provider to whom you originally outsourced, and with whom you have the Service Level Agreement, may have themselves now outsourced

everything, and be nothing more than a broker. The broker will have their own agreements with suppliers down the value chain as they buy a bundle of services from others who have in turn outsourced. Your original maintenance people would have been imbued with your safety culture, and the person with whom you originally agreed levels of service would have known the consequences of loss of service, but now? How long will it be before someone along the chain proposes that it would be cheaper to pay the fines under their Service Level Agreement than to invest in maintaining Continuity of Service ‘twenty-four seven’?

PROBLEM 1: How do you assure your system or service when parts of it are the subject of Service Level Agreements, or other formal arrangements, with external parties who may then outsource all or part of their service? [Note 1] Or, to put it another way: If your policy is to assure continuing safety by performance monitoring and assuring change, how do you detect changes behind opaque interfaces with others?

It is not just communications that can be outsourced and the problem of having to assure by formal agreement is not just limited to geographically extended systems. You may have outsourced your site security or electrical installation work, for example,

SCSC Newsletter


and the scope now exists for much more. What new ‘savings’ can be made? Look at all the maintenance costs for the fixed local infrastructure, spares holdings, twenty-four hour support, obsolescence management, etc. Why not outsource all that? Buy infrastructure as a service and run your software on a ‘cloud’ somewhere. You will still get your ‘twenty-four seven’ but via a Service Level Agreement, rather than from dedicated resources.

There is a value chain here too; you can go to a system integrator to do it all for you, but you may find the ‘cloud’ is not theirs; they are just a broker. When there is a problem, you are not talking to the trouble-shooters, but to a service desk, who will contact another service desk, and so on down the line. The service may be interrupted because one of the intermediate providers suddenly ceases trading. Is your assurance argument for Continuity of Service still as compelling? Do you have the right mitigations in place? Have you launched your own ‘cloud’, for example?

Do you have assurance for the software supporting the service desks; will it be available when needed, of sufficient integrity and proof against leaking your service problems to all and sundry? In practice that may not be as difficult as it sounds, these

are all Information Security concerns, and so may have already been addressed by the suppliers to give them a good fashionable selling point. You will, of course, also have to assure the security of the services themselves, not just the service desks. Are the security controls compatible with the safety requirements? For example, safety concerns may require all changes of software and configuration data to be assured before deployment, whereas security requirements may specify that changes to anti-virus software and its configuration data must be done immediately the new version is available.

PROBLEM 2: How do you ensure that security requirements and safety requirements do not contradict each other, or ask for the same things in different ways (possibly leading to unnecessary duplication of effort)?

An assurance argument that will be difficult to make is one to assure that software implemented to fulfil Software Safety Requirements (or Security Controls) is not ‘interfered with’ by other software. Such interference is a hazard, the Software Safety Requirements typically specify mitigations, which interference can nullify. This can be a tricky argument to make when it is all in one computer; but hidden in a ‘cloud’, potentially sharing resources with unknown

others? Can it be done at all? Well, yes it can, sometimes, but I am not going to tell you how ‘for free’; you will have to think about it ...

PROBLEM 3: How do you assure non-interference with software implementing Safety Requirements when you do not know what, or where, the potential interferers are, e.g. in a ‘cloud’?

What about software assurance in general? Twenty-five years ago, if your system contained software, you probably based your assurance on the quality of the development processes (as carried out by competent people) with lots of checking and auditing (carried out by other competent people), followed by in-depth and independent testing on representative hardware, before final testing and deployment on the actual hardware. No doubt you specified features like spare processing power and memory capacity, along with constraints on the software, forbidding dynamic allocation of storage, for example. Maybe you specified dual diverse channels and other resilience measures. Are such things now done as a matter of course in your organisation, or are they constantly challenged as being ‘old-fashioned’, or too expensive? In either case, can we improve our assurance of software; what are the emerging problems?

Two unexpected problems

Passing the BatonContinued from Page 13

SCSC Newsletter


(at least unexpected by me) have emerged already; they concern the use of ‘levels’. A number of standards and guidance documents exist that identify how much assurance is required for software in particular usage circumstances by deriving a ‘level’. In IEC 61508, for example, there are Safety Integrity Levels numbered such that SIL1 is the least onerous, whereas RTCA/DO278 has Assurance Levels, such that AL1 is the most onerous. Once you have derived your level, you know from the standard or guidance which activities to perform and/or what evidence to collect. The level is intended to provide a convenient context for conversations with your regulators and your suppliers. Note, however, that the inexperienced supplier may panic when you allocate AL4, which is easy, because they think you mean SIL4, which is not.

There is a big, often unstated, assumption that doing what is prescribed for your level is sufficient. Those who set up the schemes further assumed that all the activities would be done, and all the evidence collected, before the software would be accepted for deployment, but this has turned out not to be the case in some quarters. Timescale and other commercial pressures can result in software being accepted too early; this can

mean that necessary fixes, or provision of additional evidence cost extra, and suddenly software assurance is the villain of the piece, because it costs so much. The assurer is expected to argue away the shortfall of evidence, which is not always possible.

PROBLEM 4: If the supplier has failed to fulfil all provisions of the selected ‘level’, i.e. you do not have as much assurance as planned, how do you assess and, ideally, quantify the residual risk?

This will be a particular problem for those schemes in which the link between risk and ‘level’ is difficult to justify, such as that of EUROCAE ED-153.

The other unexpected problem was suppliers re-garding the ‘level’ requirement as trumping all others. So, for example, if you were to choose a level that does not explicitly require a particular type of test-coverage analysis to be done, but you have asked for such test coverage in a separate requirement, the supplier tags that requirement as invalid and does not address it further. Similarly, a supplier may state that they cannot provide the requested non-interference argument, for example, because it would require support from evidence not listed for the level you specified. This is illogical, because for trivial levels, like AL4 of EUROCAE ED-109, the supplier could not deliver their product if they restricted

themselves to only produce the artefacts listed; for example, saying that AL4 does not require software source code.

PROBLEM 5: How do you get the benefits from using standardised levels of assurance without the disadvantages of suppliers (and maybe colleagues) getting fixated on the level allocated, instead of on collating sufficient assurance that the risk of deploying the software is tolerable and adequately managed? Alternatively, if you give up on ‘levels’ and instead go over to one of the new approaches, like arguing confidence, how do you get what you need from your suppliers?

I have only suggested five problems, rather than David Hilbert’s twenty-three or four, and I sincerely hope that they can be solved a lot more quickly than his were. I have good ideas on how to tackle most of them, but it is time to hand over the baton – so, over to you; please publish in Safety Systems when you have solutions to share.

If you think that an SCSC Working Group could be the best way to tackle a problem (one of these, or a proposal of your own), please contact Mike Parsons ([email protected]) with your suggestion.

NoteNote 1: This problem was addressed, in part at least, in my presentation at SSS’16, ‘Assurance by Proxy’, a copy of


SCSC Newsletter


BeginningsThe word ‘reliability’ was first coined by the poet Samuel Taylor Coleridge in 1816 as a commendable attribute when speaking of his friend, the poet Robert Southey, of whom he said, ‘He inflicts none of those small pains and discomforts which irregular men scatter about them and which in the aggregate so often become formidable obstacles both to happiness and utility; while on the contrary he bestows all the pleasures, and inspires all that ease of mind on those around him or connected with him, with perfect consistency, and (if such a word might be framed) absolute reliability.’

Applying this concept of a commendable attribute to manmade machines started with trying to make machines that could be trusted to work as intended. Almost inevitably, this started with weapons of war. In the mid 18th century, the French gunsmith Honoré

le Blanc, applying the ideas of Jean-Baptiste Vaquette de Gribeauval, used gauges and jigs so that handcrafted musket parts would be common enough to be interchangeable.

Due to scepticism of this approach in Europe, le Blanc spoke to the American Ambassador to France, Thomas Jefferson. This resulted in the ideas being taken forward in the newly independent America, where they contributed to what became known as the American system of manufacturing. The use of standardised interchangeable parts became most famous with its use for the mass-production of the Ford Model T car in1913.

Standardised parts mean repeatability; with repeatability comes the opportunity for probabilistic treatment. In the 1920s, Dr. Walter A. Shewhart at Bell Labs pioneered the use of statistical quality control for product improvement, from which point onwards reliability was forever associated with statistics. In 1931 he published ‘Economic Control of Quality of Manufactured Product’ and set the founding principles of quality control.

Post World War TwoWhereas the beginnings of reliability established principles for manufacturing quality, the post World War Two (WWT) era addressed reliability as a design issue.

The development of electronics, based on the use of vacuum tubes, began in the 1900s with the invention of the diode by John Fleming in 1904 and the invention of the triode by Lee de Forest in 1906. By WWT this technology was essential for the war effort, where it was used in applications such as radio, radar and early computers.

Due to the large numbers of vacuum tubes required, the reliability of the devices became more and more of an issue. Joseph Naresky, in his 1956 book Reliability Factors in Ground Electronic Equipment, gives the following war-time figures: the US Navy was supplying a million replacement parts a year to keep a total of 160,000 pieces of equipment in operation and the US Air Force was reporting that it was barely able to get 20 hours of trouble-free operation from the electronic gear on bombers.

After the war the US led the development of electronics reliability and many organisations were established. In 1948 the IRE (Institute of Radio Engineers) formed the Reliability Society and in 1954 started publishing ‘Transactions on Reliability and Quality Control in Electronics’; this is still being published as the ‘IEEE Transactions on Reliability’. In 1952 the US Department of Defense and the American electronics industry

The Theory and Practice of Reliability– A Brief and Incomplete Historyby Roger Rivett

which is available to registered users of the SCSC website.

John Spriggs is currently at NATS and may be contacted via <[email protected]>; http://www.linkedin.com/in/johnspriggs


SCSC Newsletter

The Theory and Practice of Reliability– A Brief and Incomplete HistoryContinued from Page 16

founded the Advisory Group on the Reliability of Electronic Equipment (AGREE).

Viewing reliability as time-to-failure, resulting from a stochastic process, became the dominant understanding, with the 1952 Aeronautical Radio Inc (ARINC) report ‘Terms of Interest in the Study of Reliability’.

Three strands of work emerged at this time: the collecting and analysis of field data to improve design; the inclusion of reliability requirements in contractual specifications; and a formalised means of measuring or estimating product reliability before components and systems are built. This led to the development of models for reliability prediction based on a common set of failure rates. In 1956, RCA published TR1100 ‘Reliability Stress Analysis for Electronic Equipment’. This formed the basis for Military Handbook 217, published from 1962 until 1995.

In 1957 AGREE published a report which suggested that systems should be constructed from replaceable electronic modules, later called Standard Electronic Modules, to allow a failed system to be quickly restored. The report used what has become the classic definition of reliability: the probability of performing without failure a specified function under given conditions for a specified period of

time. Based on the assumption that the failure rate of electronic components adhered to a bathtub-type curve, the report recommended running formal demonstration tests with statistical confidence for products, and running longer and harsher environmental tests that included temperature extremes and vibration. This came to be known as ‘AGREE testing’ and eventually turned into Military Standard 781.

The 1960sIn 1962 the first Physics of Failure (PoF) in Electronics Conference was organised and was held annually until 1966. Since 1967 this has been sponsored by the IEEE and continues as an annual event now known as the Reliability Physics Symposium. The PoF approach seeks to understand why failures occur in terms of the fundamental physical and chemical behaviour of the materials out of which the components are made. Such understanding can then be used to eliminate the failures, improve manufacturing pro-cesses or provide more accurate formulations of reliability models. This is in contrast to the more empirically-based reliability prediction approaches exemplified by Military Handbook 217.

The advocates of PoF believe that while the analysis is complex and costly to apply, it provides the strongest

characterization available of reliability of components, structures and systems. Thus, by the early 1960s there were two approaches to reliability, one a quantitative approach based on predicting failure rates and the other based on identifying and modelling the physical causes of failure.

With the boom in the use of solid state devices in the 1960s, both these techniques were applied to semiconductors, e.g. in 1968 Richard Nelson of RADC (Rome Air Development Center) produced ‘Quality and Reliability Assurance Procedures for Monolithic Microcircuits’, which eventually became Mil-Std 883 Test Method Standard Microcircuits.

The work mentioned so far applies mainly at the component level. Reliability engineering also uses tech-niques for analysing the design of assemblies of components, e.g. Failure Mode Effects Analysis and Fault Tree Analysis.

Analysis Techniques (FMEA and FTA)The first standard for FMEA was issued in 1949 by the US Armed Forces, MIL–P–1629 Procedures for Performing a Failure Mode, Effects and Criticality Analysis. Over the course of the next three decades, other industrial sectors adopted the use of FMEA. These included NASA in 1963 and civil aircraft design Continued on Page 18

SCSC Newsletter



in 1967.In the late 1970s many

automotive companies started using FMEA and in 1994 the Society of Automotive Engineers (SAE) published J1739 Potential Failure Mode and Effects Analysis in Design and Manufacturing, which was jointly developed by Chrysler Corporation, Ford Motor Company, and General Motors Corporation. QS9000, the automotive version of the quality standard ISO 9000, widely used until 2006, required the use of J1739. QS9000 has now been replaced by ISO/TS 16949, which also requires the use of J1739.

Fault Tree Analysis (FTA) was developed in 1962 at Bell Laboratories to evaluate the Minuteman 1 Intercontinental Ballistic Missile (ICBM) Launch Control System. Boeing began using FTA for civil aircraft design around 1966 and its use in aviation was promoted in 1970 when the U.S. Federal Aviation Administration (FAA) adopted the use of failure probability criteria for aircraft systems and equipment.

The U.S. Nuclear Regulatory Commission (NRC) advocated the use of probabilistic risk assessment (PRA), including FTA, in 1975, and in 1981 published the NRC Fault Tree Handbook. The use of FTA has been recognised as an acceptable method for process hazard analysis by the United

States Department of Labor Occupational Safety and Health Administration since 1992.

A fault tree is a failure model of a top-level fault of the system being analysed. This model can be evaluated quantitatively by assigning failure rates from the predictive reliability approach to the basic events at the bottom of the tree. In this way the failure rate of the top event can be estimated, accepting that there are uncertainties associated with the values assigned to the basic events and in the fault model itself. From these estimates it can be decided what actions, if any, need to be taken to improve the product.

FMEA takes a different approach to determining what actions, if any, need to be taken. Each failure mode identified is assessed in terms of three factors: the severity of its effect; the likelihood that it can be prevented during design; and the likelihood that it will be detected before the product is shipped. Each factor is assessed subjectively on a scale of 1 to 10. There are different guidelines to help assess each factor and different rules for both how the factors are combined and the seriousness with which the resulting outcome should be treated. The result is a relative ranking for the design being analysed rather than an absolute assessment of ‘risk’.

Alternatives to Predictive ReliabilityIn the automotive industry reliability is used as a measure of quality, and ‘high quality’ is necessary to compete in the marketplace and be profitable. To remove some of the subjectivity of the FMEA, one could try to use reliability predictions when assessing the likelihood that a failure mode will not occur in the field. The case has been made that a probabilistic approach to reliability is not adequate for achieving high quality.

This is because it is not thought to be possible to properly take account of all the uncertainty introduced by factors such as speeds, loads, duty-cycles of loads, temperature dynamics, humidity and corrosive environments. A vehicle is a massproduced item used in all motorised regions of the world and there is large variation in how and where the vehicle is driven; consequently these factors will experience a wide range of values, which cannot reasonably be taken into account. There is also a lack of closed-loop feedback from units in the field as data outside the warranty period is not collectable.

Given these criticisms, during the 2000s there was a move away from reliability to an approach known, variously, as Robustness or Failure Mode Avoidance (FMA). These

SCSC Newsletter


approaches have their roots in the work of Taguchi who, in his book Robust Engineering [Taguchi 2000], defined robustness as ‘The state where the technology, product, or process performance is minimally sensitive to factors causing variability (either in the manufacturing or user’s environment) and aging at the lowest unit manufacturing cost’.

Robustness recognises two types of quality: customer quality, i.e. features the customer wants, and engineered quality, i.e. features the customer does not want. Robustness is about engineered quality, i.e. removing the features that the customer does not want. Such features include failures, noise, vibrations, unwanted phenomena and pollution. The approach is not to test in all foreseeable customer usage conditions and then fix the failures as they occur until all the tests are passed. Instead, it, seeks to prevent failure by having a robust definition of the function. It does this by identifying the ‘ideal function’ and then selectively choosing the best nominal values of design parameters that optimise performance reliability at lowest cost.

The classical metrics for quality/robustness, e.g. failure rate, process yield, are considered to come too late in the product development.

The Taguchi measure for robustness is signal-to-noise ratio. The signal-to-noise ratio measures the quality of energy transformation as expressed by level of performance of the desired function or of its variability. The signal-to-noise ratio is increased by reducing variability and specifying nominal values of the design parameters such that the design is insensitive to noise factors, e.g. the customer environment, aging and wearing, and manufacturing variations. In reducing variability, the Robustness approach has some similarities with the Six Sigma approach. This was developed by Motorola in the mid 1980s and popularised when Jack Welch made it the central business strategy for General Electric (GE) in 1995. In statistical terms the purpose of Six Sigma is to reduce variation to achieve very small standard deviations.

Others have suggested that reliability should be defined as ‘failure mode avoidance’, with failure being any customer-perceived deviation from the ideal condition. Two root causes of failure modes are recognised: lack of robustness (the ability of the system to avoid failure modes, even when masked as correct) and mistakes. Mistakes are seen as human failings and their prevention is the goal of quality management. This approach is currently quite common in the

automotive industry.

Concluding RemarksThe above brief and incomplete history has tried to restrict itself to reliability. However, there is a lively and ongoing debate about the relationship between reliability and safety. Maybe it was the case that for simple systems, e.g. one function, with the physics well understood and intuitively obvious failure modes, a reliable system was a safe system; this has long since ceased to be the case. Now, with the widespread deployment of software-based cyber-physical systems, the role of non-stochastic systematic failures is at least as important as random failures.

New techniques continue to be developed, e.g. STAMP (Systems-Theoretic Model and Processes) and Why-Because Analysis, and the subject continues to be debated. It is always good to frame these debates in the context of what has gone before and how we came to be at the position we are in today. We are taught the traditional techniques and take them as given, but nothing is set in stone and everything can be questioned, though the questioning is better informed if we know its history.

Techniques arise to address problems but are then used in different circumstances; it is always a good idea to question and doublecheck that the technique and its underlying


SCSC Newsletter


assumptions remain valid. Knowing history makes us better informed to do this.

George Santayana said that, ‘Those who cannot remember the past are condemned to repeat it.’ This has been paraphrased as ‘Those who do not know history’s mistakes are doomed to repeat them.’ Therefore, as engineers, let us know and understand our past.

ReferenceTaguchi G, Chowdhury S, and Taguchi S, Robust Engineering: McGraw Hill, 2000

Roger Rivett is a Functional Safety Technical Specialist at Jaguar Land Rover. He can be contacted at <[email protected]>

The Theory and Practice of ReliabilityContinued from Page 19

Many organisations do a good job of identifying the technical competencies required for specific roles. Training matrices can provide a clear indication of the skills required for a process safety engineer, for example.

However, in many organisations, when a new supervisor or manager position becomes available, the behavioural competencies for the role are not well defined, and it is often someone

Behavioural Competence for Safety Leadershipby Richard Scaife with high levels of technical

competence within the relevant department that is promoted into the role.

Many organisations are now developing behavioural standards for safety that specify the behaviours that help or hinder safety at each level of the organisation. These standards are used by the organisation to set their expectations for the behaviours that they want to see performed frequently and those that they want to see eliminated. Whilst organisations strive to eliminate behaviours that hinder safety, there can be occasions when unsafe behaviours occur unintentionally (e.g. human errors). There can also be circumstances under which individuals need to behave in a way that they wouldn’t normally do in order to fix a problem (e.g. Captain of US Airways flight 1549 not following normal ditching procedure, resulting in successful evacuation of the aircraft). For these reasons many organisations are pragmatic when they find that they have reduced the frequency of behaviours that hinder safety, although not completely eliminated them.

The behaviours are developed through a triangulated and evidence-based approach, based upon the experience of personnel from all levels of the organisation.

A triangulated approach is one in which several sources of information are combined, which typically includes deriving evidence from incident investigations and examining the characteristics of people who exhibit strong and weak safety leadership behaviours. This is backed up with reference to relevant published research in the field of behavioural safety, for example the characteristics of high reliability organisations (HROs). An HRO is an organisation characterised by a high degree of mindfulness and a healthy level of unease about good safety performance. An HRO will not rest easy if there has not been an incident for a long time, they will seek to learn what they have done to keep them from having incidents so they can keep doing this, and they will constantly ask themselves ‘What could go wrong to cause an incident in the future?’. See for example [Weick and Sutcliff, 2001].

There are several techniques that can be used to gather information from people within an organisation to help determine what these behaviours should be. They include critical incident interviewing [Flanagan, 1954] and the repertory grid technique [Kelly, 1955]. These techniques help participants in the development of the behaviour standard model that the staff would have to conform

SCSC Newsletter


to. Both techniques help to elicit examples of behaviours from people at all levels of the organisation that have, in their experience, helped or hindered safety. The resulting behaviour standard is structured around four behavioural themes—Standards, Communication, Risk Management and Involvement. Research in behavioural safety suggests that these four themes are important to the development of a strong safety culture, and also represent examples of good safety leadership at all levels of the organisation.

Because the behaviours are specified for all levels in the organisation (e.g. management, supervisors, indeed everyone), each of the four themes described above has many potential applications and can be integrated into Human Resources and safety management systems to measure performance at all levels of the company.

For example, when selecting personnel for a specific role within the organisation, a behaviour standard can be used as a framework to determine the level of behavioural competence held by candidates for that role. This would normally be done in parallel to an assessment of technical competencies. Without such a definition of the behavioural competencies it is difficult to determine whether a person will be successful in

the non-technical aspects of the role. This difficulty is often the reason why organisations promote people into roles based upon formal assessment of technical competencies and a less formal assessment against behavioural competencies. This can result in the promotion of people who can do the technical job very well, but are less successful as leaders.

Consider once more the process safety engineer who has completed the training matrix for their role and has done well on all of the technical competencies. They are now being considered for a supervisory role within their department, which will mean that they will need to be good at safely managing a team of other engineers in order to be successful. In practical terms, many organisations would promote a strong technical candidate and help them to develop the person-management skills and the behavioural competencies through mentoring or coaching on the job. However, the difficulties arise when there is not sufficient time or resources to do this mentoring and coaching effectively.

Organisations with a safety-behaviour standard can use it as a behavioural competence framework, and can develop behaviour-based interview questions, based upon the framework, for selection of new personnel or promotion of existing personnel. For example, a behaviour relevant

to a supervisor position might be: ‘I will promote risk awareness within my team.’

This may translate into an evidence-based interview question asking: ‘Can you give me an example of how you have promoted risk awareness with other people in your team?’

This in turn may result in a series of more probing questions designed to gather more evidence, for example:• ‘How did you assure yourself that your team was aware of hazards and assessed risks?’• ‘What sorts of things did you say to your team?’• ‘What did you do to make sure that your team did not become complacent?’• ‘How did you know your efforts had been successful?’

If the company was hiring someone from outside the organisation, they may seek evidence that this behaviour had been demonstrated elsewhere in a supervisory role. If the company were seeking to promote someone into a supervisory role, they would need to seek evidence that the person had demonstrated the ability to behave in this way, for example by mentoring a colleague in risk awareness.

This approach could be applied to both external recruitment and promotion from within the organisation. However, for measuring performance against a standardised set of behaviours,

Behavioural Competence for Safety LeadershipContinued from Page 20

SCSC Newsletter


organisations should not be waiting until there is a personnel change or promotion.

The frequency with which personnel display the required behaviours can be monitored in a number of other ways proactively, such as:• Relevant behaviours can be assessed by the line manager during performance appraisals.• Individuals can undertake 360-degree feedback, where they rate themselves on their performance of the behaviours relevant to their role and ask their manager, direct reports and peers to do the same. The results provide feedback on the individual’s perception of their performance and of those around them, which

can be a powerful personal development exercise. • Site safety tours are used to observe the behaviours of people on site and to measure how frequently the key safety behaviours are occurring. Many organisations elect to focus on different sets of behaviours every month to make sure their approach is manageable (in a similar way to how they focus on different physical hazards, such as potential trip hazards).

When behavioural and technical competencies are considered together, some very powerful insights can be gained into the performance of individuals or groups of people within an organisation. Figure 1 shows technical competence along the X-axis (in this case meeting performance targets)

and behavioural competence up the Y-axis (in this case safety behaviours). This figure is for illustration only; in reality appropriate measures of technical and behavioural competence would be used. In this illustration a frequency rating between ‘Always’ and ‘Never’ has been used. Each star represents the performance of one individual on both scales – e.g. to be in the top-right quadrant therefore means the individual meets performance targets and demonstrates safe behaviour standards.

Arguably, those in the lower-right quadrant could cause the largest problem in a company as they are meeting their performance targets well but are not doing so in a safe manner. These may be people


Figure 1: Technical and Behavioural Competence in the Assessment of Performance!

!

SCSC Newsletter

who are taking shortcuts but are gaining positive feedback by meeting performance targets, therefore trying to influence them is likely to be difficult if their motivation is to over-perform. Careful consideration would need to be given to this group: should they continue to be employed? Can the organisation afford to let them go on the grounds of inadequate safety performance?

The decision is significantly easier with those in the lower-left quadrant. This group is not performing well in terms of technical competence or behavioural competence. There is, therefore, either a capability issue and they should cease their employment, or they are not in appropriate roles within the organisation and perhaps they should be moved to see if they can succeed elsewhere.

Those in the top-right quadrant are clearly the role models, performing well on both axes. These people would be the best to use as coaches for those in other quadrants on the diagram.

Those in the top-left quadrant perform well in terms of behavioural competence but would benefit from either training (or possibly re-training) to improve technical competence. It may also be possible to make use of the role models from the top-right quadrant to provide coaching in some of the technical skills.

The group in the centre

of the figure should receive coaching on how to improve their performance against behavioural competencies as this could put many of them into the role model group.

Many organisations have been using the type of behaviour standard described in this article to perform gap analyses to determine how large a gap there is between the behaviours of personnel at all levels of the organisation and the positive behaviours described in their behaviour standard. This level of focus on behavioural competence has paid dividends for the organisations concerned.

One global pharmaceutical company has been monitoring behavioural competence in this way since 2008. It has incorporated the behavioural competencies into its selection and appraisal processes, conducted 360-degree feed-back with supervisors and managers, and incorporated the use of the competence model into incident analysis to identify behaviours that have had a bearing on the outcome of incidents. In 2014 they reported that they had experienced a 60% reduction in reportable incidents between 2008 and 2014. Their sites have used the results of gap analyses to develop their own interventions to address the gaps, and almost all of their sites globally have experienced measurable improvements in safety culture since they began using this approach.

In conclusion, there do appear to be tangible benefits to be had from focussing on the behavioural competence of personnel in addition to the well-understood technical competencies. Both have a bearing on overall performance, and understanding which set of competencies needs to be addressed in order to improve performance is important when making decisions around performance improvement.

ReferencesKelly, G (1955), The Psychology of Personal Constructs. New York: Norton.Flanagan, J. C. (1954), The critical incident technique. Psychological Bulletin, 51, 327-358.Weick, K. & K Sutcliffe (2001), Managing the Unexpected: Assuring High Performance in an Age of Complexity. San Francisco: Jossey-Bass

Richard Scaife is a Director of The Keil Centre Limited, a private practice of Chartered Psychologists and Ergonomics & Human Factors Specialists. With twenty-six years’ experience in a range of industries (including aviation, nuclear, and rail), his work involves analysing human failures to improve performance, and the ergonomic design of equipment to reduce the likelihood of errors. Richard won a Practitioner of the Year Award from the British Psychological Society in 2006. He may be contacted at: Telephone: +44 (0) 131 221 8270; Email: <[email protected]>


SCSC Newsletter

Always Allow That You Might Be Wrongby Felix Redmill

This article is based on an after-dinner speech, given by the author, at SAFECOMP 2015, in Delft, The Netherlands, on 24 September 2015

Every Saturday, two old friends used to go fishing on a lake near their home. On one occasion, they invited a new friend, a young man, to accompany them. They started off from shore and, when they had gone a little way, one of the men said, ‘I forgot the beer.’ The rower adjusted the boat and threw the anchor overboard and the first man stepped overboard and ran across the surface of the water. He went to his car, got the beer, and returned with it. As he climbed into the boat, his friend said, ‘I forgot the ice to keep the beer cold,’ and he stepped over the side, ran across the surface of the water, and returned with the ice. Their new young friend watched all this and, when the second man was back in the boat, he said, ‘I promised to bring sandwiches for lunch, but I forgot them in my car,’ and he stepped out of the boat and sank to the bottom of the lake. He rose to the surface, spluttering, and, as the two men watched him swim toward the shore, one of them turned to the other and said, ‘I think we should show him where the rocks are.’

On every path, there are rocks. Occasionally, they may offer advantages, as in that story, but they are most likely

to be hazardous. To achieve success in our endeavours, we need to discover where they are and how dangerous they could be. In the safety domain, we do this by hazard identification and risk assessment.

CertaintyThere are many techniques for hazard identification and risk assessment. But techniques cannot take on our responsibilities, they can only act in support of us; we also need knowledge of the mechanisms that could lead to breaches of safety. And with modern systems, safety breaches can result not only from malfunctions but also from interactions between functions or parts of systems, or even from normal operation in slightly altered circumstances. The possibilities are numerous, and some are obscure or unfamiliar, so it would be folly to believe that we identify and predict them all.

We must recognise that we make systems such that we don’t know how they might behave. Except in the simplest cases, we cannot be sure of knowing all the failure modes – or how dangerous they could be. Yet, it is often easy to believe that we’re sure – and often, too, to be certain of it.

But there’s something we should know, and keep in mind. It is that certainty is a dangerous trap. It carries huge risk. Most of us suffer from it some of the time, and many

of us suffer from it much of the time. But we should take note of a quotation attributed sometimes to Mark Twain, the American author, and sometimes to Artemus Ward, the American humorist: ‘It ain’t what you don’t know that’s the problem; it’s what you know that ain’t so.’

And I’ve heard it put another way: ‘It’s not ignorance that brings you down, but the illusion of knowledge.’

So, if you feel a conviction of certainty coming on, beware. Challenge it.

If we in the safety field were to have a motto, we could hardly do better than to heed those two sayings and embrace them in Oliver Cromwell’s entreaty to King Charles I in the face of the latter’s assertion of the divine right of kings:‘Always allow that you might be wrong.’

Indeed, this might usefully be the motto of all engineers.

SuspicionThe most powerful tool that we possess is suspicion. Without it, we wouldn’t look for rocks except where they show themselves above the surface. We wouldn’t find the deeper ones – until it’s too late.

Hazard identification and risk assessment would be easy and quick, but not challenging, not interesting, and certainly not effective.

When we show suspicion, we are likely to be unpopular: Continued on Page 25

SCSC Newsletter

in life, with those around us, and in work, where results are expected quickly and with certainty.

But, remember, be suspicious of certainty. Moreover, there is a trade-off between speed and degrees of certainty, and our responsibility is not to be popular but to be effective – in finding and understanding those rocks, and particularly the deeper ones.

If we drop our guard and allow suspicion to diminish, we would not be worthy guardians of public safety.

The Subject of RiskTo discover where the rocks are, we base our investigations on risk. In fact, we define safety in terms of risk. Although we refer to our field of activity as ‘safety’, our subject is risk. And it’s a broad subject, with many strands. Yet, we are not taught it.

We concern ourselves with Human Factors, but we’re not taught about the huge volume of research carried out into Risk Perception, and this is a key element in understanding human cognition, human behaviour, and human error.

Those of us who carry out risk analysis are seldom those who use the results to make big decisions. Risk Communication is an important topic, but we are given no instruction in it.

We are expected to make assessments of the acceptability of risks, but we are not taught to understand the many

aspects of Risk Tolerability. As well as being a crucial subject, it is also deeply interesting and one that opens many avenues to research.

We must endeavour to prevent accidents, but we do not study them, and we receive no instruction in the theory behind Causal Analysis.

The main purpose of employing risk is Decision Making, which is a subject in its own right, with topics such as Game Theory. But we receive no education in this.

In other domains, under-graduate students are given education in the subjects that are fundamental to them. But we are not taught the subject of Risk, which is fundamental to all our work and is the basis of our thinking. Consequently, we do not develop an understanding of the theory that underpins our profession. Universities do not impart it.

Only when we get to industry are we introduced to risk – but not through engineering education. Rather, we are given technicians’ training courses – in tech-niques, not theory.

The lack of a deep understanding not only makes our risk decisions risky, it also leaves us unaware of what we don’t know: we think we know more than we do. Indeed, we are taught so little about risk that we think we know it all. But remember those quotations: it ain’t what you don’t know that’s the problem; it’s what you know that ain’t so. It’s not

Always Allow That You Might Be WrongContinued from Page 24 ignorance that brings us down,

it’s the illusion of knowledge.Risk is a crucial subject

not only for safety engineers but for all engineers, and it should be a core subject in an undergraduate curriculum that is tailored to the making of safety engineers – not alluded to, or touched on, but given a full syllabus of its own.

And Risk is essential for management and the making of policy – too many of our managers have too little knowledge of risk, and no understanding. Rather than properly defining the management of safety risks, too many managers create them.

The subject of risk is not limited to analysis techniques. Its fundamentals need to be understood by us and they should be taught thoroughly in a curriculum that is appropriate to safety engineering and management. If we are properly to seek and find the rocks, to determine their dangers, to assess their tolerability, to communicate their risk information, and to manage safety-critical matters, we need to have safety-engineering education and not merely technicians’ training.

Risk decisions are key factors in other fields as well. Safety-critical decisions are routinely made in the medical and legal professions, in the probation service, the military, and the police service. But in some cases ‘risk’ is given a very limited scope and its Continued on Page 26

SCSC Newsletter

Always Allow That You Might Be WrongContinued from Page 25

‘assessment’ is no more than badly informed box ticking. For example, in the UK, the police carry out ‘risk assessments’ before they go on certain types of mission. Yet, after a risk assessment, the police break into houses, in the night, and kill unarmed occupants – in one case a naked man – saying they felt threatened. They fill-in the forms but do not plan the conduct of their operations in keeping with risk management, at least not for the safety of the public. Another example: a few years ago, the author of an article in a nursing journal explained the purpose of risk management in health-care establishments as being to protect hospitals and medical staff from being found liable when their mistakes caused death or injury. Risk to patients was not considered – but the Editor accepted and published the article.

The need for risk understanding – and education – extends to numerous other domains in this modern world.

New Technology RisksIn spite of the shortcomings in our education, our profession has ensured that man-made products and processes are, in the main, as safe as is reasonably practicable. But developments in technology and a recent change in culture have been throwing new rocks into the path of safety. They are the effects of gambling with cyber security.

Safety-critical SCADA (supervisor control and data acquisition) systems are being made accessible via the internet.

Medical infusion pumps are being designed to incorporate bluetooth.

There is the ‘internet of things’, which aims to make just about everything remotely controllable – and hackable.

Recently, hackers – fortunately, research hackers – took control, remotely, of a Chrysler vehicle. They operated sub-systems, such as windscreen wipers, and then ran it off the road. They could do this because the vehicle’s CAN (controller area network) was accessible – and controllable and hackable – via cellular networks.

Hackability is now increased by remote control not only being made possible, but also facilitated and encouraged, via smart phones. Remotely collected data is being analysed and sent to them, making the management of safety-critical systems dependent on them. In many hospitals, the monitoring of patients is done by Apps and nurses’ interactions with patients is via smart phones. I wait for a nurse to report, after a night shift, that she couldn’t prevent the death of five patients on her ward because she didn’t know that the battery of her smart phone had no charge.

Our profession is based on balancing functionality against risk. But, more and more,

designers and programmers are following fashion rather than safe practice. They are seeking to make everything remotely controllable – because it’s the fashion and because it’s possible to do it. They are deliberately creating systems with functions that are obviously dangerous and, in many cases, obviously unnecessary – and this most certainly does not constitute balancing. It is reckless, and it is predictable that their systems will eventually be hacked, with potentially disastrous consequences.

We are at a point when fashion renders us incapable of questioning if a function is necessary, or of stopping it being provided on the grounds that it isn’t. A part of the blame is that many decisions are made by programmers who are not engineers, and another part is that many who are engineers have little or no understanding of safety or of risk theory.

The fashion-conscious designers and programmers are one thing, but who are the senior managers who approve these systems? So many of them show no knowledge of risk and no understanding of our profession. What they’re doing (or recklessly allowing) endangers lives – and businesses – and opens doors to criminals and terrorists. Is what they do ethical? Should they, in their positions, be charged with criminal negligence?

Looking at ourselves, what Continued on Page 27

SCSC Newsletter

ethical responsibility does each of us have for what we do? Do we give thought to what is ethical? Do we accept any individual or collective ethical responsibility for what is done in our profession? These questions need to be contemplated.

ConclusionsI will limit my conclusions to offering three reminders.

First, certainty is a dangerous trap. Beware of it. Challenge it, particularly in oneself.

Second, our most powerful tool is suspicion. Nurture it.

Third, always allow that you might be wrong.

Felix Redmill is Editor of Safety Systems and offers training in Safety and Risk Principles. He may be contacted at <[email protected]>

Always Allow That You Might Be WrongContinued from Page 26

1 Background The Sellafield Site has grown over many decades into a complex mixture of operating nuclear facilities, aging waste stores, waste treatment facilities, redundant facilities in a care and surveillance regime, and significant new-build activities. This is a result of substantial change in the role of the site in the UK and international fuel cycle. As the MAGNOX era comes to an end, and the end of any form of reprocessing of spent fuel is in sight, we find that the focus of the business is moving more and more to inventory removal from the older facilities that have supported the major reprocessing programmes, and the construction of facilities to treat and package the material

Risk-based Approach to Nuclear Safetyby David Forsythe

for long-term safe storage. The ‘standard’ approach to nuclear safety is becoming less and less appropriate. The ‘one size fits all’ approach is becoming less and less supportive of delivering the right (and ALARP) overall solution.

2 The ProblemThere are three parts to the problem statement.

2.1 At sea vs. dry dock and the need for compromiseThe Sellafield site has two fundamentally different types of programme with very different aspects of safety, and in particular nuclear safety, as their main focus. We have operational nuclear facilities, our ‘ships at sea’, and we have plants under construction or without nuclear inventory, our ‘ships in dry-dock’. The ships at sea have clear command

structures with a focus on making sure the ship keeps safe. All the crew understand how their roles contribute to keeping the ship afloat and the passengers safe. The ship has a clearly defined safe operating envelope and everyone understands how the systems and operating procedures relate to it. The ships in dry dock can’t sink but the performance requirements at the point of hand-over to the client are paramount. The work is schedule-driven and the safe delivery of work is, to a large part, dependent on competent people applying their core skills to comprehensively risk-assessed tasks.

The work to remove inventory from the older facilities on site is analogous to combining these two extremes. What if a ship needed an urgent repair or refit while remaining at sea and with passengers still aboard; how would the dock yard teams have to adapt their approach to take into account the fact that their work affects the systems and processes that keep the ship afloat? How does the crew adapt to enable the work on-board while maintaining a safety envelope? The dock yard teams and crew need to understand where and how compromises are appropriate. The schedule will be affected by the conditions at sea but the margins of safety may need to be challenged to facilitate the work. Can the Continued on Page 28

SCSC Newsletter

steering systems be handed over to the dock yard team in rough seas? Is it a safety issue or a passenger comfort issue? How urgent does the refit need to be before passenger comfort can be compromised? It becomes a complex balance. How to we control and govern the necessary compromises?

2.2 ‘Keep the plane on the ground’ approach to safetyThe ‘traditional’ approach to nuclear safety focusses on designing facilities to a safety case, substantiating the safety functions, proving they work as intended, and then operating them within well maintained safety envelopes, using suitably qualified and experienced people working to clearly defined procedures. We apply a stage-gate approach to delivering these outcomes, and this focuses on removing risk and uncertainty at each stage. The facility is safer in design than construction, safer in commissioning than operations; the plane is safer on the ground than in the air. The same thought process applies to safety and business processes. We focus on remaining where we are until we can demonstrate that the risk and uncertainty is removed or low enough to be acceptable to move on. It applies to safety and business risk. I call it ‘keep the plane on the ground’ mentality and it is core to our traditional approach.

2.3 Managing the business vs. managing safetyThe accountabilities for business delivery and for safety drive the governance arrangements used universally across the Sellafield site. Fundamentally, the governance of the business model and business direction is different to the governance of safe delivery of work. It can be likened to the running of an airline. The leader of the business decides on the business model (what aircraft to buy, what routes to fly, economy focus or luxury focus, regional airports or major hubs, etc., etc.). However, whatever the model chosen, the operations people, the maintainers and the flight crew, have to ‘fly by the book’. Whatever aircraft are purchased, they must be flown within their safety envelopes and maintained to schedule. Routes flown are dictated by Air Traffic Control regulations and there are strict controls over the working hours of the flight crews. The safety accountability lies predominantly with the Head of Operations while the Head of the business takes accountability for delivering the right business outcome.

The site programmes are set up in a similar fashion, with similar governance. We separate the business direction from delivering the business safely. The variables impacting how we deliver work are under robust governance

arrangements focussed on nuclear safety. We have assurance and verification of ‘flying by the book’ and we look at how changes affect the book and our compliance with it. The business direction governance arrangements focus on delivering value for money and spending wisely to serve strategic objectives. But what if our plane that isn’t on the ground isn’t a modern airliner but a vintage model built half a century ago? It doesn’t have modern navigation systems and autopilot and, while there isn’t an in-flight emergency, there is uncertainty on how long it can stay airborne. Flying indefinitely ‘by the book’ is neither appropriate nor possible. We need a well-constructed landing plan with good contingency. Delivery of the landing plan doesn’t just deliver the business objectives but delivers the safe outcome. Business decisions and business governance have a direct impact on nuclear safety.

3 The Risk-based Management FrameworkIn answering the problem statements it was important for us to understand which are the programmes where nuclear safety is fundamentally the product of ‘flying by the book’ and where we can apply tried and tested approaches to engineering controls, administrative controls, training, standards and culture

Risk-based Approach to Nuclear SafetyContinued from Page 27


SCSC Newsletter

to deliver good defence in depth for nuclear safety. We need to understand too, the other type of programme; the ones that need to deliver a landing plan where the book that we fly by is a variable that is driven by the context of the landing plan. What does it look like to manage safety, especially nuclear safety, in that environment? Using criteria that would be inappropriate to discuss in detail here, we developed a framework to map the site programmes and act as a communication tool (see Figure 1).

Consider region A as programmes and facilities that can be operated indefinitely within an approved safety case. The duration of operations,

within reason, is a business issue, not a safety issue.

Region B shows a typical hazard and risk reduction curve for an ageing facility. The y axis represents risk and detriment (to workers, public and environment) presented by the existence of the facility and its inventory as well as transitory risk associated with delivering work on the plants (such as installation and operation of inventory retrievals equipment). In reality every type of risk and detriment has its own curve with an extensive range of different units and magnitudes and the framework is just presenting a principle to be used as a tool. Typically for a decommissioning programme for plants from the early atomic

era and the early commercial MAGNOX programme, the timescales on the x axis are in the order of several decades.

Region D on the framework represents the level of transient risk or detriment that is not justified by the benefit, even though the benefit is reduction of hazard and risk associated with the particular facility, i.e. it is the upper boundary of the ‘ALARP’ case. There is also a ‘time at risk’ element; not just time at the transient risk associated with delivering work, but time that the world has to continue to tolerate the risk and detriment presented by the facilities’ existence.

The boundary with region C represents the limit of tolerance of that time at risk. In



Figure 1: The Risk Based Management Framework

SCSC Newsletter

applying these two boundaries, we define an area (region B) that can be justified as ALARP. We recognise that the focus is to minimise the area under the curve within that envelope. It means that within the ALARP position, there is room to manoeuvre. We can consider using manoeuvring room to increase risk and detriment up to the D boundary if that activity can increase the gap to region C so as to give an overall reduction in the time at risk, and obviously vice versa. In other words we can change the shape of the curve to reduce its area provided we stay within the region B boundaries.

Region E represents an undeliverable programme and is an important part of challenging business processes. Programmes can become undeliverable when downstream facilities fail, resources are not available, or the facility itself suffers an unrecoverable failure.

4 So What’s the Point?The framework has proven to be a powerful communications tool. Consider how the framework would portray a disastrous event in which a modern operational (region A) facility were to suffer damage that degraded it to match the condition of our half-century old legacy facilities. Public, regulator and our own tolerance of such a situation would be extremely

low and recovery would be a national priority immediately. Extending a decommissioning programme for an existing legacy facility has the same day-on-day impact in terms of risk and detriment that the world has to tolerate. In other words, area under the curve can increase in two directions and, although perceived very differently, the value of the risk and detriment increase for both is, for all intents and purposes, the same.

Once we recognise that fact, and model what it means to a particular facility or programme, we can then use the model to develop new approaches to governance of safety and the business to ensure that we maintain an ALARP envelope. We have developed new approaches in the following areas:

4.1 Balance and CalibrationOur people are accustomed to keeping the plane on the ground until risk and uncertainty are removed. They are used to flying to book standards. We need to transform to the standards being driven by the context. We need to learn to manage risk and uncertainty and govern compromise.

4.2 ALARPFundamentally, in a traditional approach to nuclear safety, we make all the components of a task ‘ALARP’, so that all the tasks are ALARP, so that all the projects are ALARP, and

we get an ALARP programme that determines the schedule. For region B programmes, the ‘landing plan’ has to be ALARP and in turn drive what ‘good enough’, what ‘safe enough’, and what ‘acceptably low’ risk looks like for the tasks and projects that deliver the programme.

4.3 GovernanceIf we are to justify tasks and projects carrying more risk and detriment than would otherwise be acceptable, on the basis of their constraints as part of an ALARP programme, then we need exceptionally good governance of the programme to maintain a top-down ALARP position. Alignment of business and safety processes is essential.

4.4 People and CoachingIf we need our people to manage risk and govern compromise in a programmatic ALARP situation, then context, human performance and tactical decision making become absolutely key. A significant outcome of compromise may be a change of bias of safety function from engineering to people.

5 ConclusionThe Risk-based Management Framework and the risk-based approach to nuclear safety are a means to recognise, within management arrangements, that hazard and risk



SCSC Newsletter

Practicalities in the Application of ALARPby Simon Brown


The subject of ALARP has been addressed many times in this newsletter. My intention here is not to further review or explain the ALARP principle, which has been covered in depth in previous articles [1]. Rather, I will review some of the practical issues around the application of ALARP from my point of view as a recently retired HSE (the Health and Safety Executive) inspector after 23 years of service. During this time I covered a wide range of activities, including general manufacturing, fairgrounds and major hazard industries, both onshore and offshore.

The then Prime Minister, in a speech in May 2005, raised concerns that Britain is

becoming an increasingly risk-averse society and that this trend is having a detrimental impact on public policy [2]. He suggested that ‘we are in danger of having a disproportionate attitude to the risks we should expect to run as a normal part of life’ and that this is putting pressure on policy-makers ‘to act to eliminate risk in a way that is out of all proportion to the potential damage’. Against this background, the Select Committee on Economic Affairs conducted an inquiry into government policy on the management of risk.

The Select Committee considered HSE’s policy on reasonable practicability, as set out in ‘Reducing Risks, Protecting People’ [3]. The Committee’s report [4] stated that ‘In our view the use of ill-defined and ambiguous terms in risk management and regulatory documents is generally unhelpful. There is a danger that they can induce an excessively cautious attitude to risk. We recommend that terms such as ALARP, Gross Disproportion and

the Precautionary Principle should be more clearly defined or replaced with more specific and unambiguous requirements and concepts.’ One of the witnesses to the Select Committee stated that ‘reasonably practicable’ is either meaningless or is designed not to limit in any way the scope for administra-tive discretion, which is, I think, in reality the way in which it is used’. The government’s response to this [5] was that ‘more prescription is unlikely to be the answer. Prescription reduces regulators’ flexibility in devising policy options and, operationally, can lead to a mechanistic and unquestion-ing approach, all of which can result in excessive caution … Devices like ALARP, Gross Disproportion, Tolerability of Risk and the Precautionary Principle are better suited as they have to be applied in a practical and pragmatic manner.’ In this context it should be noted that legislation (under the Health & Safety at Work etc. Act 1974) generally requires risks to be reduced ‘so far as is reasonably practicable’ (SFAIRP). ALARP is not generally used in legislation. However, HSE’s view is that ALARP and SFAIRP are synonymous. Interestingly, David Eves, ex HSE Deputy Di-rector General, refers to ALARP as being the ‘questionable invention of experts’ [6].

So just how do HSE

reduction programmes are fundamentally different to operational facilities. Nuclear safety is as much about the work that we choose to do as it is about how we choose to do it. Standards are context driven, not book driven, and we need to make sure that we have good governance to ensure that we do the right things right, at the right time and in the right sequence, so as to maintain a demonstrably top-down ALARP position.

David Forsythe is Head of Nuclear Safety for Decommissioning at Sellafield. He has held management positions over the last 24 years in operations, projects and safety case management. He can be contacted via his email <[email protected]> or his direct office telephone number: 019467 76247.


SCSC Newsletter


Practicalities in the Application of ALARPContinued from Page 31

inspectors apply the ALARP (or SFAIRP) principle in their daily work in a ‘practical and pragmatic manner’, whilst maintaining a reasonable degree of consistency in en-forcement decisions? HSE’s enforcement policy statement [7] says that deciding what is reasonably practicable to control risk involves the exercise of judgement. ‘Enforcing authorities considering pro-tective measures taken by duty holders must take account of the degree of risk on one hand, and on the other the sacrifice, whether in money, or time or trouble, involved in the measures necessary to avert the risk. Unless it can be shown that there is gross disproportion between these factors and that the risk is insignificant in relation to the cost, the duty holder must take measures and incur costs to reduce the risk.’ However, it is very unlikely that an inspector would undertake such a cost benefit analysis ‘on the spot’ during a typical inspection. Instead, for the vast majority of workplaces, reference is made to established standards or guidance. For example, in an engineering workshop it is expected that machinery guarding and providing protection will be to the various European standards referenced by the EC Machinery Directive (and the associated GB regula-tions). As long as the relevant standards are met there is an

assumption that risks have been reduced ALARP/SFAIRP.

The decision process that HSE inspectors are trained to follow is formalised in the Enforcement Management Model (EMM) [8]. The EMM is intended to promote enforcement consistency by confirming the parameters, and the relationship between the many variables in the enforcement decision-making process. The first step in the decision process is to determine if there is a risk of serious personal injury that would warrant immediate action, such as stopping an activity by way of a prohibition notice or seizing or making safe an article or substance. This is a judgemental decision based on the inspector’s training and experience, i.e. competence. If it is decided that there is no such risk, then the next step is to determine the risk gap. This is the gap between the actual risk as observed in the inspection and the benchmark risk. Both the actual risk and the benchmark risk are defined by qualitative risk matrices, taking into account the consequence of the hazardous event being considered and its likelihood. The risk gap is thereby determined to be extreme, substantial, moderate or nominal. The combination of the risk gap and the ‘authority’ of the relevant standard determines the so-called ‘initial enforcement expectation’, i.e. improvement notice, letter or

verbal warning. Standards are divided into three levels of ‘authority’. Those of the highest ‘authority’ are Acts of Parliament, Regulations, Orders and Approved Codes of Practice. These are referred to as ‘defined standards’. Next come ‘established standards’ which include standards linked to legisla-tion (e.g. CEN or CENELEC standards harmonised under an EU Directive) and standards considered by HSE or industry as defining levels of performance needed to meet a duty under health and safety law. The lowest level of authority is afforded to so-called interpretive standards. These are standards put forward by HSE or determined by HSE inspectors from first principles.

A combination of an extreme risk gap with any type of standard leads to an initial enforcement expectation of an Improvement Notice, with consideration of prosecution in the case of defined or established standards. A combination of a substantial risk gap with a defined or established standard gives an initial enforcement expectation of an Improvement Notice. A combination of a moderate risk gap with a defined standard results in an Improvement Notice or with an established or interpretive standard, a formal letter. A nominal risk gap results in a verbal warning

SCSC Newsletter



for all types of standard.This initial enforcement

expectation is then modified by a number of ‘duty-holder’ factors, including incident history, previous enforcement, inspection history and general conditions. This can result in the initial enforcement expectation being elevated, e.g. an improvement notice to a prosecution, or downgraded. After all this, the inspector has to consider a number of ‘strategic factors’ including the overall public interest and the impact of enforcement on vulnerable groups such as children and patients. This may result in a review of the proposed enforcement action.

For permissioning regimes, such as major hazards onshore (COMAH) or offshore oil and gas installations, the EMM states that the risk gap, and hence the indicated enforcement action, is determined by the extent of deviation from the permissioning document (e.g. COMAH safety report or offshore installation safety case). It also states that the enforcement is by way of revocation/refusal/amendment or direction of the permissioning document. However, in the author’s experience it has been rarely the case that a permissioning document that has been accepted by HSE is used as the basis for enforcement action. Rather, reference is usually made to the relevant bench-

mark standards and the EMM is applied as for non-major hazard installations. This can be problematic as the qualitative definitions of risk do not help in the determination of the risk gap when the likelihood of a hazardous event is very low, as is usually the case for major hazards. HSE’s approach to ALARP decisions for major hazard installations is set out in internal guidance for HSE’s Hazardous Installations Directorate (HID), which is available externally [9]. Again, this refers to the basic principles set out in Reducing Risks Protecting People [3] whereby there is a need to compare the sacrifice (in terms of money, time and trouble) involved in taking further measures to reduce risk against the benefits derived from those further measures (in terms of fatalities etc. avoided). For a measure to be not reasonably practicable, there must be gross disproportion be-tween the costs and benefits. Guidance is given on judging gross disproportion, but it is suggested that the proportion factor should be at least 10 when the risk is at the tolerable/ unacceptable risk boundary (i.e. an individual risk of 1 in 1000 per annum for a worker or 1 in 10,000 per annum for a member of the public who has a workplace risk imposed). Of course, this implies the need for a quantified approach to risk assessment (QRA), although, again, reference is

made to the need to adopt good practice where it fully meets ALARP requirements. However, QRA should be treated with caution due to the often large uncertainties involved in estimations of both likelihood and consequence [10].

So, it can be seen that there are many factors to be considered by an inspector in forming a judgement as to whether risks have been reduced SFAIRP or ALARP in any particular circumstances. The concerns of the Select Committee are addressed by the extensive training of HSE inspectors and by the detailed guidance that is provided. Whilst the ALARP/SFAIRP frameworks do allow the regulator to adopt a number of approaches in assessing duty holder’s compliance, it is the author’s experience that the frameworks do generally facilitate constructive engagement be-tween the regulator and the regulated and result in adequate control of risk while not placing un-due burdens on industry.

References[1] ALARP Explored, F Redmill, Newcastle University Technical Report Series CS-TR-1197, March 2010[2] Speech by Rt Hon Tony Blair, MP, delivered at the Institute of Public Policy Research, 26 May 2005[3] Reducing Risks, Protecting People, Health & Safety

SCSC Newsletter


Executive, HSE Books, 2001[4] Government Policy on the Management of Risk, House of Lords paper HL 183-I, 7 June 2006, The Stationery Office Ltd[5] Government Response to the Management of Risk, House of Lords Paper HL 249, 13 October 2006, The Stationery Office Ltd[6] History of Occupational Safety & Health, David CT Eves, RoSPA National Occupational Safety & Health Committee, 2014[7] Enforcement Policy Statement, Health & Safety Executive, 2009[8] Enforcement Management Model, Health & Safety Executive Version 3.2, October 2013[9] HID’s Approach to ALARP

Decisions www.hse.gov.uk/foi/internalops/hid_circs/permissining/spc_perm_39.htm[10] The ALARP principle and QRA, S Schofield, Proceedings of 19th International OMAE Confer-ence, February 14-17, 2000 New Orleans, Louisiana, USA

Simon Brown graduated with a 1st class honours degree in Electronic Engineering in 1978. He joined HSE as a special-ist inspector in 1992 and later became an Operations Manager. He was a member of the IEC working groups for IEC 61508 and IEC 61511. Simon retired from HSE last year and is now an independent technical safety consultant. He can be contacted at <[email protected]>.


Risk, Uncertainty and Unpleasant Surprisesby Matthew Squair

Introduction‘I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.’ Lord Kelvin‘Oh well, if you can’t measure, measure it anyway.’ Frank Knight

The formal mathematical treatment of random systems is generally agreed to have started with an exchange of letters between Pierre Fermat and Blaise Pascal in 1654 on how to fairly divide up the points from an interrupted game of chance. Fermat came up with the idea of an ensemble, a set of hypothetical parallel worlds that captured all the possible outcomes, and coined the term ‘expectation’ for the average value of this ensemble. Before long mathematicians saw that this idea of expectation could also

be used for predictive purposes, such as the setting of insurance premiums or defining pension annuities. Eventually, thanks to de Moivre, the mathematical expectation of loss came to be synonymous with the concept of risk.

Fast forward to the latter half of the twentieth century, a time of rapid technological change. Unfortunately, the new technologies also carried within them the potential for accidents of truly catastrophic magnitude. The problem was that eliminating altogether the possibility of such accidents, or mitigating their consequences, seemed an impossible task. So how to justify the acceptability of such catastrophic possibilities? The answer found was to take a risk-based approach: if the probability of an accident could be shown to be extremely low, then the potential consequences could be ignored. Of course if you’re going to use risk to make decisions, then you need to quantify it, which places the mathematics of risk developed by Fermat, Pascal and de Moivre at the heart of managing technology risks. Early works, such as the Canvey, Rijnmond and WASH1400 studies, laid the foundations of this new regulatory approach. Today, even a cursory examination of the state of practice across industries where there is the potential for catastrophic accidents shows us that

SCSC Newsletter

quantitative risk assessment is alive and well. Methodologies such as Probabilistic Risk Assessment (PRA) allow risk practitioners to build complex models of risk, standards such as IEC 61508, EN 51029 and ISO 31000 normalise the use of risk in decision making about safety, while various industry regulators require the quantification of risk to justify societal permission to operate. In theory a quantitative approach is fairly straight-forward. We determine the set of possible accident events and their severities, calculate each accident’s probability, then multiply it with the numerical severity (normally cost) to obtain a quantitative value of risk. We then sum the set of risks to determine the total quantitative risk – although the mathematical certainty of such assessments might appear to satisfy Lord Kelvin’s dictum: just because something is true in the model does not necessarily make it true in real life. We are in fact assuming that one can use quantitative probability to reason about what are usually extremely unlikely events, and that’s a very big assumption. Unfortunately, when we examine how classical probabilistic risk theory is applied to real world problems, we find that the promised mathematical certainty starts to unravel. But, if classical risk assessment is flawed, what do we replace it with? Answering

this question is what motivates this piece. In the first part of the article we’ll discuss the theoretical and practical limitations to applying probabilistic risk theory in the real world; in the second we’ll introduce a broader spectrum of uncertainty and risk to consider how we might better manage the risks of technology.

Problems with Probability and RiskThe idea of risk actually relies upon a mathematical sleight of hand; that is, we assume that the average across the ensemble of parallel worlds is the same as the average of a series of events one after the other; technically we assume that our problem is ergodic. What we are doing in actuality is assuming that our theoretical parallel worlds model applies to the real world where we experience life as a stream of events. This assumption works if the consequences are small, but not when they are potentially catastrophic [Peters 2009]. The problem is that events happen in a strictly linear (time series) fashion, and if a catastrophe occurs we can’t access these parallel worlds or gain any benefit from them. Our world is a determinedly non-ergodic one and assuming otherwise is just a convenient mathematical approximation, which works only some of the time. The implications of this are that if we use ergodicity in risk assessments we are

significantly underestimating catastrophic risks (1).

Setting aside the ergodic-assumption fallacy for a moment, there is the question of whether we can come up with a meaningful estimate of risk in the first instance. The first problem we strike is that to calculate a meaningful value we need a finite set of events over which to sum the risk. If the distribution of our events is normal (the classic bell curve), or there’s some other limit on severity, that’s easy to argue. But, if the distribution is heavy-tailed, i.e. there’s a significant probability of extreme events in the tail, it’s very hard to accurately sum up the risk because of the uncertainty of both what is the worst case and it’s probability. As catastrophic events also dominate the risk, getting their probability wrong, usually by underestimating, translates into errors in the overall risk. Our problems in establishing a risk value become that much greater when we believe that we are dealing with an extremely heavy-tailed distribution, for here the risk value is simply unbounded, with all the societal risk-sharing problems that entails. Nor is this a purely theoretical problem; the Pareto distribution of reactor accident severity [Hofert & Wuthrich 2011] results in unbounded risk and a significant underwriting headache for insurers (2). The problem of how, or even

Risk, Uncertainty and Unpleasant SurprisesContinued from Page 34


SCSC Newsletter

whether, we can bound heavy-tailed distributions may also explain why the term ‘worst credible’ pops up again and again in risk assessments. Effectively it’s an analytical fiddle that allows us to rescue the ideal of quantitative risk. The real remedy to this situation is to bound the actual severity, thereby allowing risk to be quantified.

Demonstrating statistically the very low probabilities we are talking about also requires extremely long durations, which, as a practical matter, we are unlikely to achieve before we place a system into service [Littlewood & Strigini 1995]. Instead, we fall back on estimating the probability of more frequently occurring precursor events, e.g. component failures, and combining them into a logical model that is used to compute the probability of the accident of interest. Of course this relies on us knowing the set of precursor events and their probabilities with enough precision to be able to calculate a quantitative value as well as to achieve the correctness of the model. The first problem is that identifying such events for novel or complex technologies is a non-trivial and complex exercise with many causal factors being ambiguous or debatable. For example, it certainly was not obvious before the Germanwings tragedy that combining secure cockpits with

two-pilot operation increases the risk of ‘suicide by jetliner’. The second problem we face is in estimating the probability of these precursor events. Some probabilities seem relatively straightforward to estimate, e.g. simple component failures. But others, such as human error or software faults, are much more problematic. To get around these problems, a series of ‘make do’ simplifying assumptions are usually made, for example, to (erroneously) assume that people will operate the system in accordance with the rules, or that events are independent. Thirdly, our model may incorrectly capture the interaction between its elements. In the end, we end up relying upon theory-rich fact and poor risk models, based upon contestable evidence and assumptions, which may be incomplete or incorrect in unknown ways. The evidence to date is that such approaches are unreliable [Vesely & Rasmuson 1984].

A Broader View of RiskBut if quantitative risk is an uncertain guide, what do we do about it? A famous 1961 experiment by the psychologist Daniel Ellsberg gives us the first hint. His experimental results showed that people overwhelmingly preferred a bet in which the odds are known, to one in which the odds are unknown, even, if the potential for winning might possibly be greater in the uncertain

case. Ellesberg’s experiment clearly demonstrates that humans view probability and uncertainty as distinctly different. The economist Frank Knight characterised this as the difference between the risk associated with simple randomness, such as coin tosses, and that of the ‘unmeasurable uncertainties’ of the real world. Given that we can’t neatly measure such uncertainties, a number does not make our knowledge of the subject ‘better’. The problem is that given the dominance of quantitative measures in discussions of risk, these numbers end up being inappropriately applied to such unmeasurable uncertainties. Stating, for example, that the likelihood of reactor core damage is precisely less than one event per 1.6 million years for a reactor design [Ramana 2011] might be reassuring but it says nothing about what’s unknown, contested or assumed. As the philosopher Karl Popper pointed out, probability does not automatically correlate with corroboration. Perversely, such mathematical certitude may actually increase risk because it skews our attention towards what is modelled, rather than what is not, in a form of omission neglect.

So let’s engage in a small thought experiment for a moment. Imagine that there exists a continuum of



SCSC Newsletter

uncertainty (referring to Figure 1), and that at one end we place a state of complete ignorance or ontological uncertainty and that at the other we place a state of highly predictable randomness or aleatory uncertainty, where we can use the familiar theories of Pascal and Fermet to fully characterise the situation. In between these two extremes lies the region of epistemic uncertainty (3). Obviously we should match our management approach to the type of uncertainty we are dealing with. If risk is predominantly due to aleatory process variability, then traditional strategies, such as reliability engineering, would be appropriate, as would quantitative expressions of probability and risk. If the risk is more about epistemic uncertainty of knowledge, then robustness or diversity strategies would be better

suited, and uncertainty expressed as plausible ranges. For those ontological risks that we are ignorant of, precautionary or possibilistic strategies (4) are optimal, while our degree of knowledge should be expressed as qualitative statements of possibility. But in practice do we see these differing types of risk? A longitudinal study performed by Julie Go [Go 2008] on the RL10, NASA’s first hydrogen fueled rocket engine demonstrates that perhaps we do. Go found that during the early life of the RL10, failures’ causes were dominated by epistemic uncertainty, such as how liquid propellant behaves in zero G. As these early uncertainties were resolved through design changes, the engine’s accident rate improved rapidly and then started to plateau out at a residual rate dominated by process-variability induced

failures. Accidents were also found to occur immediately after major configuration or mission changes, indicating, unsurprisingly, that change also carries epistemic risk.

Which brings us to the difficult question of how we can account for risks that originate in the far field of complete ignorance. While, by definition, we can’t identify them directly, we may however be able to imagine reasons for being surprised. Consider for a moment the Tacoma Narrows bridge disaster. When we compare the span-to-width ratio of Tacoma against that of its contemporaries, we find that Tacoma’s ratio is an outlier amongst the set of contemporary bridges, indicating the limits of design knowledge. And this limit meant, in turn, that the designers were completely unaware of the torsional flutter



Figure 1: A Spectrum of Risk

SCSC Newsletter

mode that doomed the bridge. Or we might consider the consistent underestimation of the risk of severe core-damage accidents by the nuclear industry, when compared to the reality, and conclude that it demonstrates a degree of over confidence in their abilities (5). We could reasonably suppose that, if we see such factors, our chances of being surprised are increased, and that considering such indicators might provide us with an insight into our exposure to such risks (6).If we’re serious about managing the entire spectrum of risks, our problem is not just how to measure the unmeasurable, but how to manage it. One approach, developed at NASA, is to establish an overall risk budget including both known quantified risks and a quantitative uncertainty margin derived from an assessment of uncertainties. As the system is operated, this margin is retired or transferred into quantifiable risks as they are identified. This process of write-down reflects Carnap’s principle in that, as we use a system we gain knowledge, and this increases the expected utility of our decisions [Good 1967] (7). Conversely, if we change the system or its use, the uncertainty margin is increased. In this modernist view of risk, there is a feedback loop between the observed system and the observer, where initial assessments are

highly theoretic and therefore epistemically uncertain, but as we continue to observe (and interact with) the system, our assessments become based more on evidence and are less uncertain. This approach also provides a context for traditional safety strategies, such as deliberately restraining innovation, considering design heritage, or the use of mitigating measures. The deliberate inclusion of an uncertainty margin also makes visible what we would tend to ignore otherwise, i.e. what we believe we don’t know.

ConclusionThe idea of risk as a quantifiable property emerged from the great flowering of scientific thought of the 18th century, so it is no surprise that the application of risk to engineering has determinedly followed in these footsteps. But our understanding of uncertainty has moved on from the clockwork universe view of the 18th century. Some risks, as it turns out, are not just uncertain, but unknowable before the event and today we are slowly starting to recognise the need for tools to ‘know what we do not know’ about technology risks, and we are also learning that the strategies we use for control should be dictated by the types of risk we face. Fortunately, the work of economists, psychologists and philosophers over the last hundred years has built up a

far broader view of uncertainty and risk. Perhaps it’s time that we started using it.

Notes(1) Kelly’s criterion, which incorporates the idea of time into risk decisions (i.e. recoverable loss rather than expected value), is a better way to evaluate such catastrophic risks.(2) Dragon King events, i.e. extreme outlier events lying above a heavy tail, present an even greater challenge, e.g. the unprecedented extreme severity of the Fukushima disaster.(3) Facts are never completely theory free, of course. Even the most well characterised engineering knowledge retains a degree of epistemic uncertainty. Accidents can therefore proceed from well-founded, but still incorrect beliefs.(4) One of the earliest examples of a precautionary strategy is the insistence by Sir John Cockcroft, then head of the UK’s atomic energy research program, that filters be fitted to the exhaust stacks of the Windscale reactors.(5) The empirical severe core damage frequency is 1.6e-3 per year.(6) A fuller list would include complexity, change, novel conditions, lack of knowledge, and over-confidence [Parker & Risbey 2015].(7) Assurance standards, such



SCSC Newsletter

First ThoughtsPeople in all cultures describe their experience of time by using imagery about physical space. Present time is universally defined in terms of where the individual currently is – ‘the here and now’. The future is usually thought of as being

Clear Thinking on Risk and the Murky Futureby Rhys David

in front, and the past behind the person.

In their language and gestures, the Aymara people of South America refer to the future as ‘back days’, out of sight behind them, about to move forwards and become ‘front days’, or the past. Whichever analogy of time

and place is used, the future can perhaps only be perceived ‘through a glass, darkly’.

Behavioural psychologists have identified ‘hindsight bias’ as a common trait of human thinking. As Daniel Kahneman said: ‘The illusion that we understand the past fosters overconfidence in our ability to predict the future.’

When I started working as an engineer in system safety more than twenty years ago, I thought of the discipline as being objectively analytical, but have gradually come to appreciate the subjective dimensions. Recently, I have been working on information systems used for intelligence, and recognise some similarities between that field and safety. Both disciplines involve collecting information, sifting it for relevance, applying judgement to it, and communicating what that means, to people making decisions about an uncertain future.

This article considers aspects not generally taught to safety practitioners, such as human cognitive limitations and biases, and how those can affect risk estimation, communication and decision making.

Thoughts on ThinkingDaniel Kahneman is well known for his work on the psychology of judgement and decision making. With others, Continued on Page 40

as DO-178, perform a similar role.

References[Go, S., 2008] A historical survey with success and maturity estimates of launch systems with RL10 upper stage engines. In 2008 Annual Reliability and Maintainability Symposium. IEEE, pp. 491–495.[Good, I.J., 1967] On the Principle of Total Evidence. The British Journal for the Philosophy of Science, 17(4), pp.319–321.[Hofert, M. & Wuthrich, M.V., 2011] Statistical Review of Nuclear Power Accidents. SSRN Electronic Journal.[Littlewood, B. & Strigini, L., 1995] Validation of Ultra-High Dependability for Software-based Systems. In Predictably Dependable Computing Systems. Springer Berlin Heidelberg, pp. 473–493.[Parker, W.S. & Risbey, J.S., 2015] False precision, surprise

and improved uncertainty assessment. Phil. Trans. R. Soc. A, 373, pp.1–13.[Peters, O., 2009] On Time and Risk. Santa Fe Institute Bulletin, 24, pp.36–41.[Ramana, M.V., 2011] Beyond our imagination: Fukushima and the problem of assessing risk. Available at: http://thebulletin.org/beyond-our-imagination-fukushima-and-problem-assessing-risk-0.[Vesely, W.E. & Rasmuson, D.M., 1984] Uncertainties in Nuclear Probabilistic Risk Analyses. Risk Analysis, 4(4), pp.313–322.

With a Bachelor’s in Mechanical Engineering and a Master’s in Systems Engineering, Matthew Squair is a principal consultant with Jacobs Australia. His area of practice is safety, software assurance and cyber-security, and he writes on these subjects at www.criticaluncertainties.com. He can be contacted at <[email protected]>.


SCSC Newsletter

he established a cognitive basis for common human errors, arising from biases and heuristics (or rules of thumb). That work underpins behavioural economics, for which Kahneman won a Nobel prize, and is also particularly relevant to how people consider risks and take risk-based decisions.

Kahneman describes a model of human thinking as having two modes or ‘systems’. System 1 thinking is fast, instinctive and emotional, whereas System 2 is slower, more reasoning and more logical. The mind prefers to use System 1 thinking, as it requires less effort and is quicker, but it takes short cuts and often jumps to unjustified conclusions.

The model helps to explain why humans struggle to think probabilistically and why people often appear to make ‘irrational’ decisions when confronted with risk. Intuitive judgement is usually insensitive to the quality of evidence. People generally have only coarse settings when dealing with probability and are also prone to ‘optimism bias’, leading them to downplay the chances of something bad happening.

Many of the cognitive biases and heuristics that have been identified from experiments and real-world research are relevant to safety risk studies. For example, they

affect how risks are estimated, evaluated and communicated, and how risk decisions are reached. The ‘availability heuristic’ (judging likelihood by how easily examples come to mind) and the ‘anchoring effect’ (producing estimates that are influenced up or down by irrelevant information) both affect how people estimate probability, even when they are trying to be objective and rational.

People are prone to ‘confirmation bias’ in the way that they treat information and evidence: overweighting data that supports their own beliefs or hypotheses, and downplaying contradictory evidence. As George Bernard Shaw said: ‘The moment we want to believe something, we suddenly see all the arguments for it, and become blind to the arguments against it.’

Risk studies often draw on the work of groups of people, and that can cause another layer of difficulties. At best, a well-functioning group of advisers, with wide knowledge and openness to challenge, can help a decision maker reach better decisions. But a group may also become deadlocked or be too cosy. Irving Janis identified the problem of ‘Groupthink’ and noted: ‘members of any small cohesive group tend to maintain esprit de corps by unconsciously developing a number of shared illusions and related norms that interfere with critical thinking and reality testing.’

Thinking about Risk‘Risk’ concerns exposure to loss of something that we value, where the future is uncertain: for example, losing money, reputation or resources. It was Blaise Pascal in 1662 who first articulated the idea that risk should be seen as having two components: ‘many people … are excessively terrified when they hear thunder. … Fear of harm ought to be proportional not merely to the gravity of the harm, but also to the probability of the event.’

But why should anyone run the risk of losing something valuable? Sometimes it might be because the risk is completely unrecognised or because the individual cannot avoid it: for example the risk from a catastrophic cosmic event. Sometimes people view the possibility of loss as being so small as to be insignificant and not worth worrying about. But usually, the reason for putting up with risk exposure is that it brings the chance of some benefit, for instance in the form of thrills, or money, or business advantage.

Thoughtful Risk DecisionsBecause people have different values, they will not all share the same views of which risks are worth taking for the potential benefits they bring. Practitioners in risk-based disciplines must appreciate this subjective dimension when analysing risk scenarios, when

Clear Thinking on Risk and the Murky FutureContinued from Page 39


SCSC Newsletter

evaluating their significance, communicating their results and supporting risk-based decision making.

Historically, people would have taken risk decisions on an instinctive or emotional basis, often influenced much more by the potential benefits and losses than by a sound appreciation of the probabilities involved. Early progress on risk understanding and risk decision-making came from the field of gambling, and led to advances in probability theory.

All types of risk-based decisions, including gambling, require subjective information on the value attached to potential benefits and losses, as well as objective information on the three common elements of risk-based decisions:• Options (e.g. what bets are available at a gaming table);• Outcomes (e.g. the payout for each type of winning bet);• Uncertainties (e.g. the probability for each possible outcome).

‘Decision theory’ is concerned with the choices that individuals make when faced with options and uncertainties. It considers how people choose on the basis of information, rather than gut feeling, and has tried to identify ‘optimal’ decisions that ‘rational’ people should reach. Various strategies exist for reaching good risk-based decisions, with greater or lesser degrees of objectivity

from the decision maker.Because of the element of

chance, not all risk decisions have a happy outcome. But it is important not to infer that a successful outcome was caused by a ‘good’ decision, nor that a bad outcome was necessarily due to a ‘bad’ decision. Some irrational long shots will pay off, but we should try to distinguish them from decisions that were ‘reasonable’, given the data, whether successful or not.

In some circumstances, the decision maker may not be the person who carries the risk of loss. This can lead to a ‘moral hazard’, with the decision maker likely to behave in a more risky way, because someone else bears the potential losses. This situation is of particular concern in financial markets, where institutions that are ‘too big to fail’ take on more and more risk, in the expectation that they will be bailed out if their gambles fail.

Progress in an Uncertain WorldThe word ‘risk’ comes from the early Italian risicare, which means ‘to dare’. Most risk decisions involve a desire for benefit, advantage or advancement in an uncertain future, and a willingness to risk losing something in order to achieve that benefit.

Extreme aversion to any risk exposure could prevent progress and eventually lead to stagnation. But not all change is necessarily progress, and,

indeed, changes involving risk often happen without any formal decision-making about the possible benefits and losses. For example, the tide of technological revolutions, such as we have seen with the spread of the internet and may see with the introduction of autonomous road vehicles, happens with a momentum of its own and advances incrementally, without a single decision point.

In many gambling games, there is a finite set of possible outcomes: the coin can only land as heads or tails, the rolled die must stop with one of six faces upwards, and the roulette ball will end up in one of the 37 (or 38 in the USA) numbered pockets on the wheel. In such cases, the ‘sample space’ of possible outcomes of each event is well understood and the probability of each outcome can be calculated exactly, provided that the game is fair.

But in many other circumstances where chance plays a part, the range of possible consequences is not knowable, and the likelihood of any particular outcome can only be estimated, rather than calculated. John Maynard Keynes said: ‘By “uncertain knowledge” … I do not mean merely to distinguish what is known for certain from what is only probable. The game of roulette is not subject, in this sense, to uncertainty. … The sense in which I am using the term is that in which the prospect of a European war is Continued on Page 42


SCSC Newsletter

uncertain, or the price of copper and the rate of interest twenty years hence, or the obsolescence of a new invention. … About these matters, there is no scientific basis on which to form any calculable probability whatever. We simply do not know!’

Predicting an Uncertain FutureForecasting involves gathering relevant information and then applying statistical methods and judgements to make predictions about the future. Where there is a lack of relevant data or a completely new situation, more judgement is needed, drawing on opinions and subjective probability estimates.

Research has shown that ‘experts’ are generally overconfident in their ability to forecast the future. Philip Tetlock’s extensive study of the accuracy of forecasts in different fields has been summarised (slightly unfairly according to Tetlock) as concluding that ‘the average expert did about as well as a dart throwing chimp.’

Forecasters in many different fields express themselves qualitatively and often using language that is open to interpretation. If their forecasts are stated with probabilistic terms, then often those will be couched in vague words such as ‘a fair chance’ or ‘a real possibility’. Sherman Kent’s study of US intelligence analysis examined the

intelligence assessment of the chances of the USSR invading Yugoslavia in the 1950s, which had been expressed by analysts as ‘a serious possibility’. When asked what that phrase implied, different analysts gave values ranging from 20% to 80%.

Using qualitative descript-ions of likelihood in forecasts also allows ‘experts’ to hedge their bets and retrospectively alter what they meant. ‘A fair chance’ can be interpreted with hindsight to have meant something much smaller than 50% for predicted events that didn’t happen, and much greater than 50% for those that did. The forecaster who uses such slippery language appears to be a winner either way. And studies have found that forecasters frequently misremember their own unsuccessful predictions, as having been better than they were.

People tend to trust forecasts that are stated confidently more than they do ones presented with qualifications and caveats. Even in complex situations, the human mind seems to be more comfortable with simple narratives. Thus, safety practitioners would be wise to heed Daniel Kahneman’s advice: ‘It is wise to take admissions of uncertainty seriously, but declarations of high confidence mainly tell you that an individual has constructed a coherent story in his mind, not necessarily that the story is true.’


Final ThoughtsA key element of competence is recognising and respecting the bounds of one’s own abilities. Some of those limits are due to lack of training and experience, but others are a natural consequence of the cognitive limitations and biases that affect us all.

The Engineering Council has published six principles to guide engineers in how they deal with risk. I suggest that safety practitioners should also follow additional principles:• Be aware of human behavioural principles and guard against cognitive limitations and biases;• Ensure that risk assessments and advice to decision makers are explicit about uncertainty;• Understand group dynamics and how to apply critical thinking to test shared views;• Understand risk-based decision making processes and recognise that values attached to potential benefits and losses are subjective;• Be willing to give and receive challenge and change opinion in the face of contradictory evidence.

As Richard Feynman said: ‘I would rather have questions that can’t be answered than answers that can’t be questioned.’

Rhys David is a partner in Safety Assurance Services Ltd. He may be contacted at <[email protected]>.

SCSC Newsletter

Getting from the point of having an initial concept, through the maze of challenges to be faced, to providing a fully certified component or product is not an easy path. The path can seem all the more daunting when the final component or product is part of, or an entire, safety-critical system. With such systems, the issue of certification and proof come into sharper focus than with non-critical systems. However, adding systems-level thinking, the way can be made smoother and more certain.

In his 2015 Safety-critical Systems Symposium paper, Derek Fowler [1] stated the three arguments which we will use as a starting point for our consideration to assess the systems we are producing. 1. The system has been specified to be safe – for a given set of safety criteria, in the stated operational environment;2. The resulting system design satisfies the agreed specification;3. The implementation satisfies the system design.

Fowler further stated that such demonstration is given by provision of:• Direct evidence – which provides actual measures of the attribute of the product (i.e. any artefact that represents the system) and is the most direct and tangible way of showing that a particular assurance

From Concepts to Certified Components and Systemsby Paul E Bennett

objective has been achieved.• Backing evidence – which relates to the quality of the process by which those measures of the product attributes were obtained, and provides information about the quality of the direct evidence, particularly the amount of confidence that can be placed in it.

Establishing the ProcessIt is necessary to employ development processes that assist in the collection of the required level and quality of evidence to support any certification claim. This then requires development processes with a sufficient level of review at all stages of the development; a means by which information is securely held and protected; older versions are archived, and this robustness is always applied by the whole development team.

All projects should be conducted in accordance with the Association of Project Managers’ preferences, with three distinct phases to the process from initial concept to delivery to the client.

Big projects will, by their nature, consist of many smaller sub-projects. The creation of any safety-related or safety-critical component must be seen as a project in its own right. The three phases are:• Analysis Phase – where initial assessments of the required tasks (both human

and machine) are made, options are explored, requirement specifications are created, and initial risk assessments are performed, leading to risk reduction strategies being defined as a result. The output of the analysis phase should be a robust statement including all testable requirements.• Design Phase – where the technical design builds on the requirement specifications provided and continues with risk reduction by design of appropriate mitigations. This phase should lead to all of the required information to manufacture the end product, including the logistics of final assembly, and the correct operation and maintenance protocols.• Build Phase – where the construction effort is deployed and the final product is produced and delivered; all system-level testing is accomplished and final acceptance is gained.

Once delivered, the way a system is operated and maintained is then in the hands of the operator. It would seem prudent to include the eventual operator in some of the discussions during all of these phases and to consider providing suitable training in operation of the systems, where practicable. Such involvement may lead to new ideas or better ways of performing some operations. Hence the development processes would Continued on Page 44

SCSC Newsletter

need to be able to properly manage changes that may be requested or demanded without sacrificing defined system integrity.

A development process that the author has been utilising since the early 1980s (see Figure 1), and which he presented at SSS ‘98 [2], has the capacity to manage change and ensure correct component design. There are essentially four review stages built into the process, the first being within the Requirements Acceptance step outside of the figure.

The whole process just needs four forms and a register that logs all of the activities being performed. The process is hierarchically applied and will operate across the whole project, down to the smallest component in a repeat and

step pattern. The forms and register contents become the audit trail artefacts that prove how the project development proceeded. Audit reports based on this document trail provide important back-up evidence for safety cases (showing proper application of the process).

These four forms are Review Overview (f1), Change Proposal (f2), Work Instruction (f3) and Problem Report (f4). Change proposals may spawn several Work Instructions specific to each discipline involved in the change. The creation and closure of each of the forms is recorded in the register along with all the other project-related meta-data.

Whilst the above process is simple enough to manage manually, it can also form the basis of a fully version-controlled and change-

managed process with electronic aids. The author has used the MKS [Mortice Kern Systems] products ‘Source Integrity’ and ‘Track Integrity’ to implement the process this way. There are other suitable versioning and change tracking products in the market that could also be configured to this pattern.

Managing Information FlowAll projects produce large quantities of information, data and meta-data. It is incumbent on project management to ensure that all of this data is stored securely such that tampering, inadvertent editing and handling errors are minimised, or even eliminated. In some companies the post of Document Registrar is a valued position that is entrusted with keeping all contractual, intellectual property, design Continued on Page 45

From Concepts to Certified Components and SystemsContinued from Page 43

Figure 1: Component and Document Management Process

SCSC Newsletter

and formal communications safe and secure, while providing appropriate access to the development and management personnel. The Document Registrar also has to track the status of all elements of the information under his purview and to manage the official release of documentation to the appropriate members of the workforce and to the customers. Such releases of

information usually require a specific form for paper-based dispatch or they might be sent electronically.

In using a version-controlled repository, the flow of work and status of items at every stage of the process needs to be known. Hence, the repository, shown in Figure 2, has three distinct sections.

Work in Progress RepositoryThe least regulated of the three repository sections, the Work In

Progress (WIP) repository, still implements some controls to maintain some sense of order.

Approved Pre-release RepositoryOnce work has been reviewed and considered suitable for submission to final system testing, it is placed in the approved pre-release repository. Developers can still utilise this as a more certain basis for components. Only after passing the formal review and test procedures set down for approving the quality of the product under consideration will the product be permitted to be stored here.

Authorised Released Product RepositoryThe third repository stores the final fully tested and approved artefacts that can be released to the final system assembly or customer. Only the artefacts that have passed the defined review and test regimes can be stored here.

The Work-flowThe developers, reviewers and system test and certification personnel have their own sandbox area in which to play with the system. Sand-boxing has been a well established technique to ensure that unexpected interference with any work in progress is not encountered. It keeps others out of the work-space of individuals and ensures the work of others is not Continued on Page 46


Figure 2: Triple Repository Workflow

SCSC Newsletter

disrupted by another’s own experimentation or changes.

With a formal review or test being the only way to export a component, artefact or product to the world, it is a useful means of maintaining sane development activity. Developers can always post their latest versions back to the WIP repository, making it available to others to experiment with in their own sandboxes. Only when they feel they have a product ready do they post a notification to the appointed reviewers.

Reviewers have read-only access to the three repositories. They only review the artefacts which they have been notified are ready for review. They are able to run some testing, in their own sandbox area, to ensure full interoperability is maintained across the project. The outcome of a review is either a pass into the Approved Pre-Release Repository or a list of issues to be resolved or reworked (mentioned on the review form that they must complete). Reviewers are not permitted to make changes to any artefact. They are, however, permitted to promote artefacts into the Approved Pre-Release Repository. When system components are ready for approval testing they can be released for the next step.

The Final System Testing, Reporting and Certification

stage has its own sandbox area in which all testing of the components and artefacts that make up the integrated final system is performed. The system testers have read-only access to the approved pre-release repository and the already certified components repository. Testing is in accordance with a formal test specification. This stage also gathers the component certification and other process application evidence for compilation into the safety case as supportive evidence. Failings discovered at this stage will cause the rework route to be activated and the product will not be released to the authorised released-product repository. Testers can only flag components as ready for release but the reviewers have to approve the release once they have examined the test evidence.

The successful outcome of all this work-flow is stored in the final Authorised Released Component and Product Repository, which contains only components, artefacts and system products that can be depended upon.

A Final Word about CertificationCertification implies, by its issue, that someone is guaranteeing that the artefact being certified is fully in accordance with the stated performance and limitations stated in its data-sheet. All

hardware components have a data-sheet that explicitly states all of the properties of the component and the limitations beyond which performance is no longer guaranteed. It provides a check-out of the electronic components, the electrical components and even the humble nut and bolt. Why should software be any different?

It is, of course, noticeable, that hardware components have surfaces you can see and touch. Taking a component-oriented viewpoint should allow you to clearly define the surfaces of your software components. Each software component should have a data-sheet that describes the interfaces, performance characteristics, methods of application and limitations beyond which it will not operate as defined.

By adopting the component-oriented view point, a concentration on these facets can make the software seem as real and versatile as many of its hardware counterparts. Certi-fication effort can then focus on the surface-to-surface characteristics of the individual component and, by the fact that there is a published data-sheet for this component, it can become part of a very reusable library and can be selected and included in a system. If such components are drawn from the Authorised Released Continued on Page 47


SCSC Newsletter


Product Repository, they are already certified against their own data-sheet and then only need evaluation for their suitability and use in the current project. This technique also aids component re-use in many other projects by providing the ability to select already certified components.

SummaryThis article has described the journey from initial concept through to the final certified product. Whether that product is a component or a full system matters little as the process and work-flow are applied equally at every level. A development process and work-flow arrangement have been shown and described. Finally the notion of software as another component was broached and the need to create a data-sheet

for each and every software component, as though it were a hardware one, with performance and limitations, was also described.

References[1] Derek Fowler 2015. Functional Safety by Design – Magic or Logic? Procedings of the 23rd Safety-critical Systems Symposium, Bristol, UK, February 2015.[2] Paul E. Bennett 1998. Small Modules as Configuration Items in Certified Safety Critical Systems. Proceedings of the 6th Safety-critical Systems Symposium, Birmingham, UK, February 1998.

Paul E. Bennet IEng MIET is a Systems Engineering Consultant in High Integrity Distributed Embedded Control Systems (HIDECS). He may be contacted at:<[email protected]>


In 2004 the European Union issued the railway safety directive (Directive 2004/49/EC) as part of its long-term policy aimed at unifying and harmonizing railways in Europe. The directive required that ‘Common safety targets (CSTs) and common

The Common Safety Method and CENELEC Standards– Has the wheel been reinvented?by Odd Nordland

safety methods (CSMs) should be gradually introduced …’ The European Railway Agency, ERA, was charged with the task of issuing recommendations concerning CSTs and CSMs.The first issue of a CSM regulation came in 2009; it was amended in 2013 and has been amended again in 2015. The

amendments mainly concern making the wording more consistent and clarifying texts that could be misinterpreted; the fundamental safety method was not changed.

The CENELEC (European Committee for Electrotechnical Standardization) railway standards, which have now been around for over 10 years, are mandated by EU regulations, in particular the Technical Specifications for Interoperability (TSIs). They already contain a description of risk and safety management, so it would appear that the CSM regulation is a case of reinventing the wheel. So let’s take a closer look.

The currently valid version of the CSM regulation is 402/2013/EC, with the amendments in 2015/1136/EU. As stated earlier, the amendments do not affect the method. The regulation starts with ‘legal blah-blah’ that is, actually, quite useful! It contains definitions and explanations that facilitate understanding the requirements, and the actual method is described in Annex I. It starts with general principles, followed by a description of the risk assessment process. This is an iterative process that comprises ‘the system definition’, ‘the risk analysis including the hazard identification’ and ‘the risk evaluation’.

For hazard identification it requires ‘The proposer shall systematically identify … all

SCSC Newsletter


reasonably foreseeable hazards …’ and ‘All identified hazards shall be registered in the hazard record …’ Safety measures that are derived from hazard identification shall also be registered in the hazard record.

For risk evaluation, three risk acceptance principles are proposed: ‘the application of codes of practice, a comparison with similar parts of the railway system, or an explicit risk estimation’. Code of practice is defined as ‘a written set of rules that, when correctly applied, can be used to control one or more specific hazards’. A reference system is defined as ‘a system proven in use to have an acceptable safety level and against which the acceptability of the risks from the system under assessment can be evaluated by comparison’. And risk estimation is defined as ‘a process … consisting of … estimation of frequency, consequence analysis and their integration’. The CSM regulation doesn’t say anything about how frequency and consequence analysis should be integrated, and its definition of risk is ‘the frequency of occurrence of accidents and incidents resulting in harm … and the degree of severity of that harm’, so it is left up to the experts to define the method that they use.

Codes of practice and reference systems may be used when an analysis shows that ‘several or all hazards are appropriately covered …’ while

an explicit risk estimation shall be performed ‘if hazards are not covered by one of the two risk acceptance principles’ just named.

Hazard management is based on creation and maintenance of hazard records and shall be documented with at least a description of the organisation that carried out the risk assessment, the results of the risk assessments and a list of resulting safety requirements, evidence of fulfilment of those safety requirements and any assumptions that were made during system definition, design and risk assessment. The documentation shall be assessed by an independent assessment body (‘ABo’).

The CENELEC standard, EN 50126, on the specification and demonstration of Reliability, Availability, Main-tainability and Safety (RAMS), came into force in 1999. It requires that ‘Risk analysis shall be performed at various phases of the system life cycle … and shall be documented’. The documentation shall contain at least the methodology, assumptions, hazard iden-tification results, risk estimation results, trade-off results (e.g. safety vs. availability or security vs. maintainability), data and references. Risk evaluation is based on explicit estimation of risks using a ‘frequency – consequence’ matrix. For risk analysis it requires the analyst to ‘Systematically identify … all reasonably foreseeable

hazards’ and the establishment of a Hazard Log that shall contain detailed information about each hazard and shall be updated throughout the system’s life cycle.

The standard, EN 50129, on safety-related electronic systems for signalling, is a normative reference in EN 50126, so it is to be regarded as applicable also to other parts of a railway system where this makes sense. This is particularly the case for the requirements for a Safety Case that are described in detail in the standard. The Safety Case shall be assessed by an independent safety assessor (ISA). The standard came into force in 2003 and also contains an example of a risk analysis process that involves System Definition, Hazard Identification, Consequence Analysis, Risk Estimation, Tolerable Hazard Rate Allocation and Hazard Control.

From the above it can be seen that the processes as defined in the CENELEC standards are very similar to the process described in the CSM regulation, albeit with a greater level of detail and slightly differing terminology. So do we need the CSM regulation? For a supplier who follows the CENELEC standards, the answer would be no if it weren’t for the fact that the CSM regulation is mandatory. But demonstrating fulfilment of the CSM regulation is

The Common Safety Method and CENELEC StandardsContinued from Page 47

SCSC Newsletter


implicitly included when compliance with the CENELEC standards is demonstrated.

Nevertheless, there are some differences. The CENELEC standards only foresee explicit risk estimation. The measures that are used to mitigate the estimated risks must be shown to have the desired effect, and this usually involves calculating the residual risk after the measure is successfully applied. In some cases, however, ‘proven in use’ measures can be applied without the need for a calculation of their effect.

The CSM regulation goes much further. It allows use of a reference system, i.e. a similar system for which the safety properties are already known from experience. It is then sufficient to demonstrate that a new system is sufficiently similar so that it can reasonably be expected to behave in a similar way. This is not necessarily a simple exercise, but it is a pragmatic approach that can be very economical when existing systems are upgraded.

Another possibility that the CSM regulation allows is the use of codes of practice. This is an approach that can be discussed: it is based on the idea that if you follow the rules and do things in the generally accepted way, the result will be safe. This can be true, but it doesn’t have to be!

The generally accepted rules are the result of many years of experience and have probably been quite successful. But a systematic analysis, to confirm that they really cover all eventualities, is unlikely ever to have been performed.

From a strategic point of view, there can be benefits in approving a code of practice. It is much easier to document compliance with a code of practice than it is to calculate frequencies and consequences of hazardous events and then calculate the effect of mitigating measures. And the burden for the assessor is substantially reduced. But without some form of empiric evidence that it actually works, the other alternatives in the CSM regulation should be preferred.

One substantial difference between the CENELEC standards and the CSM regulation concerns the safety assessor. The CENELEC standards require the safety assessor to be an independent third party who is approved or authorised by the national safety authority. The authorisation applies to the individual and requires that the national safety authority evaluate the qualification and independence of the individual assessor on a case-by-case basis. The CSM regulation uses the term ‘assessment body’ as a generic term for whoever performs the safety assessment, and that can be the

independent safety assessor according to the CENELEC standards, an accepted internal assessor within the supplier’s organisation, or even the national safety authority itself. So the ‘ABo’ is not a new role beside the ‘NoBo’ (Notified Body) and ‘DeBo’ (Designated Body). Indeed, the CSM regulation explicitly states that work should not be duplicated, so if there already is an assessor on the job, then that assessor should also evaluate compliance with the CSM regulation. But there is a caveat: whereas the CENELEC standards require the assessor to be individually approved by the national safety authority, the CSM regulation goes for accreditation of the assessor’s organisation. Individual approval is still allowed, but the majority of safety authorities appear to ignore this possibility and happily delegate the responsibility for determining the qualification and independence of an assessor to the accreditation body that certifies an organisation rather than a person. One of the consequences is that the big certification organisations get rid of the small competitors that cannot afford the expensive accreditation process, so that a number of excellently qualified assessors are no longer eligible to do the job. And the responsibility of evaluating the suitability of an assessor is transferred from the

The Common Safety Method and CENELEC StandardsContinued from Page 48

SCSC Newsletter

First of all, I would like to thank Felix for inviting me to write an article but most of all for his terrific efforts of the last couple of decades in editing this newsletter and generally trying to keep the momentum going in safety-critical systems.

I paused my own scientific career around 30 years ago to study software defects. Given its impact, I might as well have studied the impact of mobile phones on beetles, but I’m not sorry as I have observed or discovered lots of interesting things about systems generally. To my sorrow, however, the enduring lesson has been that very few people actually care about software failure until it’s too late. Even with a good system, which engineers

21st Century Software Development:Largely cloudy with occasional bright periods. Rain expectedby Les Hatton

than the electoral rhythms of the member states, so it will be interesting to see how things develop.

Odd Nordland started his professional career as an independent safety assessor for nuclear power control systems in 1984. From 1997 he worked as an ISA for railway control systems until his retirement in 2016. He can be contacted at <[email protected]>.

national safety authority to the accreditation body. In the long run, transferring responsibility away from the national safety authority will make it superfluous, but then we will have a single European Rail Agency instead. It will take a long time to get there, but the EU has shown in the past that it thinks in longer time units

would naturally try to make better for the same price, their management want to make it ‘as good’ but cheaper. This is management speak for ‘cheaper’ – they have no idea what ‘as good’ even means, and I’m not entirely sure that we as software developers do either. We don’t do prevention, so I’ve given up and gone back to science on the grounds that it is slightly less masochistic. I don’t think I can stand the Groundhog Day experience of rummaging round in the bowels of yet another failed system.

You might dismiss my assertion above as the febrile wanderings of an old curmudgeon, so let me explain why I don’t think we care enough and why, as a result, engineering in software is often dismal and frequently non-

existent. The best engineering has always evolved by understanding its sources of defects through careful measurement and analysis and systematically eliminating them by novel designs and/or implementations. It takes time and patience, but it’s not rocket science. Unfortunately, software development is in far too much of a rush to bother with this simple but reliable process and we are simply drowning in code, paradigms and processes as a result. It won’t make me any more popular, but it seems to me that academic Software Engineering (as it is usually called) has had very little beneficial effect on the systems we actually have to use. Let me pick a few pressure points just as a backdrop.

The automotive industry seems hell-bent on stuffing as much software as possible into the humble car, whatever its perceived value, and there now appears to be tens of millions of lines of code squirrelled away in numerous systems (Mossinger (2010) – and one of a number of articles in the Software Impact section of IEEE Software reporting gigantic systems). Not surprisingly, there are now frequent recalls (just google them) and a glance through the supporting documents of some of these, particularly the ones that reach the courts, reveals a surprising

The Common Safety Method andCENELEC StandardsContinued from Page 49


SCSC Newsletter

lack of awareness of important principles of engineering, and reliance on dubious and ancient software metrics such as cyclomatic complexity. This great outpouring of creativity has now sprouted new heads like a latter-day Hydra. A few months back, a car was hacked by professional hackers through its entertainment system and reversed into a ditch (they told the driver first so he could get out). At the same time, VW added the ‘software cheat’ to the software development lexicon, but don’t seem to know where it is or who put it there. The last time I looked, there seemed quite a lot of software in a car, which could be called safety-related, but I must have been mistaken. Silly me.

Speaking of security, at the time of writing, Talk Talk appear to have lost some 1.3 million customer records to hackers (at the latest count) through a SQL-injection attack. One of the first things we used to teach students in university when soliciting data in web-sites was to protect against these. It’s not as if we haven’t seen them before and I am frankly amazed that we have this level of incompetence in the 21st century. They are of course not alone. Loss of personal data is essentially an epidemic. We either give it away, lose it, or have it taken away. Of course

you don’t need software to lose personal data, but it certainly helps when you want to lose lots of it and most personal data loss seems to involve sloppy software development in one way or another.

Requirements seem still to be very unevenly applied. As a result of the ubiquitous sensor-riddled smartphone, anybody can be found anywhere on the planet to be sold useless junk, but, with bitter irony, we can still lose a 350 ton aeroplane somewhere in the Indian Ocean. Perhaps all systems are doomed to be evolutionary in some sense as we never seem to know what we want at the start, and subsequent fiddling with a few million lines of code, which has incrementally sprouted in order to add a few more new features or even fix what we thought we were trying to do in the first place, remains a profoundly error-prone process. I sometimes hear talk of ‘self-healing’ systems and other wonders of the abstract, but I shouldn’t have to remind readers that a swamp, too, is self-healing.

Software has been of great benefit in the latter part of the 20th century, but we are in danger of losing really important concepts as it continues to permeate everywhere like bindweed. Perhaps the most serious loss in the long run is that of Popperian reproducibility in science. This has been a personal hobby-horse of mine for perhaps 25

years and is why I started studying software defects in the first place. The very essence of the scientific method is independent reproducibility. If a result could not be reproduced independently, there was a time when it was ruthlessly discarded, as it must be, to make sure progress is based on robust results; but not any more. The influx of massive amounts of software and computation into most sciences has added a new and unquantifiable layer of opacity. As a result, very, very few scientific results with significant amounts of computation are reproducible in the Popperian sense. Even the most famous scientific journals, such as Nature and Science, struggle with this and still fall far short of enforcing it, relying instead on rather vague and ineffective concepts such as ‘code-sharing’.

I really don’t believe this can be over-stated, so I give it a sentence to itself. If a scientific experiment involving significant computation is not accompanied by the complete means to reproduce its results, then it is NOT science. Anything short of this is unacceptable. Of course the problem here is that even the most basic of software development concepts, such as careful revision control and designing software to be testable, are generally not taught to scientists. Putting together a complete reproducibility package is not

21st Century Software Development:... Rain expectedContinued from Page 50


SCSC Newsletter

difficult, but it is a little time-consuming to the scientist in a hurry (Hatton 2015). This example allows each diagram, table and statistical result to be reproduced from an information theoretic study in post-translational modification of amino acids in proteins including regression tests, all source code and the means to run it, starting with a download of a 2.7Gb open access protein database; open data => open source => open reproduction. I couldn’t get this one published because the journal carrying the main article didn’t see what the fuss was about.

One other area where I see problems developing rapidly is the alienation of many users from the systems they have to use. As management grabs any opportunity presented by software to reduce the number of people they have to employ, those people are replaced by some simply awful systems designed by programmers who do not appear to know what a human being looks like, let alone behaves like. Today I walked into my local Barclays Bank to be greeted by two new machines which have recently replaced counter staff, but necessarily attended by one dramatically over-worked staff member whose job was to explain to the customers how to work them. One of their engaging features was that

the screen buttons were offset downwards from where you pressed if, as I am, you are tall. You had to press the one ‘above’ to get the one you thought you wanted. Unsurprisingly, the queue was gigantic for the other staff member allocated to handle the numerous things the machines couldn’t do, but no less gigantic than that attempting to check in to Easyjet at Gatwick on the way to Lisbon on holiday last week.

There, every automated check-in machine suddenly decided to refuse luggage without any explanation whatsoever, again with frantic staff running up and down trying to get the machines to function by over-riding them with swipe cards. The week before it was the turn of BA at Heathrow on my way to Germany, where the first two online check-in machines refused to check me in, issuing the now traditional incomprehensible messages, and the third said I had already checked in and refused to issue a boarding pass. I am long past the days where I used to march up to the check-in desk and report in vivid detail exactly what went wrong and where I thought it was. Now I don’t care any more, so I just slink up, say I’m jolly old and grin apologetically. Many years ago, the doyen of software testing, Glenford Myers, told me that it’s not going to get any better unless we tell the supplier how awful it is. Sorry,

Glen, I tried, I did and it didn’t. The suppliers have neatly side-stepped this, anyway, by removing anybody human you can talk to and making you complain to another awful bit of software that their first bit of awful software is well, awful. Sigh.

There are occasional bright periods and I still get a buzz from using a really good piece of software. I am terrifically impressed by the exemplary reliability of important communication components in the web, mail handling agents like Postfix, which I use a lot, transport layer security components, and many others too numerous to mention, and I have had nearly 20 years of what lawyers euphemistically term ‘quiet ownership’ in using the Linux kernel. These are all open source, which, as the flag-bearer for computational reproducibility, is, happily, also the potential saviour of the scientific method.

We and generations to come owe a debt of thanks bigger than we can possibly imagine to open source software, but how does this fit in what we call software engineering? I really have no idea. Why do we continue trying to fool the public that software has anything to do with engineering? If it did, why do we produce so many dreadful systems, or is it time for my medicine again?

21st Century Software Development:... Rain expectedContinued from Page 51


SCSC Newsletter

Technology, Security and Politicsby Martyn Thomas CBE FREng

We live in the Age of the Computer. It was conceived in the 1930s, born in 1948 [1], and has spread worldwide, fed by the exponential hardware improvements that were forecast by Gordon Moore in the 1960s and that we now call Moore’s Law.

Computer hardware has become so cheap that we can put it anywhere. There are billions of computers in the world, processing and sharing extraordinary amounts of data. Modern computers and modern communications are fast, cheap and reliable; to have

achieved all this in less than 70 years has required remarkable feats of hardware engineering.

Computer-based systems are very varied and complex, and the variety and complexity are mainly in the software, so that the hardware components can be general-purpose and cheap. So, software is complex and software systems have also become surprisingly large. Microsoft Office is 45M lines and a modern car contains around 100M lines of software. Developing such large and complex software systems so that they are acceptably reliable has required remarkable feats of engineering; unfortunately, developing them so that they are secure seems to be beyond the current state of industrial capability.

Software engineering has taken a different path from that taken by hardware engineering. Hardware engineers are producing products that will be sold in very large volumes and that cannot be easily updated in the field. Correctness and reliability are extremely important for hardware components and if the manufacturers make a significant error in the design it will seriously damage their business. Hardware engineers therefore put a lot of effort into specifying, analysing and verifying their designs. In common with almost every other branch of engineering,

their methods and tools are based on peer-reviewed science and mathematics.

In contrast, software products are sold on the basis of functionality and they are usually either so cheap and unimportant that customers are willing to discard them if they seem unsuitable, or they are updated regularly to correct defects. Because it is hard for customers to tell whether software is reliable and secure before making a decision to download or to purchase, and because software companies reject all liability for the consequences of software failure, the market does not provide strong financial incentives for most software products to be developed with any assurance of high quality. So most software developers have not adopted methods and tools based on science and mathematics – and as a consequence most software is full of errors.

A typical programmer will introduce between 10 and 30 defects in every thousand lines of program and find less than half of them before the software is released into service.

See Table 1 for some experimental results. These are old figures [2] from an experiment by Watts Humphrey of the Software Engineering Institute at Carnegie Mellon University, that involved 810 experienced programmers and 8000 programs. There is no

ReferencesHatton, L. (2015) ‘Reprodu-cibility package for pone_0125663.html’, http://w w w . l e s h a t t o n . o r g /pone_0125663_reproducibility.html Mossinger, J.M. (2010) ‘Software in Automotive Systems’, IEEE Software, 27(2), pp. 2-4

Les Hatton is an Emeritus Professor of Computer Science at Kingston University. He also has a shed in which he applies Information Theory to try to make sense of enormous protein databases. Yes, really. <[email protected]>.

21st Century Software DevelopmentContinued from Page 52


SCSC Newsletter

reason to believe that things have become substantially better.

High defect densities (see Table 2) were also reported following analysis of software in a military aircraft, published in Crosstalk, the journal of defence software engineering, after a paper called Air vehicle Software Static Code Analysis Lessons Learnt by Andy German and Gavin Mooney [3].

The main way that software developers assure the quality of their work is by running tests, even though computer scientists have been saying for the past forty years that testing can never show that software is secure or correct.

And when the project overruns, even the testing gets reduced …

It is rare for software developers to use math-ematically formal methods, even when a software failure could cause injury or serious commercial damage, and even though the US National Security Agency showed more than 20 years ago that formal methods were a better, cheaper, and quicker way to build secure software than traditional test-and-fix methods [4].

All this means that the software is the weak link in our computer systems. It is more complex than the hardware – usually much more complex – yet it is developed using

programming languages, methods and tools that are not fit for purpose.

Every week brings further evidence that our software infrastructure cannot bear the weight that we place on it.

Several times recently, one of the major UK banks has experienced an IT problem that stops customers accessing their money and paying their bills.

Websites are hacked and the personal details of thousands or millions of users are obtained by criminals [5].

Cyber attacks have violated privacy, caused billions of pounds of damage, destroyed marriages, and led to suicides.

Yet most successful cyber attacks are the straightforward exploitation of software errors; errors that should not have been made and that would not even have been possible if better languages and tools had been used by the developers.

The situation is serious and

Technology, Security and PoliticsContinued from Page 53

Software Language

Range Software Lines of Code per Anomaly

Anomalies per Thousand

Lines of Code Worst 2 500 Average 6 - 38 167 - 26

C

Best (Auto Code Generated)

80 12.5

Worst 6 167 Pascal Average/Best 20 50

PLM Average 50 20 Worst 20 50 Average 40 25

Ada

Best (Auto Code Generated)

210 4.8

Lucol Average 80 12.5 SPARK Average 250 4

Group Average Defects per KLOC

All 120.8 Upper Quartile 61.9 Upper 10% 28.9 Upper 1% 11.2

Table 1: Defect Injection Ranges for 810 Experienced Software Developers

Table 2: Software Language Anomaly Rates


SCSC Newsletter

it is getting rapidly worse. For many years, companies have accepted the cyber risks that they run with the argument that no one has both the means and the intent to attack them.

But circumstances have changed and continue to change rapidly. When I was a director of the Serious Organised Crime Agency, we saw the speed with which cyber-attack knowledge and methods became available to organised crime groups and then to anyone who had access to bitcoins and who knew how to use Tor (The Onion Router – free software for enabling anonymous communication) to access criminal websites. Now, these websites will sell you ‘zero day’ vulnerabilities in a range of widely-used software products [6] so that you can write malware to attack your chosen target or you can commission a hacker to write the exploit for you or to carry out the attack.

Cybercrime is already a major world industry and it seems that cyber terrorism is close behind. The UK Government has identified cybercrime and cyber terrorism as Tier One risks to national security, alongside war, major floods, and pandemic infections [7].

The Age of the Computer has brought incalculable benefits but also much greater risks, because the world’s software industry has failed to meet the growing need for

References[1] http://www.cyberliving.uk/a-very-brief-history-of-computing-1948-2015/[2] http://resources.sei.cmu.edu/asset_files/SpecialReport/2009_003_001_15035.p df page 132[3] In Felix Redmill, Tom Anderson (Eds) Aspects of Safety Management. Proceedings of the Ninth Safety-critical Systems Symposium, Bristol, UK 2001[4] http://www.adacore.com/sparkpro/tokeneer[5] h t t p : / / t h e g u a r d i a n .c o m / t e c h n o l o g y / 2 0 1 5 /jul/09/opm-hack-21-million-personal-information-stolen[6] h t t p s : / / w w w .mitnicksecurity.com/S=0/shopping/absolute-zero-day-exploit-exchange[7] h t tps :/ / www.gov.uk/g o v e r n m e n t / u p l o a d s /system/uploads/attachment_data/f i le/478933/52309_


security and dependability. Other engineering professions have taken centuries to reach maturity: time that was available because society was slow to become dependent on each new engineering technology. We do not have that luxury with software, so we must find economic or other incentives to greatly accelerate that progress. Enhanced product liability laws – perhaps analogous to the EU General Product Safety Directive – would help, if the laws were rigorously enforced.

Cyberspace has become just as important as the physical world; we cannot afford to leave it unpoliced but – just as in physical space – the powers that citizens give to security agencies must be proportionate and democratically accountable. That can be hard to achieve, as the 300 pages of the draft Investigatory Powers Bill [8] that is now before parliament attests.

Few citizens – and few MPs – are expert enough in the arcana of cyberspace to be able to understand all the complexities of this Bill. There are always tensions between security, privacy and surveillance: there can be no simple solution to such a complex issue; it is a matter of finding an acceptable and workable balance of interests, with sufficient oversight and accountability. Such matters are inherently political as well

as technical.The Age of the Computer

raises novel and urgent issues of technology, security and politics. Navigating these issues requires deep expertise in computer science, economics, security, politics, law and human rights. Experts in any of these topics have a professional duty to help their fellow citizens to understand the issues as well as possible, so that the decisions that will affect our future can be taken democratically by an electorate that is as well informed as possible.


SCSC Newsletter

This is the last in this current series of articles on this topic, which I hope the readership have found useful and informative. I would also like to thank Felix for his stewardship of the magazine and, his help in applying the ‘polish’ to the finished article – for those of us who said ‘yes’ when canvassed to make a contribution.

I previously explained that

The Certification of Complex Systems and Software (Part 3 of 3)by Phil Hall

the types of complex systems considered herein consist of electronic hardware: line replaceable units, application-specific integrated circuits (ASIC), programmable elect-ronic devices, etc., and equipment with embedded software components, coll-ectively referred to as programmable elements (PE) [1]. In this article the main focus is to discuss minimal compliance expectations, and to conclude with a series summary.

I am often asked to identify what constitutes a credible and acceptable approach to certification and assurance activities. In reply I would like to set out what I believe is a minimum set of key considerations and enablers: ‘building blocks’ that establish a solid foundation of core knowledge and understanding that have the ability to shape all subsequent activities. These may be the key discriminators between success and failure of a programme in the delivery and acceptance of safe and certifiable systems. There is a need for an ethos to plan to deliver safe and certifiable products from the outset, and a need to develop a common understanding of minimal certification and assurance expectations.

The ApproachThe approach to safety, safety

management and certification, may depend on whether the end product has a military or civil role and whether this then mandates domain bespoke requirements and standards. The compliance basis, as defined in the certification baseline [1], should include the need to comply with regulatory standards and contractual requirements.

All involved in the manufacture, supply or development of a product must be cognisant of any additional obligations, including legal and statutory requirements. The approach taken must stand scrutiny, and be justified and justifiable when compared to contemporary standards, requirements and codes of practice.

The Need to EvolveThe pace of technological change and the introduction of new and novel applications, make it no longer possible to rely simply on designs, processes and practices which have been perceived as ‘safe enough’ in the past. Industry must evolve and respond appropriately to external developments that may affect the safety or certification assurance of products, systems or services.

Transformations may be driven by the need to cater for changes to industry best practice, professional

Cm_9161_NSS_SD_Review_web_only.pdf Annex A[8] h t tps ://www.gov.uk/g o v e r n m e n t / u p l o a d s /system/uploads/attachment_d a t a / f i l e / 4 7 3 7 7 0 / D r a f t _Investigatory_Powers_Bill.pdf[9] http://academy.bcs.org/content/lovelace-medal

This article is based on a lecture delivered at the Royal Society, London UK, on 15 March 2016, on the occasion of the presentation of the 2015 Lovelace Medal to Professor Ross Anderson of Cambridge University [9].

Martyn Thomas CBE FREng is Livery Company Professor of Information Technology at Gresham College. He may be contacted at <[email protected]>



SCSC Newsletter

procedures, operating para-digms, contemporary stan-dards, and today’s art of the possible [2]. Other change factors include: relevant safety alerts, incidents or accidents, in-service reports, bug reports, etc., which may impact the product realisation processes or the validity of any safety or certification compliance claims. This requires proactive intelligence gathering to develop external awareness and identify new opportunities, challenges and threats.

Safety Culture and Support NetworksA key enabler to certification success is an organisation’s safety culture; the setting of strategic goals and establishing support networks aligned with the need to deliver safe and certifiable systems, products and services. This requires a culture of ownership, involvement, knowledge, belief and understanding at all levels. Support networks must exist and include suitably qualified and experienced safety and certification practitioners, and experts in the base technologies and the field of application. They must also include strong internal safety and quality design authorities and technical governance provisions, with the ability and authority to intercede when a programme has difficulties in achieving safety, certification and assurance goals.

Safety Analysis informs the Certification Basis It is extremely important that the certification basis is underpinned and informed by current, appropriate, proportionate and adequate levels of systematic safety analysis. The outcomes of safety analysis activities should inform the product design and establish confidence in the adequacy of the architectural solution, identifying required safety properties, behaviours, functions and features and safety integrity or assurance levels. Safety properties of a system should also provide a focus for certification activities and will drive the level of rigour in development and certification effort.

Safety analysis must identify significant faults, failure modes and effects on the system, and consequential impacts on PE components and partitions. I previously discussed the need for a continuous process of involvement and interaction between safety, systems, and software and assurance functions [2]. These interactions build product confidence and should be scrutinised as part of any compliance assessment.

It is of note that the safety analysis may only be as good as the maturity of a product definition and level of understanding of the Concept of Operation, operational context and intended usage. It is a mistake for a development

The Certification of Complex Systems and SoftwareContinued from Page 56

to proceed without first establishing this base level of contextual understanding, as this may compromise the safety and certification basis.

Safety, Certification Planning and Product EvolutionIt should be expected that effective safety-management system and certification planning activities are defined to support all product lifecycle stages. These should form an integral part of the overall engineering management planning of a product; relationships and precedence between plans must be established.

Product safety analysis activities may need to be performed iteratively, as the architectural definition matures and the product takes form. This analysis must be updated to reflect any significant changes, such as requirements and architectural refinements, functional and independence trade-offs, and required levels of redundancy and diversity. This may also have consequential impacts on certification compliance effort and assessments [1]. It is extremely important that product evolution is planned for, controlled and managed, and baselines established.

Safety analysis must be performed by the system architects and relevant engineering functions support-ed by safety and certification Continued on Page 58

SCSC Newsletter


expertise. Therefore, it is necessary to plan for these pan-discipline activities to take place, in order to coordinate activities and develop a holistic system-safety viewpoint [2].

Certification is Not Just About Process AssuranceI have identified four key areas to be examined: People, Plans, Processes and Product, and the associated attributes to be examined [1]. Organisations should understand and expect that all of these areas will be scrutinised during compliance assessments. Certification is not just about process assurance; delivered products are the enactment of plans by people following a number of processes and methods.

The Impact of Integrity Levels and Safety FeaturesThe minimum effect of safety analysis outcomes is to determine assurance or integrity level(s), driving the degree of rigour and required levels of independence in PE development, certification and assurance activities.

The reason for the assignment of an assurance or integrity level to a product, system or service is that a safety analysis has determined that there are safety-related dependencies. The safety analysis outcomes may shape the architectural solution or identify functional or procedural safety ‘controls’

that must be in place in order for a product, system or service to be in a managed, appropriately safe state. Knowledge and identification of these safety features must inform and direct any safety or certification assessment.

Safety SufficiencyIt is not sufficient on the part of a manufacturer, supplier or developer to act, or rely solely, on the system-level safety requirements supplied by a customer, or as specified in a contract [3]. Customer functional safety requirements, where supplied, are likely to be specified in an abstract form, agnostic of the architectural solution and implementation details of the deliverable product. The customer’s safety analysis, if available, may not be complete, or may lack details necessary to gain an adequate level of contextual understanding.

Historically the practice of providing a list of hazards with associated Tolerable Hazard Occurrence Rates typifies the provision of partial information, devoid of required knowledge and information, contextual understanding, causes, accident sequences and outcomes, et al. In such instances, it is important to have an open dialogue with the customer, seeking as much additional supporting information as possible.

All assertions and assumptions that underpin a

manufacturer’s, supplier’s or developer’s safety analysis must be documented. These details should be fed back to the customer, along with locally derived safety analysis outcomes. ‘Safety sufficiency’ does affect the credibility and validity of compliance assessment outcomes.

Safety and Certification Accountability The Health and Safety at Work etc. Act 1974, Chapter 37 (HSWA) [3] in essence demands of a manufacturer, supplier or developer that they perform ‘reviews’, including safety analysis, as far as reasonably practicable, with a view to the discovery of anything that poses a risk to health or safety and, therefore, could cause harm. There is also a responsibility to communicate the outcomes of these ‘reviews’ to the end user or customer conveying the most up-to-date information. This demands a level of currency and relevance of supplied information, with an awareness of current good practice to enable the performance of adequate research and analysis to meet the requirements of the Act.

A contract should never be seen as the ‘last word’ with regard to safety or certification responsibilities. A contract to supply merely sets out minimal delivery requirements between a customer and a supplier. It does not follow that strict Continued on Page 59

SCSC Newsletter


adherence to a contractual position will be adequate to meet the requirements of the certification baseline[1], or the minimum requirements of the HSWA [3].

Safety and certification accountability extends to all parties and any controlling authorities that directs or influences these activities. This accountability is not restricted to those individuals who are approving or authorising signatories.

Setting the StandardApplicability, relevance and usefulness of any particular ‘standard’ or ‘guidance’ is subject to formal recognition, by a regulatory body, as an ‘acceptable means of compliance’ and for adoption for this purpose by industry. The concept of a certification baseline and the need for certification liaison meetings with the certification authority or regulatory delegate, to agree the certification approach, have already been discussed [1].

The certification liaison process is a critical activity in ‘setting the standard’ and requires adequate preparation. This preparation should include support and guidance from experts in safety and certification, developing min-imum certification expectations for all parties. Certification liaison meetings are used to consolidate the certification baseline and to confirm the

compatibility of development plans and strategy with the certification requirements of the industry regulator [2]. This process also helps to reaffirm customer expectations and set the acceptance criteria for the delivered product, system or service.

New Technologies and the Regulator New technologies and capabilities may provide opportunities to exploit innovative solutions that may not always fit well within existing ‘governance frameworks’. It is important in such circumstances to engage early with the regulator to understand what, if any, additional certification requirements or operational constraints may apply. Additional certification requirements will have a cost and time impact. Operational constraints (e.g. permitted usage, area restrictions and performance limitations) will affect product viability, marketability and the ability to deliver products with adequate utility.

Codes of Practice Do MatterRecommended ‘codes of practice’ provide advisory guidance and instruction in professional conduct and may be used in evidence in any court proceedings under the relevant laws to establish liability. This includes all ‘shall’ and ‘must’ directives and advisory

wording such as should/should not and do/do not. This may take the form of regulatory, industry and domain advisory circulars, notices and bulletins, such as Certification Authority Software Team papers and technology supplements, etc.

For example, the code of practice for all road users in the UK is the Highway Code, first published in 1931; the current version was last updated on 29th April 2015 and contains 307 rules in the body text, plus eight annexes containing additional rules, many of which are legal requirements of the current Road Traffic Acts and Regulations. It would be difficult to argue a case for the defence citing a lack of knowledge of the current Highway Code when charged with any traffic offence.

The Haddon-Cave Nimrod report [4] identified the need to use best practice; in particular the need to reflect the latest thinking and best practice in safety and risk management.

The Compliance Argument Any claim of compliance against the certification baseline must be informed by a sufficient body of relevant, tangible evidence delivered from Compliance Inspection activities [1]. The strengths and weaknesses of the argument must be understood, including the need to establish the maturity, completeness and correctness of the engineering Continued on Page 60

SCSC Newsletter

Review of Independent Safety Assurance Working Group Seminar 24th November 2015by Karen Traceyand compliance inspection

programme of activities [2]. The compliance arguments inform the safety management activities and the resultant PE and system-level safety case(s).

Common FailingsCommon failings of safety culture, product definition, detached developments, com-pliance evidence shortfalls, development methodologies and environments, inadequate planning, and other capability shortcomings were described in [2] – referred to as guidance as to ‘what not to do’.

SummaryPart 1 [1] provided an overview of the certification of complex systems and software, the qualities required of certification practitioners, and how certification compliance is determined through a continuous process of audit, review, interview and assessment. It showed the need for a compelling case to support compliance verdicts, and the need to be pragmatic in compiling it.

Part 2 [2] discussed the importance and value added by certification and safety assurance work, as a key enabler to the trade of goods and services and as a quality discriminator in the marketplace.

In this final article, I have

identified the need to evolve with improving knowledge and changing circumstances, to maintain external awareness, and to respond to contemporary thinking and best practice.

In ConclusionI present herein some minimal expectations for competent certification for use as a guiding framework. This may be used to test the validity and sufficiency of anyone’s safety, certification and assurance activities.

References[1] Phil Hall, The Certification of Complex Systems and Software. Safety Systems, Vol. 25, No. 1, September 2015[2] Phil Hall, The Certification of Complex Systems and Software, Part 2. Safety Systems, Vol. 25, No. 2, January 2016[3] Health and Safety at Work

The biennial seminar, joint between the Independent Safety Assurance Working Group (ISAWG) and the Safety-Critical Systems Club, was held on the 24th November 2015. The title for this year was Maximising the Value of Your ISA. In an increasingly competitive marketplace it is essential that Independent Safety

Assessors (ISAs) achieve the appropriate level of assurance to provide sufficient evidence that products, systems or services are acceptably safe, whilst balancing this against the pressures of delivering for organisations with limited budgets.

The seminar benefitted from stimulating and engaging presentations from the nuclear, rail, aviation and defence industries. This diversity of domains helped attendees gain confidence in shared ground.

The keynote speaker was David Forsythe, the Head of Nuclear Safety Decommissioning at Sellafield Limited. He described a difficult journey using a series of analogies and themes intended to provoke thought and questions. He considered how individual


etc. Act 1974, Chapter 37 (HSWA)[4] Charles Haddon-Cave QC, The Nimrod Review, October 2009

Phil is a Principal Consultant at Cyber and Consulting Services, working in the Systems Safety Practice; he can be reached at <[email protected]>More information can be found at http://www.thalesgroup.com/uk


SCSC Newsletter

roles might change, as hazardous facilities transition into decommissioning, with a particular focus on balancing and managing risk and uncertainty. His presentation was informative, insightful and entertaining and was well received by the audience. An article based on it appears in this newsletter.

After a quick coffee break, Ali Hessami delivered a contrasting, more academic presentation entitled Smart Safety Assessment. Ali is an expert in systems assurance and safety, security, sustainability and knowledge assessment/management methodologies. His presentation reflected the results of his research and gave an overview of the principal requirements and qualities for robust, credible and systematic assessment. In particular, he argued the need for assessment to be supported by relevant processes, rules, tools, codes of practice and standards.

Nigel Peterson then gave an interesting presentation on his role as ISA support to the UK Apache programme. He explained how the process was to appoint an ISA to the team early on to add value primarily as an advisor and auditor throughout the project’s life. He went on to give an overview of the aircraft safety assessment model and the process used to demonstrate that risk is ALARP (as low as

reasonably practicable). He emphasised the need to avoid focussing on numbers and to help the customer understand the issues by providing advice and solutions and not just problems. His presentation provided an insight into how the ISA helps to smooth a path for their customer through the complexities of safety requirements and enables a fully joined up and coherent view of risk.

The last morning session was delivered by Roy Frost of AWE (the Atomic Weapons Establishment). Roy is an established lead internal regulator, with specific responsibility for peer review of safety cases. His presentation was based on the experience of safety-case production and review. It considered where peer review is beneficial, where it could go wrong and what would happen if we had no peer review at all. It highlighted ways of promoting industry best practice, and discussed how we can ensure support from customers in delivering value for money and other benefits.

Following a pleasant lunch, where the group took the opportunity to reaffirm old, and develop new, relationships, the afternoon session was introduced by Steve Kinnersly, who reported on the work of the ISAWG since the last seminar in 2013.

Andrew Eaton and Stephen

Barker from the UK CAA (Civil Aviation Authority) then delivered a presentation on a novel systematic process for change safety-case assessment. This approach was identified as being non-domain specific and could be applied to any change safety case. However, the question of how to establish the correct degree of rigour had yet to be fully answered. This led seamlessly into the afternoon workshop.

The afternoon saw the group split into two, whereupon they enthusiastically debated what would be considered to be an appropriate scope of independent assessment for the fictitious Monstaplant and Miniplant. The aim was to establish some degree of consensus in relation to rigour of assessment. The output from the workshop will be useful in support of the ISAWG, which will be looking to generate guidance in this area over the coming months.

The outgoing Chair, Simon Brown, closed the seminar by thanking all the contributors for their time and for the quality of their stimulating presentations. He also thanked participants for their enthusiastic and valuable contributions to the workshop.

Karen Tracey, Head of Independent Nuclear Safety Assessment at Sellafield, and Chair of ISAWG, can be contacted at <[email protected]>.

Review of Independent Safety Assurance Working Group Seminar 24th November 2015Continued from Page 60

SCSC Newsletter

Other Safety and Dependability Events

10th - 15th AprilIEEE Int. Conf. on Software Testing, Verification and Validation (ICST 2016)Chicago, IL, USAhttps://www.cs.uic.edu/~icst2016

1st - 6th MaySoftware Testing Conference (STAREAST)Orlando, Florida, USAhttps://stareast.techwell.com

14th - 22nd May38th Int. Conf. on Software Engineering (ICSE 2016)Austin, Texas, USAhttp://2016.icse.cs.txstate.edu

16th - 17th May18th Int. Conf. on Safety and Systems Engineering (ICSSE 2016)Paris, Francehttps://www.waset.org/conference/2016/05/paris/ICSSE/home

25th - 27th MayAustralian System Safety Conference 2016 (ASSC 2016)Adelaide, Australiahttp://www.assc2016.org/

June 6-9, 2016IET Safety Critical Systems Course, Wyboston Lakes, UKhttp://conferences.theiet.org/scs/

28th - 30th JuneInt. Conf. on Reliability,

Safety and Security of railway systems (RSSR 2016)Paris, Francehttp://conferences.ncl.ac.uk/rssrail/

28th June - 1st July46th Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN 2016)Toulouse, Francehttp://www.dsn.org/

16th - 20th JulyInt. Symp. on Software Testing and Analysis (ISSTA 2016)Saarbrücken, Germany https://issta2016.cispa.saarland/

24th - 28th JulyNinth Int. Conf. on Dependability (DEPEND 2016)Nice, Francehttp://www.iaria.org/conferences2016/DEPEND16.html

25th – 28th July6th Int. Conf. on Quality, Reliability, Risk, Maintenance, and Safety Engineering (QR2MSE 2016)Jiuzhaigou, Chinahttp://qr2mse.org/default.aspx

1st - 3rd AugustIEEE Int. Conf. on Software Quality, Reliability & Security (QRS 2016)Vienna, Austriahttp://paris.utdallas.edu/qrs16

Events CalendarItems in this list may be of interest to members. In most cases we provide web addresses for further information; apologies to readers when we are not able to do this. We have attempted to get the details right, but readers are advised to confirm them with event organisers; the Club cannot accept responsibility for errors in the information.

All events are in 2016 except when stated otherwise.

Club Events

9th JuneSafety Arguments: the Good, the Bad and the UglyLondon, UK

14th JuneSCSC Data Safety Initiative working group (DSIWG #27)Harlow, UK

29th SeptemberDetails to be confirmedLondon, TBC

1st November, 2016High Integrity Software ConferenceBristol, UK http://his-2016.co.uk/

6th DecemberTopic and Date to be confirmedLondon, TBC

7th - 9th February, 2017Safety-critical Systems Symposium 2017 (SSS’17)Bristol, UK Continued on Page 63

SCSC Newsletter

Events CalendarContinued from Page 62

8th - 12th AugustInt. System Safety Conference (ISSC 2016)Orlando, Florida, USAhttp://issc2016.system-safety.org/

15th –17th August2nd Int. Conf. on Safety & Reliability of Ships, Offshore and Subsea Structures (SAROSS 2016)Glasgow, UKhttp://www.asranet.co.uk/Content/Files/201608-SAROSS.pdf

22nd-23rd August18th Int. Conf. on Reliability, Safety and Security Engineering (ICRSSE 2016)Paris, Francehttps://www.waset.org/conference/2016/08/paris/ICRSSE

23rd - 26th August8th IEEE Int. Symp. on UbiSafe Computing (UbiSafe 2016) Tianjin, Chinahttp://trust.csu.edu.cn/conference/ubisafe2016/

29th August - 2nd September11th Int. Conf. on Availability, Reliability and Security (ARES 2016)Salzburg, Austriahttp://www.ares-conference.eu/conference/

5th - 9th September12th European Dependable Computing Conference (EDCC 2016)

Gothenburg, Swedenhttp://edcc2016.eu/

20th - 23rd SeptemberInt. Conf. on Computer Safety, Reliability and Security (SafeComp 2016)Trondheim, Norwayhttp://www.ntnu.edu/safecomp2016

25th - 29th September2016 European Safety and Reliability Conference (ESREL 2016)Glasgow, UKhttp://esrel2016.org/

26th - 29th SeptemberInt. Workshop on Formal Methods for Industrial Critical Systems and Automated Verification of Critical Systems (FMICS-AVoCS 2016)Pisa, Italyhttp://fmics-avocs.isti.cnr.it/

26th - 29th September35th Symp. on Reliable Distributed Systems (SRDS 2016)Budapest, Hungaryhttp://srds2016.inf.mit.bme.hu/

2nd - 7th October13th Int. Conf. on Probabilistic Safety Assessment and Management (PSAM 13)Seoul, Koreahttp://psam13.org/

11th - 13th OctoberIET System Safety and Cyber Security (IET SSCSC)London, UK

http://conferences.theiet.org/system-safety/

19th - 21st OctoberLatin-American Symp. on Dependable Computing (LADC 2016)Cali, Colombiahttp://www.unicauca.edu.co/ladc2016/

23rd - 27th October27th Int. Symp. on Software Reliability Engineering (ISSRE 2016 )Ottawa, Canadahttp://issre.net/

24th - 25th October18th Int. Conf. on Dependability and Complex Systems (ICDCS 2016)Paris, Francehttps://www.waset.org/conference/2016/10/paris/ICDCS

26th - 28th October3rd Int. Conf. on Safety and Security in Internet of Things (SaSeIoT 2016)Paris, Francehttp://securityiot.eu/2016/show/home

7th - 11th November21st Int. Symp. on Formal Methods (FM 2016)Limassol, Cyprushttp://www.fmeurope.org/?p=527

23rd – 26th January, 201763rd Reliability and Maintainability Symposium (RAMS)Orlando, Florida, USAhttp://www.rams.org/

SCSC Newsletter

The NewsletterSafety Systems

For the last 25 years, the Newsletter has been published three times annually, in September, January and May. It has been edited by me, Felix Redmill, and sent to paid-up members of the Safety-Critical Systems Club.

Its purpose has been to contribute to fulfilling the objectives of the Club, by raising awareness of safety in the development, certification, operation, and maintenance of technological products, and by facilitating the transfer of information, technology, and current and emerging practices and standards. It has carried articles and other information of interest to Club members – and the wider community.

Although the Editor has necessarily had to have a hand in articles, the views expressed have always been those of the authors and no attempts have been made to direct or shape them.

Over the years, articles have been contributed by Club members and others from many countries, not only in Europe but in other parts of the world as well. Indeed, this current issue carries articles by authors in Norway, Australia and the United States. Long may the Club, and Safety Systems, support an exchange of information on such a wide scale.

This issue, the last to be edited by me, is something of a ‘bumper’ one, being three

Information on eventsand other Club activities

are to be foundon the Club’s web site:http://www.scsc.org.uk/

times the usual size, both in the number of articles and in the number of pages. Its articles cover a wide range and I hope that it forms a worthy finale

to the efforts of the team that initiated the Club and saw it through 25 years of service to the UK’s safety-critical systems community.

Joan, Tom and Felixbid farewell

to all our members,past members, friends

and supporters.We hope that, as the Club

continues to support you and the entire

safety-critical systems community,

you will continue to support the Club.

safety systems€¦ · development of thinking about risk the 1980s decade was a period of change...

Documents