published pas77

60
PUBLICLY AVAILABLE SPECIFICATION PAS 77:2006 IT Service Continuity Management Code of Practice ICS code: 35.020 NO COPYING WITHOUT BSI PERMISSION EXCEPT AS PERMITTED BY COPYRIGHT LAW

Upload: jgomezni

Post on 24-Oct-2014

168 views

Category:

Documents


19 download

TRANSCRIPT

Page 1: Published PAS77

PUBLICLY AVAILABLE SPECIFICATION

PAS 77:2006

IT Service ContinuityManagementCode of Practice

ICS code: 35.020

NO COPYING WITHOUT BSI PERMISSION EXCEPT AS PERMITTED BY COPYRIGHT LAW

Page 2: Published PAS77

PAS 77:2006

© BSI 11 August 2006

This Publicly AvailableSpecification comes into effect on11 August 2006

© BSI 11 August 2006

ISBN 0 580 49047 5

Amd. No. Date Comments

Page 3: Published PAS77

PAS 77:2006

© BSI 11 August 2006 i

ii .......................... Forewordiii .......................... Introduction

1 .......................... 1 Scope

2 .......................... 2 Terms and definitions

5 .......................... 3 Abbreviations

6 .......................... 4 IT Service Continuity management

7 .......................... 5 IT Service Continuity strategy

13 .......................... 6 Understanding risks and impacts within your organization

14 .......................... 7 Conducting business criticality and risk assessments

15 .......................... 8 IT Service Continuity plan

20 .......................... 9 Rehearsing an IT Service Continuity plan

25 .......................... 10 Solutions architecture and design considerations

27 .......................... 11 Buying Continuity Services

29 .......................... Annex A (informative) Conducting business criticality and risk assessments36 .......................... Annex B (informative) IT Architecture Considerations38 .......................... Annex C (informative) Virtualization39 .......................... Annex D (informative) Types of site models43 .......................... Annex E (informative) High availability48 .......................... Annex F (informative) Types of resilience

51 .......................... Bibliography

ContentsPage

Page 4: Published PAS77

PAS 77:2006

© BSI 11 August 2006ii

Attention is drawn to the following statutory instrumentsand regulations:

• Basel II: International Convergence of CapitalMeasurement and Capital Standards: a RevisedFramework, Basel. Bank for International SettlementsPress and Communications, 2005.

• The Civil Contingencies Act 2004. Cabinet Office:The Stationery Office.

• The Data Protection Act 1998. British Parliament:The Stationery Office.

• The Higgs Report on the Role of Non-ExecutiveDirectors: Department of Trade and Industry:The Stationery Office, 2001

• The Sarbanes-Oxley Act, 107th Congress of the UnitedStates of America, 2002.

• The Turnbull Report on Corporate Governance:Department of Trade and Industry: The StationeryOffice, 1998

• The Orange Book Management of Risk – Principles andConcepts: HM Treasury, 2004.

ii

ForewordThis Publicly Available Specification (PAS) has been prepared by the British StandardsInstitution (BSI) in partnership with Adam Continuity, Dell Corporation, Unisys, andSunGard. Acknowledgement is given to the following organizations that have beeninvolved in the development of this code of practice.

• Adam Continuity• Dell Corporation• SunGard• Unisys

Contributors:

• Oscar O’Connor, Lead Author

• John Pollard

• Richard Pursey

• Andrew Roles

• Brian Hayden

• Douglas Craig

• Stafford Hunt

As a code of practice, this PAS takes the form of guidanceand recommendations. It should not be quoted as if it is aspecification and particular care should be taken to ensurethat claims of compliance are not misleading.

This Publicly Available Specification has been preparedand published by BSI, which retains its ownership andcopyright. BSI reserves the right to withdraw or amendthis Publicly Available Specification on receipt ofauthoritative advice that it is appropriate to do so.This Publicly Available Specification will be reviewed atintervals not exceeding two years, and any amendmentsarising from the review will be published as an amendedPublicly Available Specification and publicized inUpdate Standards.

This Publicly Available Specification is not to be regardedas a British Standard.

This Publicly Available Specification does not purport toinclude all the necessary provisions of a contract. Users areresponsible for its correct application.

Compliance with this Publicly Available Specification doesnot of itself confer immunity from legal obligations.

Page 5: Published PAS77

PAS 77:2006

© BSI 11 August 2006 iii

IntroductionThis code of practice provides guidance on IT Service Continuity Management (ITSCM).It is intended to compliment, rather than replace or supersede, other publications such asPAS 56, BS ISO/IEC 20000, BS ISO/IEC 17799:2005 and ISO 9001 (see Bibliography forfurther information).

requirements and objectives set out in both the IT Strategyand ITSC Strategy. Once the architecture is defined, theorganization can then define IT Service Continuity Plansfor each element of the architecture. Feedback from(amongst many other sources) rehearsing the ITSC Planscan subsequently be used as input to the next iterationof the IT Strategy.

• PAS 56 provides guidance on best practice in BusinessContinuity Management, and while it mentions theneed for IT Service Continuity it does not provide thedetailed guidelines found in this code of practice;

• BS ISO/IEC 20000 provides guidance on best practice onService Management and, as PAS 56, mentions IT ServiceContinuity, but not at the level of detail presented inthis code of practice;

• BS ISO/IEC 17799:2005 provides detailed guidance onbest practice in information security management,which is one aspect of IT Service ContinuityManagement. This code of practice does not directlyaddress information security or physical andenvironmental security as these areas are covered by BS ISO/IEC 17799:2005;

• ISO 9001 provides guidance on best practice in QualityManagement Systems. When implementing anyrecommendations found within this code of practice, the reader is encouraged to apply the quality assuranceand control recommendations found in ISO 9001.

Many organizations believe that a loss of systemsinfrastructure will not happen to them or that a loss ofsuch infrastructure will have a relatively low impact.However, while many of those organizations might believethat they have invested in adequate systems resilience,it is often apparent that such confidence is misplaced.In an age in which information technology is becomingevermore pervasive and increasingly critical within theday to day operations of many organizations, it is clearthat the ability to continue to operate with any degreeof success is likely to be severely compromised followingloss of IT services. In addition it is evident that theduration of a tolerable IT outage is becoming ever shorter.

As Figure 1 suggests, there is a continuous cycle in therelationships between several important documents.The IT Strategy defines the organization’s key policies anddirection regarding information technology, systems andservices. From this, the IT Service Continuity Strategy canbe defined to ensure that the policies and standards for IT Service Continuity directly and explicitly support theobjectives set out in the IT Strategy. This then enables theorganization to define its IT Architecture based upon the

Page 6: Published PAS77

ITArchitecture

ITStrategy

IT ServiceContinuity

Plan

IT ServiceContinuityStrategy

Figure 1 – The relationship between the IT Strategy, ITSC Strategy, IT Architecture and ITSC Plan

PAS 77:2006

© BSI 11 August 2006iv

Whilst it is true that major events such as bombs, fires and floods make headline news, the majority of IT relatedincidents fall into the category of ‘quiet calamities’ thatonly affect an individual or a small subset of theorganization. Examples of such common incidents includethe theft of a mobile worker’s notebook computer, thefailure of an important business application andcorruption of important or confidential data. Theseincidents have the potential to damage an organization’sbrand or public image and its reputation, not to mentionits revenues and customer service. Such damage has thepotential to destroy that organization unless appropriateaction is taken to implement IT Service Continuity (ITSC).

In order to retain an appropriate sense of perspective, thisdocument refers to ‘incidents’ and ‘events’ rather than‘disasters’. Since the Asian Tsunami of 2004 and HurricaneKatrina in 2005, the phrase ‘disaster recovery’ has takenon dimensions previously unknown and the authors felt it was inappropriate to describe the failure of IT systems,however disruptive, using the same language. Throughoutthe document the reader may encounter terminologywhich is used in other standards. To avoid ambiguity thereader should refer to the definition section to understandhow such terminology is used in this document which maydiffer from other standards.

This document is intended to be read by a number ofdifferent audiences:

• Executive and Senior Management – to gain a high levelunderstanding of the fundamental interdependenciesbetween Corporate Governance, Business Continuity andIT Service Continuity in order to make better-informedinvestment decisions relating to ITSCM;

• Middle Management – to understand how decisionsshould be made regarding IT Service Continuity suchthat critical business processes survive disruption (ideally)or at the very least have the ability to recover fromdisruption in timescales required by the organization;

• IT Management – to understand the decision makingprocesses required in order to ensure that IT ServiceContinuity strategies and plans fully support business priorities;

• IT Support and Operations – to gain a practical insightinto how IT Service Continuity strategies should bedrawn up and implemented in such a way as to addvalue to the organization as well as protecting it fromIT-related incidents;

• Regulators, auditors, insurance and benchmarkingorganizations – to understand what best practice inIT Service Continuity Management implies fororganizations so that these measures can be assessedas part of wider reviews of Corporate Governanceand resilience.

This code of practice is designed for organizations of allshapes and sizes, whether in the private or public sectors.

Page 7: Published PAS77

PAS 77:2006

© BSI 11 August 2006 1

It should not be regarded as a step-by-step guide toimplementing IT Service Continuity Management but asguidance on the aspects of ITSCM which organizationsshould consider when investing in this area. Not allactivities described herein will be applicable orappropriate for all organizations. In particular, smallorganizations should aim to use this code of practice as a reference guide in order to help them make informeddecisions about what level of ITSCM would be appropriatefor them given their individual characteristics. Throughoutthis code of practice certain terms have been used whichmay cause confusion. Such confusion is naturally not theintention of the authors, so the following guidance shouldbe borne in mind when reading this document:

• The term ‘business’ is used when referring to the non-ITelements of an organization. This should not be takento imply that this code of practice is aimed purely atprivate sector or commercial bodies. In each suchinstance, the term is used merely as convenientshorthand to avoid over-complicating the language used herein.

• This code of practice refers to ‘rehearsing’ ITSC Plans.Other publications in this field have referred to ‘testing’and also to ‘exercising’. The authors regard these termsas largely interchangeable and have opted to use theterm rehearsing in this context as it implies not onlytesting that ITSC Plans are accurate and capable of beingimplemented, but also that the people required toimplement them are guided, supported and providedwith feedback on their own personal performance aswell as that of the Plans. The authors did not feel thateither of the other terms used in other publicationsquite conveyed the necessary emphasis on this aspect.

• The term ‘data centre’ is used to imply any location orfacility where core information technology services arehoused, whether that be the ultra-modern data centresthat major organizations use or under the desk wherea one-person business keeps its file server. No inferenceshould be drawn regarding the applicability of guidanceor recommendations to any type of non-datacentre environment.

1 Scope

This Publicly Available Specification (PAS) explains theprinciples and some recommended techniques for ITService Continuity management. It is intended for useby persons responsible for implementing, delivering andmanaging IT Service Continuity within an organization.

This PAS provides a generic framework and guidelines fora continuity programme including the following topics.

• What the required management structure, roles andresponsibilities for implementing IT Service Continuitymanagement are.

• How business criticality, risk assessments and businessimpact assessments should be performed to produceuseable results.

• What business continuity plans contain and the stepsrequired to respond to, and recover from, the identifiedrisks within the context of specified business processes.

• How the development, rehearsal and deployment of the Business Continuity plan does not have to cost more in terms of money, risk or reputation than takingno action.

• Why a framework and capability should be developedfor the organization to respond effectively tounexpected disruption.

This document is not intended to be used as step-by-stepinstructions for conducting any of the activities describedherein. It is intended to provide an overview of acomplete process on the assumption that information will already exist within the organization that would beidentified by activities described in this document. Wherethis is the case, users of this document are encouraged toreview the information in their possession to ensure that it includes all of the details required, and that it is up-to-date and accurate.

Page 8: Published PAS77

PAS 77:2006

© BSI 11 August 20062

2.10 data availability measure a system’s ability to deliver a predetermined levelof data access during a system failure

2.11dependency modellingactivity used to determine the inter-relationships anddependencies between functions and/or processes andhow they affect the system or organization as a whole

2.12 disk imagingmethod of copying a complete hard disk of a computerinto a single file from which the gathered image can bedistributed to a single or multiple computers to minimizethe time and effort for the creation of computers that willhave identical software and configurations to the original

2.13 domainlogical association of a defined environment and theassets within the pre-defined environment

2.14downtime vs. cost vs. benefit modelmodel which analyses the costs of downtime and of themeasures required to minimize downtime in the event ofan incident and compares them against the benefitsavailable to the organization from services being resumed

2.15duplexedability to simultaneously send and receive data througha medium in both directions

NOTE When used to describe disk devices or disk connectivityit implies ‘duplication’ or ‘mirroring’.

2.16 fail-backreturn of service/operation from fail-over site

2.17fail-overability for services offered by a component, server orsystem to automatically be undertaken by anothercomponent, server or system in the event of it’s failure sothat the impact of losing that device, server or system hasa minimal impact on the service or services offered

For the purpose of this PAS, the following terms anddefinitions apply.

2.1abnormal servicelevel of service that deviates from the levels agreed fornormal operations

NOTE Usually as a result of an incident causing disruption tonormal service levels.

2.2action planschedule of activities, lead times and dependencies ofactivities in order to address a particular requirement

2.3asynchronous replication

periodic physical replication of data from one storagesystem to another

NOTE Typically over a wide area network.

2.4atomicrequirement, transaction or objective which is selfcontained i.e. cannot be broken down further

2.5audit log shippingautomated process for transferring records of transactions(audit logs) between primary and secondary systems

2.6 business continuity management plandocument that sets out to ensure resumption of criticalbusiness functions in the event of either an incident orunforeseen event that threatens the business

2.7 clustered systemtwo or more computer systems configured in such amanner that in the event of failure of a system or servicerun on it, operation is transferred to another systemwithin the cluster

2.8cold back-up site

provides the space but not the infrastructure needed toresume operations quickly

2.9continuity proceduresset of predefined procedures to be followed in the eventof an incident which disrupts normal service levels

2 Terms and definitions

Page 9: Published PAS77

PAS 77:2006

© BSI 11 August 2006 3

2.18failure modes and effects analysis (FMEA)structured quality method to identify and counter weakpoints in early conception phase of products and processes1)

2.19incidentevent that disrupts normal IT services

NOTE This usage differs from that in ITIL [1].

2.20incident recoveryactivities required to respond effectively to an incident,with the primary objective being to ensure the resumptionof normal service levels

2.21I/O Processors

allow servers, workstations and storage subsystems totransfer data faster, reduce communication bottlenecks,and improve overall system performance by offloading I/Oprocessing functions from the host CPU2)

2.22 IP Addresslogical address of a system within an IP network

NOTE The IP address uniquely identifies computers on anetwork. An IP address can be private, for use on a Local AreaNetwork (LAN), or public, for use on the Internet or other WAN.IP addresses can be determined statically (assigned to acomputer by a system administrator) or dynamically (assignedby another device on the network on demand).

2.23IT Architectureoverall design of an organization’s information technologyand services including both physical and logical entities

2.24IT Infrastructure

physical devices which comprise an organization’sinformation technology and services architecture

2.25IT Serviceset of related information technology and probably non-information technology functionality, which isprovided to end-users as a service

1) http://www.fmeainfocentre.com/

2) http://www.intel.com/design/iio/ 3) http://whatis.com

NOTE Examples of IT services include messaging, businessapplications, file and print services, network services, and helpdesk services3).

2.26 IT Service Continuity Management supports the overall Business Continuity Managementprocess by ensuring that the required informationtechnology technical and services facilities (including computer systems, networks, applications,telecommunications, technical support and service desk)can be recovered within required, and agreed,business timescales

2.27 last mile telecoms providerorganization responsible for the provision oftelecommunications services from the national or localtelecommunications infrastructure to a specific location

2.28latencydelay due to the time it takes to transmit data from onelocation to another

2.29maintenance proceduresprocedures applied by an organization to ensure that theirIT Infrastructure is maintained in optimum conditionthrough both proactive and reactive measures

2.30monte carlo analysismeans of statistical evaluation of mathematical functionsusing random samples, often used in risk analysis of highlycomplex systems

2.31Network Attached Storage (NAS)storage device that can be attached to the network forthe purpose of file sharing

NOTE In essence a NAS device is simply a file server.

2.32network protocoltechnological rules, codes, encryption, data transmissionand receiving techniques which allow networks to operate

Page 10: Published PAS77

PAS 77:2006

© BSI 11 August 20064

2.42 risk management plandocument that sets out to define a list of activities, leadtimes and dependencies in order to mitigate one or moreidentified risks

2.43risk mitigation

set of actions that will affect either the probability of therisk occurring or its impact should the risk occur. These aresummarized as risk transference, tolerate the risk,terminate or treat

2.44risk monitoringiterative process of the risk owner checking and reportingon any changes in status of the risk log in terms of riskproximity, impact and response

2.45stateful/statelessdescribe whether a computer or computer program isdesigned to note and remember one or more precedingevents in a given sequence of interactions with a user,another computer or program, a device, or other outside element

NOTE Stateful means the computer or program keeps track ofthe state of interaction, usually by setting values in a storagefield designated for that purpose. Stateless means there is norecord of previous interactions and each interaction request hasto be handled based entirely on information that comes with it.Stateful and stateless are derived from the usage of state as aset of conditions at a moment in time. (Computers areinherently stateful in operation, so these terms are used in thecontext of a particular set of interactions, not of how computerswork in general).

2.46 storage arraytwo or more hard disk drives working in unison toimprove fault tolerance and performance

2.47 synchronous replication

instantaneous physical replication of data from onestorage area to another, typically over a high speedinterconnect such as fibre channel

2.48test scriptsdefinition of the specific tests to be enacted when provingthe functionality and operation of a system or service

2.33 operations bridgecentral facility used for monitoring and managing systems,services and networks

2.34paper testmechanism for proving the hypothetical effectiveness of aprocess by working through scenarios in a discursive forum

2.35point in time (PIT)

consistent copy of the data taken at the same instance intime for one or more systems

2.36 recovery proceduresprocedures which result in the restoration of servicesfollowing an incident

2.37redundant routingresilient approach to data networking in which there area minimum of two routes from each node in the network

2.38rehearsingthe critical testing of ITSC strategies and ITSCs, rehearsingthe roles of team members and staff, and testing therecovery or continuity of an organization’s systems (e.g.technology, telephony, administration) to demonstrateITSC competence and capability

NOTE A rehearsal may involve invoking business continuityprocedures but is more likely to involve the simulation of abusiness continuity incident, announced or unannounced,in which participants role-play in order to assess what issuesmay arise, prior to a real invocation.

2.39replication appliancedevice which provides functionality to replicate data toother storage systems

2.40 riskcombination of the probability of an event and itsconsequence [ISO Guide 73:2002]

2.41risk communicationexchange or sharing of information about risk betweenthe decision-maker and other stakeholders [ISO Guide73:2002]

Page 11: Published PAS77

PAS 77:2006

© BSI 11 August 2006 5

2.49vulnerability report

report which identifies the specific vulnerabilities of aspecific system or service

2.50 work scheduledefined set of activities and deliverables which, oncecompleted, will result in the desired outcome of aprocedure or project

2.51Zero Data Loss (ZDL)remote replication method that guarantees not to loseany live data

2.52zoningallocation of resources for device load balancing and forselectively allowing access to data only to specific systems

NOTE Zoning allows an administrator to control who can seewhat is in a SAN.

3 Abbreviations

For the purpose of this PAS, the followingabbreviations apply.

BCM Business Continuity Management

BCMP Business Continuity Management Plan

BCMT Business Continuity Management Team

BCSG Business Continuity Steering Group

CMT Crisis Management Team

DAS Direct Attached Storage

DBMS Database Management System

IMT Incident Management Team

I/O Input/Output

IT Information Technology (also includes Information Systems (IS))

ITIL Information Technology Infrastructure Library

ITSC Information Technology Service Continuity

NAS Network Attached Storage

OS Operating System

RAID Redundant Array of Independent Disks

RPO Recovery Point Objective

RTO Recovery Time Objective

SAN Storage Area Network

UPS Uninterruptible Power Supply

WAN Wide Area Network

Page 12: Published PAS77

PAS 77:2006

© BSI 11 August 20066

Information Technology Service Continuity (ITSC) is thecollection of policies, standards, processes and toolsthrough which organizations not only improve their ability to respond when major system failures occur butalso improve their resilience to major incidents such thatcritical systems and services do not fail. It is related to anumber of disciplines and should be undertaken witha complete and thorough understanding of theorganization’s policies, standards, processes andsupporting services for:

a) Business Continuity Management;

b) Major Incident and Crisis Management;

c) Corporate Governance and Risk Management;

d) Information Technology (IT) Governance;

e) Information Security and Data Protection.

ITSC management should also have a significant influence on IT strategy to identify information systemsand services which require high levels of resilience,availability and capacity.

The purpose of risk management and ITSC management is not simply to be able to say that risk-based controlmechanisms have been implemented. The management of risk can result in many tangible and intangible benefitsto the organization if implemented with commitment and the right motivation. Risk management can be usedto improve product quality, productivity, financialperformance and working conditions. These benefitsshould be at the forefront of the participants’ thinkingthroughout the process.

Risk can be seen as a positive approach to improvingall aspects of the organization's performance. Everystakeholder can make a significant, positive contributionby considering, on a regular basis, the ways in which theorganization's ability to achieve its objectives could beat risk. In order to do so, the organization's objectives should be communicated clearly to everyone involved intheir achievement and communication on risks shouldbe encouraged.

ITSC management addresses risks that could cause asudden and serious impact, such that they couldimmediately threaten the continuity of the business.These typically include:

a) |oss, damage or denial of access to key infrastructure services;

b) failure or non-performance of critical providers,distributors or other third parties;

c) loss or corruption of key information;

d) sabotage, extortion or commercial espionage;

e) deliberate infiltration or attack on criticalinformation systems.

Business Continuity Management (BCM) is concerned withmanaging risks to ensure that at all times an organizationcan continue operating to, at least, a pre-determinedminimum level. The BCM process involves reducing the risk to an acceptable level and planning for the recoveryof business processes should a risk materialize and adisruption to the business occur. In essence ITSCmanagement should be a part of the overall BusinessContinuity plan and not dealt with in isolation.

4 IT Service Continuity management

Page 13: Published PAS77

PAS 77:2006

© BSI 11 August 2006 7

5.1 Defining an IT Service Continuity strategyNOTE The following frameworks are well regarded andrespected and can be used for additional information whencreating a comprehensive IT Strategy that will assist in clarifyingand defining your ITSC Strategy. For additional reading pleaserefer to:

• CMM, Capability Maturity Model,http://www.itservicecmm.org;

• CobiT 4.0, Control Objectives for Information and relatedTechnology, http://www.itgi.org;

• ITIL, IT Infrastructure Library, http://www.itil.co.uk.

An ITSC strategy should define the direction and high-level methods that should meet IT service level objectives.It should ensure a business is never compromised by a lackof IT availability beyond acceptable, predefined andregularly reviewed levels of uptime and performance.

The ITSC strategy should be agreed at Board level andideally be fully endorsed by the CEO. A Board membershould be accountable for the strategy and be referred to when deciding on new business initiatives includingmergers and acquisitions, directional change and anydecision that could have an impact on ITSC.

In devising an organization’s ITSC strategy, it is advisable toconsider four discrete but linked stages in the management

of a major incident, as also shown in Figure 2:

a) initial response – covering the initial actions required toensure the safety and welfare of people affected by theincident, to activate the relevant incident managementteams and determine the level of response which isappropriate to the incident;

b) service recovery – this may take place in a number ofstages depending upon the needs and scale of theorganization but should involve the restoration of allrequired services in priority order to pre-agreed(possibly degraded) levels of service;

c) service delivery in abnormal circumstances – until theorganization is ready and able to resume normal serviceoperations there is still a need to continue to operaterequired services at the pre-agreed service levels untilthe circumstances permit these ‘abnormal services’ to befailed back to ‘business as usual’ and decommissioned;

d) normal service resumption – as with service recovery,the resumption of normal service may take place instages according to the needs and priorities of theorganization. Only when each service has beenvalidated and verified as being ‘back to normal’ shouldthe secondary systems be decommissioned. This stage isonly complete when all of the organization’s IT servicesare restored to normal service levels.

5 Service Continuity strategy

Figure 2 – Major Incident Management

Initial response

Service recovery

Servicedelivery in abnormalcircumstances

Resumptionof normal service

Page 14: Published PAS77

PAS 77:2006

© BSI 11 August 20068

The ITSC strategy should enable the organization to planfor and rehearse the whole life cycle of a major incidentfrom the point of initial disruption, through the recovery,to abnormal service to the point where normal servicelevels are once again guaranteed.

The strategy should be developed from a clearunderstanding of the organization’s need for IT servicesand the agreed service levels that are required from timeto time, taking into account:

a) priority for key business units at given moments in time;

b) peak loads on business;

c) strategically important business periods e.g. reportingperiods, manufacturing deadlines etc;

d) compliance with business Continuity ManagementPlans and objectives;

e) investment vs. risk;

f) impact of failure or loss;

g) recovery time objectives;

h) acceptable levels of downtime and performance;

i) system changes and upgrades;

j) new projects;

k) interdependencies;

l) compliance with legislation;

m) deadline management;

n) rehearsing and rehearsing recovery plans;

o) data protection;

p) data availability;

q) plan maintenance;

r) education and awareness programmes for all IT staff.

The strategy should not define the detailed tactics butshould set the direction of the individual components ofan ITSC plan.

5.2 Creating an ITSC strategyThe ITSC strategy should be a by-product of a BusinessContinuity Management Plan (BCMP) but can be definedwithout. Where a BCMP exists, those responsible for ITservice levels are likely to have contributed to the planand already be aware of the implications of that plan onthe IT strategy and direction.

As shown in Figure 3, an ITSC strategy should have sixmain elements all of which are part of a continuouscyclical process.

Figure 3 – ITSC strategy elements

InstillContinuity

Culture

UnderstandRequirements

Review Strategy and

UpdateObjectives

UnderstandDependencies

Monitor

Rehearse,Exercise and

Audit

Page 15: Published PAS77

PAS 77:2006

© BSI 11 August 2006 9

The six elements in the ITSC Strategy Process are:

a) Understanding the business requirements and agreeingservice levels – The ITSC strategy should be aligned withcorporate strategy and in line with the pre-definedbusiness goals of the organization. To provide a soundbase to start from and allow for continued andcontrolled growth of the ITSC strategy a defined servicelevel should be agreed. This will allow for clearlydefined service levels which are specific and can bemeasured, analysed and improved on.

NOTE Refer to ITIL [1] for guidance on agreeing service levels.

b) Reviewing the IT strategic vision and updatingobjectives – Change is inevitable and allowances shouldbe made to ensure that the ITSC strategy is alwaysaligned with the overall IT strategy and business goals.The overall IT strategy and its ITSC strategy should beconstantly assessed and updated.

c) Performing regular risk assessment and dependencymodelling (internal and external) – Changes in thenature of the risks and dependencies should beexpected and in order to ensure that these variationsare controlled and taken into cognisance in the ITSCstrategy they should be reassessed regularly. Thequantity and quality of the regular dependencymodelling exercises will depend on the nature of the environment.

d) Building and embedding an ITSC culture from newproject inception – When initiating an IT projectconsideration should be given to how the deliverableswill support or enhance the ITSC strategy.

e) Exercising, maintaining and auditing continuity andrecovery plans – The continued rehearsals, maintenanceand auditing of the continuity and recovery plans anorganization can ensure that:

1) the plans are constantly improved on and up to date;

2) staff are familiar with the plans’ contents;

3) the plans are fit for purpose and relevant;

4) resource requirements for management of majorincident are understood and planned for;

5) visibility and governance are provided for the teamsexercising the plans and the organization as a whole.

f) Monitoring performance – Constant improvements canonly be made if the pre-defined service levels in thefirst stage are constantly monitored, measured,analysed and reported on. Monitoring performanceshould be central to this.

Everyone within the organization who is likely toparticipate should be reminded of the importance ofinitiating an ITSC programme and the priority it shouldbe afforded.

NOTE Experience shows that a better quality of response isachieved if appropriately worded instructions are received from

an executive level within the organization. This is particularlytrue within larger organizations where a variety of individualsand levels of management are likely to be involved.

A suitable mandate should be sent to the relevantdepartment heads and others who may be involved in the programme as a matter of priority before thecommencement of the programme.

The business should define the levels of service it expectsfrom the IT department. It should clearly define priorities,allowing the Head of IT to determine which services havegreater protection, resilience and redundancy over others.

IT resilience costs money. The depth and detail of an ITSCstrategy will for many companies be driven by risk versuscost. A strategy should be developed that deliversimprovement over time focusing on key issues andpriorities where risks are high and the impact of loss orfailure significant. This in itself is not a simple algorithmand will require an impact and risk analysis beforebudgets can be assigned and a strategy ratified.

An initial strategy should be chosen by defining a strategyaround the priorities and calculating the associated costsfor delivery or by allocating a budget and then buildinga strategy around the available budget.

NOTE It is rare that an organization will have implemented anITSC plan from the outset of IT infrastructure implementation.Many IT environments have evolved over time and are often acombination of a number of inconsistent strategies.

5.3 The role of the Board and ExecutiveAt Board or Executive level, business priorities should bedefined focusing on key deliverables such asmanufacturing deadlines, financial reporting, customerservice levels etc. The impact of loss or failure should bedefined, often a by-product of a BCMP. These prioritiesshould help to define which areas of the IT infrastructureneed strengthening to improve resilience.

The Head of IT should then determine which IT processesand systems the key business priorities depend upon.This dependency modelling should show the impact ofindividual or collective component failure. It should allowthe strategist to determine vulnerable weaknesses in anIT infrastructure and measure the impact of failure froman operational level.

The Board of Directors (or equivalent) should have a visionfor the organization’s future, a growth plan and a seriesof objectives. This information acts as a long-term steer forthe ITSC strategy which will in time help steer investmentdecisions and corporate direction.

Page 16: Published PAS77

PAS 77:2006

© BSI 11 August 200610

5.4 Identifying requirements and weaknessesThe foundation of an ITSC strategy should be to ensurean embedded resilience throughout an IT infrastructure.The Head of IT should commission an internal review ofall areas of potential weakness from single points offailure through to redundancy, supply chain dependenceand general IT housekeeping processes such as secureback-up and restore technology. From this review, astrategy of improved resilience can be determined.

An ITSC strategy should make use of history and trendreports highlighting downtime experience, proven areasof weakness and service level reports.

A vulnerability report is highly likely to involveexpenditure. Key to implementing an ITSC strategy is to measure the costs of downtime vs. resilienceexpenditure i.e. impact and risk vs. cost. This should bedone by working with the Board to calculate the cost ofdowntime on key business functions on an hourly basis.This determines budgets and also focuses attention onservice level agreements required as a result ofmeasured downtime.

Department Heads should provide their service levelrequirements and the level of uptime they require withina steady state as well as the recovery time objectives afteran incident. These should be measured using a downtimevs. cost vs. benefit model.

Advancements in technology carry an inherent demandfor constant change and improvement. Enhancements toan IT infrastructure should be planned, rehearsed, andcarefully managed with clear contingency plans in placeshould the implementation fail. The ITSC strategy shouldtake into account a tolerance for downtime for majorsystem upgrades and changes. Should scheduleddowntime be unacceptable then plans should be put inplace for duplicate environments running parallel systems.Agreement should be reached on levels of foreseendowntime prior to defining the strategy.

Due consideration and research should be undertakenon current and forthcoming legislative requirements aswell as good practice guidelines on aspects of businesscontinuity and IT resilience. The Board (or equivalent)and especially Non-Executive Directors, if present, shouldsteer an organization towards compliance and healthybusiness management.

The ITSC strategy should also include an ongoing andcontinuous process for change management includinginvolvement of third party suppliers as well as internalcustomers. An agreement should be reached at Board orExecutive level on levels of investment and priorities for

expenditure on IT resilience. It should agree on its policyon outsourcing risk management to third party supplierse.g. incident recovery companies and third partymaintenance organizations, (see Clause 11).

5.5 Management structure and roles5.5.1 GeneralThe management structure should be a standard, threetiered structure used widely within both the private andpublic sectors for major incident and crisis management.

The three tiers are:

a) Bronze – operational level: Incident Management Team (IMT);

b) Silver – tactical level: Business Continuity ManagementTeam (BCMT);

c) Gold – strategic level: Crisis Management Team (CMT).

In terms of relative size and subsidiarity, the relationshipbetween these teams is illustrated in Figure 4:

Figure 4 – Management structure

Gold – Crisis Management

Team

Bronze – IncidentManagement Team

Silver –Business

ContinuityManagement

Team

Page 17: Published PAS77

PAS 77:2006

© BSI 11 August 2006 11

Depending upon the seriousness of the incident, ITSCshould be managed at both Silver (BCMT) and Bronze(IMT) levels, with Bronze teams being established for each IT service, coordinated by a dedicated IT serviceSilver team. The Gold team (CMT) should provide highlevel direction, prioritisation and coordination and shouldtake sole and direct responsibility for communicating withexternal stakeholders, including the media, emergencyservices and public authorities where that is appropriateor required.

Some organizations (mainly large) have internal teamsspecializing in Corporate Risk and Business ContinuityManagement from whom support and assistance shouldbe sought when constructing, rehearsing and invokingITSC Plans.

As these teams are most likely to comprise members of the organization’s management team, i.e. not businesscontinuity or incident management specialists, it ispossible that for both rehearsals and live activationsadditional specialist support may be appropriate.Consideration should also be given to the resourcerequirements implied by the activation of these teams andtheir associated ITSC Plans, since there is a possibility thatmore resources will be required during the initial stagesthan are directly available within the organization.

5.5.2 Incident Management TeamIn the event of disruption, whether from a predictedsource or elsewhere, the IMT for the affected IT serviceshould be activated.

The IMT should determine whether the nature and extentof the disruption warrant the deployment of the relevantBCMP, and if so should:

a) Determine the nature of the disruption and, ifnecessary, coordinate with the organization’s ProblemManagement4) function to adapt existing procedures forthe initial response to the disruption.

b) Implement the selected procedures, securing therequired resources through the BCMT where appropriate.

c) Identify, and where appropriate adapt, the relevantcontinuity procedures to ensure that the businesscontinues to operate as near to normal a manner aspossible for the duration of the disruption. All suchactivities should be coordinated through the BCMT.

d) Identify, and where appropriate adapt, the relevantrecovery procedures to ensure that the businessrecovers from disruption in a timely and controlledmanner once the root cause of the disruption has been

eliminated. All such activities should be coordinatedthrough the BCMT.

e) Activate the BCMT who investigate the requirement forfurther Bronze teams to be activated. Whilst the IMT isactive, all activities should be coordinated through theBCMT to ensure that no action taken by one IMTconflicts with actions taken by others.

f) Communicate with all parts of the organizationaffected by the disruption on a regular basis regardingprogress and the actions initiated by the IMT.

g) Organize, once recovery actions have been completed,a thorough review of its management of the disruptionso all relevant lessons from the experience can belearned and incorporated into procedures and training programmes.

5.5.3 Business Continuity Management TeamThe BCMT should determine the nature and extent of thedisruption and should:

a) coordinate the activation and management of allrelevant IMTs;

b) coordinate the allocation of resources to IMTs;

c) coordinate the management of undisrupted business;

d) manage communication with regulators, investors, the media, associates and staff;

e) ensure that the active IMTs have all the facilities,people and other resources that they require to mounteffective response, continuity and recovery operations;

f) where appropriate, activate the CMT.

5.5.4 Crisis Management TeamThe CMT should be activated when an incident, or acombination of incidents, have such wide ranging impactas to require organization-wide response coordination.When active, the CMT should take responsibility for the coordination of all active BCMTs and for allcommunication with all external stakeholders, especiallycustomers, suppliers, regulators, the media and the public.

5.5.5 Learning lessonsIn order to ensure that each real or rehearsed invocationof the ITSC management contributes to the ongoingimprovement of the ITSC strategy and related plans, eachteam should maintain a comprehensive journal, includingdetails of:

a) the reasons for the team being activated, includingdetails of the disruption that occurred, and justificationfor the team being activated;

b) any amendments made to the ITSC strategy and relatedplans and procedures as a result of the actual disruptionbeing different in character to that predicted;

c) all decisions made during the disruption, includingsupporting evidence;4) http://whatis.com

Page 18: Published PAS77

PAS 77:2006

© BSI 11 August 200612

d) all events transpiring during the disruption, their effectsand likely causes;

e) all actions taken and evidence of their results;

f) all communication in relation to the disruption,including the other parties involved, the nature of thecommunication and what information was passed ineach direction.

This journal should cover the period from the time theteam is activated to the time it stands down. All entries in the journal should include details of the date and timethe entry was made, and by whom.

The completed journals should be used to support futurereviews of business and IT service continuity plans and theireffectiveness. Therefore, stringent change control shouldbe applied on these journals, and no changes of any natureshould be permitted once the team has stood down.

5.6 IT Service Continuity in achanging environmentBusiness is by its very nature dynamic. It changes regularlyand with change comes risk; not only risk of failure butrisk of destabilizing existing policies and strategies.Therefore, the ITSC strategy should be resilient to changeand also adaptable.

The key factors that should be considered to ensure thatthe ITSC strategy and plans remain appropriate for theorganization as it and its environment change include the following.

a) Board level responsibility and accountability for theITSC strategy should be to help keep an ITSC strategycurrent as the organization changes, develops andgrows. BCM and ITSCM should be a high-profileingredient to Board level thinking and should be themost important aspect of any continuity plan.

b) The change management process should include allparties responsible for the ITSC strategy, both itscompilation and its delivery. No change to the ITinfrastructure should be considered until theimplications of the change have been assessed andunderstood and contingency plans are rehearsed.

c) The procurement process for new IT systems shouldinclude sign-off that resilience has not beencompromised by even the most simple of upgrades orimprovements. Non-IT expenditure could still have animpact on IT resilience such as recruitment (systemoverheads), marketing campaigns (web site activity) etc.

d) Due diligence on merger and acquisition (M&A) activityshould include a resilience assessment. Often, M&Aactivity can bring perceived cost saving benefits such asbranch or office closure. This can also reduce resiliencethrough loss of fail-over sites, loss of secondary systems,and inherent redundancy.

e) Service levels (e.g. uptime statistics) should be reviewedas a Board agenda item each month. Trend analysis canshow even a slight decline in service which can be anindicator of bigger problems.

f) Testing and rehearsing contingency and recovery plansshould be an essential ingredient to keeping an ITSCstrategy current. Ensuring a department or applicationcan be recovered fully, after failure can ensure simpleerrors and problems are minimized. This includesperforming complete data back-ups as well as testingthird party suppliers.

g) Supplier’s ability to maintain appropriate levels ofservice should be regularly assessed. Including supplierssuch as incident recovery and maintenance providers inthe change management loop is highly recommended.

h) Remunerating staff against service levels can helpensure the relevant level of awareness reaches all levelsof the organization.

i) An internal/external audit of plans.

Page 19: Published PAS77

PAS 77:2006

© BSI 11 August 2006 13

6.1 GeneralRisks are prevalent within any environment. Beforecommencing any ITSC programme there should be anunderstanding of potential risks and impacts.

The loss of IT (staff, management or infrastructure)typically results in the loss of the ability to operate andmanage an organization’s systems infrastructure, with the resultant degradation or loss of critical applicationsand data. How this affects an organization depends onwhat it does, its key processes and their dependence ontechnology and the duration of that disruption. Forexample, businesses in the financial sector frequentlydepend on financial and/or market information feeds andapplications in order to manage time bound investmentsor transactions. An inability to manage investments andother financial vehicles would have a potentially seriousimpact on the business’s balance sheet and loss ofsignificant revenues impact on an organization’s balancesheet and revenues.

In order to fully understand how a disruption in IT servicecan affect an organization it is necessary to conduct abusiness criticality and risk assessment (see Annex A) whichwill identify critical activities with the degree these aredependent on IT. It should also identify the requiredrecovery timescales (RTOs) for IT services which are vital inthe implementation of those critical activities as well asthe currency of the data which is used in the recovery ofthose IT services.

6.2 Vulnerability assessmentIn parallel with an impact analysis, the potentialvulnerabilities prevalent within IT service delivery whichmight give rise to disruption should be determined. Thisinformation can be obtained through a risk assessmentwhich should review the IT infrastructure’s exposure interms of:

a) system resilience and availability;

b) key suppliers and agreements;

c) documentation;

d) hardware and software assets;

e) storage;

f) back-up regimes;

g) staff exposure;

h) staff training;

i) location of buildings and facilities;

j) IT security;

k) systems monitoring;

l) power;

m) data communications;

n) archiving;

o) IT environment and monitoring;

p) telephony;

q) any other relevant exposure.

Every organization’s risk level will be different, howeverthe outcome of the risk assessment should provide it withsufficient information to evaluate its vulnerabilities in arational manner and to decide how to deal with them byeliminating the risk altogether. This can be achieved byinvesting in resources to mitigate the exposure or bypreparing beforehand for the consequences of the risk,such as having appropriate incident management in place.

By adopting this twin track approach at the start of theITSC programme one should obtain an understanding ofthe organization’s dependencies on IT infrastructure interms of the impacts of infrastructure failure (as a wholeor in part) and an appreciation of the vulnerabilitiespresent which could give rise to an incident whichprecipitates those impacts.

6 Understanding risks and impacts within your organization

Page 20: Published PAS77

PAS 77:2006

© BSI 11 August 200614

A critical initial activity in the development of an ITSCstrategy or plan is to identify all business processes andthe departments or business functions responsible for theiroperation and to categorize each function and processaccording to its criticality to the business. Subsequently toidentify all IT services which support each business processand assess their criticality to the operation of thosebusiness processes.

NOTE 1 More detailed guidance is available in Annex A.

NOTE 2 Specific guidance on conducting risk assessmentsrelating to information security can also be found in BS ISO/IEC17799:2005.

ITSC management addresses the ways in which thefollowing types of activity could be disrupted, stopped orhave their performance degraded to unacceptable levels.

a) operation of IT services and processes;

b) IT service resumption following a disruption or failure;

c) new IT service or information systems developmentprojects;

d) readiness and operation of ITSC required to complywith statutory or regulatory requirements.

The organization should be regarded in two ways:

a) Physically: an organization exists on one or more sites,each site comprising buildings, which can be brokendown in a variety of ways (floor, wing, corridor,office etc.);

b) Organizationally: most organizations are structuredinto a number of Directorates, each of which comprisesa number of functions, which comprise departments,processes and activities. Naturally this namingconvention is not intended to be an accuratedescription of all organizations but a theme whichcan be readily recognized.

It is possible for each physical component to support anumber of organizational components. In order to avoidduplication of effort the risk assessment process shouldexamine the organization and its IT services from bothof these perspectives. It is equally possible for a singleorganizational component to be situated in a number ofdifferent physical locations.

The criticality of each business process should have adirect impact on the criticality of supporting IT services.Suggested designations are shown in Table 1.

NOTE If a system or service cannot readily be assigned to any of these categories, the organization may wish to considerwhether that system or service has any ongoing purpose.If however a system or service can be assigned to more thanone category the organization should decide on which singlecategory designation will be used.

The process of assessing business criticality and risk should be managed to ensure that the assessment ofphysical risks is coordinated with, but not dominated by,the assessment of organizational risks. Neither assessmentis more important than the other, but each has its part toplay in ensuring that the business as a whole adopts aposition in which all types of risk are managed aseffectively as possible.

The inherent complexity in all organizations implies that any risk assessment method should be adaptable to the different circumstances within each part ofthe organization.

NOTE See Annex A for details on how to conduct businesscriticality and risk assessments.

7 Conducting business criticality and risk assessments

Table 1 – Business criticality categories

Category Impact

Mandatory Vital to enable the organization to meet statutory or other(internally or externally)imposed requirements.

Critical Vital to the day-to-day operationof the organization.

Strategic Important for the implementationof the long term strategy.

Tactical Important for the achievement of the short to medium termperformance objectives of the organization.

Page 21: Published PAS77

PAS 77:2006

© BSI 11 August 2006 15

8.1 Definition of an ITSC planThe ITSC plan is a simple, clear, unambiguous and allencompassing set of documents that define the actionsrequired to restore IT services in the event of an incident.An ITSC plan is a series of working documents which areconstantly rehearsed, updated, modified and improved.

Depending on the organization’s requirements, the ITSCplan can be one document, or a series of connecteddocuments. It can be printed on paper or held as anelectronic/on-line documents. However, the ITSC planshould be readily available in the right place at the righttime and to the right people when an incident occurs,which might mean having hard copies accessible.

The ITSC Plan for each service should provide detailedprocedures and step-by-step guidelines for each stage inthe incident management process, as described in Figure 2in Clause 5.1.

8.2 Defining an architectureBefore building an ITSC plan the IT infrastructure shouldbe reviewed to determine whether it has all thecomponents and technology required to allow IT servicesto continue in the event of an incident. If not, then thesystems should be updated to include ITSC components,such as resilient, high availability or redundant systemsand data replication mechanisms. This should be doneby defining an IT architecture that includes thesecomponents. Much like the architecture of an officeblock will include fire escapes and emergency exits, theIT architecture may include components whose solepurpose is to ensure service continuity.

There are a number of common IT models which can beadopted to facilitate ITSC. Building IT architecture for asite doesn’t have to be an onerous task, commonlyaccepted models for IT resiliency and ITSC can be used(see 10.3). Selection of the appropriate model(s) dependson many things including IT architecture and servicecontinuity considerations.

8.3 Key Service Continuity FactorsThere are three key factors which should be balancedprior to deciding on the IT architecture:

a) Recovery Time Objective (RTO): How quickly after anincident the IT service needs to be restored.

b) Recovery Point Objective (RPO): The point in theprocessing cycle where the IT service can be resumed.

NOTE This could be at some consistent point prior to theincident e.g. the time of the last back-up. It comes down toanswering the question: ‘How much live data can I afford to

lose?’ If the answer is none then this will have a big impact onthe third factor, cost.

c) Cost: Typically the smaller the RTO and RPO values, thehigher the cost of the solution. Essentially the cost ofthe technology increases as the time to recover and theamount data that can be lost decrease. Since theavailability and cost of technology solutions changeover time, these decisions should be reviewed on aregular basis.

See Annex B for a more detailed discussion of ITArchitecture considerations which influence servicecontinuity. See also Annex C for a detailed discussion onvirtualisation and how such technologies might be used to build resilience into the IT Architecture and also assistcontinuity planning.

8.4 Populating the IT Service Continuity plan8.4.1 GeneralIf the IT infrastructure supports multiple services, forexample a bank could provide separate independentcashier and mortgage application services, then the ITSCplan should be considered in multiple ways. One aspect is total failure of a site (or sites), another is the failure ofindividual IT services within a site.

An ITSC plan should be part of a wider Business ContinuityManagement Plan and, as such, should adhere to anystandards and terminology defined by that. If followingan ITIL model for incident and problem management thenthe ITSC plan should also fall in line with ITIL processes.

The model ITSC plan should contain the procedures tofollow from initial response through to resumption ofnormal service following an incident (see 5.1).

8.4.2 Teams to populate the ITSC planIn order to populate each part of the plan the followingshould be prepared.

a) Nominate members of management to form IncidentManagement Teams. The main role of these teams is tomanage the recovery processes for each technologyplatform, each IT service and all required site facilities.The members of these teams should be trained tounderstand their responsibilities in the event of anincident.

b) Develop escalation and process flow charts so that oncethe decision has been made to invoke the correct ITSCprocedures are followed to allow recovery tocommence as quickly as possible.

c) Develop detailed procedures specifying how to recovereach component of the IT systems. Although operations

8 Service Continuity plan

Page 22: Published PAS77

PAS 77:2006

© BSI 11 August 200616

staff will understand how to operate the systems on aday-to-day basis, they may not know how to recoverthe applications and databases in the event of majorincident. Make the procedures as comprehensive aspossible and maintain them up to date when thesystems change.

8.4.3 Initial response to an incident – invocation of ITSCproceduresIf the organization has an IT service desk, the IMT shouldbe contacted when the service desk is first made aware ofan incident. If there are many members of the IMT, thena cascade process may be adopted. Either way the personmaking the initial contact should record who has beencontacted and the response they received. If a person onthe IMT is unavailable then their responsibilities shouldbe fulfilled by their designated deputy within the team.In any instance, leaving phone messages is not a sufficientresponse to the incident.

NOTE There may be a complex set of instructions or simplyinstructions to contact the members of the IMT.

Depending on the seriousness of the incident the IMT mayopt to activate either or both of the BCMT and CMT. If theinitial assessment concludes that the IMT can manage theincident directly then the other groups can be stooddown. The rules and decision making criteria for activationand escalation will be organization specific and should bedeveloped as part of the ITSC plan.

NOTE How the organization defines an incident will impacthow you escalate the problem. For example, one site defines adisaster as an event which is likely to render the whole siteunavailable for a considerable length of time. A major incidentis defined as an event which is likely to render a single ormultiple systems as unavailable for a considerable length oftime. Any event that is outside of these definitions is handledas a ‘business as usual’ event that is handled in the normal wayby their service desk.

8.4.4 Problem AssessmentThe IMT has the responsibility of assessing the impact ofthe incident. Where possible, assessment should be madeby those with the most domain or system knowledge.Critical time can be lost to prevarication or indecision overwhether systems should be failed over. Likewise criticaltime can be lost by failing over the system to a remote sitetoo quickly when a simple local recovery or waiting for asystem to be repaired would have been sufficient. The IMTshould develop a set of detailed criteria based on pastexperience and escalate based on whether the currentsituation meets those criteria. Where the IMT identifies apotential impact upon the organization beyond IT services,it is its responsibility to activate the BCMT using theorganization’s defined processes.

8.4.5 Roles and responsibilitiesThe plan should include a full description of thepredefined teams: the IMT, and the constituent specialistrecovery teams for platforms, services and facilities. Theseshould also contain each member’s role and responsibilityand current contact information.

NOTE 1 Any documents containing phone numbers shouldconstantly be updated. Electronic documentation that is linkedto directory systems is useful for keeping the plan up-to-dateprovided the directory systems are resilient enough to withstand incidents.

NOTE 2 IMT members should be geographically dispersed inorder to withstand environmental incidents.

8.4.6 Procedures to followThese procedures should be prepared in readiness for thisdocument and should be frequently updated as a result ofrehearsing and actual invocations. If a site has multiple ITsystems, there should be multiple procedures which formpart of the overall recovery procedures.

NOTE Procedures could be developed as hyperlinked electronicdocumentation connected via a single high level index. This hasthe advantage over paper since new versions of sub-documentscan be released without having to replace all the documents inthe set and everyone has access to the most up-to-dateinformation.

All procedure documentation should be readily availableto all points of the enterprise, even in an incidentscenario. It could be a good idea for IMT members to keepthe latest copy of the plan where they will always haveaccess to it e.g. at home or on their company laptops. Thecontents of the plan(s) are likely to contain sensitive orconfidential information and should always be heldsecurely, with appropriate measures taken to ensure thatthe contents cannot be accessed by unauthorisedpersonnel (see BS ISO/IEC 17799:2005 for furtherinformation on the kinds of measures which can beappropriate). Following a major incident there may not bethe time or equipment necessary to print copies of plans,so any documentation you create should be either easy toread from a screen, or printed out on a regular basis.

The procedures may take many forms. One useful form isa flowchart that shows the various possible high-levelsteps that should be followed and decisions that should bemade. A common form of this process flowcharting isknown as a swim-lane diagram. Each lane represents anindividual recovery process for one system.

Figure 5 shows the highest level process flow for afinancial institution which runs only three financialsystems: savings, mortgages and insurance. How eachsystem is recovered depends on the chosen architecture ofeach system. In this example, the savings system uses

Page 23: Published PAS77

PAS 77:2006

© BSI 11 August 2006 17

synchronous remote mirroring of the savings database.Recovery takes the form of enabling the remote mirrorson the remote system, recovering the databaseenvironment and then allowing branch traffic to accessthe system from the remote site. The mortgage systemuses a combination of tape back-ups and audit logshipping. To recover this environment, first reload the lastknown copy of the database from tape and then bring itup to date by reapplying the audit records read from the

audit logs. The insurance system is a high availabilityclustered system which automatically fails-over to theback-up site to provide almost uninterrupted service.

NOTE In this example there are no interdependencies betweenthe individual systems. This may not be the case in reality. Quiteoften one system needs to be recovered before another can bebrought on line.

Figure 5 – Example of a high level process flow chart for service continuity management

Disaster / MajorComponent Event

Contact EMTmembers

Assess scaleof disaster

Main site still usable and safe?

Switch all branchnetworks to remote site

Yes

Savings systemsavailable?

Yes

Mortgage systemsavailable?

Yes

Insurance systemsavailable?

Yes

End failover checks

No

No

No

No

Re-route help deskand operations

calls to remote site

Notify branches of major

disaster invocation

Call in DisasterOperations Team

Establish disasteroperations bridge

at remote site

Switch remoteaccess ports to

remote site

End sitepreparation

Prepare backup site for full production running

UP mirror of savings

database packs

Short recoveryof DBMS

environment

Restart DBMSsupport runs

Allow branchtraffic for

savings systems

End of recoveryof savings systems

Failover savings systems to remote backup

Reload mortgagesystems from lastPIT backup tapes

Re-apply DBMSaudit logs to

mortgage database

Validate mortgagedatabase for

corrupted entries

Restart DBMSsupport runs

Allow branchtraffic for

mortgage systems

End of recovery of

mortgage systems

Failover mortgage systems to remote backup

Cluster failoverinsurance systems to remote backup

Allow branchtraffic for

insurance systems

End of recovery of

insurance systems

Failover insurance systems to remote backup

Page 24: Published PAS77

PAS 77:2006

© BSI 11 August 200618

Each process in this flow chart should be documentedseparately, with its own flowchart if necessary highlightingeach task that forms the process. The documentedprocedures should provide detailed step-by-stepinstructions. The level of detail required in the plan will

depend on the skill level of the intended audience.

Each task shown in the top level process flow chart shouldbe accompanied by a summary sheet containing the itemsshown in Figure 6.

Figure 6 – Task summary sheet

Although it may not be possible to plan for all post fail-over scenarios, where for example there has been totaldevastation of the production site, basic planning shouldbe undertaken and the high level steps understood.When returning service to the original system or site thendetailed plans should be created for the fail-back process.In these circumstances it is unlikely that fail-back will be a

straightforward reversal of the fail-over steps and aseparate set of procedures are likely to be required.Thus a full fail-back plan should be in place with the samequality and standard of documentation as for the fail-over.Figure 7 shows an example fail-back plan for the fictitiousfail-over considered in Clause 8.4.6.

Task A-4: Call-in the fail-over operations team

Task description: Contact remote site on-call operations staff and request extra coverage at the remote site.

Essential documentation: Current remote site Operations Contact List – Contact-List.docEmergency Call Out Procedure – Emergency-Call-Out-.doc

Action takes place at: Wolverhampton Back-up Site

Task completed by: Remote Site Operations Support Manager

Preceding tasks: A-3

Time to complete task: 10 minutes

Requestor: Incident Management Team (BCM Manager)

Full description/reason for action:

There is a need to provide full operations coverage at the remote site to augment normal skeleton staff.Thus need to invoke emergency on-call procedures for operations.

Status check: Ensure that the section below is completed and signed-off

Signature Name Time

Status and Comments:

8.4.7 Fail-back

Page 25: Published PAS77

PAS 77:2006

© BSI 11 August 2006 19

Figure 7 – Example of a high level process flow chart for fail-back

End fail-back

Re-route help deskand operations callson production site

Establishoperations bridgeat production site

Switch remoteaccess ports toproduction site

End sitepreparation

Prepare production site for full production running

UP productionmirror of savingsdatabase packs

Short recoveryof DBMS

environment

Restart DBMSsupport runs

Allow branchtraffic for

savings systems

End of recoveryof savings systems

Restore savings systems to production site

Reload mortgagesystems backup

tapes

Re-apply DBMSaudit logs to

mortgage database

Restart DBMSsupport runs

Allow branchtraffic for

mortgage systems

End of recovery of

mortgage systems

Restore mortgage systems to production site

Cluster failoverinsurance system to production site

Allow branchtraffic for

insurance systems

End of recovery of

insurance systems

Restore insurance systems to production site

Switch all branchnetworks to

production site

Orderly shutdownof backup site

EMT requestfail-back

Page 26: Published PAS77

PAS 77:2006

© BSI 11 August 200620

9.1 IntroductionThe delivery of, and the feedback from, any rehearsal isone of the most interesting and fruitful parts of anybusiness continuity programme. However, its successdepends almost entirely on the way in which it isapproached and developed. Good solid preparationensures a sound delivery and everybody benefits from theexercise. Poor preparation leads to an ineffective rehearsaland the whole programme suffers. One unsatisfactoryexperience in an ill-conceived rehearsal will cause mostparticipants to want to distance themselves from thewhole concept of business continuity.

On the other hand, a well-prepared exercise will provideall of the participants with a profitable experience. Theywill be fully engaged in the opportunity to learn frompractical experiences. Thus they will become morecompetent whilst gaining confidence in themselves as well as the plans and procedures.

It is important for the organization’s staff to be aware ofand to recognize the differences between a servicecontinuity rehearsal and an actual invocation. The maindifference is the high degree of planning and preparationthat is required for each rehearsal.

With any rehearsal there is a high degree of planning andpreparation to ensure that there is little or no impactupon the live systems and to also ensure the rehearsalobjectives are met wherever possible.

All resources identified should be made available, bookedand be available for the planning and preparationrequired for the rehearsals. Also during any rehearsal thelive systems will still be running and therefore have to bemaintained and supported.

9.2 Roles and responsibilitiesThe service continuity manager:

a) is responsible for service continuity;

b) is the service continuity management process owner;

c) leads the development of the service continuityrecovery plan;

d) is the person who invokes the service continuityrecovery plan;

e) is a senior member of the IT function;

f) does not need to be technical;

g) should understand the IT priorities of the users;

h) should not delegate responsibility;

i) should have cover during absence.

The service continuity recovery team:

a) participates in the rehearsing and invocation of theservice continuity recovery plan;

b) includes technical staff for technical procedures;

c) includes users for rehearsing and during actualinvocation;

d) includes departmental representatives forcommunication and coordination (in rehearsing and ininvocation);

e) is led by the service continuity manager.

9.3 Rehearsal guidelinesStaff resources, costs and implications should be consideredby the organization when planning for a rehearsal.

Staff resources are the most important element as without them the rehearsal would be difficult, if notimpossible, to conduct. The staff resources should have theappropriate skills for any rehearsal, including appropriateplatform knowledge, storage management knowledgeand application knowledge.

NOTE These resources are not only required for the actualrehearsal but also for pre-rehearsal meetings and should allowsufficient time for preparation and planning. It is essential forsenior management ‘buy-in’ to this.

Costs are perhaps the most sensitive consideration of anyrehearsal as they are not insignificant. Therefore, eachrehearsal should be scoped in order to leverage maximumrewards/benefits and strive towards the organization’soverall continuity objectives.

There are implications to conducting rehearsals which theorganization’s senior management need to be madeaware of. For example, whilst preparing, planning andattending the rehearsal, staff are not doing their day joband therefore impacting upon existing services, processes,systems and projects. Senior management should be awareof this and plan accordingly.

Whilst everything is done to minimize the disruptionrehearsals can have on the business, the following shouldalso be considered:

a) Is it possible to time this rehearsing to cause the leastdisruption to business functions?

b) How much will the rehearsal cost? Is this appropriatefor the additional confidence gained over other formsof rehearsing, including a tabletop or scenario exercise?

c) Does the rehearsal scope continue to progress againstthe agreed rehearsing strategy and associatedannual plans?

9 Rehearsing an IT Service Continuity plan

Page 27: Published PAS77

PAS 77:2006

© BSI 11 August 2006 21

d) How can staff be trained to cope with the situation ifthey do not experience it in rehearsal-mode?

e) Once the BCMP is in operation, how will you return tonormal business operations? Are there specific issueshere that warrant rehearsing in their own right?

f) How different are the circumstances of an actualinvocation likely to be relative to those of a rehearsal?

NOTE For example it may be advisable to use copies of livesystems and data in a rehearsal, the emotional environmentof a rehearsal is likely to be more relaxed than in a realincident, etc.

9.4 Business user rehearsingWhilst the organization’s technical staff performs theservice continuity rehearsal, the business users shouldvalidate the recovered applications and services.Therefore, they should understand their role to allowthem to prepare appropriately.

All business users who take part in service continuityrehearsals should be aware of the artificial environment,benefits of rehearsing, preparing and using rehearsalscripts and data input for validation.

The environment used for an exercise might not beidentical to the live environment in an actual invocationtherefore participants should be aware of and understandthe differences. For example, they might not have accessto current data or logons might be different.

Business users should rehearse to validate the recoveryand feel confident that their applications and services can be recovered. Rehearsing should also provide valuablefeedback to the organization, ensure the recovery isachieved as expected and offer opportunities for improvement.

Business users should develop rehearsal scripts whichcan be followed during a rehearsal to ensure that theappropriate elements for a particular rehearsal are tested.Rehearsal scripts also provide valid input into theaudit process.

As the rehearsals become more complex, they should beas real as possible to be able to track data through thevarious recovered systems from front office to back office.The input data should be validated and the results, whenrunning the rehearsal scripts, transactions and batch jobs,should be checked against pre-defined expectations.

9.5 StrategyTo achieve the organization’s ITSC objectives, acombination of the following recommendations should be

considered. The frequency of exercises will depend on theindividual circumstances of your organization but acceptedbest practice is to exercise plans at least once a year.

a) ‘Callout’ rehearsals should be conducted regularly, inaddition a surprise callout rehearsal should beconducted involving all departments and the IMT.

b) Walk through reviews of recovery plans, emergencymanagement plans and departmental plans.

c) Scenario-based walkthrough exercises for IMT, supportteams and individual departments.

d) Component rehearsing (e.g. individual departments,business processes, IT systems, voice and data networklinks, etc). For instance when new systems areimplemented, when there are previous rehearsalfailures, when changes occur or for previouslyunrehearsed components. Component testing shouldalso be considered during periods when a morecomprehensive test cannot be completed, e.g. test thatnetwork traffic can be redirected to the fail-over site,that users can connect to the fail-over site and that live data can be restored at the fail-over site.

e) Integration rehearsals (e.g. multiple systems and/orbusiness processes) where IT services rely uponcombinations of information systems working togetherthe organization should reassure itself that they arecapable of not only recovering the individual systemsbut also that they can be recovered in such a way as toprovide the required services by interacting as expected.

f) Relocation rehearsals (technical and business recovery),whereby key parts of the business relocate to, andoperate from, the recovery site, including the loss of the main facility, an IT switch or critical businessprocesses.

g) Fail-over rehearsals of the live IT environment to therecovery site (including verification by users) andbusiness relocation rehearsals.

h) Major incident simulations should include scenario-based role playing exercises, IT fail-over, businessrelocation and full fail-back rehearsals.

In all cases, results should be documented and updates toappropriate continuity plans completed within four weeksof each rehearsal.

All rehearsing should be carefully managed andcoordinated to ensure low risk to the business but withmaximum return on the effort put in.

9.6 Rehearsal programme managementTo support the rehearsal programme an adequatemanagement framework should be in place as illustratedin Figure 8.

Page 28: Published PAS77

PAS 77:2006

© BSI 11 August 200622

e) Business Continuity Rehearsal Group: is chaired by theBusiness Continuity Coordinator and includingrepresentatives from the IT Support Groups andCompliance/Audit. The Business Continuity RehearsalGroup reports to the BCSG.

The Business Continuity Rehearsal Group is responsible for:

1) planning and executing all ad hoc infrastructurerehearsing, and regular full scale service continuityrehearsal simulation rehearsals;

2) agreeing the rehearsal scope and objectives with thebusiness, via the BCSG;

3) pre-rehearsal planning and preparation;

4) production of the rehearsal plan document;

5) coordination of activities during the rehearsal;

6) post rehearsal reporting;

7) follow-up of actions arising.

Figure 8 – Suggested Programme Management Organization

The suggested roles are as follows:

a) Business Continuity Coordinator: the key facilitator ofthe Business Continuity function.

b) Compliance/Audit: to oversee recovery rehearsals andexercises and to ensure they meet the regulatoryrequirements and satisfy external auditors.

c) Business Continuity Steering Group (BCSG): oversightcommittee for the entirety of the business continuityfunction consisting of senior representation from allbusiness areas, to reflect the business-wide impact ofbusiness continuity planning and management.

NOTE As part of the rehearsal strategy, the organization’sBusiness Continuity function should maintain a rolling rehearsal schedule. The Business Continuity Steering Groupshould sign off the rehearsal programme as part of thisdocument being issued.

d) IT Rehearsal Working Group: responsible for planningtechnical IT aspects of recovery rehearsals.

BusinessContinuityRehearsal

Group

BusinessContinuity

Coordinator

Compliance/Audit Team

BusinessContinuitySteeringGroup

IT RehearsalWorkingGroup

Page 29: Published PAS77

PAS 77:2006

© BSI 11 August 2006 23

The Business Continuity Rehearsal Group should bebusiness-led, rather than an IT-led group. The BusinessContinuity Rehearsal Group should meet regularly, asrequired to meet the above responsibilities. Typically, thiswill be monthly, but increasing in frequency in the weeksbefore a rehearsal.

9.7 Rehearsal planning process9.7.1 Rehearsal plan contentsAn effective rehearsal should contain:

a) a body responsible for control and coordination;

b) objectives and success criteria;

c) a rehearsal plan and schedule;

d) a reversion plan allowing restoration back to live service at certain key points;

e) briefing of participants;

f) management and coordination;

g) event logs and rehearsal feedback forms;

h) independent observers;

i) post-rehearsal reporting, follow-up and action plan.

Post rehearsal reporting should include a variety ofsources, e.g. helpdesk call for the duration of the testcompared to the normal amount of calls for the day andtime the test was carried out, to see if there were anincreased number of incidents recorded.

9.7.2 Rehearsal planning principles

The rehearsal process includes a number of principles,which should be applied throughout the planning process:

a) Document an overall rehearsal strategy with a desiredobjective to be reached within a clearly definedtimeframe, which should include the move torehearsing invocation.

b) Involve the customers in the service continuityrehearsing process.

c) Document and agree a detailed annual plan andrehearsal programme which relates to the overallrehearsing strategy.

d) Real and achievable objectives with realistic datesshould be set.

e) Ensure that all critical daily tasks and housekeepingroutines are included.

f) Include Business Continuity aspects and BusinessRecovery rehearsing in the plans.

g) Include scenario planning/rehearsing with a genericpriority list.

h) Promote continuous improvements by followingactions, suggestions and ideas from previous rehearsals.

i) Include the Service Continuity Management team inrehearsals and test their abilities.

9.7.3 The importance of rehearsing

Rehearsing is a vital part of the long term BCM lifecycle,which will prove the viability of recovery plans andhighlight areas for further improvement. It also providesan ideal training opportunity for those involved in thekey activities.

Rehearsals are so called so that areas of weakness can beidentified and new processes implemented to improveresilience. It is crucial that rehearsals are seen as positivetasks and any internal political influences are eliminatedso that the focus of business resilience and continuity is maintained.

The overall aims of the rehearsing strategy are to ensureeffective crisis management and to enable live processingto be moved to the recovery site(s) on a regular basis andbecome part of business as usual.

NOTE Even the most comprehensive rehearsal does not covereverything. For example in a service disruption where there hasbeen injury or even death to colleagues, the reaction of staff toa crisis cannot be rehearsed and the plans should makeallowance for this.

Rehearsals should have clearly defined objectives andcritical success factors which will be used to determine the success or otherwise of the exercise as well as of theBCP itself.

A full rehearsal should replicate the invocation of allstandby arrangements, including the recovery of businessprocesses and the involvement of external parties. Thisshould test completeness of the plans and confirm:

a) time objectives, e.g. to recover the key businessprocesses within a certain time period;

b) staff preparedness and awareness;

c) staff duplication and potential over commitment ofkey resources, during invocation of the BCP;

d) the responsiveness, effectiveness and awareness ofexternal parties.

Rehearsals may be announced or unannounced. However,in the latter case the senior management should approvethe announcement in advance otherwise it may bedifficult to achieve commitment.

9.7.4 Rehearsal objectives

The rehearsal strategy should meet the objectives to:

a) validate emergency callout procedures and contactdetails contained in the recovery plans;

b) ensure key staff are familiar with their IncidentManagement, Business Recovery and Technical Recoveryplans;

Page 30: Published PAS77

PAS 77:2006

© BSI 11 August 200624

c) prove the ability to recover the technical IT andcommunications infrastructure;

d) prove the ability of critical staff to relocate to and workfrom the nominated recovery site(s);

e) validate the effectiveness and accuracy of thedocumented IT and Business Recovery plans.

9.7.5 Planning a rehearsal

All parts of each rehearsal should be planned in advanceas without the planning and preparation the followingcould occur:

a) objectives will not be met and live systems could beadversely affected;

b) the rehearsal could fail which will cause the staffinvolved to disassociate themselves from BusinessContinuity and Service continuity rehearsal;

c) the identified resources (staff and other) may not beavailable when required or may not be appropriate,such as skill sets, adequate communications link, andserver specification;

d) there is nothing to measure progress against and therefore no opportunities to improve the rehearsing process;

e) expectation of the organization’s staff and customersmay not be met or remain unknown.

NOTE In many ways each rehearsal can be viewed as a‘project’ in that it has defined start and end points andshould have agreed objectives and desired outcomes.For guidance on best practice in project managementand planning the reader should refer to PRINCE2 [2]and/or the Project Management Institute’s ‘ProjectManagement Body of Knowledge’ [3].

Page 31: Published PAS77

PAS 77:2006

© BSI 11 August 2006 25

10.1 GeneralService continuity may be achieved in many ways rangingfrom replicating every single IT component to removing allknown single points of failure from those components.There are many available models to choose from as

10 Solutions architecture and design considerations

Figure 9 – Infrastructure Architecture Models for Business/Service Continuity

illustrated in Figure 9. An organization may, however,favour one particular model but then also use componentsof several others to complete the IT architecture.

SiteSite recovery

Site/data centre failover

Application failover/load balancing

Redundant systems

SAN, NAS & DAS

Backup and restore

Rapid equipment replacement

High availability system features

Data

Platform

Application

If the IT architecture is changed to support ITSC then this should be checked to ensure it does not compromisecontinuity or security. Thus a review of the completeenvironment should be undertaken to ensure security is maintained at the same level. This should include athorough examination of alternative/back-up sites andnetwork links between them. The following shouldbe considered:

a) Is the replication of data exposing client data?

b) Are the Service continuity rehearsal plans secure orcould these be used to identify weaknesses in theIT architecture?

c) Are there unused service continuity rehearsal InternetProtocol (IP) addresses, which during normal operationa hacker could use to gain access to the network?

The classic approach to ITSC is to use a two-site modelwhich has a back-up site that can continue to provide aservice when the main site is disabled or destroyed by anincident. There are a number of ways in which this remote

site model may be implemented (see Annex D), dependingupon the organization’s requirements.

10.2 System resilienceTypically any system running mission critical applicationsshould be locally resilient. This means that the centralsystem has no known single points of failure such aspower supplies, CPUs, I/O Processors. In addition, paths tomultiple peripherals are duplicated or duplexed and diskdevices are mirrored or part of a Redundant Array ofIndependent Disks (RAID) configuration. Loss of any singlecomponent should not cause an interruption to service.

Further information can be found in Annex E.

10.3 Application resilienceApplication software may also play a part in systemresilience by creating cluster systems viewed as a singlesystem by the outside world but implemented physically as

Page 32: Published PAS77

PAS 77:2006

© BSI 11 August 200626

multiple independent systems with automated fail-overbetween hosts. There could be issues relating to thesharing of databases (see D.2). Clustering and databasesharing should be implemented if there are concernsaround hardware or even software stability. Anyapplication resiliency mechanisms should ensure recoveryof data to consistent points. For example if a databasehas data on one volume and the indices on another, then the application should ensure that updates to thedisks are either all applied or none applied – i.e. theupdate is ‘atomic’. Databases that are resilient in this way are said to adopt Atomic, Consistent, Isolated andDurable (ACID) properties.

A stateless server is one that provides a service but retainsno transaction state information between interactionsfrom the client. Each transaction is atomic e.g. selfcontained and has no relation to preceding or followinginteractions. An example of this type of server is a webserver, web applications are typically stateless. Naturallystateless servers are good candidates for the creation ofserver farms: large groups of servers that all offer thesame level of service. When optimum load is exceededthen another server running the same stateless serversoftware should be added.

10.4 Network resilienceThe network should be resilient and capable of handlingthe fail-over approach. There should be adequatecommunication bandwidth between sites to allowproduction to switch from one site to another and forperformance to remain acceptable for business needs.

Where appropriate, networks should use dual-pathsbetween critical systems, both within a site and betweensites, with all components replicated (switches, networkscards, etc.). Single points of failure should be identifiedand a risk analysis performed to identify if the risk isacceptable. Alternative network providers should beconsidered for inter-site links. This includes the last milefrom any major trunks to the site, with cabling routedindependently, following separate routes into thebuilding and terminating to physically separatecommunication equipment.

10.5 Data resilienceTypically computer systems are reliant on the resilience oftheir disk based data storage. There are many differentmodels that can be adopted to ensure data resiliencesome of which are described in Annex F, which discussesvarious approaches to resilience. Organizations shouldselect the most appropriate model or models.

Page 33: Published PAS77

PAS 77:2006

© BSI 11 August 2006 27

11.1 GeneralBuying continuity services is not a simple process. Anyorganization that chooses to minimize its risks byoutsourcing to a third party should assess the viability andsustainability of the service it is buying. This is especiallythe case for continuity services, which may never be usedand are hard to rehearse outside of a controlled andpre-planned environment.

Paradoxically, it is quite possible that buying continuityservices from an external supplier could compromise anITSC plan if the due diligence on that supplier and itsservices has not been thorough.

An organization should understand how a continuityservices organization (supplier) makes money. Forexample, the supplier invests in resources (buildings,infrastructure, IT equipment etc.) that may be requiredby a client if an incident or failure occurs. To ensure thatthe service continuity rehearsal services are economicallyviable and thereby affordable to a client, and also toensure the supplier is profitable, it syndicates thoseresources across as many clients as possible. The supplierthen manages the chance (risks) of more than one clientinvoking the service and thereby demanding access tothose same resources simultaneously.

The implication is that, if the supplier does not managethe risk of multiple, simultaneous invocations bothprofessionally and reasonably then the buyer could, in the event of a major incident, be denied access to the very resources it has subscribed to and thereby couldstruggle to regain IT and thereby business resumption.

There are a range of questions to which satisfactoryanswers should be required when buying any service orproduct from an organization, irrespective of industry.This section is focused on the specialist due diligencerequired when buying continuity services and assumes thereader is already versed in standard purchasing practicessuch as financial due diligence and validating theaccreditations of a supplier.

Further information on best practice in this area can befound at The Chartered Institute of Purchasing and Supply5).

11.2 Syndication managementThere is a high chance that companies based in closeproximity could be affected by the same incident or eventthat can disrupt IT and Business Continuity. There are

many examples of this, notably the terrorist attacks onNew York in 2001 and on London in 2005, accidents suchas the Buncefield oil terminal explosion and naturaldisasters such as the Asian Tsunami and Hurricane Katrina.The supplier should be able to demonstrate its riskmanagement system and the methods it uses to ensurethe risks of multiple, simultaneous invocations (whichmajor incidents and natural incidents imply) are as low as possible.

It is also highly advisable to assess the method ofsyndication used by the supplier and match it against thelevels of risk that your organization will find acceptable.For example, the supplier may offer lower prices if thebuyer is prepared to accept a higher syndication rate (risk).

11.3 Syndication ratiosThe supplier could quote a ratio of clients that it will allowto concurrently subscribe to a particular resource e.g. 25clients share one computer etc. However, this ratio is justone aspect of the risk level that a buyer should be awareof and should not be accepted on its own as a satisfactoryindication of the chances of gaining access to the resourceyou have subscribed to should an incident occur.

The supplier should be able to produce automatically arisk listing of:

a) its clients;

b) their industry;

c) their location;

d) the resources under cover;

e) the number of times it has sold those same resources;

f) the speed with which the resources are to be deliveredand/or made available;

g) the length of time the resources may be required afteran incident.

This report should be made freely available to the buyerwho can then determine if the risk of buying from thesupplier is acceptable.

Risk management is a dynamic process. The buyer ofcontinuity services should periodically request and see thesyndication report from the supplier and therebycontinually be able to assess its own risk position.

11.4 Location of clientsIt is important, when buying Continuity Services, tounderstand not only the number of clients sharing aresource but also their location. As an example, it may beunlikely for an organization to find it acceptable to share

11 Buying Continuity Services

5) http://www.cips.org

Page 34: Published PAS77

PAS 77:2006

© BSI 11 August 200628

the same resource with another client of the supplier inthe same building, street or close area.

11.5 Risk presented by other clientsIn addition to the location of other clients, it is also crucialto understand who those clients are and the industry theyare in. By doing so the buyer becomes able to determinethe likely threat those other clients could place on theirown ITSC plans e.g. whether their very presence couldconstitute a threat or they could be a target for extremistswhich could have a knock-on effect on your own ITSC.

This is a dynamic equation and will often provide a rangeof risk positions dependent upon the current politicalclimate. As an example, it would be appropriate to knowif you are subscribing to syndicated resources that areshared with a organization that could be classed as awelfare, social or political risk e.g. an organization thatcould be the target of an animal rights group, a forestrybusiness that could be threatened by an environmentalpressure group or an organization that could be known to sympathize with a particular side in an area ofpolitical unrest.

It is often difficult to know exactly what risks other clientsmay actually place on you. Clearly there are limits to whatyou can do, however it is often worth imagining (if notactually doing) a helicopter scan over your premises andthe surrounding areas of other clients. You may, forexample, not be aware that your neighbour is storing gascylinders in their work yard or is charging fuel tanks nextto your building. You may be closer to a flood plain thanyou had originally thought or there could be buildingworks going on that could, by accident, cut your telecomslines etc. The Buncefield oil terminal explosion was provedto be the classic example of a single incident causingdirect and consequential issues for many companies.

11.6 Location and Physical SecurityWhen there is an incident that has to be managed by thepolice and other emergency/security services, an areacould be cordoned off for safety and on-going incidentmanagement purposes. This could mean that access to thepremises is denied and buildings could be evacuated.

When buying a continuity service, the buyer should expectthat its chosen supplier does not sell the same resources toany other companies within the same geographic location.The buyer should, in advance of subscribing to the service,understand the typical size of an exclusion zone enforcedby the security services and determine what is areasonable and satisfactory area to demand exclusiveaccess to the syndicated resources. The buyer can then askthe supplier to prove that it has allocated the requisiteexclusion zone.

Another consideration is that of the physical andenvironmental security measures which are in force in therecovery site. These should be equivalent to those for theprimary location and should be regularly audited againstspecific and detailed requirements.

11.7 RehearsingAn ITSC plan should always be rehearsed to ensure it iscurrent and appropriate to meet the required ITSC servicelevels. It is crucial when procuring continuity services froma supplier that the services are rehearsed.

A supplier will have a finite amount of resource (bothequipment and people). It is important when buying aservice that the supplier's resources are known to thebuyer to help it gauge the chances of service provisionwhen an incident or interruption occurs. This informationshould be made readily available by the supplier; however,one way to gauge the amount of available resource is torequest scheduled and unscheduled rehearsals.

If a supplier is under-resourced to meet its contractualobligations, it is unlikely to be able to honour shorttimescale scheduled rehearsals. Should this happen, thenalarm bells should be ringing as a lack of resource in arehearsal when all is relatively calm and quiet, is likely tomean over-stretched resources and over-syndicatedservices. If the doubt is there, the buyer should ask deeperquestions to ensure its own risk management levels havenot been compromised.

Page 35: Published PAS77

PAS 77:2006

© BSI 11 August 2006 29

The approach described here is a variant of Failure Modesand Effects Analysis (FMEA). A variant is suggestedbecause the standard FMEA approach assumes applicationto a business process and concentrates on the causes andeffects of disruption or failure of steps in the process.

NOTE Such an approach would not be directly applicable to adepartmental management process or a project, thoughsufficient common ground exists for the approach to be

adapted for those circumstances. The variant is also requiredsince the development of FMEA, the risk management industryhas widely accepted that the concept of risk includes boththreats and opportunities.

Figure A.1 indicates the steps in the risk assessmentprocess, which results in the development of an ITSC plan(see Clause 9).

Annex A (informative)Conducting business criticality and risk assessments

Figure A.1 – Risk assessment process

A.1 General

Process and RiskIdentification

ResponseSelection

Rehearse andLearn lessons

IT ServiceContinuity Plan

ResponsePlanning

AssignResponsibility

andImplementation

NOTE Where systems and/or IT services are involved in safety critical environments, such as on oil rigs, nuclear power plants etc.,more sophisticated approaches to risk management such as Monte Carlo Analysis may be more appropriate.

Page 36: Published PAS77

PAS 77:2006

© BSI 11 August 200630

A.2 Process and risk identificationThe heart of any process for assessing risk should have a‘types of risks’ set that can be easily understood by thoseconducting the assessment. In the case of a physical riskassessment, this should involve identifying the hierarchy of IT services that will be the subject of the assessment,the owners of each service and the dependencies betweenthem. In the case of an organizational risk assessment itinvolves identifying the organization’s structure and theprocesses for which each node in the structure isresponsible, the owners of each process and thedependencies between them.

NOTE This does not mean that all risk assessments should startwith a business modelling exercise, since in many cases thisinformation will already be available. Where the informationexists, common sense suggests that it would be prudent toreview it to ensure continued accuracy, but under nocircumstances should effort be expended in reproducing workthat already exists in an acceptable form.

The object of the risk assessment should therefore be todefine the possible changes, understand how likely theyare and how each change would impact IT serviceprovision.

The types of risk that can be identified include changes to:

a) business process or activity, including risks ranging fromcatastrophic failure through minor disruption topositive improvement in productivity;

b) dependencies, including risks ranging in effect from thecollapse of a critical supplier of goods or services to the

temporary failure of an information flow from anotherbusiness process;

c) plant or equipment;

d) buildings and environment;

e) information technology or systems;

f) information security including confidentiality, integrityand availability;

g) projects including risks associated with not deliveringthe specified solution, risks associated with the solutionand risks associated with its delivery.

In assessing the types of risk to which a physical ororganizational component of the business could besubject, the assessment should be well informed andbased on verifiable evidence. Where possible andappropriate, the views of acknowledged experts should becalled upon to ensure that the assessment of the natureand likelihood of a particular risk is as realistic as possible.

All risks identified during this activity should be describedin the ITSC plan. At this stage it is only necessary to recordsummary details for each risk including a name, whichshould convey something of the nature of the risk, andone or two sentence description of the nature of the risk.

The probability of a risk occurring and its likelihoodshould be determined according to Table A.1.

Table A.1 – Probability of risk occurring

Probability Definition

Low The risk is not expected to occur more than once per year.

Medium The risk is not expected to occur more than once per quarter.

High The risk is expected to occur at least once per month.

Very High The risk is known to exist or is expected to occur frequently and/or regularly.

Page 37: Published PAS77

PAS 77:2006

© BSI 11 August 2006 31

In performing a risk assessment one should identify notonly the immediate effects of the risk occurring but alsothe impact on the business of those effects. For example,the effect of a hard disk problem could be the corruptionof some data stored on that disk, whilst the businessimpact of corrupt data relating to customer accountscould result in significant cash flow problems and couldalso adversely effect the organization’s reputation forexcellence. In general, the assessment of each risk shouldconsider the impact on:

a) environment;

b) financial performance of the organization;

c) health and safety of employees and the public;

d) morale of employees;

e) productivity and process efficiency;

f) product quality;

g) business controls;

h) regulatory or legislative compliance;

i) reputation of the organization with its customers,investors, staff and suppliers;

j) political impact at local, regional, national andinternational level.

When assessing the impact of a risk one should ensurethat the assessment is well informed and based uponverifiable evidence, hence, expert opinion should be calledupon where possible and appropriate to do so. Table A.2categorizes the impact of a risk.

Table A.2 – Impact of risk

Impact Definition

Low Expected to have a minor negative impact. The damage would not be expected to have a longterm detrimental effect.

Example: very short-term (less than five minutes) power failure

Medium Expected to have a moderate negative impact. The impact could be expected to have short tomedium term detrimental effects.

Example: short-term (less than one hour) failure of email system

High Expected to have a significant negative impact. The impact could be expected to have significantmedium to long term effects.

Example: unexpected failure of online banking system resulting from unknown cause.

Very High Expected to have an immediate and very significant negative impact. The impact could beexpected to have significant long term effects and potentially catastrophic short term effects.

Example: data centre destroyed by fire or flood

Page 38: Published PAS77

PAS 77:2006

© BSI 11 August 200632

Implementing a risk response should only be done if thetangible and intangible benefits of doing so outweigh the tangible and intangible costs. In addition, the tangibleand intangible costs of preparing the response andultimately of deploying it should not outweigh the costsof taking no action. Since success in business involves adegree of risk taking, there will be risks that the business

is happy to accept in the expectation that doing so willresult in improved profitability, market share or othertangible benefits.

The body responsible for deciding which responses shouldbe implemented should consider the questions listed inTable A.3.

Table A.3 – Questions

In order to ensure that Risk Management represents aviable and positive investment for the future of thebusiness, a cost-benefit analysis for each possible riskresponse should be conducted. The objective of thisexercise is to determine whether the benefits of takingaction will outweigh the costs of taking no action. Thisanalysis is then fed into the decision making process forselecting the responses to be implemented.

From the risk profiles (see A.2), documented in the ITSCplan, obtain details of the financial costs of:

a) the estimated cost of taking no action in the event thatthe risk occurs, i.e. the impact cost;

b) the estimated development and implementation costsof existing and new counter-measures;

c) the estimated costs that would be prevented or avertedby implementing the proposed counter measures.

In addition to these financial costs, other factors should be

taken into account, such as the organization’s reputation,employee health, safety and morale, environmentalprotection, security and the confidence of investors,customers and regulators. In each case, an estimate of the impact on the intangible factors should be made fortaking no action, for preventing the risk and forimplementing the proposed counter-measures.

By examining the intangible costs in conjunction with thefinancial costs a broader picture is seen. This can be fedinto the process of deciding whether a response should beimplemented for the risk(s) in question.

A.3 Response selection

Question Options

Is the risk likely to result in a positive outcome? If so, a response should be devised which causes therisk to occur and maximises the benefit derived from it.If not, consideration should be given to a responsewhich would avoid, eliminate or mitigate the risk or its impact.

Is the risk sufficiently likely or its impact sufficiently If so, some form of response wouldsignificant to justify implementing the response? appear appropriate.

Would a decision not to develop a response leave If so, prudence would suggest that some form ofthe organization or its officers open to civil or appropriate response should be developed.criminal litigation?

Would the benefits (both in terms of risk mitigation If not, consideration should be given to alternativeand other consequential improvements) to the approaches which cost less to implement or in somebusiness from implementing the response outweigh cases whether the organization is prepared to accept both the costs of taking no action and the costs the risk.associated with the implementation?

Page 39: Published PAS77

PAS 77:2006

© BSI 11 August 2006 33

A.4.1 GeneralOnce decisions have been made regarding the riskresponses that are appropriate for the circumstances, theimplementation of each element of the response shouldbe carefully planned. Risk response planning is concernedwith ensuring that resources are deployed effectively andefficiently, paying particular attention to maximizing thebenefit to the business from implementing the response.

For each response the most appropriate people should beinvolved in its development and implementation. As such,the participants may work in areas of the business otherthan that affected by the risk, or may indeed work forother stakeholders.

The plan for implementing each risk response should identify:

a) the scope and objectives of the response, such as RTO,RPO etc;

b) planning assumptions;

c) pre-requisites;

d) summary of resources required (people, facilities,equipment, money);

e) work breakdown, identifying the sequence of activitiesrequired to implement the response, including theresources required for each step and estimates of theeffort and elapsed time required.

Based upon the work breakdown for all required riskresponses, a schedule of work should be created in whichthe timing of each activity should be determined by theavailability of the time, effort and people required tocomplete it.

A.4.2 Assign risk categoryBased on Table A.2, the definitions of risk categories,as deduced from predictions of likelihood and businessimpact, have been slightly modified, as shown inFigure A.2.

Figure A.2 – Risk categories

A.4 Response planningB

usi

nes

s Im

pac

t

Risk Likelihood

Very high

High

Medium

Low

Category One

Category Two

Category Three

Very h

igh

Hig

h

Med

ium

Low

Page 40: Published PAS77

PAS 77:2006

© BSI 11 August 200634

Having assigned the risk category, details of the likelihood,impact and risk category should be added to the riskdescription in the ITSC plan. At this stage the ITSC planshould contain the risks in category grouping, withCategory One risks listed first.

To interpret these categories further:

a) A Category One risk is one to which the organizationshould certainly respond;

b) A Category Two risk is one to which the organizationshould consider responding;

c) A Category Three risk is one which the organizationshould consider accepting.

No two organizations are the same and thus no firmguidance on interpreting these categories can be givenwithout it being inappropriate to a significant percentageof the audience. Hence, though the guidance above isintentionally vague, it helps to frame the questions theorganization should be asking itself at this stage of theprocess.

A.4.3 Develop risk profileFor each risk identified as falling into Categories One andTwo, a risk profile should be developed, which defines:

a) the nature of the risk and the events likely to trigger it;

b) the probability of the risk occurring, including details of any circumstances where the likelihood of the riskcould change;

c) details of the potential impact of the risk on thebusiness, including estimates of the cost to the businessof taking no action to prevent or mitigate its impact;

d) details of the symptoms likely to be displayed in the

event that the risk occurs and the ways in which thesesymptoms could be detected;

e) an assessment of the likelihood of detecting the risk and measures that could be taken to increase that probability;

f) details of existing counter-measures designed tomonitor the risk, prevent it from occurring or tomitigate its impact, including estimates of the costs ofimplementing and maintaining these counter-measures;

g) proposals for additional counter-measures, or changesto those in place, to prevent the risk from occurring andto mitigate its impact, including details of the facilities,equipment and personnel required, and estimates ofthe time, effort and cost required to implement andmaintain these new counter-measures;

h) estimated savings accruing from implementing theproposed counter-measures in the event that the risk occurs;

i) estimated consequential savings likely to accrue fromimplementing the proposed counter-measures in theevent that the risk does not occur.

This information provides the basis for a cost-benefitanalysis, which should support decision making on howeach risk should be addressed by risk monitoring, riskmitigation, risk communication and business continuityplanning activities. Details of the risk profile are added tothe ITSC plan.

A.4.4 Assess probability of detectionThe probability of symptoms of the risk being detectedshould be determined according to Table A.4.

Table A.4 – Probability of risk detection

Probability Definition

Low The symptoms expected to be displayed when the risk occurs will not be obvious or easy todetect without specialised monitoring processes.

Example: disk hardware error causing infrequent and random errors when writinginformation to disk.

Medium The symptoms expected to be displayed when the risk occurs will be detectable with basic orstandard monitoring processes.

Example: malicious intrusion onto corporate network, failure of online or batch process tocomplete successfully, etc.

High The symptoms displayed when the risk occurs will be immediately apparent.

Example: failure of email system, power failure, natural disaster etc.

Page 41: Published PAS77

PAS 77:2006

© BSI 11 August 2006 35

A.4.5 Response selectionA basic model for determining appropriate responses isbased upon risk categorization and likelihood of detection.Having categorized the identified risks and having decidedwhether a response ought to be implemented, the natureof that response should be influenced not only by thepotential impact or likelihood of the risk occurring but alsothe organization’s ability to detect that it has occurred.

For example, in planning a response to a risk such as the example given for a ‘low’ probability of detection, the organization might be well advised to considerimplementing specialized monitoring processes and/orequipment to make detecting the risk more possible.In the case of the medium probability example, theorganization can implement one of a number of commonfirewall and intrusion detection tools to both identify andprevent such intrusions.

A.4.6 Assign responsibility and implementHaving determined the appropriate response to the risk,the actions implied should be planned such that resourceutilization and cost information is available for cost-benefitanalysis. The cost-benefit analysis is an important part ofthe decision making process for determining which of thepotential response actions will be justified and thereforeimplemented. It is also important information to retainwhen a decision is taken not to take action in response to a risk, as it demonstrates that a formal and rigorousthought process was followed in arriving at that decision.

Details of the decisions taken on the proposed responseactions should be added to the ITSC plan and summarizedin an action plan and work schedule.

A.5 Rehearse and learn lessonsAn ITSC plan is only likely to be effective if it is regularlyrehearsed and when the lessons from these rehearsals are fed back into updated plans. Clause 9 providesguidance on how to plan and conduct such rehearsals for maximum effectiveness.

Page 42: Published PAS77

PAS 77:2006

© BSI 11 August 200636

e) Remote access: An alternative solution to shippingpeople to the remote site is to provide them withremote access to the remote site via dial-up, or throughthe Internet using Virtual Private Network (VPN) orsimilar technology. This means that people can workfrom home in order to support the remote systems,but it could also introduce additional security issues.

f) Dark site vs. manned site: The IT architecture coulddictate that the recovery site is to be typicallyunmanned during normal operation, with all operationbeing handled from a central operations bridge. In thiscase in the event of an incident the central bridge maynot be available and so an alternative bridge will alsobe required. Running a dark remote site could alsomean that there are no operations staff at the remotesite that can help recover systems in the event ofan incident.

g) Skill level of staff: In the event of a major incident,key staff could be unavailable either as a result of theincident, or simply because of holidays or sickness.The ability for the IT infrastructure to continue tooperate could depend on the ability of the remainingstaff to handle the surviving systems. If staff are nottrained to provide cover across the board, then therecovery strategy could be at risk. Specifically, recoveryprocedures for the IT architecture should be writtenwithout assuming any detailed knowledge so that theycan be implemented by as many members of the teamas possible.

h) Telecoms connectivity and redundant routing:As already stated, distance between sites can be anissue. If network connectivity is leased from a networksupplier, depending on the requirements the proximityto existing high-speed trunks could present cost issues.To provide resilience, there should be a contract withdifferent trunk providers to ensure continuity of serviceand redundant routing. There should also be separatecontracts with different last-mile telecoms providers toensure service continuity.

i) Level of automation required: The IT architecture mayinclude the requirement to make fail-over and fail-backcompletely automated. Many sites desire a completelyautomated fail-over to occur when problems aredetected. Some sites have found that it is impractical to automate everything and like the ability to be ableto instigate fail-over manually after local recoveryattempts have been ruled out. Either way, perfectingautomation could involve cost and time to developand rehearse.

When selecting an appropriate IT architecture, thefollowing non-exclusive options should be considered.

a) Location and distance between sites: If failing overfrom one site to another the network path distancebetween the two sites should be carefully considered.If the two sites are too close together, for example ona campus, they could be impacted by the same naturaldisaster. If too far apart then the cost of connecting thetwo sites with suitable telecommunications and/orcourier services could become prohibitive. Mostimportantly the distance between the sites could havea negative impact on the way in which the IT systemsoperate. If the chosen model includes synchronousreplication, then the greater the distance the greaterthe latency, thus introducing delays in the transfer ofdata between sites which could in turn impactapplication performance.

b) Number of sites: The number of sites used should beconsidered as a major factor for the IT architecture.For example a company might have corporate offices in three cities, e.g. Paris, London and Munich withpartial processing of data in each local centre. The ITarchitecture would typically define methods forensuring that the London centre could carry on theprocessing of Paris and Munich work in the event oftheir total and combined loss. There could be a mutualrecovery strategy for all three, so that the systems,network and storage at each site is sized to cope withthe combined traffic and work of the other two inaddition to its own work.

c) Site Security: for some types of organization, thesecurity of information or premises is of paramountconcern, especially when handling ‘protectively marked’or classified information. For example, the protectivemarking relating to a secondary site may be at a lowerlevel than that of the primary site, especially if it isroutinely used for developing, testing or trainingpurposes. In these circumstances, the architectureof both premises and IT should be designed andimplemented in such as way as to make the up-ratingof the site’s protective marking as straightforwardas possible.

d) Staff access and proximity: If there is a remote sitestrategy, then in the event of an incident the staff couldhave to work from the remote site for extended periodsof time. This can then lead to compounded issues dueto staff being separated from their family and/ordependants or finding and paying for hotelaccommodation for staff near the remote site forextended periods.

Annex B (informative)IT Architecture Considerations

Page 43: Published PAS77

PAS 77:2006

© BSI 11 August 2006 37

j) Redundant routing of communications: The ability tocommunicate in a period of disruption is fundamentalto the successful management of an incident. Whilstthere may be multiple redundant phone lines into andout of sites, check the telephony provider is not routingall these lines through one common exchange whichcan be impacted by an incident at that exchange. Inaddition since email systems can be impacted by anincident, it may be provident to maintain a number ofindependent email accounts on external InternetService Providers (ISP) for use in case of emergency.Consideration should be given to providing multipleforms of communication, such as SMS, pagers, external(non-corporate) email systems, pre-agreed brief codedmessages (to avoid overloading the networks and tospeed communications) and so on.

k) Third party connectivity and external links: If theorganization depends on the services of a third partyprovider (for example, in the financial world manycompanies use third party credit reference agencies),those services should be accessible from the remote site.The contract with the third party should provide aguaranteed level of service in the event of an incident.

Page 44: Published PAS77

PAS 77:2006

© BSI 11 August 200638

C.1 GeneralVirtualization, although considered a new technology bymany people, has actually been with us since the earlymainframe days when administrators were able topartition memory, processing and disk resources to createa virtual machine. This same technology has now beenwidely adopted in three keys areas: storage, server andnetwork virtualization. Although each of the above aretechnically very different the concept of virtualizationremains the same. Take a physical resource and partitionit into multiple virtual resources or consolidate multipleresources into a single virtual resource.

The benefits of virtualization allow you to maximizeutilization of the physical resource while simplifyingmanagement through fewer physical devices.

C.2 Network virtualizationNetwork virtualization allows you to take the componentsin your network infrastructure and either consolidatesthem into fewer networks or takes an existing networkand divides it into smaller segments. For example, youcould take a single 48 port network switch and partitionit into four segments, each with 12 ports. This allows youto create 4 isolated networks and utilize all ports on thenetwork switch. It also makes managing the networkeasier as there is only one physical switch.

C.3 Storage virtualizationStorage virtualization provides a means to hide thecomplexity of a storage infrastructure behind a virtuallayer. The main advantage to doing this is simplifiedmanagement. There are three ways to implement storagevirtualization:

a) use an appliance;

b) in the network fabric;

c) locally in the storage array.

There are pros and cons to each of these methods.

There are quite a few virtualization appliances on themarket today that all more or less do the same thing.The appliance will usually sit between the storage arraysand fabric switches. This is called an in-band appliance.All data passing between the host and storage arrays alsopasses through the appliance. One concern about thisapproach is that the appliance might become a bottleneck. The second option is an appliance that sits on theedge of the SAN fabric. This is known as an out-of-bandappliance. An advantage of this model is that only a small

amount of metadata needs to be passed to the appliance,thus eliminating the bottleneck problem. Most appliancesalso support clustering so that the appliance does notbecome a point of failure. A disadvantage of theappliance approach is that adding an additional deviceincreases complexity and management of the SAN.

Fabric based virtualization places the virtualizationtechnology inside the SAN fabric switches. This increasesthe processing and memory requirements of the switchbut has the added advantage of reducing overallcomplexity. This technology is still at a relatively earlystage but there are already of number of competingproducts on the market. There is however, some cautionaround how much intelligence should be implementedat the fabric level. There also the needs to be somestandardization at the fabric level so that fabrics withmulti vendor switches are fully interoperable.

C.4 Server virtualizationDeploying virtualization software on a server allows youto partition the server into multiple virtual servers andthen host an independent OS and applications on eachof these virtual machines. Server virtualization abstractsthe OS and applications from the underlying hardware.This helps protect applications from hardwarepeculiarities. It also makes it much easier to migrateapplications onto new hardware platforms.

The management console allows you to configure howmuch memory and processing resources each virtualmachine can have. It also allows you to monitor how many resources on the physical server each virtualmachine is consuming.

Replication technologies built into the virtualizationsoftware allow you to quickly clone and deploy virtualmachines. By integrating with some of the major softwaredeployment tools, it is also possible to rapidly deployapplications onto virtual machines. One version ofvirtualization software also allows for the relocation ofvirtual machines between separate physical servers. Thiscan be policy driven so in the event of a server failure thevirtual machines can be moved to a new physical server.

Annex C (informative)Virtualization

Page 45: Published PAS77

PAS 77:2006

© BSI 11 August 2006 39

D.1 GeneralThere are a number of basic site models that can beadopted to provide resilience. The requirements from theITSC strategy will have a major influence on which modelis selected, and this may have significant implications forthe IT architecture. Thus the decision will require inputand careful consideration by many areas of theorganization and the final selection is likely to be aniterative process as the costs and implications are morethoroughly understood.

D.2 Active/ContingencyThis model introduces a remote or back-up site forrecovery only at the time of incident. It is often referred to as a cold back-up site since at the point of incident itusually consists of either an empty computer room, or acomputer room populated with inactive computers in anun-initialized state. An alternative to this static computerroom is a mobile computer suite provided with generators

etc. that may be setup in the parking lot of an incidentstricken company. Similarly hotel rooms and other rentedoffice space may be turned into incident back-up sites totemporarily house new computer equipment.

Specialist companies exist that can help ship equipmentquickly to help minimize the costs and increase theviability of cold back-up sites. These companies are skilledin the rapid deployment and delivery of pre-configuredsystems and resources from servers and PCs through totelephone switches, structured cabling and furniture.An alternative is the potential for sharing machine roomspace with a supplier or business partner, providingreciprocal arrangements for computer room space.Care should be exercised here and no such arrangementshould be undertaken until all the risks of co-hostinganother company’s equipment are fully understood.The advantages and disadvantages associated with thismodel are listed in Table D.1.

Annex D (informative)Types of site models

Table D.1 – Advantages/Disadvantages associated with Active/Contingency model

Advantages

• Typically lower cost than active/active

• If buying access to the contingency site from asupplier, the service will typically be treated asrevenue/operational expenditure rather than capitalwhich can have advantages for some organizations.

• Limited investment in unused infrastructure andremoves need to upgrade continuity equipmentwhen upgrading production.

• Additional support skills may be available if usinga third party to provide the service.

• May be possible to utilize space across other siteswithin the organization, reducing or removing theneed for a specific cold site.

Disadvantages

• Typically a slower fail-over than other approaches.

• As systems are built at point of recovery, veryrigorous change and configuration management is required to ensure fail-over procedures are up to date.

• Process is likely to require a high level of technicalskill to deal with complex recovery issues.

• If using a shared recovery site, then an additionalrisk that another organization may also require orbe using the site.

Page 46: Published PAS77

PAS 77:2006

© BSI 11 August 200640

At the other end of the spectrum from theActive/Contingency model is the Active/Active model. Asthis name implies, in normal operation both sites are upand running accepting work at both centres and balancingthe load across all computers at both sites. In the event of

an incident or system failure at one site then all work isrouted to the second site which has been sized to be ableto accept the workload increase with little or no reductionin throughput. The advantages and disadvantagesassociated with this model are listed in Table D.2.

Table D.2 – Advantages/Disadvantages associated with Active/Active model

D.3 Active/Active

Advantages

• Fast recovery from an incident

• Improved confidence in ability to fail-over as much of the resilience equipment is being activelyused at each site.

• Recovery procedures can be simplified and/orautomated, as much of the infrastructure will be up and running.

• May improve utilization of the infrastructure overother models.

• Less overhead on change and configurationmanagement as sites are being continually exercised and so issues are likely to be identifiedmore quickly than where equipment is not be used.

• Makes live fail-over rehearsals easier to implement.

Disadvantages

• Can be more difficult to implement and managethan other models.

• May require additional load balancing technologyto allow services to be split across sites. For exampleto route Internet traffic to two separate sites.

• Complex databases issues. If a database is to beactive at multiple sites then a mechanism is requiredto externalize and manage updates so that data atthe sites is kept synchronised. Some organizationsapproach this by running a cluster with databaseonly active at one of the sites at any one time.

• Limited separation between sites. To achieve thedesired level of performance the parts of theActive/Active pair are often close together.

Page 47: Published PAS77

PAS 77:2006

© BSI 11 August 2006 41

In the Active/Alternate model, production runs at one sitewith a warm standby mirror copy of the production systemmaintained at a second site. In the event of a failure,production work moves from the main site to the warm-standby site with little or no interruption to service.

This requires either synchronous (Zero Data Loss) orasynchronous (Point in Time) replication of data. Theadvantages and disadvantages associated with this modelare listed in Table D.3.

Table D.3 – Advantages/Disadvantages associated with the Active/Alternate model

In the Active/Back-up model two separate computer suitesare maintained, but production only runs at one site, theremote site hosting back-up systems are only enabledwhen an incident strikes.

One way of exploiting the software license issue is toutilize the back-up systems as development, test or

training platforms. Many IT companies will reduce the costof software licences if a system is only used fordevelopment work, and will allow production licences tobe transferred to the back-up site when an incidentstrikes, although this can incur additional cost. Theadvantages and disadvantages associated with this modelare listed in Table D.4.

Table D.4 – Advantages/Disadvantages associated with the Active/Back-up model

D.4 Active/Alternate (Active/Passive)

Advantages

• Either site can be nominated as production siteon a scheduled basis, providing confidence in the solution.

• Makes live fail-over rehearsals easier to implement.

• Updates and maintenance can be scheduled ateither site by switching service to the other site.

Disadvantages

• The fail-over to the Alternate site can have moreimpact on service than in the Active/Active model,though still typically better than other models.

• Systems at the alternate site must be kept in stepwith the Active site and as with the other modelsthis will be a greater overhead than for theActive/Active model.

• Limited separation between sites. To achieve thedesired level of performance the parts of theActive/Active pair are often close together.

Advantages

• Can reduce the number of software licencesrequired as a warm standby system doing noproductive work may still incur the cost ofoperating system, database, and communicationssoftware licences.

Disadvantages

• If using back-up site systems for development, testand/or training, then in the event of a majorincident the facility is no longer available. If theincident is the result of, say, a faulty softwarerelease, there may not be access to the requireddevelopment resources such as source files,documentation required to provide a resolution.

• Security implications when running production fromthe back-up site.

• Slower to activate as existing services may have tobe stopped first, before fail-over can be initiated.

D.5 Active/Back-up

Page 48: Published PAS77

PAS 77:2006

© BSI 11 August 200642

Of course, the two-site model is fine for most companies,but some companies, especially multi-nationals bynecessity adopt a three or four site model where one or

more sites can take over the work of the other sites ifrequired. The advantages and disadvantages associatedwith this model are listed in Table D.5.

Table D.5 – Advantages/Disadvantages associated with a Multi-site models

D.6 Multi-site Models/Hybrids

Advantages

• Reduced impact following a major incident at onesite as production is spread across multiple sites.

• Requires less spare capacity for resilience as load isspread across multiple sites.

Disadvantages

• The approach can have complex implications, andgive rise to issues such as visibility of data andscalability of systems.

Page 49: Published PAS77

PAS 77:2006

© BSI 11 August 2006 43

High availability refers to the ability of a computer systemand its hosted resources to withstand failures. Thesefailures can range from component level hardware failuresto complete site failures. Availability is commonlymeasured in 9’s with five 9’s being the highest level.

NOTE For instance a system with five 9’s availability allows forfive minutes downtime per year. While five 9’s or nearcontinuous business operation is often desired, solutionsguaranteeing zero-downtime are often too cost-prohibitive toimplement, especially after weighing all risks of failure anddetermining what kind of downtime is acceptable for yourneeds, as shown in Figure E.1.

Annex E (informative)High availability

Figure E.1 – Downtime vs. cost

Continuous Processing

Fault Tolerant

Fault Resilient

HighAvailability

CommercialAvailability

Downtime

System Availability

$$$$

$$$

$$

$

99.999%

99.99%

99.9%

99.0%

E.1 General

Page 50: Published PAS77

PAS 77:2006

© BSI 11 August 200644

Availability spans many discreet layers both within andoutside the infrastructure with each layer providingadditional levels of fault tolerance and/or recovery.The first layer is the physical building and security levelthat prevents unauthorised access to the proximity of thesite. Each layer also includes a subset of systems such aselectrical, mechanical, cooling, and security that can beused independently or combined to provide improvedlevels of availability.

E.2 Platform availabilityHigh availability at the platform layer is achieved usingsystems that incorporate redundant components forcooling, power, disk, memory, etc. In most instances thesecomponents are also hot swappable to reduce downtime.The majority of computer manufactures now offer thesefeatures as standard on server level systems. As a minimumthe computer system should provide for dual power inlets,redundant power supplies and cooling fans.

E.2.1 PowerWithin the data centre there should also be some formof backup uninterruptible power supply (UPS) to protectagainst power surges and outages. The UPS should besized according to the number of systems it needs tosupport and how long they need to be kept running for in the event of a power failure. The two main types ofUPS are standby and online. Standby UPS are suitable forPC’s and other small non mission critical appliances. In theevent of a power failure the standby UPS willautomatically switch to battery backup. Online UPS unitsare more suited to mission critical systems as theycontinuously monitor the power source to protect againstline surges and brownouts along with automaticallyswitching to battery backup during a power outage.Battery backup UPS units are suitable for keeping systemsrunning during short outages. To provide for extendedpower outages a standby generator can be used.

As an added layer of protection against power outagessome companies also use dual sourced power for theirdata centre. This approach helps protect against poweroutages due to electricity grid related problems. In thisscenario the dual power inlets on the server system arefeed from separate power sources.

E.2.2 CoolingProviding cooling in the data centre is also very importantwhen building a highly available system. All systems in the data centre will generate heat. In cooling terms,effectiveness is measured by the amount of heat that canbe controlled.

When sizing cooling, the total heat being generated inthe data centre should be calculated, including that

generated by computer and communications equipment,lights, and people.

NOTE For the computer and communications equipment it is best to use the peak load figure based on the maximumpower requirements of the device. This is normally shownon the devices power requirements label or the systemspecification document.

The size of the data centre, and the positioning of thesystems within it, should also be taken into account asboth will impact on the cooling requirements.

E.2.3 Systems monitoringAt the platform level all equipment should be fullymonitored (preferably in real-time) to alert of any possiblefailures. All the large computer and communicationsvendors provide software utilities to monitor their systems.For ease of administration in a multi-vendor environmentit can also be beneficial to look at deploying an enterpriseclass systems management suite. This will allow you tomonitor all vendors systems from a single console. Somevendors also provide a feature called ‘Phone home’ wherea system will send an alert via an email or similar transportmechanism to the vendor’s support function personnelalerting them of a failing or failed component. The vendorcan then dispatch an engineer to resolve the problem.

E.2.4 Warranty and supportAll data centre equipment should have an appropriatelevel of warranty for maintenance and troubleshootingsupport. Most system vendors provide a tiered warrantystructure ranging from next business day to 24/7x365 with2 hour fix. The level of warranty purchased for a system islargely dependent on how mission critical the system is tothe business.

E.3 Data availabilityThe level of data availability is governed by the underlyingstorage hardware and the features it provides to protectits stored data. These features can range from standardRedundant Array of Independent Disks (RAID)implementations to the more advanced data replicationtechnologies like Snapshots and synchronous mirroring.

E.3.1 Redundant Array of Independent Disks (RAID)RAID provides for both high availability and improvedI/O performance at the disk level by using mirroring andstriping techniques to duplicate data across multiple disks.At present there are around twelve different RAID levels,some of which are proprietary to specific storage vendors.The most commonly supported RAID levels in everyday useare RAID 0, 1, 5, 0+1, and 10.

Selecting which RAID type to implement is usually a tradeoff between performance and cost. As a rule of thumb

Page 51: Published PAS77

PAS 77:2006

© BSI 11 August 2006 45

RAID levels that utilise the most disks provide the highestlevel of redundancy and performance. Each RAID level hasits own advantages and disadvantages which aresummarized in Table E.1.

Table E.1 – Advantages/Disadvantages of RAID levels

RAID Level Description Advantages Disadvantages

0

1

5

0+1

10

Data Striped across one ormore disks

Data mirrored between two disks

Requires a minimum of two disks

Data and parity stripedacross multiple disks

Requires a minimum ofthree disks

Mirrored RAID 0 segments

Requires a minimum offour disks

Striped RAID 1 segments

Requires minimum offour disks

Easy to implement

Very good read/writeperformance

No parity overhead

No faulttolerance/redundancy

100% disk redundancy

Improved readperformance over RAID 0

Simple design

Very good readperformance

Parity is distributed acrossall disks

Maximum utilization ofdisk resources

High I/O throughput – read and write

Same overhead as RAID 1

Very good/writeperformance

Same overhead as RAID 1

Can withstand single drivefailures across RAID 1segments

No faulttolerance/redundancy

Single disk failure causesdata loss

Not suitable for high availability

Only 50% disk utilization

Limited redundancy – single disk failure

RAID function requiresadditional processing

Write penalty due toparity calculation

Slow rebuild after drive failure

Limited redundancy – single disk failure

Very expensive to implement

Only 50% disk utilization

Limited redundancy – single disk failure

Very expensive to implement

Only 50% disk utilization

0

1

5

0+1

10

Page 52: Published PAS77

PAS 77:2006

© BSI 11 August 200646

E.3.2 Direct Attached Storage (DAS)This is one of the most frequently used storage methodsfor both internal and external storage. It simply consists of directly attaching the storage device or disks to acomputer system using a RAID controller. For internalstorage the RAID controller can either be a PeripheralComponent Interconnect/SUN Bus (PCI/SBUS) or integrateddevice. For external storage a number of options areavailable depending on the intelligence of the storagedevice. If the external device is just a bunch of disks(JBOD) these will need to be connected to an internalRAID controller. If the external storage device includesdisks and a disk controller then an appropriate host busadapter will need to be installed in the server system.

NOTE For instance the external device might be a fibre channelarray in which case a fibre channel host bus adapter will beinstalled in the server.

The main disadvantage with DAS is that it creates islandsof isolated storage that can only be accessed by the locallyattached server. Each pool of storage also needs to bemanaged separately. Because of its simplicity DAS storageusually has more single points of failure than otherstorage models.

E.3.3 Network Attached Storage (NAS)Most NAS devices use an underlying proprietary OperatingSystem (OS) and file system so that they can be used in aheterogeneous environment. A NAS device will typicallypresent both Network File System (NFS) and CommonInternet File System (CIFS) file systems to end users butinternally these files systems are usually stored as aseparate file system. The majority of NAS devices operateat the file level, although some of the newer modelswhich offer features like mirroring and snapshottechnology operate at the block level.

NAS devices have become very popular due to the easewith which they can be deployed and centrally managed.Gigabit networking has also helped to expand theirusability in the data centre. Most NAS devices alsoincorporate high availability features at the platformlevel to protect against disk, power and controller failure.Some of the higher end models support clustering of theNAS device to protect against unit failure.

The biggest disadvantage with NAS devices is theirpotential to saturate the local network during peak usage.This can limit their use for I/O intensive applications.Recent improvements in Local Area Network (LAN)connectivity speeds have helped to offset this limitation.

E.3.4 Storage Area Networks (SAN)SAN storage has become increasingly popular over thelast few years. Its centralized design helps storageadministrators easily deploy and manage storage

resources. The SAN fabric is responsible for carrying databetween the host servers and the target storage arrays.The fabric is a dedicated fibre based network designedfor high availability through the use of multiple datapaths between the hosts and storage array.

The SAN array (also known as the storage array) is thesubsystem which houses the power supplies, fans, disks,disk controllers and the arrays operating system. SANarrays are designed for the high availability of missioncritical data. All SAN arrays operate at the block level andare independent of the file systems they host. This makesthem well suited to heterogeneous environments.

In addition to the platform redundancy built into theSAN array there are also a number of other featuresinherent to SAN storage. These features include Snapshot,Mirroring and Cloning functionality.

Snapshot technology allows for point in time copies ofdata to be created mainly for the purpose of backup.The Cloning feature allows for the creation of point intime copies of data. Unlike Snapshot technology whichuses disk pointers to create an instant point in time copyof the data, cloning physically copies every block of datato a new disk. Cloning takes longer than a snapshot but it provides better availability by creating a duplicate of the source disk. Both Snapshot and Cloning store theirinformation in the local array. For increased availabilitythe data can be replicated to a remote array usingmirroring technology.

There are two main types of mirroring, synchronous andasynchronous. Synchronous mirroring copies each blockof data to the remote array and waits for anacknowledgement before writing the block of data to the local array. This ensures that both the remote andlocal copies of data are always consistent. To implementsynchronous replication between two sites will typicallyrequire a high speed link such as dark fibre DenseWavelength Division Multiplexing (DWDM). Synchronousreplication is also limited to network path distances of 200 km.

To replicate data over extended distances requiresasynchronous mirroring. Using this method allows thewrites to the local array to continue as normal and thenbe replicated to the remote array at fixed intervals. Theone disadvantage with asynchronous replication is thepossibility of losing data that has not been replicated tothe remote site should the local site fail.

Page 53: Published PAS77

PAS 77:2006

© BSI 11 August 2006 47

E.4 Application availabilityOne of the major demands on IT personnel is ensuringthat business critical applications are kept online. The nextstep is ensuring that the application layer is not impactedby hardware failures. A common approach to achievingthis goal is through the use of clustering technologies.Clustering technologies cover a number of areas from thefile system through to the computer node.

E.4.1 Clustered file systemA clustered file system is a file system that is distributedbetween multiple computer nodes. Each node holds partof the file system but to the end user they appear as asingle file system. Clustered file systems are frequentlyused in high performance compute environments thatrequire very high data throughput between computenodes. Replicating the file system between multiplenodes provides added protection against hardwarefailures. For performance reasons most clustered filesystems tend to be hosted from fibre channel storagearrays. Some of the disadvantages of clustered file systems include the high cost of deploying and managingthem in large configurations.

E.4.2 Application clusterApplication clusters are somewhat similar in design toclustered file systems. The application is scaled out acrossmultiple compute nodes to provide for both higheravailability and performance. To the host the applicationwill appear as a single resource. In most cases applicationclusters use a clustered file system for shared data access.

To identify failures in the cluster a ‘heart beat’ packet isexchanged between nodes to ensure that each node inthe cluster is online. If a node fails to respond to a heartbeat data packet then that node will be identified asbeing offline and all its resources will be distributed to the remaining nodes. The application cluster can alsodynamically re-assign processing tasks if threshold policiesare in place.

Today there are a number of application cluster productson the market for web serving and transactionaldatabases. One of the advantages of using an applicationcluster is the ability to scale out your application as yourequire additional processing power. Deploying anapplication cluster is a complex task but this is slowly been addressed by software vendors.

E.4.3 Computer cluster servicesAll the major operating systems include some form ofcluster services support which can be installed during theOS deployment or as an additional add-on. Cluster servicesfall into one of two categories:

a) Load balancing cluster services – are frequently used fordistributing network load across multiple hosts. Take for

example a large web farm with fifty web servers. Inorder to balance web requests across all fifty serverseach server has load balancing services installed alongwith a virtual IP address. All fifty servers get the samevirtual IP address so each time a web request is receivedall fifty web servers intercept the request but the loadbalancing software uses a set of rules to determinewhich server should process the request.

b) Fail-over cluster services – can be classified as eithershared everything or shared nothing. Like loadbalancing cluster services they can also be installed aspart of the operating system. There are also a largenumber of third party cluster services applications thatintegrate with the major OS. Typically cluster serviceswill require a shared storage resource and a networkport on each cluster node for sending and receivingheart beat information.

1) The shared everything model allows all nodes in thecluster to have shared access to the cluster resources.In order to achieve this, the cluster needs to use adistributed lock manager to control node access.Some users have questioned the scalability of theshared everything model because of the overheadand complexity of managing the resource locking.

2) The shared nothing model also uses shared storagebut only one node in the cluster has access to aresource at any given point in time. The sharednothing model is often referred to as Active/Passiveor Active/Active. An Active/Passive cluster is whenone node in the cluster manages all the resourcesand the second node acts as a fail-over node thattakes ownership of the active resources in the eventof a failure. The Active/Active model is when bothnodes in the cluster are actively hosting independentresources. In the event of a failure the surviving nodetakes ownership of all resources.

Both models of fail-over clustering are widely used forapplications such as web, file and print, messaging anddatabase servers. Some of the higher end clusteringservices software packages also include data replicationfeatures. These features can be used to build what areknown as stretch clusters. A stretch cluster basically allowsyou to increase the distance between the cluster nodes.

NOTE An example of a stretch cluster is where one node mightbe located in London whilst the second node is located inManchester. The replication engine ensures that the data copiesat both locations are consistent. In the event of a failure thereplication engine simply brings the secondary copy of dataonline so the surviving node can take ownership of all thecluster resources. This process can take a couple of minutes butimportantly there is no outage at the application level and thefail-over is transparent to the end users.

Page 54: Published PAS77

PAS 77:2006

© BSI 11 August 200648

F.1 GeneralThis annex provides a brief overview of some of thereplication approaches that should be used to protect andrecover data. As technology progresses other options willbecome viable and any selection should include a reviewof products available. As with the choice of a site modeldescribed in Annex D the ITSC Strategy will be a majorfactor in selecting the appropriate replicationmechanism(s) and the resultant choice will have aneffect on the IT Architecture.

F.2 Media back-up/restoreCreating data back-ups on an alternate media (usuallytape) is still the de facto method of ensuring there is asecure Point In Time (PIT) copy of vital data but for manycompanies the tape back-up has become a second levelback-up which is just insurance against disk or applicationbased replication failures. However some legacyapplications still rely on tape as the primary back-up.

NOTE 1 In this context the term ‘tape back-up’ will applyequally well to any other physical media used to create aback-up of live data which is then physically removed offsite.

NOTE 2 If tape back-up is used either as a primary or secondaryback-up it is normal to create a schedule of back-ups whichreflects the working pattern of the system being backed up.

F.2.1 Example 1An order entry system is open for orders from 09:00 to17:00 every day, Monday through Saturday, the online day.After 17:00 batch processes extract all new orders fromthe orders file and process them, distributing shipmentorders to the warehouse and build orders to the factoryfloor. The order history file is then updated. On Sundaythe system is unavailable.

The back-up cycle for this system may be:

a) full back-up of the entire database once per weekon Sunday;

b) back-up of all new orders, daily after 17:00 but beforebatch processing commences;

c) back-up of all shipment orders and build orders afterbatch processing;

d) back-up of order history file before start of online day.

This is highly tailored to a known and fixed file basedprocessing cycle.

F.2.2 Example 2Modern tape back-up systems adopt a more holisticapproach to backing up the entire system, as illustrated inthe following example. A software development system isavailable 24 x 7 for use by a group of 100 + developerswriting and testing software. Source files and executablesare created and updated on an adhoc basis throughoutthe day and at various times throughout the night.Developers work typical eight hour shifts during the corehours of 07:00 to 19:00 but in order to meet deadlines willsometimes work through the night. The least busy periodis Sunday nights through early morning Monday.

The back-up cycle for this system adopts thefollowing pattern:

a) full system back-up of all files on Sunday night;

b) full system back-up of all files changed since Sunday atmidnight every day;

c) incremental back-up of all files changed since midnighttwice per day, once at midday and once at 21:00.

In this example only those files that are unopened arebacked up. If a file is left open to an application it will notbe backed up, simply because its contents cannot beguaranteed to be consistent. It is possible to override thisrestriction and create ‘dirty’ back-ups but this should onlybe done with an understanding of the applications in useas data could be an uncertain state. Conversely, if ignoringopen files certain files may never be backed up leaving theorganization exposed to the risk of data loss.

F.2.3 Media IssuesMany of the issues related to tape back-up of opendatabases are dealt with by application or system basedback-ups. Faced with the issues related to creating ZeroData Loss (ZDL) back-ups of live, permanently opendatabases, systems and database vendors have createdtheir own back-up software that worked in concert withthe database system to enable online productiondatabases to be saved and restored with no data loss, andwith the ability to fail-back updates performed by failedtransactions. These back-up mechanisms rely on thecreation of log files, often referred to as audit trails whichreflect every update to the database, and allow databasesto be restored in a consistent fashion should systems fail.

F.2.4 Media StorageIf tape-based back-up is the primary back-up mechanism,then copies of the back-ups should be taken offsite at theearliest opportunity. Many companies exist to providesecure data storage and can be contracted to collect data

Annex F (informative)Types of resilience

Page 55: Published PAS77

PAS 77:2006

© BSI 11 August 2006 49

at multiple times during the day. If keeping back-ups onsite, they should be stored in a controlled environment ora fireproof safe ideally at some distance from the originalsource. It is a common mistake to make critical back-upsbut leave them sitting in the office reception or on theloading dock for hours waiting for an offsite courier totake them to a secure store.

If the back-up copies are held at a back-up site it can bebeneficial to load up files from tape to disk. This verifiesthe back-up (the tapes can be read) and can also reducethe recovery time.

A more sophisticated variation on the standard tapeback-up mechanism is to perform remote back-ups.Usually only seen on high-end systems with high speedfibre links, the data is backed up to tape at the remoteback-up site providing both a back-up and secure offsiteback-up at the same time.

F.3 Database management system-basedreplicationMany database vendors involved in writing DatabaseManagement Systems (DBMS) aimed at mission criticalapplications have not been able to rely on systems basedback-up and replication software because it either doesn’texist or is unreliable. As a consequence they haveincorporated data replication functionality into thedatabase software itself. Not only does the DBMS providea mechanism for updating records, but it also provides amechanism for creating a database replica on a remotesystem. One mechanism for this is called log shipping.Updates to the database are captured to a log file,periodically batches of updates are shipped across a widearea network to a remote system hosting a copy of thedatabase from the primary site. The remote system thenreapplies the updates to the remote database, effectivelykeeping the two databases synchronised. In the event of a loss of the primary site, the remote system takes over processing.

In this method updates are grouped into batches andfile transferred across the network. This asynchronousshipment method means that the remote system is always a few updates behind the live system and as aconsequence there is likely to be some data loss (i.e. anyupdates since the last log was shipped) should the mainsite be impacted by an incident. This time lag may or maynot be an issue, depending on the RPO for the system.

F.4 Application based replicationAlternatively the replication can be built into anapplication, either as part of a standard package or as acustom development by the organization. With thisapproach the recovery point for the application may not

be in step with other applications being recovered by theorganization. Thus if data needs to be synchronizedbetween applications, additional recovery steps may berequired to resynchronize the applications.

F.5 Host-based replication/mirroringHost-based replication is a catch-all term referring to atechnique whereby replication is achieved through theduplication of disk write requests, either via some speciallyinstalled software, or via the operating system itself.Essentially each write request is captured and duplicatedon an alternate disk device, providing a synchronized copyof the original database on a completely separate set ofdisks. Often referred to as Host Mirroring, this form ofreplication originated with mainframe systems long beforeRAID technology was invented. Not only did it provide amirror copy at the device/volume level to ensure dataresilience, it also provided some performanceimprovements by toggling read requests between thepair of disks. However this type of replication has someinherent drawbacks:

a) Delay – the application waits for both writes tocomplete. As the primary reason for creating hostmirroring was data security, the application is forced towait for both writes to complete thereby potentiallyslowing it down.

b) Host loading – since the host is responsible for issuingthe second write request, it also incurs a CPU overheadwhilst performing it.

c) Distance – since both disk devices are seen as channelattach disks, then the second copy is typically limited toshort distances.

A more recent slant on this approach is available in theopen systems world through host based file/diskreplication software. These products typically require extraresources on the server, such as network adapters in orderto function. They also use network protocols to replicateInput/Output (I/O) and are subject to the limitations of theoperating system in this area.

F.6 Storage array-based replicationThis method moves the job of replication down to thelevel of the storage controller. Firmware running in thedisk control unit is responsible for the replication of datafrom one volume to another volume in a remote site.This form of replication can and does function over largedistances and requires high speed and high bandwidthnetwork links between the sites. The cost ofimplementation of this mechanism can be prohibitiveexcept for large mission critical applications.

The advantage of this mechanism is that it is divorcedfrom the path of the I/O in as much as it imposes no extra

Page 56: Published PAS77

PAS 77:2006

© BSI 11 August 200650

load on the host system. Handling replication at thecontrol unit level also allows the control unit to createmultiple copies or snapshots of the volumes. Thesesnapshots can be used to drive separate applications suchas overnight batch processing and their careful use canvastly improve overall system throughput. However itshould be noted that depending on how it is used, storagearray based replication can introduce I/O latency andpotentially delay the I/O completion.

F.7 Storage Area Network-based replicationIn a Storage Area Network (SAN), storage devices areattached to a fabric of fibre channels which operate atvery high speeds.

An approach to replication that operates at the SAN levelis to introduce a replication appliance into the SAN asboth a disk volume and host at the same time. Throughthe use of special device drivers embedded in the hostsystems, write requests are replicated to the ReplicationAppliance. These devices can then ship the write across aWide Area Network (WAN) to a remote location whereanother set of replication appliances can distribute thewrite requests to a duplicate set of disk devices. Theadvantage of this mechanism is that standardcommunications lines can be used to carry the replicationtraffic since replication appliances may also compress thedata being sent.

F.8 Disk replication modesAlmost all of the mechanisms described for disk replicationcan operate in one of two modes and it is worthconsidering which of these modes best fit the RTO, RPOand cost requirements.

a) Synchronous (Zero Data Loss) – Each write request isreplicated to a remote system and the issuing systemeffectively waits until it receives an I/O complete statusback from the remote site. This ensures that all I/Owrite requests are securely completed on the remotesite and a back-up is guaranteed to be synchronizedwith the original copy. It should be noted that in orderto achieve acceptable throughput dedicated links arerequired between the two sites.

b) Asynchronous (Point In Time) – The write request isissued to the remote system but there is no delay orwait for an acknowledgement of I/O completion, ratherthe local application continues without any delay.While this improves performance it does create awindow where data loss can occur if the local disksubsystem is destroyed with writes pending. ThisAsynchronous replication is also called a Point In Timeback-up because it reflects the consistent state of thedisk at a specified point in time.

Page 57: Published PAS77

PAS 77:2006

© BSI 11 August 2006 51

Standards publications

PAS 56: 2003, Guide to Business Continuity Management

BS ISO/IEC 20000, Information Technology – ServiceManagement

ISO/IEC 17799:2005 Code of Practice for InformationSecurity Management

ISO Guide 73:2002, Risk management – Vocabulary –Guidelines for use in standards

Other publications

[1] IT Infrastructure Library (ITIL). Office of Governmentand Commerce: The Stationery Office.

[2] PRINCE2 Maturity Model (P2MM). Office ofGovernment and Commerce (OGC).

[3] Project Management Body of Knowledge. ProjectManagement Institue (PMI).

Further Reading

TR 19:2005, Technical Reference for Business ContinuityManagement (Bt GM). Spring Singapore.

Emergency Preparedness: Guidance on Part 1 of the CivilContingencies Act 2004, its associated Regulations andnon-statutory arrangements. Home Office: The StationeryOffice.

Generally Accepted Practices for Business ContinuityPractitioners. Disaster Recovery Journal and DRIInternational, 2005.

“Business Continuity”. CBI with Computacenter, 2002.

”A Risk Management Standard”. The Institute of RiskManagement, The Association of Insurance and RiskManagers and The National Forum for Risk Managementin the Public Sector, 2002.

Microsoft Operations Framework, a pocket guide,Van Haren Publishing, ISBN 9077212108.

Management of Risk: Guidance for Practitioners. Officeof Government and Commerce: The Stationery Office.

“A Guide to Business Continuity Planning” by James C.Barnes, ISBN 0-471-53015-8.

Bibliography

Page 58: Published PAS77

PAS 77:2006

© BSI 11 August 200652

BSI is the independentnational body responsible forpreparing British Standards.It presents the UK view onstandards in Europe and atthe international level. It isincorporated by Royal Charter.

RevisionsBritish Standards are updated byamendment or revision. Users ofBritish Standards should make surethat they possess the latestamendments or editions.

We would be grateful if anyonefinding an inaccuracy or ambiguitywhile using this Publicly AvailableSpecification would informCustomer Services.

Tel: +44 (0)20 8996 9001Fax: +44 (0)20 8996 7001Email: [email protected]

BSI offers members an individualupdating service called PLUS whichensures that subscribers automaticallyreceive the latest editions of standards.

Buying standardsOrders for all BSI, international andforeign standards publications shouldbe addressed to Customer Services.

Tel: +44 (0)20 8996 9001Fax: +44 (0)20 8996 7001Email: [email protected]

Standards are also available from theBSI website at http://www.bsi-global.com

In response to orders for internationalstandards, it is BSI policy to supply theBSI implementation of those that havebeen published as British Standards,unless otherwise requested.

BSI – British Standards InstitutionInformation on standardsBSI provides a wide range ofinformation on national, Europeanand international standards throughits Library and its Technical Help toExporters Service. Various BSI electronicinformation services are also availablewhich give details on all its productsand services.

Contact the Information CentreTel: +44 (0) 20 8996 7111Fax: +44 (0) 20 8996 7048Email: [email protected]

Subscribing members of BSI arekept up to date with standardsdevelopments and receive substantialdiscounts on the purchase price ofstandards. For details of these andother benefits contact MembershipAdministration.

Tel: +44 (0) 20 8996 7002Fax: +44 (0) 20 8996 7001Email: [email protected]

Information regarding online access toBritish Standards via British StandardsOnline can be found at http://www.bsi-global.com/bsonline

Further information about BSI isavailable on the BSI website athttp://www.bsi-global.com

CopyrightCopyright subsists in all BSIpublications. BSI also holds thecopyright, in the UK, of thepublications of the internationalstandardization bodies.

Except as permitted under theCopyright, Designs and Patents Act1988 no extract may be reproduced,stored in a retrieval system ortransmitted in any form or by anymeans – electronic, photocopying,recording or otherwise – withoutprior written permission from BSI.

This does not preclude the free use,in the course of implementing thestandard, of necessary details suchas symbols, and size, type or gradedesignations. If these details are tobe used for any other purpose thanimplementation then the priorwritten permission of BSI must be obtained.

Details and advice can beobtained from the Copyright & Licensing Manager.

Tel: +44 (0) 20 8996 7070Fax: +44 (0) 20 8996 7553Email: [email protected], 389 Chiswick High RoadLondon W4 4AL.

Page 59: Published PAS77
Page 60: Published PAS77

British Standards Institution389 Chiswick High RoadLondon W4 4ALUnited Kingdom

http://www.bsi-global.com

ISBN 0 580 49047 5