xtium dr run book template idc intro

Upload: fluidux

Post on 10-Feb-2018

233 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    1/33

    Disaster Recovery Run Book Template

    Provided by Xtium, Inc. 2013

    [YOUR LOGO HERE]

    Date of Last Update:MM/DD/YYYY

    DISASTER RECOVERY

    RUN BOOK

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    2/33

    Page | 2

    DR Run Book Template provided by Xtium 2013

    Foreword

    By Laura DuBois, Program Vice President, Storage,IDC

    The Disaster Recovery Imperative

    Nearly all organizations today rely on information technology and the data it manages to

    operate. Keeping computers and networks running, and data accessible, is imperative. Without

    this information technology, customers cannot be serviced, orders taken, transactions

    completed, patients treated, and on and on.

    Disasters that create IT downtime are numerous and common, spanning the physical and

    logical, the man-made and natural. Organizations must be resilient to these disasters, and able

    to operate in a disruption of any type, whether it is a security incident, human error, devicefailure, or power failure.

    State of Preparedness

    Most organizations know the importance of disaster recovery, and firms of all sizes are investing

    to drive greater uptime. An IDC study on business continuity and disaster recovery (DR) showed

    that unplanned events of most concern were power, telecom, and data center failures (physical

    infrastructure)more so than natural events such as fire or weather. Security was considered

    the second most critical and extreme threat to business resiliency.

    Seventy-one percent of those surveyed had as many as 10 hours of unplanned downtime over

    a 12-month period. This underscores the importance of greater uptime and DR, which is driving

    firms to conduct DR tests more frequently. Approximately one in four firms are conducting DR

    testing quarterly or monthly, while another 45% are testing semi-annually or annually.

    This is a marked increase from previous research, which IDC conducted three years ago, where

    firms were testing annually at best. However, 25% of firms are still not doing any DR testing.

    IDC Advice

    DR planning is complex and spans three key areas: technology, people, and process. From anIT perspective, planning starts with a business impact analysis (BIA) by application/workload.

    Natural tiers or stages of DR begin at phase 1infrastructure (networking, AD, DHCP, etc.)

    then extend to recovery by application tiers. Each application tier should have an established

    recovery time objective (RTO) and recovery point objective (RPO) based on business risk.

    DR testing is essential to adequate recovery of systems and data, but also to uncover events or

    conditions met during real disasters scenarios that were not previously accounted for. Examples

    http://www.idc.com/http://www.idc.com/http://www.idc.com/http://www.idc.com/
  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    3/33

    Page | 3

    DR Run Book Template provided by Xtium 2013

    include change management such as the needed reconfiguration of applications or systems.

    Also, the recovery of systems in the right sequence is important. To ensure that DR testing,

    planning, and recovery is organized and effective, many organizations use a disaster recovery

    "run book."

    A DR run book is a working document, unique to every organization, which outlines the

    necessary steps to recover from a disaster or service interruption. It provides an instruction set

    for personnel in the event of a disaster, including both infrastructure and process information.

    Run books, or updates to run books, are the outputs of every DR test.

    However, a run book is only useful if it is up-to-date. If documented properly, it can take the

    confusion and uncertainty out the recovery environment which, during an actual disaster, is

    often in a state of panic. Using the run book template provided here by Xtium can make the

    difference for an organization between two extremes: being prepared for an unexpected event

    and efficiently recovering, or never recovering at all.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    4/33

    Page | 4

    DR Run Book Template provided by Xtium 2013

    Your Disaster Recovery Run Book

    A disaster recovery run book is a working document unique to every organization that outlinesthe necessary steps to recover from a disaster or service interruption.

    Run books should be updated as part of your organizations change management practice. Forinstance, once a production change has been committed, run book restoration instructionsshould be reviewed for accuracy. In addition to synchronizing run books with corporate changemanagement, the outcomes and action plans of each DR test should also be incorporated intorun book update cycles.

    How to Use this Template

    This template outlines the critical components of your disaster recovery and business continuitypractices. Disaster recovery tests should be regularly conducted, reviewed, and planssubsequently updated.

    Use this template as a guide for documenting your disaster recovery test efforts. It includessections to specify contact information, roles and responsibilities, disaster scenarios likely toaffect your business and recovery priorities for your business IT assets.

    Keep in mind there may be more sections of a run book based on your deployment model; thistemplate serves as a standard with all its sections applicable (and necessary) to any disasterrecovery testing procedure. Similarly, your run book may look different if you are working with amanaged service provider that handles most or all aspects of your disaster recovery tests.

    If you have further questions about DR tests, take a look at our disaster recovery testing guide,

    available for free at xtium.com,orcontact usfor more information.

    http://www.xtium.com/cloud-services/disaster-recovery/disaster-recovery-testing-frequently-asked-questions/http://www.xtium.com/cloud-services/disaster-recovery/disaster-recovery-testing-frequently-asked-questions/http://www.xtium.com/about/contact-us/http://www.xtium.com/about/contact-us/http://www.xtium.com/about/contact-us/http://www.xtium.com/about/contact-us/http://www.xtium.com/cloud-services/disaster-recovery/disaster-recovery-testing-frequently-asked-questions/
  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    5/33

    Page | 5

    DR Run Book Template provided by Xtium 2013

    DR Scenarios

    Though not part of the run book itself, were providing this sectionto list some common events

    that would cause DR scenarios. These threats are general and could affect any business, so

    you might also want to list those which would threaten your business specifically.

    Research firm Forrester outlined some of the most common causes of disaster scenarios from a

    2011/2012 study. The findings illuminate the fact that your business should not just be prepared

    for the news-making types of disaster threats (hurricanes or tornados, for example). Instead,

    consider all these potential causes for disaster:

    Source: http://it.toolbox.com/blogs/managed-hosting-news/whats-your-2012-it-disaster-recovery-plan-49333

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    6/33

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    7/33

    Page | 7

    DR Run Book Template provided by Xtium 2013

    Distribution List

    This secion is also critical to the development of your run book. You must keep a clearly defined

    distribution list for the run book, ensuring that all key stakeholders have access to the

    document. Use the chart below to indicate the stakeholders to whom this run book will be

    distributed.

    Role Name Email Phone

    Owner

    Approver

    Auditor

    Contributor (Technical)

    Contributor (DBA)

    Contributor (Network)

    Contributor (Vendor)

    Location

    Specify the location(s) where this document may be found in electronic and/or hard copy. You

    may wish to include it on your companys shared drive or portal.

    If located on a shared drive or company portal, consider providing a link here so the most recent

    version is readily accessible.

    If this run book is also stored as a hard copy in one or multiple locations, list those locations

    here (along with who has access to those locations). We do recommend making your run book

    available outside of shared networks, as the document must be readily accessible at time of

    disaster in the event that primary systems like email are not accessible to employees. In other

    words, ensure your run book is accessible under any circumstances!

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    8/33

    Page | 8

    DR Run Book Template provided by Xtium 2013

    Table of Contents

    Document Control 7

    Contact Information 8

    Data Center Access Control List 10

    Communication Structure of Plan 11

    Declaration Guidelines 13

    Alert Response Procedures 15

    Issue Management and Escalation 16

    Changes to SOP During Recovery 17

    Infrastructure Overview 19

    Data Center 19

    Network Layout Topology 21

    Access to Facilities 21

    Order of Restoration 22

    System Configuration 23

    Backup Configuration 25

    Monitors 26

    Roles and Responsibilities 27

    Data Restoration Processes 29

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    9/33

    Page | 9

    DR Run Book Template provided by Xtium 2013

    Document Control

    Document creation and edit records should be maintained by your companys disaster recovery

    coordinator (DRC) or business continuity manager (BCM). If your organization does not have a

    DRC, consider creating that role to manage all future disaster recovery activities.

    Document Name Disaster Recovery Run Book for [Your Companys Name Here]

    Version

    Date created

    Date last modified

    Last modified by

    Document Change History

    Version Date Description Approval

    V 1.0 11/20/2010 Initial version Business Owner /

    DRC

    V1.1 12/30/2010 End of year DR test

    action plan updates to

    run book

    Test Manager / DRC

    Keep the most up-to-date information on your

    disaster recovery plan in this section, including themost recent dates your plan was accessed, used

    and modified. Keep a running log, with as many

    lines as necessary, on document changes and

    document reviews, as well.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    10/33

    Page | 10

    DR Run Book Template provided by Xtium 2013

    Contact Information

    This section will list your service providers contacts(if applicable) along with those from your IT

    department. This is the team that will conduct ongoing disaster recovery operations and

    respond in the case of a true emergency. Specific roles listed below are examples of those that

    might comprise your team.

    All of these roles need to be in communication when in a disaster recovery mode of operation.

    For pending events, this same distribution list should be used to provide advanced notice of

    potential incidents. Customer support teams should also not be overlooked as they are the first

    line of communication to your customer base. Forgetting this step will cause extra work on your

    primary recovery team as they take time to explain what is going on.

    Your companys

    contactsTitle Phone Email

    Name Disaster Recovery

    Coordinator

    Primary phoneSecondary phone

    Email

    Name Chief Information Officer Primary phoneSecondary phone

    Email

    Name Network Systems

    Administrator

    Primary phoneSecondary phone

    Email

    Name Database Systems

    Administrator

    Primary phoneSecondary phone

    Email

    Name Chief Security Officer Primary phoneSecondary phone

    Email

    Name Chief Technology Officer Primary phoneSecondary phone

    Email

    Name Business Owner Primary phoneSecondary phone

    Email

    Name Application Development

    Lead (as applicable)

    Primary phoneSecondary phone

    Email

    Name Data Center Manager Primary phoneSecondary phone

    Email

    Name Customer Support

    Manager

    Primary phoneSecondary phone

    Email

    Name Call Center Manager Primary phone Email

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    11/33

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    12/33

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    13/33

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    14/33

    Page | 14

    DR Run Book Template provided by Xtium 2013

    And, for the situation written above, your general progression of calls might be as follows:

    Disaster

    Recovery

    Coordinator

    Head of

    Operations

    Director of Service

    Delivery

    Sr. Systems

    Engineer

    Network Engineer

    Systems

    Administrator

    CEODirector of Business

    Development

    Sales contact

    PR Representative

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    15/33

    Page | 15

    DR Run Book Template provided by Xtium 2013

    Declaration Guidelines

    As you create your run book, you must consider guidelines for declaring a disaster scenario.

    Guidelines that we recommend are specified in the chart below:

    Situation Action Owner

    Workaround does not exist in

    a matter of time that does not

    affect customer SLAs

    Declare application level

    failover and enact failover to

    secondary site

    Restoration procedres cannot

    be completed in your

    production environment

    Declare application level

    failover and enact failover to

    secondary site

    A production environment no

    longer exists or is unable tobe accessed

    Declare a data center failure

    and enact a total failover planfrom primary to secondary

    data center

    Service provider issues

    cannot be resolved

    Notify service provider and

    have them enact DR plans

    The use of technology can be incorporated into the declration steps of a DR plan. Be sure not

    to declare on the first instance of an event unless it is completely understood that secondaryinstances of the event will result in increased damage to your customer or your business

    sytems. The table below details some standard practices to use in order to mitigate premature

    declarations. SLAs should be built in a manner that allows for some troubleshooting and

    system restoration prior to the need to declare a disaster.

    Also use this section to outline standard monitoring procedures along with associated

    thresholds. List all system monitors, what they do, their associated thresholds, associated alerts

    when those thresholds are met or exceeded, the individual(s) who receive the alerts, and the

    remediation steps for each monitor.

    List event monitoring standards by defining thresholds for event types, durations, correctiveactions to be taken once the threshold is met, and event criticality level. Use the following chart

    (or a derivative thereof for your monitoring standards) to specify event monitoring standards.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    16/33

    Page | 16

    DR Run Book Template provided by Xtium 2013

    The first few rows have been filled in with examples:

    Event Type Duration of Event Corrective Action Event criticality

    Performance

    Monitoring Status =

    Warning Alert Level

    > 2 minutes Isolate problem

    device / recycle

    device

    Critical Level

    Memory Usage >

    80%

    > 5 minutes - Isolate physical

    device / virtual

    machine

    - configure memory

    pool increase

    - clear memory cache

    - clear memory buffer

    Critical Level

    CPU Usage > 90% > 3 minutes - increase compute

    allocation (virtual)

    - add additional

    compute resources

    into application pool

    Critical Level

    Memory > 15 minutes - check memory

    queue

    - clear memory cache

    of affected system

    - increase memory

    allocation (virtual)

    Storage

    Network

    Ping Check

    IP Check

    These event types (memory, storage, network, ping

    check and IP check) are categories of events for

    which you should list specific examples in this chart.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    17/33

    Page | 17

    DR Run Book Template provided by Xtium 2013

    Alert Response Procedures

    List out your step-by-step procedures for responding to service issue alerts in this section. As an

    example, Xtiums ticket submission and response procedures follow this general outline:

    Service interruption identified > Service Delivery Manager contacted

    1. Ticket is opened with support team (either in-house or third party providers ticket creation

    system).

    2. Contact key stakeholders to ensure they are aware of the alert and determine if any current

    activity or recent changes may be responsible for the service interruption.

    3. Verify that alert is legitimate and not an isolated single user issue or monitoring time out.

    4. Notify end users of ticket creation.

    5. Contact the appropriate member(s) of your operations or engineering teams to notify them

    of the alert and assign investigation and data restoration procedures.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    18/33

    Page | 18

    DR Run Book Template provided by Xtium 2013

    Issue Management and Escalation

    This section should list detailed procedures for issue management and escalation, when

    necessary, in the case of an unmet service objective.

    Escalation procedures will vary by levels of operation and severity of the associated activities.

    At Xtium, for example, we categorize standard operating procedure interruptions in five levels (5

    being the lowest severity, 1 the highest). Of course, these can and will differ among

    organizations. The following serves only as an example:

    1FatalFunctionality has ceased completely with no known workaround for all users. Impact

    is highest.

    2CriticalFunctionality is critically impaired but still operational for some users. Impact is

    high.

    3SeriousFunctionality is impaired but workarounds still exist for all users. Impact ismoderate.

    4MinorSome functionality is impaired but there is a reasonable workaround for some

    users. Impact is low.

    5RequestThis is an enhancement-related service request that does not at all impact

    current operations or functionality.

    Depending on the severity of the service interruption, your escalation procedures will vary byparties involved, response chain, response time and target resolution.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    19/33

    Page | 19

    DR Run Book Template provided by Xtium 2013

    Changes to SOP During Recovery

    Recovery events necessitate the priority of data and business process restoration. At times,

    other non-critical standard operating procedures (SOPs) must be suspended.

    During a recovery event, recovery operations should take precedent over inbound queries ortickets. Monitors and alerts should also be reviewed for suspension until recovery is complete.

    This is a best practice procedure to avoid flooding your network operations center (NOC) and

    support teams with bogus or bad alarms.

    Change management policies should also be altered to expedite recovery procedures. For

    example, adding a new server or firewall rule in a standard environment might take one day

    once all necessary reviews and permissions are met. But during recovery operations, a

    standard firewall change should be expedited to support recovery operations.

    Ticketing of work during recovery operations should be reviewed to ensure the necessity of any

    requested tasks. Non-critical tickets should be deferred and addressed once recoveryprocedures are complete.

    Remember, the number one rule in recovery is: Recover! Get things back up and running

    whether in a workaround, failover or full restore state.

    That in mind, use this section to identify which standard operating procedures will be suspended

    in the event of a true emergency scenario (one that would fall under your critical or fatal service

    interruption classifications). List out specifications for change management, monitors and alerts,

    and problem and issue resolution during recovery procedures. Certain non-critical standard

    operating procedures may be suspended, such as in the following situation:

    A user submits a call/ticket to your service desk stating they cannot access the company

    website. This ticket would be responded to with a message that your organization is currently in

    a recovery operations cycle and your service ticket will be addressed as soon as technicians

    have completed the restoration work.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    20/33

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    21/33

    Page | 21

    DR Run Book Template provided by Xtium 2013

    Infrastructure Overview

    Provide a detailed overview of your IT environment in this section, including the location(s) of all

    data center(s), nature of use of those facilities (e.g. colocation, tape storage, cloud hosting),security features of your infrastructure and the hosting facilities, and procedures for access to

    those facilities.

    Data center

    Specify the location of all facilities in which your companys data is stored. Inclu de an address

    and directions to each location.

    Example Simple data center diagram:

    Source: http://www.storageguardian.com/media/network_diagram.gif

    Examples of a data center diagram need to be

    detailed enough to provide your backup recovery

    team member the necessary information toperform his or her responsibilities if called upon.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    22/33

    Page | 22

    DR Run Book Template provided by Xtium 2013

    Example Detailed data center diagram:

    Source: http://www.routereflector.com/en/2013/05/data-center-topology-with-cisco-nexus-hp-virtual-connect-and-vmware/

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    23/33

    Page | 23

    DR Run Book Template provided by Xtium 2013

    Network Layout Topology

    Source: http://sanketshukla.blogspot.com/2009/11/dhs-network-topology-diagram.html

    Access to Facilities

    Data centers and colocation facilities typically maintain strict entry protocol. Certain members of

    your organization will typically hold the appropriate credentials to enter the facility. Detailmembers of your team (and/or your IT service providers team) who have access to all data

    facilities along with any requirements for access.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    24/33

    Page | 24

    DR Run Book Template provided by Xtium 2013

    Order of Restoration

    This section will include instructions for recovery personnel to follow that lay out which

    infrastructure components to restore and in which order. It should take into account application

    dependencies, authentication, middleware, database and third party elements and list

    restoration items by system or application type.

    Ensure that this order of restoration is understood before engaging in restore work. An example

    is provided below. The rest of the table should be filled out in the exact order that restoration

    procedures are to be completed.

    Order of Restoration Table:

    Server Name Server RoleOrder of

    RestorationOS / Patch level

    Application

    loaded

    Ws12_VF1 Web Server

    Valley Forge 1

    Restore prior to

    db12_VF1

    startup

    ESX4.1 Apache

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    25/33

    Page | 25

    DR Run Book Template provided by Xtium 2013

    System Configuration

    This section should include systems and application specific typology diagrams and an

    inventory of elements that comprise your overall system. Include networking, web app

    middleware, database and storage elements, along with third party systems that connect to and

    share data with this system.

    Network table:

    Device

    typeName Primary IP OS level Gateway Subnet Mask

    Firewall

    Load

    balancer

    Switch

    Router

    Server table:

    Server

    Name/PriorityOS Patch IP Address Sub Gateway DNS

    Alternate

    DNS

    Secondary

    IPs

    Production

    Mac

    Address

    You should lay out each of your systems

    separately and include a table for your network,

    server layout and storage layout.

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    26/33

    Page | 26

    DR Run Book Template provided by Xtium 2013

    Storage table:

    Name LUN AddressRAID

    configurationHost name

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    27/33

    Page | 27

    DR Run Book Template provided by Xtium 2013

    Backup Configuration

    Use this section to list instructions specifying the servers, directories and files from (and to)

    which backup procedures will be run. This should be the location of your last known good copy

    of production data.

    Server Software VersionBackup

    Cycle

    Backup

    Source

    Backup

    Target

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    28/33

    Page | 28

    DR Run Book Template provided by Xtium 2013

    Monitors

    Listed by server, be sure that these monitors are put in place and activated as part of your

    restore activities. Restoring from a disaster should result in a mirror to your production

    environment (even if scaled). Monitors and alerts are a critical element to your production

    system.

    Server name Monitor Cycle Alert

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    29/33

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    30/33

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    31/33

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    32/33

  • 7/22/2019 Xtium DR Run Book Template IDC Intro

    33/33