major incident handling process

16
Major Incident Handling Request for Approval by ITS Senior Management Team May 29, 2007 Presenters: Linda Rosewood, Ann Berry-Kline, John Hammond (Process Approved) SMT chartered a project to produce a Problem Management process. The group worked on an ITIL-framework for Problem Management, but discovered that what SMT probably wanted, and what the division needed, was a major incident han- dling process. The model for this would be Incident Management, where the goal is to restore service as soon as possible and manage communications client expectations. Problem Management will be finished later. This Major Incident Handling process is a special form of Incident Management, where coordination and communication internally and externally are emphasized. What we're asking for today is the approval of this MIH process. When we presented this for dis- cussion several months ago, the discussion focused on "who can declare a major inci- dent" and how the processes would work after hours. Since our last visit, John Hammond, Eric Kiesler and I have refined the process and completed it for both 8-5 and after hours. We left the criteria for declaring a major inci- dent to the details recorded in an OLA. Recently, the CruzMail OLA team used this process and it worked flawlessly when the details of the service were applied to it. We believe the process is ready to be adopted for the four highlighted services, and to become a general model for all ITS services, including those offered by the divisions. Additionally, one of the deliverables of the DDSLA program is to create a plan to com- plete OLAs/SLAs for all services. This plan will need to stage services. One criteria for prioritizing a service at the top of the list would be if it qualifies as an MIH service, and in the packet youll find a list of services the committee recommends as the next to have MIH developed. If approved, we will go forward with implementing the process in the four highlighted services, as well as creating a training of Core Technology, Communications, and Sup- port Center staff, where we run scenarios and understand the tasks of each role. In advance of a complete portfolio of OLAs, we hope to get to a place where we can establish thresholds in the services provided by Core Technologies put this process in production sooner rather than later. That is, we will become familiar with the process and be able to identify "who can declare a major incident" before OLAs are created for all major services.

Upload: sabin-ranjit

Post on 11-Mar-2015

312 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Major Incident Handling Process

Major Incident HandlingRequest for Approval by ITS Senior Management TeamMay 29, 2007Presenters: Linda Rosewood, Ann Berry-Kline, John Hammond(Process Approved)

SMT chartered a project to produce a Problem Management process.The group worked on an ITIL-framework for Problem Management, but discovered that what SMT probably wanted, and what the division needed, was a major incident han-dling process. The model for this would be Incident Management, where the goal is to restore service as soon as possible and manage communications client expectations. Problem Management will be finished later.

This Major Incident Handling process is a special form of Incident Management, where coordination and communication internally and externally are emphasized. What we're asking for today is the approval of this MIH process. When we presented this for dis-cussion several months ago, the discussion focused on "who can declare a major inci-dent" and how the processes would work after hours.

Since our last visit, John Hammond,  Eric Kiesler and I have refined the process and completed it for both 8-5 and after hours.  We left the criteria for declaring a major inci-dent to the details recorded in an OLA.

Recently, the CruzMail OLA team used this process and it worked flawlessly when the details of the service were applied to it.

We believe the process is ready to be adopted for the four highlighted services, and to become a general model for all ITS services, including those offered by the divisions. Additionally, one of the deliverables of the DDSLA program is to create a plan to com-plete OLAs/SLAs for all services. This plan will need to stage services. One criteria for prioritizing a service at the top of the list would be if it qualifies as an MIH service, and in the packet you’ll find a list of services the committee recommends as the next to have MIH developed.

If approved, we will go forward with implementing the process in the four highlighted services,  as well as creating a training of Core Technology, Communications, and Sup-port Center staff, where we run scenarios and understand the tasks of each role.

In advance of  a complete portfolio of OLAs, we hope to get to a place where we can establish thresholds in the services provided by Core Technologies put this process in production sooner rather than later. That is, we will become familiar with the process and be able to identify "who can declare a major incident" before OLAs are created for all major services.

Page 2: Major Incident Handling Process

contents of MIH packet

Contents of MIH packet1 This contents page2 Major Incident Handling Process Overview

A diagram giving overview of individual processes, showing how incident management is applied to major incidents, and identification of three types of coordination roles: Incident, Technical, and Communications.

3 MIH Process DescriptionA outline, describing tasks, roles, and timeline. A textal annotation of the detailed process diagrams that follow.

4 MIH Process Description (After Hours)Supplementary to MIH Process Description, showing how tasks, roles, and timeline are different if a major event occurs outside of our business day of 8 am to 5 pm Monday-Friday.

5 Implementation NotesAssumptions, suggestions on monitoring tool coordination, "Hot Line" proposal for Data Center

6 Recording and Classification Processes (Detail)A process diagram for the first two stages of Incident Management, including incipient event tracking, major incident classification, and the declaration of a major incident.

7 Referral and Resolution ProcessesA process diagram for stages three and four of Incident Management showing building the team, and working the incident, to resolution.

8 Event Coordination and Event Communications ProcessesThe purpose of MIH process is to improve coordination and communication between ITS units. This is a process diagram detailing the coordination and communication tasks and roles.

9 Closure Processes by RoleA process diagram for the last stage of Incident Management for the roles in the Support Center, Communications, and Technical teams.

10 After Hours ProcessA process diagram supplementary to the MIH process.

11 Role DefinitionsDetailed definition of roles and tasks performed

12 Major Incident Declaration TemplateExactly what tool will be used to store and edit the template has not been determined yet. It is likely to be found in IT Request. This template lists what information is to be included when declaring a Major Incident.

13 Highlighted services that need MIHList of services that need MIH proceses first. The first of this list of ten will be CruzMail, CruzNet, and CruzTime. Although Desktop Services is of the four highlighted services this summer, it will not have a major incident handling OLA.

Page 3: Major Incident Handling Process

Create Prelim

Comms

Closure

MIH Declared

Major Incident

Classification

Incipient Event

Tracking

Reco

rdin

gCl

assifi

catio

nRe

ferra

l/Re

solu

tion

Diag

nose

Clos

ure

Help Desk

IncCoord Tech Lead PMG/Comm

ITS Major Incident Handling Process Overview

SMT/DLs Notified

Major IncTracking

Resolution

Major Inc Coord

Build Team

AssignTech Lead

Work Incident

Event Comms

Closure Closure

Step 1

Step 5

Step 4

Step 3

Step 2

Page 4: Major Incident Handling Process

03 MIH Process Descriptions

Major Incident Processes Role Inputs Outputs Time1 (Recording) Incipent Event tracking

1.1 HD staff note pattern of tickets and start linking them with "hot ticket" feature of IT Request

Help Desk Team

tickets hot ticket 15 to 30 min

1.2 ITS staff can use Major Incident Hotline Svc Providers phone call ticket on-going1.2.1 aka "Red Phone." This is a phone that rings in the HD office that is

always answered and is never busy. Used by Svc Providers to report major incidents and by Tech Lead to update HD team. The telephone number is easy to remember, and distributed orally.

2 (Classification) Major Incident Classificaiton Inc Coord2.1 For each service Major Incident criteria are consulted in OLA tickets2.2 Priority is assigned to tickets based on OLA2.3 Process includes quick call or IM to system administrators for status

check. Also check with Support Center Subject Matter expert. 3 Major Incident declared (Investigation and Diagnosis)

3.1 if tickets indicate that the issue has reached Major Incident Status3.1.1 complete MIH Template and post it as tech-only message Major Inc

declaration10 minutes

3.2 create preliminary communications in IT Request visible to ITS staff, visible to the public, and on the telephone greeting.

tech-only msgs

ITR public msgs

phone msg

5 minutes

3.3 Notify SMT/DLs that a major incident is declared. email/cell/vmail to PMG/Comm

5 minutes

3.3.1 message to SMT/DLs is simple: there is a major incident, look at Messages and Ticket # for details

4 Major Incident Tracking4.1 Continue to collect incident information Help Desk

Teamtickets tech notes on going

5 Incident Coord Assigns Tech Lead (Referral and Resolution)5.1 consults OLA or IT Request to identify Lead Tech Inc Coord ITR Config MIH

6 Tech Lead Builds Team6.1 Builds team. Tech Lead can assign other tech lead if needed Tech Lead note in hot

ticket describes team

6.2 Agreement on communication protocol: Status/tools/phone numbers/IM Tech Lead/Icd Coord

5 minutes

7 Team works Incident7.1 outputs: ticket updates, FAQs, tech-messages workarounds,

communications within team and with Incident Coord, root cause, known errors, emergency RFCs

see left

7.2 Priority is placed on on creating work-arounds over fixes or discovering root cause.

8 Event Communications8.1 Tech Lead's communications Tech Lead updates to

Icd Coord via protocol

every 60 minutes or as scheduled. On the hour if possible.

8.2 Incident Coordinator's communications Inc Coord workarounds, status, fixes

updates to PMG/Comm via tech-only notes, phone calls, public web messages

ditto

8.3 PMG/communications: PMG/Comm from IncCoord

campus and others

every 60 minutes or as scheduled

8.4 PMG/Comm is responsible for keeping the ITS phone list updated ITS staff contact informaiton. ITR technician accounts are created by the Support Center. ITR technicians are expected to keep their tech profiles updated.

Page 5: Major Incident Handling Process

Major Incident Processes Role Inputs Outputs Time

8 Event Communications

8.4 PMG/Comm is responsible for keeping the ITS phone list updated ITS staff contact informaiton. ITR technician accounts are created by the Support Center. ITR technicians are expected to keep their tech profiles updated.

9 Event Coord9.1 includes implementing workarounds as created Inc Coord

10 Resolution10.1 resolutions Tech Lead workaroun

ds, fixesas available

10.2 resolve tickets HD Team as possible11 Closure

11.1 Tech Lead closure activities Tech Lead post mortem: audience=SMT/DLs/ITS

within 48 hours

11.2 HD team closure activities HD team FAQs, and other normal IcM closure activities

withing 5 working days

11.3 PMG/Comm closure activities audience=campus

w/in a week and kept in web archive

12 After hours process Role InputsIn separate document

Page 6: Major Incident Handling Process

04 MIH Process Description After Hours

Major Incident Processes After Hours Role Inputs Outputs Time1 (Recording) Alarms/Monitors/Reports are received by Ops Ops Staff alarms daily email

Currently Ops provides a daily message about the night's events. sc.update is a list that should be added that distribution. This alerts not only the Help Desk but Lisa Bono and her back-ups.

As Core Technologies monitoring tools develop and align, we will align "warning event" and "critical event" alerts to the Major Incident processes

2 (Recording)Alarm/Monitor/Report sufficient to suggest calling On-Call staff2.1 opens ticket. emails ticket to sc.update. Ops staff alarm ticket immediate

this step will require training before implementation2.2 Using a template, creates a tech-only message pointing at ticket.

Audience is all-techs. Expires in 24 hours. this step will require training before implementation.

2.3 checks SLAor other documentation --is system/service eligible for off-hours support? If not, contacts client if known. Email ticket to [email protected] calling their attention to the ticket.

Before SLAs are in place, we require a list of services that get off-hours support. This list is partially completed.

2.4 is there support for off-hours? if not, contact client and DL per SLA. 2.4.1 If service does not get off-hours support, Help Desk staff perform

resolution and closure on ticket after 8 am. 2.5 if yes, contact On-Call Person

2.5.1 if on-call doesn't respond w/in defined time, moves down the list of contacts per on-call protocol.

3 (Classification) Incident Classification3.1 For each service Major Incident criteria are consulted in OLA On-Call ticket 30 min

4 Major incident? 4.1 On-Call person decides if report fits the criteria for this service. Acts as

Tech Lead until relieved or delegated.10 min

4.2 If yes, Informs Operations staff, who complete Major Incident Template and adds it to ticket. Changes original ticket to "Hot Ticket."

tech notes

Operations staff act as Incident Coord during off-hours4.3 If not Major Incident, Operators resolve ticket and copy in sc.update.

This updates PMG/Comm also. Record in detail what occured and why it was not a Major Incident.

ITR message, email.

30 min

5 (Investigation/Diagnosis)5.1 On-Call person does what is necessary to gather information and

determine team and perhaps other tech lead, if needed. On-Call depends

on servicevaries

6 Referral/Resolution (if it is a Major Incident)6.1 assembles team, if needed; start working issue. On-Call varies6.2 decides: is there a high probablility that status will not be normal at

6:30 am? (consulting the documentation for that service)On-Call

6.2.1 if yes, Operators create a tech-only message pointing at Hot Ticket containing the Major Incident declaration.

Ops Staff Major Inc declaration

6.2.2 Tech lead communicates with management per service's protocol. Ops-Staff phone calls/emails

6.3 At 8 am Major Incident Process starts with Step 3 of business hours Major Incident Handling. Incident Coordinator duties can be handed off, or stay with Operators depending on workload and service.

7 If resolved before 8 am, begin closure procedures.

Page 7: Major Incident Handling Process

05 Implementation notes for MIH

Implementation Notes for Major Incident Handling We already handle Major IncidentsThis process builds on what we already are doingGoal of these process is to document for training, consistency, and improvement.

2 Assumptions2.1 We depend on OLAs to record the implementation details per service2.2 incidents only--no MIH for service requests2.3 Incident Coords are assumed to be members of the Help Desk team, but could be in other

part of ITS as process develops operationally. As Support Center matures and becomes the single point of contact, incident coordinators are likely to be solely in the Support Center or Data Center.

After hours process identifies DC Operators as the Incident CoordsGroundworks and HPOV integrationthe two main monitors of Core Tech already have alerts and events and actions taken when threshholds are reached. For example, a "warning event" seen by Groundworks generates an email and a "critical event" requires operator action.As this process is implemented, we will need to coordinate the actions and terms from the monitoring tools to the terms used in the process and keep our terminology consistent.As our monitors mature, we will rely less on clients notifying ITS of outages.

Senior Management Buy-InAll Senior Managers will need to endorse the framework of Major Incident handling and use it as we develop OLAs for all services. We need to put energy into process development and in examining how we did after each Major Event. Core Tech is the key area here.

Major Incident HotlineThe Help Desk team's room contains a "hot line" that is used for phone conversations between the Lead Tech and the Incident Coord during a Major Incident. (During open hours of 8 am to 5 pm, generally.)We keep the phone number a close secret.The phone is always answered and never busy.It is used by DLs to report incidents. It is used by Tech Leads to report status when email is not suitable.The MIH plan proposes installing second hotline in the Data Center.As with the hotline in the Help Desk, it should have a distinctive ring, and be always answered.It is to be used by DLs and ITS Service Providers to report incidents.System Owners will use this line to give status reports to the Data Center operators during a major incident, or to report the initial discovery of what could become a major incident.The Hotlines are to be used when verified status is known, not to ask the Help Desk or Data Center staff for technical assistance or to "check to see if there is something wrong with the network."

The hotlines are intended for conversations such as "I'm the system administrator for XYZ. Users can't login and we're investigating it now. I'll call you back in 30 minutes with an update."

Closure ReportsAfter a Major Incident, the Tech Lead has the responsibilty to produce a technical report describing the technical history of the incident, workarounds created, and a root cause, if known.The Support Center staff work with the PMG/Communications staff to create public versions of this document.The public (and technical reports if appropriate) will be posted on the web indefinitely.

Page 8: Major Incident Handling Process

Incident, Alarm, Monitors, email, Voice

Record Basic Details and Acknowledge

Receipt

ITRequest

Need More Detail?

IcM Recording process

Service Request?Service Request Process

No

Yes

Yes

Major Incident Classification

UrgencyPriorityImpact

AssessmentParameters

OLAImpact

Classification

Standard Incident

Management

No

Classified Major Incident No

Referral & Resolution

Recording and Classification

Incipient Event Tracking

Major Incident Classification

Major Incident Tracking

Preliminary Comms using Declaration template

Notify SMT/DLs

Upgrade to Hot Ticket

Consultation with Svc Mgrs and SC SMEs

SLA

Support Center

Page 9: Major Incident Handling Process

Tech Lead Assigns Appropriate Resources

to Team

Investigate and Diagnose

Review Progress

Referral and Resolution

Tech Lead and MIC Agree on Internal Communication

Protocol

Test Workaround

Success?Yes

Tech Lead to MIC Status

Tech notes to Hot Ticket

Major Incident Coordinator

Assigns Technical Lead

Identity Workaround

Determine Root Cause?

No

No

YesTest Root Resolution

Success?

Publish Resolution or Workaround

Yes

Yes

No

No

Build Team

Work Incident

Resolution

Implement Workaround

Fix Root Cause

MIH Technical TeamSupport Center

Closure Closure

Initial Investigation

Page 10: Major Incident Handling Process

Event Comms and Event Coord

Problem Management

Recording and Classification

Major Incident Closure

Closure

Build Team

Work Incident

Resolution

To DLs/SMT

To Specific Clients

To campus

Ticket updates

Tech-only messages

workarounds/FAQs

Major Incident

Coordination

Publish Workaround

Tech Lead to MIC Status

Tech notes to Hot Ticket

Publish Resolution

Event Comms

Closed

Referral and Resolution

MIH Technical TeamSupport CenterCommunications

Problem Management

Major Incident Closure

Problem Record

Page 11: Major Incident Handling Process

Closure

PMG CommsSupport Center

Resolution

Closure

Problem Management

Problem Record

Report to Campus

Resolution

MIH Tech Team

Write Technical Report

assist with writing

Resolution

Page 12: Major Incident Handling Process

After Hours MIH Process

Incident, Alarm, Monitors, email, Voice

Record Basic Details in ITRequest and

Acknowledge Receipt using template

Check Service Availability SLA

Off Hour Support for Service?

Contact Client and/or DLsNo

Contact On Call Person

On Call Investigate/Diagnose

Major Inc Criteria

Major Incident Declared

Notify Operations

tech-only message ,

email sc.update

Start Step 3 @ 8 am

Recording

Classification

Investigation/Referral

Closure

Closure Processes

Work Incident

fixed by 6:30 am?

Prelim Comms

Notify Unit Management

Message to SMT

Message to DLs

Update tech-only mesg

Update to Hot Ticket

No

Yes

On Call Personnel Operations

Yes

Page 13: Major Incident Handling Process

11 Roles for MIH

Roles1 Major Incident Tech Lead

Predetermined person for each service, recorded in the OLA, and IT Request as the "Group Manager." Makes decisions about resourcesAuthority to pull in resources from across units.Reports status to Major Incident Coordinator per protocol as defined in OLACoordinates work of technical team developing workarounds and other resolutions. Writes Technical Post Mortem Report (Audience: SMT, DL's and ITS staff.) Reports to other audiences use this as a resource.Charcteristics of Role

Senior technical person or managerA single person is designated the tech lead during an event. This designate can be

delegated.Characteristics of Communications

Communications with Incident Coordinator concentrate on status, workarounds, and fixes.Although theories are discussed with Incident Coordinator, the focus should not be on

technical theories until the Problem Management process discovers the root cause of the problems behind the incidents.

2 Major Incident CoordinatorIn initial version of the process, this person is always a member of Support Center's Help Desk Team. However, Incident Coords may be designated in Instructional Technology, Media Services, Divisional IT groups, or other ITS units with a client-facing function. Receives reports from Lead Tech and reports incident information to Lead Tech. Usually communicates only the to Lead Tech during an incident--and not to the rest of the technical teamGives status reports to the PMG Communications staff per OLA/SLASees that technical communications via Tech-only messages are created and updatedOversees recording, classification, and initial diagnosis during incident.Oversees complete closure of all tickets.Oversees documentation of workarounds, FAQs, etc during closure.

3 Support Center Subject Matter ExpertSupport Center staff are assigned to be technical experts in each service of the catalog. Consulted during incident classification process before major incident is declared. Assists incident coordinator in writing communications and implementing workarounds.If Incident Coordinator is not in the Support Center, then relevant SMEs should be consulted before declaring a major incident.

4 Data Center Operator Incident Coordinator during Off-Hours processNotifies On-Call tech per OLA/ protocolCreates initial ticket during off-hours events

5 On Call TechCore Systsems staffperson who is on call for a particular service after-hoursResponsible for finding the right person to respond to the issue.Contacted by operators, and no one else.

6 PMG Communications Staff

Page 14: Major Incident Handling Process

Roles6 PMG Communications Staff

Communicates with campus community during event and post-mortumUses all communication channels: ITS status page, mass vmail, mass email, targeted communications. In the future RSS feeds, etc.

7 Divisional Liaison (DL's)Communicates with their clients as neededInforms ITS staff of client needs, priorities, critical calendar events

8 Help Desk StaffExecutes IcM processesPerforms internal and external communications

9 Senior Management Team (SMT)Receives early communicationsReceives communications during events and post-mortum by lead techSets priorities and policy, assigns resources

Page 15: Major Incident Handling Process

12 Major Incident Declaration template

Major Incident Declaration TemplateExactly what tool will be used to store and edit the template has not been determined yet. It is likely to be found in IT Request. Elements of the template

the service impactedperson who made the declarationthe condition of alarmsthe impact--who is impacted by this outage or service degradation?who should care about thiswhen did it begin or was discovered?how long is the condition expected to last? This maybe unknown. If so, say unknown.tech groups working on the incidenttechnical lead on incidentincident coordinatorPMG/comm staff person contacted (if possible)Consult campus calendar and maintenance calendar. Will other events be affected? Could other events be related to cause?

Page 16: Major Incident Handling Process

13 Services for MIH

Services for MIHThe committee suggests that MIH processes be created and added to the OLAs of these services first.

1 Network2 Cruzmail3 Web Services, especially for the main campus and ITS servers4 Cruztime5 Business Systems (FIS, PPS)6 AIS7 CruzID and source systems8 Unix Systems (Timeshares)9 Telephone

10 Santa Cruz Tickets.com11 Other Mission Critical Systems as appropriate