incident-vs-problem management white paper

Nova Corporation White Paper

Incident Management and Problem Management

DESS Task Order 1

Dan Goebel, ITIL Expert

Nova Corporation

2/14/15 FOUO �1

Dan Goebel ITIL Expert

Nova Corporation

February 14, 2015

Incident vs. Problem Management Background For an IT shop using IT Service Management as a means to deliver IT services to the business, the processes of Incident and Problem management represent the most often used processes. As such there sometimes arise some confusion as to when one ends and the other starts.

In this paper we’ll explore the processes individually and when the handoff should occur. We’ll also explore Metrics, Key performance Indicators, Critical Success Factors, and a simple out of the box process flow for each one. Additionally we will look at inputs and outputs for each of these processes.

This white paper should go a long way in helping DISA ensure that services get restored quickly and that problems are resolved to prevent further incidents.

Appendix A and B have cut-sheets (one page summaries) for both Incident and Problem Management respectively

Incident Management An incident is defined as an unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set.

The management of incidents almost always occurs via interaction with the Service Desk, the Service Desk is defined as a function and stands on it’s own and is not a part of the Incident Management Process, though tightly integrated.

The goal is to restore normal function or service operation as quickly as possible with minimal impact. This is to ensure that Service Level Agreements (SLAs) are met. The SLA may often refer to availability and this is where Incident Management helps to meet those thresholds.

2/14/15 FOUO �2

What triggers an incident can be an event or series of events that disrupt service. These events may be reported via the Service Desk, via users, or through an Event Management tool. One must keep in mind that there are events that do not disrupt service and therefore do not cause an incident.

Key to successful Incident Management is to have an Incident Management Model. this model should include:

• Necessary steps to be taken to handle an incident• Order these steps should be taken in• Responsibilities, possibly through a RACI chart• Timescale and thresholds for completion of actions• Escalation procedures

• Necessary evidence preservation activities. This is especially important in security related events

Incidents will need to be identified, logged, categorized, prioritized, diagnosed, escalated, resolved, and finally closed. This is illustrated in the following generic flow chart:

2/14/15 FOUO �3

One of the problems at DISA is that escalation is confused with Problem Management. For instance, the DESS contract states for Task Order 1, that an incident affecting 10 or more users gets assigned to Problem Management. The idea that a threshold for escalation, in this case 10 users, is a criteria for movement to another process, is not consistent with the ITIL standard. As we shall see in Problem Management, the assumption that x amount of users are affected does not matter so much because it is assumed that the service being analyzed in Problem Management is up and running with a workaround. In the case of servers or network

2/14/15 FOUO �4

equipment having service degradation in a short period of time, for instance, the workaround could be to reboot the equipment every three days. Operationally this is not acceptable, therefore, the next step is to move this issue to Problem Management to solve the issue of quickly degrading service.

Metrics The following represent some possible metrics to be used to help understand where the Incident Management process stands in terms of effectiveness for the business.

These could be found in the following sources; Incident Management system reports, Labor or HR reports, Process and tool assessment audit findings.

KPI (Key Performance Indicators)

KPIs are calculations or measurements that are used to indicate the performance level of an operation or process. These provide input for actionable management decisions and feed the ACT portion of the Deming Plan-Do-Check-Act (PDCA) cycle. KPIs are derived from the previous Metrics shown and are either the same or calculated using two or more of the metrics to show whether tolerance thresholds are within acceptable levels.

Ref Metric

A Total # incidents

B Avg time to resolve Severity 1 and Severity 2 incidents

C # of incidents resolved within agreed service levels

D # of High/Major incidents

E # of incidents with customer impact

F # of incidents reopened

G Total available labor hours to work on incidents (non-Service Desk)

H Total Labor hours spent resolving incidents (non Service Desk)

I Incident Management tooling support level

J Incident Management process maturity

2/14/15 FOUO �5

CSF (Critical Success Factor)

Critical success factor is defined the limited number of areas in which results, if they are satisfactory, will ensure successful competitive performance for the organization. They are the few key areas where things must go right for the business to flourish. Moving up the chain to higher level understanding is the concept of these critical success factors. these critical success factors are built on metrics as can be seen below in the CSF table for Incident Management.

Problem Manangement Problem Management is easily defined as the cause of one or more incidents

Ref Metric Calculation

1 Total # incidents A

2 # of High/Major incidents D

3 Incident Resolution Rate C/A

4 Customer Impact Rate E/A

5 Incident Reopen Rate F/A

6 Avg Time to resolve Severity 1 and Severity 2 Incidents B

7 Incident Labor utilization rate H/G

8 Incident Management tooling support level I

9 Incident Management process maturity level J

CSF KPI

Quickly Resolve Incidents 5,6,8

Maintain IT Service Quality 1,2,3,4,8,9

Improve IT and Business Productivity 7,8

Maintain User Satisfaction 4,8,9

2/14/15 FOUO �6

Problem Management concerns itself with managing the lifecycle of problems. The objective here is to prevent future problems and resulting incidents from occurring, eliminate recurring incidents, and minimize the impact of incidents that cannot be prevented.

Problem Management’s activities include:

• Diagnose the root cause of incidents and to determine the resolution to those problems• Ensure that the resolution is implemented through Change Management and Release and

Deployment Management• Keep track of workarounds and resolutions

While separate from Incident Management, Problem Management is the process of coming up with workarounds, analyzing related incidents, seeing patterns of faults and solutions and feeding this information into Change management and Release & Deployment for permanent fixes to recurring problems and incidents. It would be reasonable to assume that those personnel that work in Problem Management are Tier II and Tier III engineers.

Problem Management has 2 major processes, Reactive and Proactive. Reactive Problem Management is handled under the Operations portion of the ITIL Lifecycle and Proactive falls under the Continual Service Improvement portion. Below is the Problem Management flow, Metrics, and CSF

2/14/15 FOUO �7

Metrics

These could be found in the following sources; Incident Management system reports, Problem management System Reports,Labor or HR reports, Process and tool assessment audit findings.

Ref Metric

A # of repeat incidents

B # of Major Problems

C Total # of Incidents

D Total # of Problems in the pipeline

E # of problems removed

F # of known errors (Root cause known and Workaround in place)

G # of problems reopened

H # of problems with customer impact

I Avg problem resolution time - Severity 1 and 2 in days

J Total available labor hours to work on problems

K Total Labor hours spent working on and coordinating problems

L Problem Management tooling support level

M Problem Management Process maturity

2/14/15 FOUO �8

KPI

CSF

Conclusion In order to be fully successful in implementing ITSM at DISA these processes will need to be fully defined by using sources such as the ITIL Service Operation Manual and the DOD DESMF v 2.0. Process maturity will first need to assessed via audit using ISO 15504 methods and then continually monitored via Continual Service Improvement processes as outline in ITIL V3 and ISO 20000 PDCA cycle. Problems cause incidents, reduce problems, you’ll reduce incidents.

Ref Metric Calculation

1 Incident repeat rate A/C

2 # of Major Problems B

3 Problem Resolution Rate E/D

4 Problem Workaround rate F/D

5 Problem Reopen Rate G/D

6 Customer Impact rate H/D

7 Avg Problem Resolution time I

8 Problem Labor utilization rate K/J

9 Problem Management Tooling support level L

Incident Management process maturity level M

CSF KPI

Minimize impact of problems (reduce incident frequency/duration

1,2,4,6,7

Reduce unplanned labor spent on Incidents 1,3,4,5,8,9

Improve quality of services being delivered 2,6

Resolve problems and errors efficiently and effectively

3,4,5,7,8,9,10

2/14/15 FOUO �9

Sources

1. ITIL Wiki, http://wiki.en.it-processmaps.com/index.php/Main_Page, accessed Feb. 13, 2015

2. ITIL V3 Service Operations, Axelos, Incident and Problem Management sections, 2007

3. Steinberg, Randy A. Measuring ITSM, Trafford Publishing, 2013

4. Manktelow, James. "Critical Success Factors: "Identifying the things that really matter for success." Critical Success Factors. Mind Tools, n.d. Web. 17 Feb. 2015. <http://www.mindtools.com/pages/article/newLDR_80.htm>

2/14/15 FOUO �10

http://wiki.en.it-processmaps.com/index.php/Main_Page

2/14/15 FOUO �11

Introduction

Handles all incidents, including failures, questions by users or

technical staff, and event that may be autom

atically triggered

Definition – an Incident is an unplanned interruption to an IT service or the reduction in the quality of an IT service. Failure of a C

I that has not yet affected service is also an incident.

Objective – to resum

e regular state as quickly as possible to m

inimize im

pact

Scope – all incidents reported by users or tools. A service request is not an incident

Business Value

▪ Reduce dow

ntime

▪ Align services with business priorities

▪ Establish priorities

Basic C

oncepts

▪ Time lim

its▪ – agreed to by operational level agreements

(OLAs) and Underpinning Contracts (UCs – see Supplier

Managem

ent)▪ Incident m

odels▪ - a way to determine the steps that are

necessary to execute the process correctly, especially in term

s of types and priorities of incidents▪ M

ajor incident▪ – a separate procedure is required for m

ajor incidents – i.e. shorter timefram

es and higher urgency – m

ust be agreed upon before hand

Activities, M

ethods, and Techniques

The incident managem

ent process has the following steps

▪ Identification ▪ – incident must be know

n, monitoring tools are helpful

▪ Registration ▪ – all relevant inform

ation must be registered

▪ Classification ▪ – im

portant for later analysis, must be consistent

▪ Prioritization ▪ – establish urgency and impact

▪ Diagnoses ▪ – record greatest possible num

ber of symptom

s; if known then resolve then

▪ Escalation ▪ – - functional escalation – tier 1 to tier 2 to tier 3; m

ust have timefram

es between

- hierarchical escalation – call upon managem

ent to get problem solved; m

ay follow

functional escalation▪ Investigation ▪ – each tier m

akes its own diagnoses and docum

ents▪ R

esolution ▪ and recovery – when solution is found it m

ust be tested – by user, centrally and by supplier if necessary

▪ Closing ▪ – Service desk closes ticket

Interfaces▪ Problem

Managem

ent▪ C

onfiguration Managem

ent▪ C

hange Managem

ent▪ C

apacity Managem

ent▪ Availability M

anagement

▪ Service Level Managem

ent

Metrics

▪ total # of incidents▪ # and %

of major incidents

▪ Avg cost per incident▪ # and %

of correctly allocated incidents▪ %

of incidents handled in the agreed amount of tim

e

Implem

entation▪ Detect incidents as quickly as possible▪ All incidents m

ust be registered (use Remedy)

▪ Previous knowledge must be available to learn from

▪ Must be integrated with the CM

DB to help determine relationship between Cis

▪ Integrated with SLA; helps determine im

pact and priority▪ Critical Success Factors

- a good service desk- clearly defined SLA targets- adequate support staff- Integrated support tools- O

LA and UC to shape behavior of support personnel

Risks

▪ If too many incidents com

e in; then incidents cannot be handled in the timefram

e agreed to▪ If a support tool does not warn of a lack of progress on an incident; then it can becom

e stagnant▪ If no integration or lack of tools; then there will be a lack of adequate inform

ation sources▪ If no O

LAs or UCs; then no coinciding objectives and unaligned processes

Incident Management

2/14/15 FOUO �12

Introduction

Definition – A Problem is the unknow

n cause of one or more

incidents

Objective – to prevent problem

s and incidents, eliminate

repeating incidents and minim

ize the impact of incidents that

cannot be prevented

Scope – all activities needed to diagnose the underlying cause of incidents and to find a solution to these problem

s. Uses

change and configuration managem

ent to implem

ent any fixes

Business Value

▪ Ensure improvem

ents in the availability and quality of the IT service provisions

▪ Resolution inform

ation is used to accelerate incident handling and identify perm

anent solutions▪ R

educes # of incidents and handling time yielding shorter

disruption times and few

er disruptions overall

Basic C

oncepts

▪ Know

n-error▪ – a problem that has a docum

ented root cause and a w

ork around▪ W

ork around▪ – reducing or eliminating the im

pact of an incident or problem

for which a full resolution is not yet

available▪ A know

n error DB

(KED

B)▪ is used for faster diagnoses,

the creation of a problem m

odel for handling future problem

s▪ Problem

Model▪ – steps that need to be taken,

responsibilities of people involved and necessary tim

escales

Risks

▪ If too many incidents com

e in; then incidents cannot be handled in the tim

eframe agreed to

▪ If a support tool does not warn of a lack of progress on an incident; then it can becom

e stagnant▪ If no integration or lack of tools; then there will be a lack of

adequate information sources

▪ If no OLAs or UCs; then no coinciding objectives and

unaligned processes

Activities, M

ethods, and Techniques

2 important processes

▪ Reactive problem

managem

ent – performed by service operations

- identification – via (1) service desk identifies an unknown cause of one or more

incidents = problem registration or (2) analysis of incident by support group reveals a

problem or (3) autom

ated tracing of error via tool or (4) supplier reports problem- registration – requires date and tim

e stamp and historic report for control

- classification – is the same as incident m

anagement

- prioritization – repaired or replaced, costs?, resources needed to solve problem,

time

- investigation and diagnoses – different types of diagnoses: chronological analysis, Pain Value, Kepner-Tregoe, brainstorm

ing, Ishikawa diagrams, Pareto

- workaround – needs to be done as quick as possible while keeping the problem

open for final solution- identified know

n errors – must be put into a known error database, this way other

incidents that come up can be tied to the sam

e problem if necessary

- resolution – as soon as a solution is found it should be applied imm

ediately going through proper change m

anagement to include thorough testing

- conclusion – only after full evaluation and applied solution also applied to incidents associated with this problem- review- perform

lessons learned▪ Proactive problem

Managem

ent – a part of Service Operations, but handled m

ostly under CSI in conjunction with the KEDB.

Interfaces▪ Incident M

anagement

▪ Configuration M

anagement

▪ Change M

anagement

▪ Capacity M

anagement

▪ Availability Managem

ent▪ Service Level M

anagement

▪ Financial Managem

ent

Implem

entation▪ H

ighly dependant on the maturity of incident

managem

ent processes and tools as these help to identify problem

s

Metrics

▪ Total number of problem

s registered within a given period

▪ % of problem

s that were resolved w

ithin SLA targets (and its inverse)▪ # and %

of problems for w

hich more tim

e was needed to resolve them

▪ Backlog of outstanding problems and trend

▪ Average $ of handling a problem▪ # of problem

s outstanding according to classification▪ %

of successful major problem

reviews

▪ # of known errors added to the KED

B▪ Accuracy %

of KEDB (from

DB checks)▪

Problem Management

Intentionally left blank

2/14/15 FOUO �13

incident-vs-problem management white paper

Technology