incident-vs-problem management white paper
TRANSCRIPT
Nova Corporation White Paper
Incident Management and Problem Management
DESS Task Order 1
Dan Goebel, ITIL Expert
Nova Corporation
2/14/15 FOUO �1
Dan Goebel ITIL Expert
Nova Corporation
February 14, 2015
Incident vs. Problem Management Background For an IT shop using IT Service Management as a means to deliver IT services to the business, the processes of Incident and Problem management represent the most often used processes. As such there sometimes arise some confusion as to when one ends and the other starts.
In this paper we’ll explore the processes individually and when the handoff should occur. We’ll also explore Metrics, Key performance Indicators, Critical Success Factors, and a simple out of the box process flow for each one. Additionally we will look at inputs and outputs for each of these processes.
This white paper should go a long way in helping DISA ensure that services get restored quickly and that problems are resolved to prevent further incidents.
Appendix A and B have cut-sheets (one page summaries) for both Incident and Problem Management respectively
Incident Management An incident is defined as an unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set.
The management of incidents almost always occurs via interaction with the Service Desk, the Service Desk is defined as a function and stands on it’s own and is not a part of the Incident Management Process, though tightly integrated.
The goal is to restore normal function or service operation as quickly as possible with minimal impact. This is to ensure that Service Level Agreements (SLAs) are met. The SLA may often refer to availability and this is where Incident Management helps to meet those thresholds.
2/14/15 FOUO �2
What triggers an incident can be an event or series of events that disrupt service. These events may be reported via the Service Desk, via users, or through an Event Management tool. One must keep in mind that there are events that do not disrupt service and therefore do not cause an incident.
Key to successful Incident Management is to have an Incident Management Model. this model should include:
• Necessary steps to be taken to handle an incident• Order these steps should be taken in• Responsibilities, possibly through a RACI chart• Timescale and thresholds for completion of actions• Escalation procedures
• Necessary evidence preservation activities. This is especially important in security related events
Incidents will need to be identified, logged, categorized, prioritized, diagnosed, escalated, resolved, and finally closed. This is illustrated in the following generic flow chart:
2/14/15 FOUO �3
One of the problems at DISA is that escalation is confused with Problem Management. For instance, the DESS contract states for Task Order 1, that an incident affecting 10 or more users gets assigned to Problem Management. The idea that a threshold for escalation, in this case 10 users, is a criteria for movement to another process, is not consistent with the ITIL standard. As we shall see in Problem Management, the assumption that x amount of users are affected does not matter so much because it is assumed that the service being analyzed in Problem Management is up and running with a workaround. In the case of servers or network
2/14/15 FOUO �4
equipment having service degradation in a short period of time, for instance, the workaround could be to reboot the equipment every three days. Operationally this is not acceptable, therefore, the next step is to move this issue to Problem Management to solve the issue of quickly degrading service.
Metrics The following represent some possible metrics to be used to help understand where the Incident Management process stands in terms of effectiveness for the business.
These could be found in the following sources; Incident Management system reports, Labor or HR reports, Process and tool assessment audit findings.
KPI (Key Performance Indicators)
KPIs are calculations or measurements that are used to indicate the performance level of an operation or process. These provide input for actionable management decisions and feed the ACT portion of the Deming Plan-Do-Check-Act (PDCA) cycle. KPIs are derived from the previous Metrics shown and are either the same or calculated using two or more of the metrics to show whether tolerance thresholds are within acceptable levels.
Ref Metric
A Total # incidents
B Avg time to resolve Severity 1 and Severity 2 incidents
C # of incidents resolved within agreed service levels
D # of High/Major incidents
E # of incidents with customer impact
F # of incidents reopened
G Total available labor hours to work on incidents (non-Service Desk)
H Total Labor hours spent resolving incidents (non Service Desk)
I Incident Management tooling support level
J Incident Management process maturity
2/14/15 FOUO �5
CSF (Critical Success Factor)
Critical success factor is defined the limited number of areas in which results, if they are satisfactory, will ensure successful competitive performance for the organization. They are the few key areas where things must go right for the business to flourish. Moving up the chain to higher level understanding is the concept of these critical success factors. these critical success factors are built on metrics as can be seen below in the CSF table for Incident Management.
Problem Manangement Problem Management is easily defined as the cause of one or more incidents
Ref Metric Calculation
1 Total # incidents A
2 # of High/Major incidents D
3 Incident Resolution Rate C/A
4 Customer Impact Rate E/A
5 Incident Reopen Rate F/A
6 Avg Time to resolve Severity 1 and Severity 2 Incidents B
7 Incident Labor utilization rate H/G
8 Incident Management tooling support level I
9 Incident Management process maturity level J
CSF KPI
Quickly Resolve Incidents 5,6,8
Maintain IT Service Quality 1,2,3,4,8,9
Improve IT and Business Productivity 7,8
Maintain User Satisfaction 4,8,9
2/14/15 FOUO �6
Problem Management concerns itself with managing the lifecycle of problems. The objective here is to prevent future problems and resulting incidents from occurring, eliminate recurring incidents, and minimize the impact of incidents that cannot be prevented.
Problem Management’s activities include:
• Diagnose the root cause of incidents and to determine the resolution to those problems• Ensure that the resolution is implemented through Change Management and Release and
Deployment Management• Keep track of workarounds and resolutions
While separate from Incident Management, Problem Management is the process of coming up with workarounds, analyzing related incidents, seeing patterns of faults and solutions and feeding this information into Change management and Release & Deployment for permanent fixes to recurring problems and incidents. It would be reasonable to assume that those personnel that work in Problem Management are Tier II and Tier III engineers.
Problem Management has 2 major processes, Reactive and Proactive. Reactive Problem Management is handled under the Operations portion of the ITIL Lifecycle and Proactive falls under the Continual Service Improvement portion. Below is the Problem Management flow, Metrics, and CSF
2/14/15 FOUO �7
Metrics
These could be found in the following sources; Incident Management system reports, Problem management System Reports,Labor or HR reports, Process and tool assessment audit findings.
Ref Metric
A # of repeat incidents
B # of Major Problems
C Total # of Incidents
D Total # of Problems in the pipeline
E # of problems removed
F # of known errors (Root cause known and Workaround in place)
G # of problems reopened
H # of problems with customer impact
I Avg problem resolution time - Severity 1 and 2 in days
J Total available labor hours to work on problems
K Total Labor hours spent working on and coordinating problems
L Problem Management tooling support level
M Problem Management Process maturity
2/14/15 FOUO �8
KPI
CSF
Conclusion In order to be fully successful in implementing ITSM at DISA these processes will need to be fully defined by using sources such as the ITIL Service Operation Manual and the DOD DESMF v 2.0. Process maturity will first need to assessed via audit using ISO 15504 methods and then continually monitored via Continual Service Improvement processes as outline in ITIL V3 and ISO 20000 PDCA cycle. Problems cause incidents, reduce problems, you’ll reduce incidents.
Ref Metric Calculation
1 Incident repeat rate A/C
2 # of Major Problems B
3 Problem Resolution Rate E/D
4 Problem Workaround rate F/D
5 Problem Reopen Rate G/D
6 Customer Impact rate H/D
7 Avg Problem Resolution time I
8 Problem Labor utilization rate K/J
9 Problem Management Tooling support level L
Incident Management process maturity level M
CSF KPI
Minimize impact of problems (reduce incident frequency/duration
1,2,4,6,7
Reduce unplanned labor spent on Incidents 1,3,4,5,8,9
Improve quality of services being delivered 2,6
Resolve problems and errors efficiently and effectively
3,4,5,7,8,9,10
2/14/15 FOUO �9
Sources
1. ITIL Wiki, http://wiki.en.it-processmaps.com/index.php/Main_Page, accessed Feb. 13, 2015
2. ITIL V3 Service Operations, Axelos, Incident and Problem Management sections, 2007
3. Steinberg, Randy A. Measuring ITSM, Trafford Publishing, 2013
4. Manktelow, James. "Critical Success Factors: "Identifying the things that really matter for success." Critical Success Factors. Mind Tools, n.d. Web. 17 Feb. 2015. <http://www.mindtools.com/pages/article/newLDR_80.htm>
2/14/15 FOUO �10
2/14/15 FOUO �11
Introduction
Handles all incidents, including failures, questions by users or
technical staff, and event that may be autom
atically triggered
Definition – an Incident is an unplanned interruption to an IT service or the reduction in the quality of an IT service. Failure of a C
I that has not yet affected service is also an incident.
Objective – to resum
e regular state as quickly as possible to m
inimize im
pact
Scope – all incidents reported by users or tools. A service request is not an incident
Business Value
▪ Reduce dow
ntime
▪ Align services with business priorities
▪ Establish priorities
Basic C
oncepts
▪ Time lim
its▪ – agreed to by operational level agreements
(OLAs) and Underpinning Contracts (UCs – see Supplier
Managem
ent)▪ Incident m
odels▪ - a way to determine the steps that are
necessary to execute the process correctly, especially in term
s of types and priorities of incidents▪ M
ajor incident▪ – a separate procedure is required for m
ajor incidents – i.e. shorter timefram
es and higher urgency – m
ust be agreed upon before hand
Activities, M
ethods, and Techniques
The incident managem
ent process has the following steps
▪ Identification ▪ – incident must be know
n, monitoring tools are helpful
▪ Registration ▪ – all relevant inform
ation must be registered
▪ Classification ▪ – im
portant for later analysis, must be consistent
▪ Prioritization ▪ – establish urgency and impact
▪ Diagnoses ▪ – record greatest possible num
ber of symptom
s; if known then resolve then
▪ Escalation ▪ – - functional escalation – tier 1 to tier 2 to tier 3; m
ust have timefram
es between
- hierarchical escalation – call upon managem
ent to get problem solved; m
ay follow
functional escalation▪ Investigation ▪ – each tier m
akes its own diagnoses and docum
ents▪ R
esolution ▪ and recovery – when solution is found it m
ust be tested – by user, centrally and by supplier if necessary
▪ Closing ▪ – Service desk closes ticket
Interfaces▪ Problem
Managem
ent▪ C
onfiguration Managem
ent▪ C
hange Managem
ent▪ C
apacity Managem
ent▪ Availability M
anagement
▪ Service Level Managem
ent
Metrics
▪ total # of incidents▪ # and %
of major incidents
▪ Avg cost per incident▪ # and %
of correctly allocated incidents▪ %
of incidents handled in the agreed amount of tim
e
Implem
entation▪ Detect incidents as quickly as possible▪ All incidents m
ust be registered (use Remedy)
▪ Previous knowledge must be available to learn from
▪ Must be integrated with the CM
DB to help determine relationship between Cis
▪ Integrated with SLA; helps determine im
pact and priority▪ Critical Success Factors
- a good service desk- clearly defined SLA targets- adequate support staff- Integrated support tools- O
LA and UC to shape behavior of support personnel
Risks
▪ If too many incidents com
e in; then incidents cannot be handled in the timefram
e agreed to▪ If a support tool does not warn of a lack of progress on an incident; then it can becom
e stagnant▪ If no integration or lack of tools; then there will be a lack of adequate inform
ation sources▪ If no O
LAs or UCs; then no coinciding objectives and unaligned processes
Incident Management
2/14/15 FOUO �12
Introduction
Definition – A Problem is the unknow
n cause of one or more
incidents
Objective – to prevent problem
s and incidents, eliminate
repeating incidents and minim
ize the impact of incidents that
cannot be prevented
Scope – all activities needed to diagnose the underlying cause of incidents and to find a solution to these problem
s. Uses
change and configuration managem
ent to implem
ent any fixes
Business Value
▪ Ensure improvem
ents in the availability and quality of the IT service provisions
▪ Resolution inform
ation is used to accelerate incident handling and identify perm
anent solutions▪ R
educes # of incidents and handling time yielding shorter
disruption times and few
er disruptions overall
Basic C
oncepts
▪ Know
n-error▪ – a problem that has a docum
ented root cause and a w
ork around▪ W
ork around▪ – reducing or eliminating the im
pact of an incident or problem
for which a full resolution is not yet
available▪ A know
n error DB
(KED
B)▪ is used for faster diagnoses,
the creation of a problem m
odel for handling future problem
s▪ Problem
Model▪ – steps that need to be taken,
responsibilities of people involved and necessary tim
escales
Risks
▪ If too many incidents com
e in; then incidents cannot be handled in the tim
eframe agreed to
▪ If a support tool does not warn of a lack of progress on an incident; then it can becom
e stagnant▪ If no integration or lack of tools; then there will be a lack of
adequate information sources
▪ If no OLAs or UCs; then no coinciding objectives and
unaligned processes
Activities, M
ethods, and Techniques
2 important processes
▪ Reactive problem
managem
ent – performed by service operations
- identification – via (1) service desk identifies an unknown cause of one or more
incidents = problem registration or (2) analysis of incident by support group reveals a
problem or (3) autom
ated tracing of error via tool or (4) supplier reports problem- registration – requires date and tim
e stamp and historic report for control
- classification – is the same as incident m
anagement
- prioritization – repaired or replaced, costs?, resources needed to solve problem,
time
- investigation and diagnoses – different types of diagnoses: chronological analysis, Pain Value, Kepner-Tregoe, brainstorm
ing, Ishikawa diagrams, Pareto
- workaround – needs to be done as quick as possible while keeping the problem
open for final solution- identified know
n errors – must be put into a known error database, this way other
incidents that come up can be tied to the sam
e problem if necessary
- resolution – as soon as a solution is found it should be applied imm
ediately going through proper change m
anagement to include thorough testing
- conclusion – only after full evaluation and applied solution also applied to incidents associated with this problem- review- perform
lessons learned▪ Proactive problem
Managem
ent – a part of Service Operations, but handled m
ostly under CSI in conjunction with the KEDB.
Interfaces▪ Incident M
anagement
▪ Configuration M
anagement
▪ Change M
anagement
▪ Capacity M
anagement
▪ Availability Managem
ent▪ Service Level M
anagement
▪ Financial Managem
ent
Implem
entation▪ H
ighly dependant on the maturity of incident
managem
ent processes and tools as these help to identify problem
s
Metrics
▪ Total number of problem
s registered within a given period
▪ % of problem
s that were resolved w
ithin SLA targets (and its inverse)▪ # and %
of problems for w
hich more tim
e was needed to resolve them
▪ Backlog of outstanding problems and trend
▪ Average $ of handling a problem▪ # of problem
s outstanding according to classification▪ %
of successful major problem
reviews
▪ # of known errors added to the KED
B▪ Accuracy %
of KEDB (from
DB checks)▪
Problem Management
Intentionally left blank
2/14/15 FOUO �13