incident management:

8
White Paper Incident Management: A CA IT Service Management Process Map Peter Doherty — Senior Consultant, Technical Service, CA, Inc. Peter Waterhouse — Director, Product Marketing, Business Service Optimization, CA Inc. June 2006

Upload: billy82

Post on 07-May-2015

4.677 views

Category:

Business


2 download

TRANSCRIPT

Page 1: Incident Management:

White Paper

Incident Management:A CA IT Service ManagementProcess MapPeter Doherty — Senior Consultant, Technical Service, CA, Inc.Peter Waterhouse — Director, Product Marketing, Business ServiceOptimization, CA Inc.June 2006

Page 2: Incident Management:

Table of ContentsIntroduction ..........................................................................................................................................................................................................3Incident Management ...................................................................................................................................................................................... 4

Event ..............................................................................................................................................................................................................4Detect ............................................................................................................................................................................................................4Record ............................................................................................................................................................................................................4Investigate and Diagnose ..........................................................................................................................................................................6Escalate..........................................................................................................................................................................................................6Resolve ..........................................................................................................................................................................................................7

Optimizing the Incident Management Journey ............................................................................................................................................7Potential Issues with Incident Management..................................................................................................................................................7Summary................................................................................................................................................................................................................8About the Authors ..............................................................................................................................................................................................8

2

Page 3: Incident Management:

3

IntroductionCA’s IT Service Management (ITSM) Process Maps providea clear representation of the ITIL best practice framework.We use the analogy of subway or underground systemtransport maps to illustrate how best to navigate a journeyof continuous IT service improvement. Each map detailseach ITIL process (track), the ITIL process activities (stations)that must be navigated to achieve ITIL process goals (yourdestination), and the integration points (junctions) thatmust be considered for process optimization.

CA has developed two maps (Service Support — Figure A;and Service Delivery — Figure B), since most ITSMdiscussions are focused around these two critical areas.The Service Support journey represents a journey ofimproving day-to-day IT service support processes thatlay the operational foundation needed upon which to buildbusiness value. The Service Delivery journey is moretransformational in nature and shows the processes thatare needed to deliver quality IT services.

Close examination of the maps shows how a continuousimprovement cycle has become a ‘circle’ or ‘central’ line,with each Plan-Do-Check-Act (P-D-C-A) improvementstep becoming a process integration point or ‘junction’.These junctions serve as reference points when assessingprocess maturity, and as a means to consider the impli-cations of implementing a process in isolation. Each of theITIL processes are shown as ‘tracks’, and are located in aposition most appropriate to how they support the goalof continuous improvement. Notice too, how major ITILprocess activities become the ‘stations’ en-route towardsa process destination or goal.

This paper is part of a series of 10 ITSM Process Mapwhite papers. Each paper discusses how to navigate aparticular ITIL process journey, reviewing each processactivity that must be addressed in order to achieve processobjectives. Along each journey careful attention is given tohow technology plays a critical role in both integrating ITILprocesses and automating ITIL process activities.

Figure A. Service Support.

Figure B. Service Delivery.

Page 4: Incident Management:

Incident ManagementThe objective of the Incident Management process is toreturn to a normal service level, as defined in a ServiceLevel Agreement, as quickly as possible with minimumdisruption to the business. Incident Management shouldalso keep a record of incidents for reporting, and integratewith other processes to drive continuous improvement.ITIL® places great emphasis on the timely recording,classification, diagnosis, escalation and resolution ofincidents. Within Incident Management the Service Deskplays a key function, acting as the first line of support andactively routing incidents to specialists and subject matterexperts (SMEs). To be fully effective, the Service Desk hasto work in unison with other supporting processes. Forexample, if a number of incidents are recorded at thesame time, the Service Desk analyst needs sufficientinformation to prioritize each incident. Technology can bea key contributing factor by ranking incidents accordingto business impact and urgency. Today many tools enablethe automatic recording of incidents within the ServiceDesk function, but lack the capabilities to correlateincidents and associate them with business service levels.

Let’s review the Incident Management process journey(see Figure 1), assessing each critical process activity (orstation), and examine how technology can be applied tooptimize the every stage of the journey, ensuring arrivalat the process terminus — the efficient restoration ofIT services.

Incident Management starts with an event that, accord-ing to ITIL, isn’t not part of the standard operation of aservice and which causes, or may cause an interruption orreduction in service quality. Incidents can include hardwareand software errors, and user service requests which aretypically not associated with IT infrastructure failures.Examples of service requests include functional questionsor requests for information, or a request to have a userpassword reset.

The first activity along the Incident Manage-ment process journey is the mechanism todetect incidents as they occur within the opera-tional infrastructure and result in deviationsfrom normal service. Users of IT services arethe first to detect service deviations, yet withautomated management, IT can rapidly detect

incidents before they adversely affect end-users and ITservices. In some cases IT can use process automationtools to detect errors before they affect IT service levelsand to solve problems quickly before they impact thebusiness.

In most cases incidents will be recorded by a Service Deskfunction, which should record all incidents to ensure thatcompliance with service level agreements can be reportedcorrectly. The location of an incident will determine whoor what reports it. Naturally, users should have a facility torapidly report incidents, supplying all information to thefront line analyst, but a truly effective reporting functionalso should enable the system itself to automaticallyrecord incidents as they occur.

4

Figure 1. Incident Management Process Line.

EVENT

DETECT

RECORD

Page 5: Incident Management:

Many Service Desk solutions provide self-help andknowledge based capability, but even if users resolve theissue themselves, they should record the incident. This isimportant, since the IT function can proactively use anaccurate base of recorded incidents to facilitate effectiveprocess improvements along other IT Service Managementprocess lines. Also, giving end users the ability to log non-time critical incidents through a web enabled interfacecombined with a knowledge management tool greatlyreduces the number of calls made to the Service Desk.

Part of the Incident Management recording functionshould involve the effective classification (to determineincident category) and matching (to determine if a similarincident has occurred previously). Technology can help byproviding front line support with information pertaining tothe configuration items (CI’s) supporting the end userwho recorded the incident. During this phase Service Deskanalysts review previous incident activity to understandthe reason for the incident. The analyst should also havethe means to correctly classify the incident using agreedcoding criteria, identifying type of incident (e.g. ITService=degraded), and the Service or CI affected (e.g.Order Entry Service). Many organizations mistakenlycombine the IT Service / CI into the incident type. Bydoing so, they find that their incident classificationmethodology becomes far too complicated and peopleresort to incorrectly classifying incidents.

After classification, it is important to properly prioritizethe incident. Service Desk solutions can help byautomatically determining the priority based on the typesof incident (e.g. IT Service=Outage), and the businessservices that are affected. The priority may also bedetermined by existing Service Level Agreements. Afterclassification, the analyst should use incident matching tosee whether a similar incident has occurred previously,and whether there is a solution, workaround or knownerror. If there is, then the investigation and diagnosisstages may be bypassed, and resolution and recoveryprocedures initiated.

If the incident has high priority and can’t be resolvedimmediately, the incident manager should create a linkedproblem record and initiate Problem Management processactivities. Interestingly enough, Problem Management willhave a different focus to Incident Management and couldbe in conflict. Incident Management should restore the ITservice while Problem Management should determine aroot cause and update the status to a known error. In themajority of cases where there is a conflict, IncidentManagement should take priority, since it is more criticalto restore normal service levels, even with workarounds.

Before continuing along our Incident Managementprocess journey, it is worth considering how the effectivedetection, recording and classification of incidents(achieved thus far) can facilitate an “optimum” journeyalong other ITIL process lines. In Figure 2 we can see thatafter the detection and recording activities, the IncidentManagement process arrives at a critical point — TheCheck junction. Incident Management outputs derivedfrom the timely detection and accurate reporting ofincidents provide the means to be more proactive andoptimize the Problem Management process. For example,the accurate recording of all incidents will assist ProblemManagement with the rapid identification of underlyingerrors. Where justified, Problem Management will striveto permanently correct these errors, and reduce theamount of repeat incidents. Alternatively, the Checkjunction enables Incident Management to take inputs fromProblem Management to further streamline the overallprocess. For example, by delivering information aboutknown errors (from an integrated known error database)the “journey time” to the ultimate destination — servicerestoration — will be reduced dramatically. Naturally,technologies can play a key role, integrating both Incidentand Problem Management within a single solution.

5

Figure 2.

Page 6: Incident Management:

monitored, horizontal escalation can lead to incidentsbouncing around the system without anyone takingownership and the increased likelihood of breechingservice level agreements. This is why it is so important tohave a proactive approach and use process automation tocorrectly route incidents to the appropriate SME groups.Vertical escalation is where the incident needs to gainhigher levels of priority. As part of the activity, it isessential that rules are clearly in place to ensure timelyescalation, and avoid the need for support analysts towork out when to escalate — a recipe for disaster!

For every resolution attempt, accurate data must beattached to the incident detail to save repeating recoveryprocedures and lengthening overall resolution times.Technology can play another key role, automating theescalation process itself, and pinpointing the exact sourceof errors. This latter capability is important since it ensuresthe correct incident hand-off to appropriate SME groupsearly in the support cycle.

At this stage of the journey, the Incident Managementprocess line has arrived at the ACT junction (see Figure3). Here, iterative investigation and diagnosis will havedetermined the nature of the incident, and what actionsneed to be initiated to resolve the problem. Customerservice must be restored as quickly as possible (throughworkarounds if necessary), and incidents should beescalated to Problem Management to detect the under-lying cause of the problem, provide resolutions andprevent incidents from reoccurring.

If no immediate solutions are available, then the ServiceDesk function needs to be able to route incidents tosubject matter experts (SMEs). During the investigationand diagnosis phase, support analysts will collect updatedincident details and analyze all related information(especially configuration details from a CMDB linked tothe Service Desk).

During this phase, the support staff must access tocomprehensive historical incident, problem and knowledgedata, centralized and maintained within the Service Desk.Also critical is the capability to augment incident manage-ment records with diagnostic data supplied by SMEsor via integrated management technologies. The role ofmanagement technologies can play a key role here incorrectly identifying and routing incidents to theappropriate SMEs. By its very nature, investigation anddiagnosis of incidents is an iterative process, and mayinvolve multiple Level 1, 2 and 3 SME groups as well asexternal vendors. This demands discipline and a rigorousapproach to maintaining records, actions, workaroundsand corresponding results. Integrated Service Desktechnology can help in this process by providing:

• Flexible routing of Incident Management data accordingto geographic region, time etc.

• Automatic linkage and extraction of CMDB data for theexamination of failed items.

• A strong knowledge base and tools to expedite thediagnostic function.

• Management dashboards and reports to provide anoverall status of Incident Management.

• Controls to ensure process conformance and providecomprehensive audit logs.

Having conducted investigation and diagnosis, theIncident Management journey arrives at another station— Escalation. Critical here is the ability to rapidly escalateincidents according to agreed service levels and allocatemore support services if necessary. Escalation can followtwo paths; horizontal (functional) or vertical. Horizontalescalation is needed when the incident needs to beescalated to different SME groups better able to performthe Incident Management function. If not closely

6

INVESTIGATE

DIAGNOSE

Figure 3.

ESCALATE

Page 7: Incident Management:

The final stage along the Incident Management journey isResolution and Recovery. Here the main activities includeresolving the incident with solutions or workaroundsobtained from previous activities. For some solutions, aRequest for Change (RFC) will need to be submitted, so itis vital that technologies support the timely and accuratetransference of incident details to a Change Managementprocess. Once the solution is resolved by the SME groups,the incident is routed back to the Service Desk function,which confirms with initiator of the incident that the errorhas been rectified and that the incident can be closed.During this phase, integrated technologies must supporta number of service improvement functions, such asproviding restricted access to the incident closing function,and ensuring that incidents are matched to known errorsor problem records.

Optimizing the IncidentManagement JourneySince a primary role of Incident Management process isto ensure that users can get back to work as quickly aspossible, activities should incorporate technologies thatsupport the functions of recording, classification, routingto specialists, monitoring and resolution. Tools that helpenhance the Incident Management process should at aminimum provide:

• Facilities to automate the detection, recording, trackingand monitoring of incidents.

• Capabilities to ensure the integration of an accurateCMDB that will help estimate the impact of incidentsaccording to business priority. Integrated CMDBinformation also ensures the support analyst has accessto accurate information during critical diagnosis andinvestigation phases of the Incident Managementprocess.

• A comprehensive Knowledge Base (available to bothusers and support analysts) detailing how to recognizeincidents, together with what solutions and workaroundsare available.

• Strong workflow capability to streamline escalationprocedures and ensure timely incident hand-offsbetween various support groups.

• Tight integration and proactive controls betweensupporting processes. For example, automatic logging ofincidents during unapproved changes to configurationrecords.

• Mechanisms to report Key Performance Indicators onthe Incident Management process. At a minimum,reports and service dashboards should be capable ofproviding the following information:

– Total number of incidents.

– Average incident resolution time (by Customerand Priority).

– Incidents resolved with agreed Service Levels(by Customer and Priority).

– Incidents resolved by front-line support or throughaccess the knowledge base (with escalation androuting to subject matter experts).

– Breakdown of incidents by classification, department,business service, etc.

– Number of incidents resolved by analyst group /individual analyst / SME group, etc.

Potential Issues with IncidentManagementThe following is a list of issues to look out for to avoidproblems in the Incident Management process:

• Incident Management Bypass. If users attempt toresolve incidents themselves, IT cannot gauge servicelevels and the number of errors. Technology can helpby centralizing the Service Desk function — essentiallyacting as the clearing house for all incidents, andintegrating Incident Management within a broaderIncident, Problem, and Change and ConfigurationManagement process. Incident Management bypasscan also happen by informally approaching the SMEgroups for help. From a process perspective, however,the SME group should not take on the work until theincident has been logged in the Service Desk function.

• Holding on to Incidents. Some organizations mistakenlyfuse Information Management and Problem Managementinto a hybrid Incident Management process. This isdetrimental from the perspective of metrics and theability to prioritize the problems properly. There shouldbe a clear separation between the two processes, andincidents should be closed once the customer confirmsthat the error condition has gone away. Based onbusiness rules the analyst can make the decision as towhether a related problem record should be created tolook for a permanent solution.

• Traffic Overload. This occurs when there are anunexpected number of incidents. This may result inthe incorrect recording of incidents leading to lengthierresolution times and degradation of overall service.Technology can help, by automating procedures todeploy spare capacity and resources.

7

RESOLVE

Page 8: Incident Management:

• Too many choices. There is the temptation to classifyincidents in finite detail and make the analyst navigatethrough many sub-levels to select the incident type.This increases the time it takes to create the incidentand will often lead to the incorrect classification, as theanalyst gives up searching for the most correct match.

• Lack of a Service Catalog. If IT services are not clearlydefined, it becomes difficult to refuse to provide help. AService Catalog can help by clearly defining IT services,the configuration components that support the service,together with agreed service levels.

SummaryThe objective of Incident Management is to rapidly restoreservices in support of service level agreements. UnlikeProblem Management, whose focus is on finding the root-cause of problems, Incident Management is essentiallyabout getting things back up quickly, even if this meansperforming workarounds and quick fixes.

Technologies can play a critical role in optimizing thisprocess, by automating the actual process activitiesthemselves (such as incident recording and classification),and by accessing the outputs from other relatedprocesses. Integration with other processes (especiallyProblem, Change, Configuration and Service LevelManagement) is especially important to ensure thatincidents are kept to a minimum and that the highestlevels of availability and service are maintained.

About the AuthorsPeter Doherty is a Senior Consultant with CA. He is a15 year Service Management practitioner and holds aManager’s Certificate in IT Service Management. A highlysought speaker for IT Service Management seminars andconferences, he won the President’s Award for bestcontent and presented paper at the 2004 AustralianitSMF National conference. Peter has published on thesubject of IT Asset Management as an extension of ITILand is a regular contributor to industry publications.

Peter Waterhouse is Director of Product Marketing in CA’sBusiness Service Optimization business unit. Peter has 15years experience in Enterprise Systems Management, withspecialization in IT Service Management, IT Governanceand best practices.

Copyright © 2006 CA. All rights reserved. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies. This document is for your informationalpurposes only. To the extent permitted by applicable law, CA provides this document “As Is” without warranty of any kind, including, without limitation, any implied warranties of merchantability orfitness for a particular purpose, or non-infringement. In no event will CA be liable for any loss or damage, direct or indirect, from the use of this document including, without limitation, lost profits,business interruption, goodwill or lost data, even if CA is expressly advised of such damages. ITIL® is a registered trademark and a registered community trademark of the UK Office of Governmentand Commerce (OGC) and is registered in the U.S. Patent and Trademark Office. MP302670606