closed loop incident process - papers4you.at

©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Andreas Gutzwiller Presales Consultant, Hewlett-Packard (Schweiz)

Closed Loop Incident Process From fault detection to closure

HP Software and Solutions

Closed Loop Incident Process Solution The CLIP solution is a:

– Highly automated fault detection-to-recovery solution

– Focused on end-to-end service availability and performance

– Reducing mean time to recovery and improves mean time between system failures

Agenda 1.  Event and Incident Processes

2.  Closing the Loop

3.  Architecture

4.  Why CLIP

5

Neither process can stand alone in today’s IT environments ITILv3 Linkage of Event & Incident Management

 Event – A change of state or alert that has significance for the management of a Configuration Item (CI) or IT Service.

 Incident – Unplanned interruption, or reduction of quality, of an IT service

 IT Service – People, processes & technology deliverable that supports a customer’s business processes

 Event Management •  Responsible for managing events

throughout their lifecycle. Main activity of IT Operations.

•  Event Filtered/Correlated Resolve or forward to Incident Close

 Incident Management •  Includes any event which, or could,

disrupts a service. From users or IT staff

•  Incident -> Categorize /Prioritize -> Diagnose -> Resolve -> Close

6

ITIL Areas Involved in CLIP –  Operations Bridge (aka NOC)

•  Central coordination point •  Manages various classes of events

•  Detects incidents •  Manages routine operational activities

•  Reports on the status and performance •  May provide first-level support for those

events which generate an incident

“The Service Desk is not typically involved in Event Management … unless the Service Desk and Operations Bridge have been combined”

–  Service Desk •  Single central point of contact for all

users of IT

•  Logs and manages all incidents, service requests and access requests

•  Provides interface to all other Service Operation processes and activities

Traditional Incident Management From diagnosis to resolution

Multiple un-integrated systems and data stores, manually coordinated hand-offs → inconsistent troubleshooting, high MTTR

Identify service performance degradation

1

Troubleshoot problem to

isolate root cause

2 Identify

actionable condition /

changes to be implemented

3

Create TT/RFC to implement

change

4

Implement and automate change

to close RFC

5

Update CMS (Federated CMDB)

6

End User CMDB “Fire Storms” Help Desk

1.  Service performance notification

2. Gather data to assign SME

3. Bouncing the incident

4. Ticket is finally assigned to the correct SME

5. Impact analysis and change management

6. Update CMDB - timely & correctly?

SME: Subject Matter Experts 7



3.  Architecture

4.  Why CLIP

9

Closed Loop Incident Process solution for ITIL Event and Incident Management

From Fault Detection To Recovery & Closure

ITIL Process Event Management Incident Management

Event Generation & Detection

Event Correlation & Business

Impact

Incident Submission

Investigation & Diagnosis

Resolution

Recovery & Closure

10


Event Generation & Detection

Event Generation &

Detection

Event Correlation &

Business Impact

Incident Submission


Resolution

Recovery & Closure

Operations bridge console collects events & alerts from servers, networks, apps & 3rd party Challenge

  Bottom-up alert and event overload   Lack of qualitative cross domain “actionable”

and causal event data

Solution   All events come to one place, correlated and

enriched against an auto-updated service model

User Example – Events to single console   End user experience slow   SQL slow query performance alert   J2EE DB collection pool issue

Event Generation &

Detection

Event Correlation &

Business Impact

Incident Submission


Resolution

Recovery & Closure

11


Event Correlation & Business Impact

Business services, business impact relationship, and SLAs determined Challenge

  Struggle to link causal events to top down end-user experience and business impact

Solution   Proactive end-user experience linked to

business process and business transaction flow to identify high revenue generating service impact

User Example - Cause from symptoms and impact   Oracle database is the cause, topology based

correlation   Critical funds transfer business service

impacted

Event Generation &

Detection

Event Correlation &

Business Impact

Incident Submission


Resolution

Recovery & Closure

12


Incident Submission

Automatic submission to service desk with annotations and cause area Challenge

  Quality and enrichment of data   Siloed, broken service lifecycle   Duplication of effort wasting time

Solution   Better collaboration   Automation and integrated of event to incident

process lifecycle

User Example - Automatic incident ticket creation   Ticket visible to ops bridge   Assignment to subject expert

Event Generation &

Detection

Event Correlation &

Business Impact

Incident Submission


Resolution

Recovery & Closure

13



Problem isolation, SME tools, and KM used to determine root cause Challenge

  Significant problem resolution time spent on pinpointing problem in a dynamic heterogeneous IT universe

  Incident assigned and reassigned to multiple silos

Solution   Cross domain data visualization and analysis

User Example - Diving deeper to find root cause   Expert sees corrupt DB tables   Finds runbook automation fix in

knowledgebase

Event Generation &

Detection

Event Correlation &

Business Impact

Incident Submission


Resolution

Recovery & Closure

14


Resolution

Change request with attached run book automation to repair CI’s Challenge

  Little or lack of automation leads to increased manual efforts impacting quality and efficiency

Solution   Expert created/authorized run book

automation to empower lower level teams   Manage change, configuration, and release

process User Example - Processing the change

  Get change request approval   Use runbook to reindex database tables

Event Generation &

Detection

Event Correlation &

Business Impact

Incident Submission


Resolution

Recovery & Closure

15


Recovery & Closure

Automatically close incident & related incidents acknowledging related events Challenge

  Struggle to improve speed of restoration, recovery and closure of incident and verify post compliance of SLA/OLA

Solution   Automate all notifications & updates,

continuously monitor SLA/OLA compliance User Example – Verify the change worked

  User, DB and connection pool OK   Ticket and events closed



3.  Architecture

4.  Why CLIP

Integrated ITIL event and incident management process optimizing MTTR and MTBF Closed Loop Incident Process Integration Points

Service Desk

Integrated CMDB

Automation

Monitoring

1 2

3 5

1

5

1.  Sharing CIs, topology and state information 2.  For creating and updating incidents 3.  For updating events 4.  Incident-, Problem- and Change-Mgmt 5.  Runbook automation to remediate

17

4

Integrated ITIL event and incident management process optimizing MTTR and MTBF HP’s Closed Loop Incident Process Solution

Service Manager

UCMDB

Operations Orchestration S

A

Other

CA

NA

SE

BSM

CIs, Topo, Events, Status N

et

Ops

App

Other

1

2 3

4 5

6

7

1.  CIs, topology, events, status measurements flowing into BSM

2.  Sharing events and topology 3.  For creating and updating incidents 4.  To access Business Impact View for a CI

5.  Runbook automation to enrich, diagnosis and remediate

6.  Sharing CIs and state information 7.  Runbook automation to remediate

18



3.  Architecture

4.  Why CLIP

Closed-Loop Incident Mgmt Process Incident management from diagnosis to automated resolution

•  Key processes—incident, change and configuration—need to be tightly linked •  Seamless process linkage requires tools to be consistently service-oriented

IT service management

Business service automation

Configuration Management System (Federated CMDB)

Business service management

1. Identify service performance issue

3. Create RFC to make change

2. Gather data to identify root cause

4b. Review, assess, plan and govern change 5a. Implement change

Identify service performance degradation

1 Troubleshoot problem to isolate root

cause

2 Identify

changes to be implemented

3 Create TT/RFC to implement

change

4 Implement and

automate change to close

RFC

5 Update CMS (Federated

CMCB)

6

6. Update Configuration Management System

4a. Initiate change

5b. Close change request?

20

Drive innovation value of IT Closed Loop Incident Process Key Benefits

Cost •  Drive efficiency through automation •  Optimize service lifecycle process efficiency

72% lower maintenance cost

Quality •  Eliminate error-prone manual tasks •  Predict and prevent negative business impact

2.5x increased availability and performance

Transparency •  The cost/value ratio of delivered services is understood by the business

•  Any service from everywhere

99.5% availability via integrated delivery

Agility •  Saved labor can be spend on innovation •  Measure and optimize time to develop and successfully

deploy new services

30% faster time to market for new apps

Business risk

•  Reduce risk of failure when deploying changes •  Enable compliance

70% fewer bad changes

21

closed loop incident process - papers4you.at

Documents