cisco mttr e mtbf

81
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr 1 © 2004 Cisco Systems, Inc. All rights reserved. NMS-2201 9627_05_2004_c2 AVAILABILITY MEASUREMENT SESSION NMS-2201 2 © 2004 Cisco Systems, Inc. All rights reserved. NMS-2201 9627_05_2004_c2 Agenda Introduction Availability Measurement Methodologies Trouble Ticketing Device Reachability: ICMP (Ping), SA Agent, COOL SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent Application Developing an Availability ‘Culture’

Upload: edson-aquino-aquino

Post on 26-Apr-2015

530 views

Category:

Documents


18 download

TRANSCRIPT

Page 1: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

1© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

AVAILABILITY MEASUREMENT

SESSION NMS-2201

222© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Agenda

• Introduction

• Availability Measurement MethodologiesTrouble TicketingDevice Reachability: ICMP (Ping), SA Agent, COOL

SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent Application

• Developing an Availability ‘Culture’

Page 2: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

333© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Associated Sessions

• NMS-1N01: Intro to Network Management• NMS-1N02: Intro to SNMP and MIBs

• NMS-1N04: Intro to Service Assurance Agent • NMS-1N41: Introduction to Performance Management

• NMS-2042: Performance Measurement with Cisco IOS®

• ACC-2010: Deploying Mobility in HA Wireless LANs• NMS-2202: How Cisco Achieved HA in Its LAN

• RST-2514: HA in Campus Network Deployments• NMS-4043: Advanced Service Assurance Agent

• RST-4312: High Availability in Routing

INTRODUCTIONWHY MEASURE AVAILABILITY?

4© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 3: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

555© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Why Measure Availability?

1. Baseline the network

2. Identify areas for network improvement

3. Measure the impact of improvement projects

666© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Why Should We Care About Network Availability?

• Where are we now? (baseline)

• Where are we going? (business objectives)

• How best do we get from where we are not to where we are going? (improvements)

• “What if, we can’t get there from here?”

Page 4: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

777© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Why Should We Care About Network Availability?

• Percent of downtime that is unscheduled: 44%

• 18% of customers experience over 100 hours of unscheduled downtime or an availability of 98.5%

• Average cost of network downtime per year: $21.6 million or $2,169 per minute!

SOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of Downtime Causes, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB

Recent Studies by Sage Research Determined ThatUS-Based Service Providers Encountered:

Downtime—Costs too Much!!!

7© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

888© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Cause of Network Outages

• Change management

• Process consistency

• Hardware• Links• Design• Environmental

issues• Natural disasters

Source: Gartner Group

Software andApplication

40%

User Errorand Process

40%

Technology20%

•Software issues•Performanceand load•Scaling

Page 5: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

999© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Top Three Causes of Network Outages

• Congestive degradation• Capacity

(unanticipated peaks) • Solutions validation

• Software quality

• Inadvertent configuration change

• Change management

• Network design• WAN failure (e.g., major fiber

cut or carrier failure)• Power

• Critical services failure (e.g. DNS/DHCP)

• Protocol implementations and misbehavior

• Hardware fault

101010© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Method for Attaining a Highly-Available Network

• Establish a standard measurement method

• Define business goals as related to metrics

• Categorize failures, root causes, and improvements

• Take action for root cause resolution and improvement implementation

Or a Road to Five Nine’s

Page 6: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

111111© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Where Are We Going? Or What Are Your Business Goals?

• FinancialROI Economic Value Added Revenue/Employee

• Productivity

• Time to market

• Organizational mission

• Customer perspectiveSatisfaction Retention Market Share

Define Your ‘End-State’?What Is Your Goal?

121212© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Why Availability for Business Requirements?

• Availability as a basis for productivity dataMeasurement of total-factor productivityBenchmarking the organizationOverall organizational performance metric

• Availability as a basis for organizational competency

Availability as a core competencyAvailability improvement as an innovation metric

• Resource allocation informationIdentify defectsIdentify root causeMeasure MTTR—tied to process

Page 7: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

131313© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

It Takes a Design Effort to Achieve HA

Hardware and Software Design

Network andPhysical Plant Design

Process Design

INTRODUCTIONWHAT IS NETWORK AVAILABILITY?

14© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 8: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

151515© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

What Is High Availability?

30 Seconds99.9999%

5 Minutes99.999%

53 Minutes99.990%

23 Minutes4 Hours99.950%

46 Minutes8 Hours99.900%

48 Minutes19 Hours1 Day99.500%

36 Minutes15 Hours3 Days99.000%

Downtime per Year (24x7x365)Availability

High Availability Means an Average End User Will Experience Less than Five Minutes Downtime per Year

161616© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Availability Definition

• Availability definition is based on business objectives

Is it the user experience you are interesting in measuring?

Are some users more important than other?

• Availability groups? Definitions of different groups

• Exceptions to the availability definition

i.e. the CEO should never experience a ‘network’ problem

Page 9: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

171717© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

How You Define Availability

• Define availability perspective (customer, business, etc.) • Define availability groups and levels of redundancy

• Define an outage• Define impact to network

Ensure SLAs are compatible with outage definition

Understand how maintenance windows affect outage definition

Identify how to handle DNS and DHCP within definition of Layer 3 outage

Examine component level sparing strategy

• Define what to measure• Define measurement accuracy requirements

181818© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Network DesignWhat Is Reliability?

• “Reliability” is often used as a general term that refers to the quality of a product

Failure rateMTBF (Mean Time Between Failures) or

MTTF (Mean Time To Failure)

Engineered availability

• Reliability is defined as the probability of survival (or no failure) for a stated length of time

Page 10: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

191919© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

MTBF Defined

• MTBF stands for Mean Time Between Failure

• MTTF stands for Mean Time to FailureThis is the average length of time between failures (MTBF) or, to a failure (MTTF)

More technically, it is the mean time to go from an OPERATIONAL STATE to a NON-OPERATIONAL STATE

MTBF is usually used for repairable systems, and MTTF is used for non-repairable systems

• MTTR stands for Mean Time to Repair

202020© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

One Method of Calculating Availability

• Availability = MTBF(MTBF + MTTR)

• What is the availability of a computer with MTBF = 10,000 hrs. and MTTR = 12 hrs?

A = 10000 ÷ (10000 + 12) = 99.88%

• Annual uptime8,760 hrs/year X (0.9988)= 8,749.5 hrs

• Conversely, annual DOWN time is,8,760 hrs/year X (1- 0.9988)= 10.5 hrs

Page 11: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

212121© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Networks Consist of Series-Parallel

• Combinations of in-series and redundantcomponents

D1D1

D2D2

D3D3

EE FFCCB1B1

B2B2AA

RBD

1/2 2/3

222222© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

More Complex Redundancy

• Pure active parallelAll components are on

• Standby redundantBackup components are not operating

• Perfect switchingSwitch-over is immediate and without fail

• Switch-over reliabilityThe probability of switchover when it is not perfect

• Load sharingAll units are on and workload is distributed

Page 12: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

MEASURING THE PRODUCTION NETWORK

23© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

242424© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Reliability or Engineered Availability vs. Measured Availability

1. Reliability is an engineered probability of the network being available

2. Measured Availability is the actual outcome produced by physically measuring over time the engineered system

Calculations Are Similar—Both Are Based on MTBF and MTTR

Page 13: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

252525© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Availability Choice Based onBusiness Goals

• Passive availability measurement(Without sending additional traffic on the production network using data from problem management, fault management, or another system)

• Active availability measurement(With traffic being sent specifically for availability measurement using ICMP echo, SNMP, SA agent, etc. to generate data)

262626© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Types of Availability

• Device/interface

• Path

• Users

• Application

Page 14: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

272727© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Some Types of Availability Metrics

• Mean Time to Repair (MTTR)

• Impacted User Minutes (IUM)

• Defects per Million (DPM)

• MTBF (Mean Time Between Failure)

• Performance (e.g. latency, drops)

282828© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Back to How Availability Is Calculated?

• Availability (%) is calculated by tabulating end user outage time, typically on a monthly basis

• Some customers prefer to use DPM (Defects per Million) to represent network availability

Availability (%) = (Total User Time – Total User Outage Time) X 102

Total User TimeDPM = Total User Outage Time X 106

Total User TimeTotal User Time = Total # of End Users X Time in Reporting PeriodTotal User Outage Time = Σ(# of End Users X Outage Time in Reporting Period)Σ Is over All the Incidents in the Reporting PeriodPorts or Connections May Be Substituted for “End Users”

Page 15: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

292929© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Defects per Million

• Started with mass produced items like toasters

• For PVCs,DPM = Σ (#conns*outage minutes)

Σ (#conns*total minutes)

• For SVCs or phone calls,DPM = Σ (#existing calls lost + #new calls blocked)

total calls attempted

• For connectionless traffic (application dependent),DPM = Σ (#end users*outage minutes)

Σ (#end users*total minutes)

NETWORK AVAILABILITY COLLECTION METHODSTROUBLE TICKETING METHODS

30© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 16: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

313131© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Availability Improvement Process

• Step IValidate data collection/calculation methodologyEstablish network availability baselineSet high availability goals

• Step IIMeasure uptime ongoingTrack defects per million (DPM) or IUM or availability (%)

• Step IIITrack customer impact for each ticket/MTTRCategorize DPM by reason code andbegin trendingIdentify initiatives/areas for a focus toeliminate defects

323232© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Data Collection/Analysis Process

• Understand current data collection methodologyCustomer internal ticket databaseManual

• Monthly collection of network performance data and export the following fields to a spreadsheet or database system:

Outage start time (date/time)Service restore time (date/time)Problem descriptionRoot causeResolutionNumber of customers impactedEquipment modelComponent/partPlanned maintenance activity/unplanned activityTotal customers/ports on network

Page 17: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

333333© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Network Availability Results

• Methodology and assumptions must be documented

• Network availability should include:Overall % network availability (baseline/trending)Conversion of downtime to DPM by:

Planned and unplannedRoot causeResolutionEquipment type

Overall MTTRMTTR by:

Root causeResolutionEquipment type

• Results are not necessarily limited to the above but should be customized based on your network and requirements

343434© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Availability = 1 - 8 x 24 .100 x 24 x 365

DPM = 8 x 24 x 106

100 x 24 x 365

MTBF = 24 x 365 .8

MTTR = 1095 x (1-0.978082) .0.978082

= 219.2 failures for every 1 million user hours

= 0.978082

= 1095 (hours)

= 0.24 (hours)

Availability Metrics: Reviewed

• Network has 100 customers• Time in reporting period is one year or 24 hours x 365 days• 8 customers have 24 hours down time per year

Page 18: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

TROUBLE TICKETING METHODSAMPLE OUTPUT

35© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

363636© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Network Availability

99.5099.55

99.6099.65

99.7099.7599.80

99.8599.90

99.95100.00

July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun

Overall Network Availability(Planned/Unplanned)

• Key takeaways

Illustra

tive

Page 19: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

373737© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Platform Related DPM Comparison

• Platform related DPM contributed• 13% of total DPM in September• Platform DPM includes events from:

BackboneNASPGPOPRadius ServerVPN Radius Server

• All other events are included in the “Other” category

Breakdown of Platform Related DPM

• Network Access Server (NAS) accounts for 50% of the total Platform related DPM in September

• Private Access Gateway (PG) showing significant decrease over the past 3 months

52.610482.549.2Total Platform Related3.42.88.80VPN Radius.31.200Radius Server1.6.53.90POP18.956.859.626PG26.12719.421.7NAS2.315.7.81.5BackboneSeptAugJulyJune

0

100

200

300

400

500

600

June July Aug Sept Oct Dec

100

Nov

100

Oct

100

414.8

52.6

362.2Sept

100

Dec

100100100------99.99% Target

498.7507.4388.7Total DPM

10482.549.2Platform Related

394.7424.9339.5OtherAugJulyJune

DPM

Illustra

tive

383838© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

0

500

1000

1500

2000

2500

Dec Jan Feb Mar Apr May

DPM

1964.81641.91293.112261202.23789.3TOTAL20.2474.3

3789.7

087.7

Mar

106.6422.5314.2

19133.410680

Apr

201117.5101.6406Config/SW240553.6512.7884.3HW

604.4212.4136.2145.7Other14.811.131.4566.1Power12718.468.836.1Environmental

115.28.9823.618.2Human Error95.2UnknownMayFebJanDec

Illustra

tive

DPM by Cause

Page 20: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

393939© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

MTTR Analysis: Hardware Faults

• Number of faults increased slightly in September however MTTR decreased 49% of faults resolved in < 1 Hour in September

• 11% of faults resolved in > 24 hours with an additional 3% >100 Hhours

Produce for Each Fault TypeRouter HW

12.42

15.1

8.497.19

0

2

4

6

8

10

12

14

16

Jun Jul Aug Sep Oct Nov Dec

Hou

rs

0

20

40

60

80

100

120

140

Jun Jul Aug Sep Oct Nov Dec

# of

Fau

lts

>100

>24 Hr

12-24 Hr

4-12 Hr

1-4 Hr

<1 Hr

0102030405060708090

100

Jun Jul Aug Sep Oct Nov Dec

# of

Tot

al

>100

>24 Hr

12-24 Hr

4-12 Hr

1-4 Hr

<1 Hr

Illustra

tive

404040© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Unplanned DPM

• Key take-a-ways • Action plansIdentify areas of focus to enable reduction of DPM to achieve network availability goal

0100200300400500600700800900

1000

Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

170401101010

Nov

350651159080Jul

4408018010080

Jun

3506710410079

May

4601002457540Oct

960200385210165Aug

760145325180110Sep

40220520310TOTAL105014060SW58020090HW5558090Process03510070Other

DecAprMarFeb

Illustra

tive

Page 21: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

414141© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Trouble Ticketing Method

• ProsEasy to get startedNo network overhead

Outages can be categorized based on event

• ConsSome internal subjective/consistency process issues

Outages may occur that are not included in the trouble ticketing systemsResources needed to scrub data and create reports

May not work with existing trouble ticketing system/process

Network Availability Collection Methods

AUTOMATED FAULT MANAGEMENT EVENTS METHOD

42© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 22: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

434343© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Availability Improvement Process

• Step IDetermine availability goals

Validate fault management data collection

Determine a calculation methodology

Build software package to use customer event log

• Step IIEstablish network availability baseline

Measure uptime on an ongoing basis

• Step IIITrack root cause and customer impact

Begin trending of availability issues

Identify initiatives and areas of focusto eliminate defects

444444© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Event Log ExampleFri Jun 15 11:05:31 2001 Debug: Looking for message header ...Fri Jun 15 11:05:33 2001 Debug: Message header is okayFri Jun 15 11:05:33 2001 Debug: $(LDT) -> "06152001110532"Fri Jun 15 11:05:33 2001 Debug: $(MesgID) -> "100013"Fri Jun 15 11:05:33 2001 Debug: $(NodeName) -> "ixc00asm"Fri Jun 15 11:05:33 2001 Debug: $(IPAddr) -> "10.25.0.235"Fri Jun 15 11:05:33 2001 Debug: $(ROCom) -> "xlr8ed!"Fri Jun 15 11:05:33 2001 Debug: $(RWCom) -> "s39o!d%"Fri Jun 15 11:05:33 2001 Debug: $(NPG) -> "CISCO-Large-special"Fri Jun 15 11:05:33 2001 Debug: $(AlrmDN) -> "aSnmpStatus"Fri Jun 15 11:05:33 2001 Debug: $(AlrmProp) -> "system"Fri Jun 15 11:05:33 2001 Debug: $(OSN) -> "Testing"Fri Jun 15 11:05:33 2001 Debug: $(OSS) -> "Normal"Fri Jun 15 11:05:33 2001 Debug: $(DSN) -> "SNMP_Down"Fri Jun 15 11:05:33 2001 Debug: $(DSS) -> "Agent_Down"Fri Jun 15 11:05:33 2001 Debug: $(TrigName) -> "NodeStateUp"Fri Jun 15 11:05:33 2001 Debug: $(BON) -> "nl-ping"Fri Jun 15 11:05:33 2001 Debug: $(TrapGN) -> "-2"Fri Jun 15 11:05:33 2001 Debug: $(TrapSN) -> "-2“

Event Log

• Analysis of events received from the network devices

• Analysis of accuracy of the data

Page 23: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

454545© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Calculation Methodology: Example

• Primary events are device down/up

• Down time is calculated based on device-type outage duration

• Availability is calculated based on the totalnumber of device types, the total time, and thetotal down time

• MTTR numbers are calculated from average duration of downtime

• With MTTR the shortest and longest outage provides a simplified curve

464646© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Automated Fault Management Methodology

• ProsOutage duration and scope can be fairly accurateCan be implemented within a NMS fault management systemNo additional network overhead

• ConsRequires an excellent change management/provisioning processRequires an efficient and effective fault management systemRequires a custom developmentDoes not account for routing problems Not “true” end-to-end measure

Page 24: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

NETWORK AVAILABILITY DATA COLLECTIONSAMPLE OUTPUT

47© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

484848© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Automated Fault Management:Example Reports

18.726:38:110:23:100:00:2099.9170%.0830%844:59:1626478018GRAND TOTAL

16.842:16:100:26:070:00:1799.9491%.0509%212:29:46173897OtherTotals

14.909:49:350:22:360:00:2499.8691%.1309%430:02:0316734732NetworkTotals

24.427:48:460:20:470:00:1999.9327%.0673%202:27:278012389HostTotals

Events per

Device

Longest Outage

Duration

Mean Time to Repair

Shortest Outage

Duration%Up

%Down

Total Down Time

hhh:mm:ssCount of Incidents

# of Devices

Device Type

Page 25: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

494949© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Count of Incidents

Automated Fault Management:Example Reports (2)

Other Totals11% Host Totals

30%

NetworkTotals59%

Host TotalsNetwork TotalsOther Totals

Other Totals7% Host Totals

30%

NetworkTotals63%

Host TotalsNetwork TotalsOther Totals

Total Down TimeOther Totals

25% Host Totals24%

NetworkTotals51%

Host TotalsNetwork TotalsOther Totals

Number of Managed Devices

Network Availability Collection Methods

ICMP ECHO (PING) AND SNMP AS DATA GATHERING TECHNIQUES

50© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 26: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

515151© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Data Gathering Techniques

• ICMP ping

• Link and device polling (SNMP)

• Embedded RMON

• Embedded event management

• Syslog messages

• COOL

525252© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Data Gathering Techniques

• Method definition: Central workstation or computer configured to send ping packets to the network edges(device or ports) to determine reachability

• How: Edge interfaces and/or devices are defined and “pinged” on a determined interval

• Unavailability: Pre-defined, non-response from the interface

ICMP Reachability

Page 27: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

535353© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Availability Measurement Through ICMP

Periodic ICMP Test

Periodic Pings to Network Devices Period Ping to Network Leaf Nodes

545454© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Data Gathering Techniques

• ProsFairly accurate “network availability”

Accounts for routing problems

Can be implemented for fairly low network overhead

• ConsPoint to multipoint implies not “true” end-to-end measure

Availability granularity limited by ping frequencyMaintenance of device database…must have a solid change management and provisioning process

ICMP Reachability

Page 28: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

555555© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Data Gathering Techniques

• Method definition:SNMP polling and trapping on links, edge ports, or edge devices

• How:An agent is configured to SNMP poll and tabulate outage times for defined devices or links; database maintains outage times and total service time; sometimes trap information is used to augment this method by providing more accurate information on outages

• Unavailability: Pre-defined, non-redundant links, ports, or devices thatare down

Link and Device Status

565656© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Polling Interval vs. Sample Size

• Polling interval is the rate at which data is collected from the network

Polling interval = 1 Sampling Rate

• The smaller the polling interval the more detailed (granular) the data collected

Example polling data once every 15 minutes provides 4 times the detail (granularity) of polling once an hour

• A smaller polling interval does not necessarily provide a better margin of error

Example polling once every 15 minutes for one hour, has the same margin of error as polling once an hour for 4 hours

Page 29: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

575757© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Link and Device Status Method

• Method definitionSNMP polling and trapping on links, edge ports, or edge devices

• How:Utilizing existing NMS systems that are currently SNMP polling to tabulate outage times for defined devices or links

A database maintains outage times and total service time

SNMP Trap information is also used to augment this method by providing more accurate information on outages

585858© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Link and Device Status Method

• ProsOutage duration and scope can be fairly accurateUtilize existing NMS systemsLow network overhead

• ConsNo canned SW to do this; …custom developmentMaintaining element device database challengingRequires an excellent change mgmt and provisioning processDoes not account for routing problemsNot a “true” end-to-end measure

Page 30: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

CISCO SERVICE ASSURANCE AGENT (SA AGENT)

59© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

606060© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Service Assurance Agent

• Method Definition:SA Agent is an embedded feature of Cisco IOS software and requires configuration of the feature on routers within the customer network; use of the SA agent can provide for a rapid, cost-effective deployment without additional hardware probes

• How: A data collector creates SA Agents on the routers to monitor certain network/service performances; the data collector then collects this data from the routers, aggregates it and makes it available

• Unavailability: Pre-defined paths with reporting on non-redundant links, ports, or devices that are down within a path

Page 31: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

616161© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Case Study: Financial Institution (Collection)

SA Agent Collectors

Remote Sites

DNS

Internet Web Sites

626262© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Availability = 1 - Probes with No Response Total Probes Sent

DPM = Probes with No Response x 106

Total Probes Sent

Availability Using Network-Based Probes

• DPM equations used with network-based probes as input data• Probes can be

Simple ICMP Ping probe, modified Ping to test specific applications,Cisco IOS SA Agent

• DPM will be for connectivity between 2 points on the network, the source and destination of probe

Source of probe is usually a management system and the destination are the devices managed

Can calculate DPM for every device managed

Page 32: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

636363© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

DPM = 1 x 106

10000= 100 probes out of 1 million will fail

Availability = 1 - 1 .10000 = 0.9999

Availability Using Network-Based Probes: Example

• Network probe is a ping

• 10000 probes are sent between management system and managed device

• 1 probe failed to respond

646464© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Sample Size

• Sample size is the number of samples that have been collected

• The more samples collected the higher the confidence that the data accurately represents the network

• Confidence (margin of error) is defined by

• Example data is collected from the network every 1 hourAfter One Day After One Month

0367.03124

1m =x

=2041.0241m ==

sizesample1

m =

Page 33: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

656565© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Service Assurance Agent

• ProsAccurate “network availability” for defined “paths”Accounts for routing problems

Implementation with very low network overhead

• ConsRequires a system to collect the SAA data

Requires implementation in the router configurations

Availability granularity limited by polling frequencyDefinition of the critical network paths to be measured

COMPONENT OUTAGE ONLINE MEASUREMENT (COOL)

66© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 34: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

676767© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

COOL Objectives

• To automate the measurement to increase operational efficiency and reduce operational cost

• To measure the outage as close to the source of outage events as possible to pin point the cause of the outages

• To cope with large number of network elementswithout causing system and network performance degradation

• To maintain measurement data reliably in presents of element failure or network partition

• To support simplicity in deployment, configuration, and data collection (autonomous measurement)

686868© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

COOL Features

NetToolsNetTools3rd Party Tools3rd Party Tools

Customer Equipment

Access Router

NMS

C-NOTEC-NOTE

PNLPNL

COOL Embedded in Router

Automated Real-Time MeasurementAutonomous Measurement

Outage Data Stored in Router

Outage Monitor MIB Open access via Outage Monitor MIBEvent Notification Filtering

Page 35: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

696969© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

COOL Features (Cont.)

• Support NMS or tools for such applications as

Calculation of software or hardware MTBF, MTTR, availability per object, device,or networkVerification of customer’s SLATrouble shooting in real-time

• Two-tier frameworkReduces performance impact on the routerProvides scalability to the NMSMakes easy to deployProvides flexibility to availability calculation

NMS

Customer Equipment

NMS

COOL

Outage Monitor MIB

Access Routers

Access RouterCore Router

Out

age

Mon

itorin

g an

dM

easu

rem

ent

Out

age

Cor

rela

tion

and

Cal

cula

tion

NMS

COOL

Outage Monitor MIB

707070© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

A

DD

RP

Power Fan,Etc.

PhysicalInterface

LogicalInterface

Access Router

Outage ModelC

B

Failure of Remote Device (Customer Equipment or Peer Networking Device) or Link In-betweenRemote ObjectsC

Failure of Software Processes Running on the RPs and Line CardsSoftware ObjectsD

Interface Hardware or Software Failure, Loss of SignalInterface ObjectsB

Component Hardware or Software Failure Including the Failure of Line Card, Power Supplies, Fan, Switch Fabric, and So on

Physical Entity ObjectsA

Failure ModesObjects MonitoredType

NetworkManagement

System

CustomerEquipmentMUX/

Hub/Switch

PeerRouter

LinkLink

A

DD

Page 36: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

717171© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Outage Characterization

• Data DefinitionDefect threshold: a value across which the object is considered to be defective (service degradation or complete outage)

Duration threshold: the minimum period beyond which an outage needs to be reported (given SLA)

Start time: when the object outage starts

End time: when the outage ends

Down Event

Up Event

Outage Duration

DurationThreshold

DefectThreshold

Start Time End Time

Time

727272© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Architecture

Outage Outage ManagerManager

Internal Component Internal Component Outage DetectorOutage Detector

Fault Manager(IOS)

Event Source

Callbacks Syslog

Remote Component Outage Detector

Remote Component Outage Detector

Customer Equipment Detection Function Ping SAA

APIs

Data Table StructureData Table Structure HA and Persistent Data StoreHA and Persistent Data Store

Time Stamp Temp Event DataCrash Reason

Outage Data

NVRAM

ATA Flash

Outage Monitor MIBOutage Monitor MIB

SNMP Polling SNMP Notification

ConfigurationConfigurationCustomer

AuthenticationCLI

Baseline Optional

CPU UsageDetect

Outage Component Table

Event History Table

Event Map TableProcess Map Table

Remote Component Map Table

Measurement Metrics

Customer Interfaces

Measurement Methods

Page 37: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

737373© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Outage Data: AOT and NAF

• Requirements of measurement metrics:Enable calculation of MTTR, MTBF, availability, and SLA assessment

Ensure measurement efficiency in terms of resource (CPU, memory, and network bandwidth)

• Measurement metrics per object:AOT: Accumulated Outage Time since measurement started

NAF: Number of Accumulated Failures since measurement started

AOT = 20 and NAF = 2

Router 1

Time10 10

System Crash System Crash

Down

Up

747474© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Outage Data: AOT and NAF

• Object containment model

• Containment independent propertyRouter Device

AOT = 20;NAF = 2;

Service Affecting AOT = 27;NAF = 3;

Interface AOT = 7;NAF = 1;

Interface 1Interface Failure

202077

20

Router 1 Interface 1

Router Device

Line Card

Physical InterfaceLogical Interface

Router 1

Time10 10

System Crash System Crash

Down

Up

Time10 10

Up7

Page 38: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

757575© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Example: MTTR

• Find MTTR for Object iMTTRi = AOTi/NAFi

= 14/2

= 7 min

Object i

Time10 min. 4 min.

Measurement Interval (T2–T1)

Failure FailureT1 T2

TTR TTR

DownUp

767676© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Example: MTBF and MTTF

• Find MTBF and MTTF for Object i

MTBF = 700,000 = 1,400,000/2

MTTR = 699,993 = (700,000 – 7)

MTBFi = (T2 – T1)/NAFi MTTFi = MTBFi – MTTRi = (T2 – T1 – AOTi)/NAFi

Object i

Time10 min. 4 min.

Measurement Interval (T2–T1)

Failure FailureT1 T2

TTR TTF

DownUp

TBF

(T2–T1) = 1,400,000 min

Page 39: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

777777© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Time10 min. 4 min.

Failure FailureT1 T2

DownUp

Example: Availability and DPM

• Find availability and DPM for Object i

Availability = 99.999% = (700,000/700,007) * 100

DPMi = [AOTi/(T2 – T1)] x 106 = 10 DPM

Object iMeasurement Interval = 1,400,000 min.

Availability (%) = MTBFMTBF + MTTR * 100

787878© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Planned Outage Measurement

• To capture operation CLI commands both “reload” and “forced switchover”

• There is a simple rule to derive an upper bound of theplanned outage

If there is no “NVRAM soft crash file”, check the “reboot reason” or “switchover reason”

If it’s “reload” or “forced switchover”, it can be considered as an upper bound of the planned outage

Send BreakSend Break

Reload

Forced Switchover

Planned Outage

Operation Caused Outage

Upper Bound of the Planned Outage

Page 40: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

797979© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Event Filtering

• Flapping interface detection and filtering:Some faulty interface state can be keep changing up and down

May cause virtual network disconnection

May occurs event storm when hundreds of messages for eachflapping event

May make the object MTBF unreasonably low due to frequentshort failures

This unstable condition needs to get operator’s attention

COOL detects the flapping status

Catching very short outage event (less than the duration threshold)

Increasing the event counter,

Flapping status, if it becomes over the flapping threshold (3 event counter) for the short period (1 sec); sends a notification

Stable status, if it becomes less than the threshold; sends another notification

808080© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Data Persistency and Redundancy

• Data persistencyTo avoid data loss due to link outage or router itself crash

• Data redundancy To continue the outage measurement after the switchoverTo retain the outage data even if the RP is physically replaced

Copy

NVRAM

RAMOutage Data

FLASHPersistent

Outage Data

NVRAM

RAMOutage Data

FLASHPersistent

Outage Data

Copy

Active RP Standby RP

COOLCOOL

Router

PersistentOutage Data

PersistentOutage Data

Periodic Update

Event Driven Update

Page 41: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

818181© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Outage Monitor MIB

(Physical Entity Object Description)

(Interface Object Description)

ifTable

entPhysicalTable

(Process Object Description)

cpmProcessTable

CISCO-OUTAGE-MONITOR-MIB

cOutageHistoryTable

cOutageObjectTable

Remote Object Map Table(Remote Object Description)

Object-Type;Object-Index;

Event-Reason-Index;Event-Time;Event-Interval;

Object-Type;Object-Index;

Object-Status;Object-AOT;Object-NAF;

IF-MIB

ENTITY-MIB

CISCO-PROCESS-MIB

Iso.org.dod.internet.private.enterprise.cisco.ciscoMgmt.ciscoOutageMIB1.3.6.1.4.1.9.9.280

Event Reason Map Table(Event Description)

Process MIB Map

828282© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Configuration

MIB Display

Customer EquipmentDetection Function

Cisco IOSConfigurationCOOL

Update

Update

Show CLI

run;add;removalfiltering-enable;

Config CLI

Show event-tableShow object-table

Object TableEvent Table

Page 42: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

838383© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Enabling COOLari#dirDirectory of disk0:/

1 -rw- 19014056 Oct 29 2003 16:09:28 +00:00 gsr-k4p-mz.120-26.S.bin

128057344 bytes total (109051904 bytes free)ari#copy tftp disk0:Address or name of remote host []? 88.1.88.9Source filename []? auth_fileDestination filename [auth_file]? Accessing tftp://88.1.88.9/auth_file...Loading auth_file from 88.1.88.9 (via FastEthernet1/2): ![OK - 705 bytes]

705 bytes copied in 0.532 secs (1325 bytes/sec)ari#clear cool perari#clear cool persist-files ari#conf tEnter configuration commands, one per line. End with CNTL/Z.

ari(config)#cool run

ari(config)#^Zari#wr memBuilding configuration...[OK][OK][OK]

Obtain Authorization

File

Enable COOL

848484© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

COOL

• ProsAccurate “network availability” for devices, components, and softwareAccounts for routing problems Implementation with low network overhead.Enables correlation between active and passive availability methodologies

• ConsOnly a few system currently have the COOL featureRequires implementation in the router configurations of production devicesAvailability granularity limited by polling frequencyNew Cisco IOS Feature

Page 43: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

Network Availability Collection Methods

APPLICATION LAYER MEASUREMENT

85© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

868686© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Application ReachabilitySimilar to ICMP Reachability

• Method definition: Central workstation or computer configured to send packets that mimic application packets

• How: Agents on client and server computers and collecting data

Fire Runner, Ganymede Chariot, Gyra Research, Response Networks, Vital Signs Software, NetScout, Custom applications queries on customer systems

Installing special probes located on user and server subnets to send, receive and collect data; NikSun and NetScout

• Unavailability: Pre-defined QoS definition

Page 44: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

878787© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Application Reachability

• ProsActual application availability can be understoodQoS, by application, can be factored into the availability measurement

• ConsDepending on scale, potential high overhead and cost can be expected

DATA COLLECTION FOR ROOT CAUSE ANALYSIS (RCA) OF NETWORK OR DEVICE DOWNTIME

88© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 45: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

898989© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Data Gathering Techniques

• Alarm and event

• History and statistics

• Set thresholds in router configuration

• Configure SNMP trap to be sent when MIB variable rises above and/or falls below a given threshold

• Alleviates need for frequent polling

• Not an availability methodology by itself but can add valuable information and customization to the data collection method

Cisco IOS Embedded RMON

909090© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Data Gathering Techniques

• Provide information on what the router is doing

• Categorized by feature and severity level

• User can configure Syslog logging levels

• User can configure Syslog messages to be sent as SNMP traps

• Not an availability methodology by itself but can add valuable information and customization to the data collection method

Syslog Messages

Page 46: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

919191© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Expression and Event MIB

Expression MIB• Allows you to create new SNMP objects based upon formulas• MIB persistence is supported – a MIB’s SNMP data persists across

reloads • Delta and wildcard support allows you to:

Calculate utilization for all interfaces with one expressionCalculate errors as a percentage of traffic

Event MIB• Allows you to create custom notifications and log them and/or send

them as SNMP traps or informs• MIB persistence is supported – a MIB’s SNMP data persists across

reloads • Can be used to test objects on other devices• More flexible than RMON events/alarms

RMON is tailored for use with counter objects

929292© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Data Gathering Techniques

• Underlying philosophy: Embed intelligence in routers and switches to enable a scalable and distributed solution, with OPEN interfaces for NMS/EMS leverage of the features

• Mission statement:Provide robust, scalable, powerful, and easy-to-use embedded managers to solve problems such as syslog and event management within Cisco routers and switches

Embedded Event Manager

Page 47: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

939393© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Embedded Event Manager (Cont.)

• Development goal: predictable, consistent, scalable management

DistributedIndependent of central management system

• Control is in the customer’s handsCustomization

• Local programmable actions:Triggered by specific events

949494© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

EEMPoliciesEEM

Policies

Cisco IOS Embedded Event Manager:Basic Architecture (v1)

Event Detector Feeds EEMEvent Detector Feeds EEM

Embedded Event Manager EEMPolicies

Notify

SyslogEvent Detector

OtherEvent Detector

Switch-over Reload

Actions

NetworkKnowledge

SNMPEvent Detector

Syslog EventSyslog Event SNMP DataSNMP Data Other EventOther Event

Page 48: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

959595© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

EEM Versions

• EEM Version 1Allows policies to be defined using the Cisco IOS CLI appletThe following policy actions can be established:

Generate prioritized syslog messagesGenerate a CNS event for upstream processing by Cisco CNS devicesReload the Cisco IOS softwareSwitch to a secondary processor in a fully redundant hardware configuration

• EEM Version 2EEM Version 2 adds programmable actions using the Tclsubsystem within Cisco IOSIncludes more event detectors and capabilities

969696© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

PosixPosixProcessProcessManagerManager

IOS ProcessIOS ProcessWatchdogWatchdog

SyslogSyslogDaemonDaemon

SystemSystemManagerManager

WatchdogWatchdogSysmonSysmon

HAHARedundancyRedundancy

FacilityFacility

SyslogSyslog

SystemSystemManagerManager

TimerTimerServicesServices CountersCounters

InterfaceInterfaceCounters andCounters and

StatsStats

RedundancyRedundancyFacilityFacility

SNMPSNMP

IOS SubsystemsSubscribers to

Receive Application Events, Publishes Application Events Using Application

Specific Event Detector

Tcl Shell

EEM PolicySubscribers to

Receive Events, Implements Policy

Actions

Embedded EventEmbedded EventManager ServerManager Server

ApplicationSpecific

Event Detector

Event Detectors

EventSubscriber

Event Publishers

EEM Version 2 Architecture

• More event detectors!

• Define policies or “programmable local actions” using Tcl

• Register policy with EEM Server

• Events trigger policy execution

• Tcl extensions for CLI control and defined actions

Cisco Internal Use Only 96Cisco Internal Use Only 9696

Page 49: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

979797© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

What Does This Mean to the Business?

• Better problem determinationWidely applicable scripts from Cisco engineering and TACAutomated local action triggered by eventsAutomated data collection

• Faster problem resolutionReduces the “next time it happens…please collect”Better diagnostic data to Cisco engineeringFaster identification and repair

• Less downtimeReduce susceptibility and Mean Time to Repair (MTTR)

• Better serviceResponsivenessPrevent recurrenceHigher availability

• Not an availability methodology by itself but can add valuable information and customization to the data collection method

INSTILLING AN AVAILABILITY CULTURE

98© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 50: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

999999© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Putting an Availability Program into Practice

• Track network availability

• Identify defects

• Identify root cause and implement fix

• Reduce operating expense by eliminating non value added work

How much does an outagecost today?

How much can i save thru process and product enhancements?

100100100© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

How Do I Start?

1. What are you using now?a. Add or modify trouble ticketing analysis

b. Add or improve active monitoring method

2. Process—analyze the data!a. What caused an outage?

b. Can a root cause be identified and addressed?

3. Implement improvements or fixes

4. Measure the results5. Back to step 1—are other metrics

needed?

Page 51: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

101101101© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

If You Have a Network Availability Method

• Use the current method and metric for improvementDon’t try to change completelyUse incremental improvements

Develop additional methods to gather data as identified

• Concentrate on understanding unavailability causes—All unavailability causes should be classified at a minimum under:

Change, SW, HW, power/facility, or link

• Identify the actions to correct unavailability causes i.e., network design, customer process change, HW MTBF improvement, etc.

102102102© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Multilayer Network Design

Distribution

Access

Core/Backbone

WAN Internet PSTN

Server Farm

Building BlockAdditions

Core

SA Agent Between Access and Distribution

Page 52: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

103103103© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Distribution

Access

Core/Backbone

WAN Internet PSTN

Server Farm

Building BlockAdditions

Core

Multilayer Network DesignSA Agent between

Servers and WAN Users

104104104© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Distribution

Access

Core/Backbone

WAN Internet PSTN

Server Farm

Building BlockAdditions

Core

Multilayer Network DesignCOOL for High-

End Core Devices

Page 53: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

105105105© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Distribution

Access

Core/Backbone

WAN Internet PSTN

Server Farm

Building BlockAdditions

Core

Multilayer Network DesignTrouble

Ticketing Methodology

AVAILABILITY MEASUREMENT SUMMARY

106© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 54: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

107107107© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Summary

• Availability metric is governed by your business objectives

• Availability measurement’s primary goal is:To provide an availability baseline (maintain)To help identify where to improve the networkTo monitor and control improvement projects

• Can you identify ‘Where you are now?’ for your network?

• Do you know ‘Where you are going?’ as network oriented business objectives?

• Do you have a plan to take you there?

108108108© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Complete Your Online Session Evaluation!

WHAT: Complete an online session evaluation and your name will be entered into a daily drawing

WHY: Win fabulous prizes! Give us your feedback!

WHERE: Go to the Internet stations located throughout the Convention Center

HOW: Winners will be posted on the onsiteNetworkers Website; four winners per day

Page 55: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

109© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

110110110© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Recommended Reading

• Performance and Fault Management

ISBN: 1-57870-180-5

• High Availability Network Fundamentals

ISBN: 1-58713-017-3

• Network Performance Baselining

ISBN: 1-57870-240-2

• The Practical Performance Analyst

ISBN: 0-07-912946-3

Page 56: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

111111111© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Recommended Reading (Cont.)

• The Visual Display of Quantitative Informationby Edward Tufte (ISBN: 0-9613921-0)

• Practical Planning for Network Growthby John Blommers (ISBN: 0-13-206111-2)

• The Art of Computer Systems Performance Analysisby Raj Jain (ISBN: 0-421-50336-3)

• Implementing Global Networked Systems Management: Strategies and Solutions

by Raj Ananthanpillai (ISBN: 0-07-001601-1)

• Information Systems in Organizations: Improving Business Processes

by Richard Maddison and Geoffrey Darnton (ISBN: 0-412-62530-X)

• Integrated Management of Networked Systems—Concepts, Architectures, and Their Operational Application

by Hegering, Abeck, Neumair (ISBN: 1558605711)

112112112© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Appendix A: Acronyms

• AVG—Average• ATM—Asynchronous Transfer Mode• DPM—Defects Per Million• FCAPS—Fault, Config, Acct, Perf,

Security• GE—Gigabit Ethernet• HA—High Availability• HDLC—High Level Data Link Control• HSRP—Hot Standby Routing

Protocol• IPM—Internet Performance Monitor• IUM—Impacted User Minutes• MIB—Management Information Base

• MTBF—Mean Time Between Failure• MTTR—Mean Time to Repair• RME—Resource Manager Essentials• RMON—Remote Monitor• SA Agent—Service Assurance Agent• SNMP—Simple Network Management

Protocol• SPF—Single Point of Failure; Shortest

Path First (routing protocol)• TCP—Transmission Control Protocol

Page 57: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

BACKUP SLIDES

113© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

ADDITIONAL RELIABILITY SLIDES

114© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 58: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

115115115© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Network DesignWhat Is Reliability?

• “Reliability” is often used as a general term that refers to the quality of a product

Failure RateMTBF (Mean Time Between Failures) or

MTTF (Mean Time to Failure)

Availability

116116116© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Reliability Defined

1. The probability of survival (or no failure) for a stated length of time

2. Or, the fraction of units that will not fail in the stated length of time

A “mission” time must be stated

Annual reliability is the probability of survival for one year

Reliability:

Page 59: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

117117117© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Availability Defined

1. The probability that an item (or network, etc.) is operational, and ready-to-go, at any point in time

2. Or, the expected fraction of time it is operational. annual uptime is the amount (in days, hrs., min., etc.) the item is operational in a year

Example: For 98% availability, the annual availability is 0.98 * 365 days = 357.7 days

Availability:

118118118© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

MTBF Defined

• MTBF stands for Mean Time Between Failure

• MTTF stands for Mean Time to FailureThis is the average length of time between failures (MTBF) or, to a failure (MTTF)

More technically, it is the mean time to go from an operational state to a non-operational state

MTBF is usually used for repairable systems, and MTTF is used for non-repairable systems

Page 60: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

119119119© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

How Reliable Is It?

• “MTBF Reliability”:R = e-(MTBF/MTBF)

R = e-1 = 36.7%

• MTBF reliability is only 37%; that is, 63% of your HARDWARE fails before the MTBF!

• But remember, failures are still random!

120120120© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

MTTR Defined

• MTTR stands for Mean Time to Repairor

• MRT (Mean Restore Time)This is the average length of time it takes to repair an item

More technically, it is the mean time to go from a non-operational state to an operational state

Page 61: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

121121121© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

One Method of Calculating Availability

• Availability = MTBF(MTBF + MTTR)

• What is the availability of a computer with MTBF = 10,000 hrs. and MTTR = 12 hrs?

A = 10000 ÷ (10000 + 12) = 99.88%

122122122© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Uptime

• Annual uptime8,760 hrs/year X (0.9988)= 8,749.5 hrs

• Conversely, annual DOWNtime is,8,760 hrs/year X (1- 0.9988)= 10.5 hrs

Page 62: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

123123123© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Systems

• Components “In-Series”

• Components “In-Parallel” (Redundant)

Component 1 Component 2

Component 1

Component 2

RBD

124124124© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

In-Series

Part 1

Part 2

In-Series

Up Up Up

UpUp Up

Up Up Up Up

Down Down

Down Down

Down DownDown

Page 63: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

125125125© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

In-Parallel

In-ParallelUp Down Up

Part 1

Part 2

Up Up Up

UpUp Up

Down Down

Down Down

126126126© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

In-Series MTBF

COMPONENT 1MTBF = 2,500 hrs.

MTTR = 10 hrs.

COMPONENT 2MTBF = 2,500 hrs.

MTTR = 10 hrs.

System Failure Rate= 0.0004 + 0.0004 = 0.0008

System MTBF= 1/(0.0008) = 1,250 hrs.

Component Failure Rate= 1/2500 = 0.0004

Page 64: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

127127127© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

In-Series Reliability

System ANNUAL Reliability:

R = 0.03 X 0.03 = 0.0009

Component ANNUAL Reliability:R = e-(8760/2500) = 0.03

COMPONENT 1MTBF = 2,500 hrs.

MTTR = 10 hrs.

COMPONENT 2MTBF = 2,500 hrs.

MTTR = 10 hrs.

128128128© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

In-Series Availability

System Availability:

A = 0.996 X 0.996 = 0.992

Component Availability:A = 2500 ÷ (2500 + 10) = 0.996

COMPONENT 1MTBF = 2,500 hrs.

MTTR = 10 hrs.

COMPONENT 2MTBF = 2,500 hrs.

MTTR = 10 hrs.

Page 65: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

129129129© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

In-Parallel MTBF

System MTBF*:= 2500 + 2500/2=

3,750 hrs.

COMPONENT 1

MTBF = 2,500 hrs.

COMPONENT 2

MTBF = 2,500 hrs.

In general*, ∑=

n

ii

MTBF

1*For 1-of-n Redundancy of n Identical Components with NO Repair or Replacement of Failed Components

130130130© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

1-of-4 Example

= 5,208 hrs.

*For 1-of-n Redundancy of n Identical Components with NO Repair or Replacement of Failed Components

In general*, ∑=

n

ii

MTBF

1

42500

32500

22500

12500

4

1

2500 +++=∑=i

i

Page 66: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

131131131© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

In-Parallel Reliability

COMPONENT 1MTBF = 2,500 hrs.

MTTR = 10 hrs.

System ANNUAL Reliability:R= 1- [(1-0.03) X (1-0.03)] = 1-0.94 = 0.06

COMPONENT 1MTBF = 2,500 hrs.

MTTR = 10 hrs.

Component ANNUAL Reliability:R = e-(8760/2500) = 0.03 Unreliability

132132132© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

In-Parallel Availability

Unavailability

Component Availability:A = 2500 ÷ (2500 + 10) = 0.996

System Availability:A= 1- [(1-0.996) X (1-0.996)] = 1-0.000016 = 0.999984

COMPONENT 1MTBF = 2,500 hrs.

MTTR = 10 hrs.

COMPONENT 1MTBF = 2,500 hrs.

MTTR = 10 hrs.

Page 67: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

133133133© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Complex Redundancy

1

2

3

n

m-of-n

.

.

.

Examples:

1-of-2

2-of-3

2-of-4

8-of-10

“Pure Active Parallel”

134134134© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

More Complex Redundancy

• Pure active parallelAll components are on

• Standby redundantBackup components are not operating

• Perfect switchingSwitch-over is immediate and without fail

• Switchover reliabilityThe probability of switchover when it is not perfect

• Load sharingAll units are on and workload is distributed

Page 68: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

135135135© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Networks Consist of Series-Parallel

• Combinations of in-series and redundantcomponents

D1D1

D2D2

D3D3

EE FFCCB1B1

B2B2AA 2/31/2

136136136© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Failure Rate

• The number of failures per time:Failures/hourFailures/day

Failures/week

Failures/106 hours

Failures/109 hours ⇒ called “FITs” (“Failures in Time”)

Page 69: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

137137137© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Approximating MTBF

• 13 units are tested in a lab for 1,000 hours with 2 failures occurring

• Another 4 units were tested for 6,000 hours with 1 failure occurring

• The failed units are repaired (or replaced)

• What is the approximate MTBF?

138138138© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Approximating MTBF (Cont.)

• MTBF = 13*1000 + 4*6000 1 + 2

= 37,000

3

= 12,333 hours

Page 70: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

139139139© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Modeling

Distributions• Normal• Log-Normal

• Weibull

• Exponential

Freq

uenc

y

Time-to-Failure

MTBF

Freq

uenc

y

Time-to-Failure

MTBF

MTBF

140140140© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Constant Failure RateThe Exponential Distribution

• The exponential function:f(t) = λe-λt, t > 0

Failure rate, λ , IS CONSTANT

λ = 1/MTBF

• If MTBF = 2,500 hrs., what is the failure rate?

• λ = 1/2500 = 0.0004 failures/hr.

Page 71: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

141141141© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

The “Bathtub” Curve

Time

Failu

re R

ate

Wear-Out“Useful Life” PeriodInfant Mortality

DECREASING Failure Rate

CONSTANT Failure Rate

INCREASING Failure Rate

142142142© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

The Exponential Reliability Formula

• Commonly used for electronic equipment

• The exponential reliability formula:

R(t) = e-λt or R(t) = e-t/MTBF

Page 72: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

143143143© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Calculating Reliability

• A certain Cisco router has an MTBF of 100,000 hrs; what is the annual reliability?

Annual reliability is the reliability for one year or 8,760 hrs

R =e-(8760/100000) = 91.6%

• This says that the probability of no failure in one year is 91.6%; or, 91.6% of all units will surviveone year

ADDITIONAL TROUBLE TICKETING SLIDES

144© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Page 73: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

145145145© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Essential Data Elements

Description of Action Taken to Fix the ProblemStringResolution

Identity if the Event Was Due to Planned Maintenance Activity or Unplanned OutagePlanned/UnplannedType

For HW Problems include Product ID; for SW Include Release VersionAlphanumericComponent/Part/SW

Version

HW, SW, Process, Environmental, etc.StringRoot Cause

Outline of the ProblemStringProblem Description

Number of Customers that Lost Service; Number Impacted or Names of Customers ImpactedIntergerCustomers Impacted

Time of Resolutionhh:mmResolution Time

Date of Resolutiondd/mmm/yyResolution Date

Time of Faulthh:mmStart Time

Date of Faultdd/mmm/yyStart Date

Trouble Ticket NumberAlphanumericTicket

Date Ticket Issueddd/mmm/yyDateDescriptionFormatParameter

Note: Above Is the Minimum Data Set, However, if Other Information Is Captured it Should Be Provided

146146146© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Trouble Tickets• Definitions• Data accuracy • Collection

processes

Operational Process and Procedures

AnalysisData Analysis

HA Metrics/NAIS Synergy

• Network reliability improvement analysis

• Problem management• Fault management• Resiliency assessment• Change management• Performance

management• Availability

management

• Baseline availability• Determine DPM

(Defects Per Million) by:

Planned/UnplannedRoot CauseResolutionEquipment

• MTTR

Analyzed Trouble Ticket DataReferral for Process/Procedural Improvement

Referral for Analysis

Page 74: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

ADDITIONAL SA AGENT SLIDES

147© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

148148148© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

SA Agent: How It Works

1. User configures Collectors through Mgmt Application GUI

2. Mgmt Application provisions Source routers with Collectors

6. Application retrieves data from Source routers once an hour

7. Data is written to a database

8. Reports are generated

3. Source router measures and stores performance data, e.g.:

Response time

Availability

4. Source router evaluates SLAs, sends SNMP Traps

5. Source router stores latest data point and 2 hours of aggregated points

SNMP

Management Application SA Agent

Page 75: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

149149149© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

SAA Monitoring IP Core

R1

R3

R2

IP CoreIP Core

P1

P2

P3

Management System

150150150© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Monitoring Customer IP Reachability

P1-Pn Service Assurance Agent ICMP Polls to a Test Point in the IP Core

TP1TP1

TPxTPx

P1

P3

P2

PN

Nw1

Nw3

Nw3

NwN

Page 76: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

151151151© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Service Assurance Agent Features

• Measures Service Level Agreement (SLA) metricsPacket Loss Response time Throughput

Availability Jitter

• Evaluates SLAs

• Proactively sends notification of SLA violations

152152152© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

SA Agent Impact on Devices

• Low impact on CPU utilization

• 18k memory per SA agent

• SAA rtr low-memory

Page 77: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

153153153© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Monitored Network Availability Calculation

• Not calculated:Already have availability baselineFault type, frequency and downtime may be more useful

Faults directly measured from management system(s)

154154154© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Monitored Network Availability Assumptions

• All connections below IP are fixed

• Management systems can be notified of all fixed connection state changes

• All (L2) events impact on IP (L3) service

Page 78: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

ADDITIONAL COOL SLIDES

155© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

156156156© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

CLIs

[no] cool run <cr>

[no] cool interface interface-name(idb) <cr>[no] cool physical-FRU-entity entity-index (int) <cr>

[no] cool group-interface group-objectID(string) <cr>[no] cool add-cpu objectID threshold duration <cr>

[no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]<cr>

[no] cool if-filter group-objectID (string)<cr>

Configuration CLI Commands

Router#show cool event-table [<number of entries>] displays all if not specified

Router#show cool object-table [<object-type(int)>] displays all object types if not specified Router#show cool fru-entity

Display CLI Commands

Router#clear cool event-table

Router#clear cool persistent-files

Exec CLI Commands

Page 79: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

157157157© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Measurement Example:Router Device Outage

Reload (Operational) , Power Outage, or Device H/W failure

Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4). Index: the corresponding MIB table index. If it is PhysicalEntity(2), index in the ENTITY-MIB. Status: Up (1) Down (2).Last-change: last object status change time.AOT: Accumulated Outage Time (sec).NAF: Number of Accumulated Failure.

158158158© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Measurement Example: Cisco IOS S/W Outage Standby RP in Slot 0 Crash Using “Address Error (4) Test Crash”;AdEL Exception It Is Caused Purely by Cisco IOS S/W

Standby RP Crash Using “Jump to Zero (5) Test Crash”;Bp Exception It Can Be Caused by S/W, H/W, or Operation

Page 80: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

159159159© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Measurement Example: Linecard Outage

Add a Linecard

Reset the Linecard

Down Event Captured Up Event Captured

AOT and NAF Updated

160160160© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Measurement Example: Interface Outage

12406-R1202(config)#cool group-interface ATM2/0.12406-R1202(config)#no cool group-interface ATM2/0.3

sh cool object 1 | include ATM2/0.33 1 1054859087 0 0 0 ATM2/0.135 1 1054859088 0 0 0 ATM2/0.239 1 1054859090 0 0 0 ATM2/0.441 1 1054859090 0 0 0 ATM2/0.5

12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 1 1054859105 18 1 ATM2/0.11 35 1 1054859106 18 2 ATM2/0.21 39 1 1054859107 17 3 ATM2/0.41 41 1 1054859108 18 4 ATM2/0.5

12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 0 1054859146 41 1 ATM2/0.11 35 0 1054859147 41 2 ATM2/0.21 39 0 1054859149 42 3 ATM2/0.41 41 0 1054859150 42 4 ATM2/0.5

sh cool object 1 | include ATM2/0.33 1 1054859087 0 41 1 ATM2/0.135 1 1054859088 0 41 1 ATM2/0.239 1 1054859090 0 42 1 ATM2/0.441 1 1054859090 0 42 1 ATM2/0.5

Configure to Monitor All the Interfaces which Includes ATM2/0; String, Except ATM2/0.3

1

2 3

4 5

Object Table

Shut ATM2.0 Interface Down

Down Event Captured

Up Event Captured

No Shut ATM2.0 Interface

Object Table Shows AOT and NAF

Page 81: Cisco Mttr e Mtbf

© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr

161161161© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2

Measurement Example:Remote Device Outage

12406-R1202(config)#cool remote-device 1 50.1.1.2 remobj.1 30 2 50.1.1.1 112406-R1202(config)#cool remote-device 2 50.1.2.2 remobj.2 30 2 50.1.2.1 112406-R1202(config)#cool remote-device 3 50.1.3.2 remobj.3 30 2 50.1.3.1 1

sh cool object-table 4 | include remobj1 1 1054867061 0 0 remobj.12 1 1054867063 0 0 remobj.23 1 1054867065 0 0 remobj.3

12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shut

12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shut

4 2 5 1054867105 42 2 remobj.24 1 5 1054867108 47 3 remobj.14 3 5 1054867130 65 10 remobj.3

4 1 4 1054867171 63 1 remobj.14 3 4 1054867193 63 8 remobj.34 2 4 1054867200 95 10 remobj.2

sh cool object-table 4 | include remobj1 1 1054867061 63 1 remobj.12 1 1054867063 63 1 remobj.23 1 1054867065 95 1 remobj.3

3 Remote Devices Are Added

Object Table

Shut Down the Interface Link Between the Remote Device and Router

Down Event Captured

Up Event Captured

Object Table Shows AOT and NAF

No Shut the Interface Link