fault isolation and service quality assurance · nikos trikoupis - cern it/cs 4th terena nrens and...

28
4th TERENA NRENs and Grids Workshop Nikos Trikoupis - CERN IT/CS Fault Isolation and service quality assurance in a 10gbE redundant grid infrastructure Nikos Trikoupis Infrastructure and Operations CERN IT/CS

Upload: others

Post on 08-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Fault Isolation and service quality assurance

in a 10gbE redundant grid infrastructure

Nikos TrikoupisInfrastructure and Operations

CERN IT/CS

Page 2: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Agenda

Requirements and challenges for Monitoring in the CERN network

Fault Isolation essentials

Service Quality Assurance

Page 3: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Introduction

CERN’s “business processes” rely on network services availability.

Our mission is to deliver and manage an infrastructurereliable but capable of sustaining a high rate of change.

Being the LCG Tier-0 network provider is a huge responsibility.

Page 4: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Facts and Challenges

A collaborative environment with highly complex applications

Network redesign: multi-10gbE core, 10gbE to the farms

The10G WAN PHY standard allows for WAN connectivity at LAN speeds and the use of the same management tools

For the first time, the barrier between Campus and WAN disappears.

Page 5: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

LCG

GPN

TN

CNIC

Tier1s

V1

V2C1

C2

C3 C4 C5

2-S

10-1

40-S 376-R 874-R

887-R

BC1 BC2

BB1 BB2

LB1

LB2

LC1

LC3 LC4

LV1

Meyrin

Prevessin

LHC

FIREWALL

EXT

CCR-513

CCC-874

T513

T874

cernh2

cernh8

rci76-2 rci65-3

sw6506-isp

E513-E

Internet

rci76-1

rci72-4

WHO

gate3

gate-bkup

B513-Erca80-2

rca80-1

FARMS

Hot-standby server

Primary serverLHCopn

Primary Server

LHCopn Hot-standby

SPECTRUM

LC2

Network Layout: Marc Collignon, IT/CS/IO

SPECTRUM

OneClick“Secondary”

OneClick“Primary”

SPECTRUMSPECTRUM

CERN Network overview

Page 6: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Chaotic, Unpredictable Traffic Patterns

Increased Demand

Restrictions in budget and personnel

New Types of network and user equipment

Network

More and Higher-Speed Bandwidth Choices

Pressures on Network Management

Page 7: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

It’s theNetwork’s Fault!

Physicists

Technical Services System Managers

Application Developers

Managing the Network Foundation

Copyright © 2003

Page 8: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Defending the Network

Faults may occur, but:

They have to be detected (before users do it) as quickly as possible.The cause of the fault has to be identified so that corrective action may be taken.This task has to be performed by operations on a 24x7 basis.Time To Repair must be reducedPrioritize faults based on impact

The size and complexity of the network infrastructure dictates the use of automated network management tools

Page 9: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

RequirementsEvent filtering, deduplication, suppression and correlationAutomated network discovery and updateAccurate Layer 2 and Layer 3 network topology, including redundancy and routing protocols

Network Root Cause Analysis

Page 10: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

?

?

?

?

?

?

?

?

?

?? ?

?

How to prioritize?

Which to fix?

Oops!

Traditional Procedure

Without Network Root Cause Analysis

Page 11: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Pinpointing the root cause

An alarm displayed in CERN’s alarm manager is the result of a fault isolation process, Root Cause Analysis.

ONE alarm displayed for one problem

symptomatic faults suppressed

operational procedures are followed to complete the troubleshooting process

Page 12: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

one problem one problem butbut

9 9 ‘‘device unreachabledevice unreachable’’alarms!alarms!

A failure scenario: device fault

Page 13: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

undetected problem ?undetected problem ?

A failure scenario: Loss of redundancy

Page 14: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

On Requirement 4SPECTRUM in the device failure scenario

Page 15: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

On Requirement 4SPECTRUM in the loss of redundancy scenario

Page 16: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Distinguishing events and meaningful alarms

Alarm/Event

DB

IMT EventRules

ConditionCorrelation

Event Management System

APISyslogSpectroWATCHESTRAPS

Event Management System allows configuration, creation and control of traps, events and alarms

~25.000 events a day (30 Nov 06)

Page 17: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

To do the job, our monitoring system must:

1. understand topology and relationships.2. work across multiple-vendor and technology

solutions.3. distinguish between a plethora of events and

meaningful alarms.4. quickly pinpoint the root cause and suppress all

symptomatic faults.5. help prioritize based on impact. 6. be fault-tolerant.

Page 18: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Database Server

Web Server

CPU, Memory, Disk

Apache.exe process

User Response time

(HTTP test)

Application Servers

Redundant Servers

Response time (TCP test)

Log file Parsing

CPU, Memory, Disk

CPU, Memory

SQL Log file parsing

SQL Processes

Network Connection

Request Service

DNS

Other Required Services

Scope of the Service Quality Assurance Problem: From Silos to Services

Page 19: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Availability and Performance Monitoring Best Practices

Don't fall in the trap of collecting all possible data availableTry to focus on key metrics that are indicators of total end-to-end service quality.Automate!

Page 20: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Configuring Services: The Basics

Page 21: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Service Dashboard

The tool to provide real time service status and statistics

Allows at-a-glance understanding of How well the services are runningProblems and statusTransparency towards users and IT management

Status and Statistics exported to PerfSonar, MonaLisa as well as other databases and alarm systems.

SummaryCurrent Service StatusCurrent “Customer”Status

General DetailsMTTRMTBF% Uptime, Downtime, Degraded

Outage DetailsDurationCauseTroubleshooterImpact

Page 22: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Page 23: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Page 24: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Page 25: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Page 26: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Page 27: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Conclusions

Monitoring is essential for robust network operation.

Root Cause Analysis enables quick reactions to faults.

Transparent, real time reporting and information exchange demonstrates service quality and gains the trust of users and collaborators.

Focus on collecting and storing relevant data.

Page 28: Fault Isolation and service quality assurance · Nikos Trikoupis - CERN IT/CS 4th TERENA NRENs and Grids Workshop Defending the Network Faults may occur, but: They have to be detected

4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS

Thank you!Thank you!Q & AQ & A

http://http://cern.chcern.ch/monitoring/monitoring