gridpp7 – june 30 – july 2, 2003 – fabric monitoring– n° 1 fabric monitoring for lcg-1 in...

13
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP Collaboration meeting July 1, 2003

Upload: devin-lee

Post on 28-Mar-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1

Fabric monitoring for LCG-1in the CERN Computer Center

Jan van Eldik

CERN-IT/FIO/SM

7th GridPP Collaboration meeting

July 1, 2003

Page 2: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 2

Outline

• Fabric monitoring developments at CERN

• Architectural overview

• Deployment: status & plans for LCG-1

• Outlook

Page 3: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 3

Fabric Monitoring at CERN

• Improved fabric management is key part of LCG programme

• EDG WP4 develops tools for automated installation, configuration, fabric monitoring, fault tolerance

• IT/FIO Supervision & Monitoring section: develop and deploy a monitoring solution for LHC-era

• A lot of expertise: EDG WP4 monitoring developments,PVSS Scada studies, SNMP studies, operator alarm displays, …

• Architecture based on functional requirements gatheredby PEM project

• Important objective: fabric monitoring for LCG-1 at Cern

Page 4: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 4

Requirements and architecture

Measurement RepositoryMonitored nodes

SensorMonitoring Sensor

Agent

CacheConsumerLocal Consumer

SensorSensor

ConsumerConsumer

Global Consumer

Database

• Both for performance and exception monitoring

• Local and global consumers

• Scalable, extensible, robust

Page 5: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 5

EDG WP4 implementation

Measurement Repository (MR)Monitored nodes

SensorMonitoring Sensor

Agent (MSA)

CacheConsumerLocal Consumer

SensorSensor

ConsumerConsumer

Global Consumer

Monitoring Sensor Agent• Calls plug-in sensors to sample configured metrics• Stores all collected data in a local disk buffer•Sends the collected data to the global repository

Plug-in sensors• Programs/scripts that implements a simple sensor-agent ASCII text protocol•A C++ interface class is provided on top of the text protocol to facilitate implementation of new sensors

The local cache•Assures data is collected also when node cannot connect to network•Allows for node autonomy for local repairs

Transport• Transport is pluggable.• Two protocols over UDP and TCP are currently supported where only the latter can guarantee the delivery

Measurement Repository• The data is stored in a database•A memory cache guarantees fast access to most recent data, which is normally what is used for fault tolerance correlations

Database

Repository API•SOAP RPC•Query history data•Subscription to new data

Database•Proprietary flat-file database•Oracle•Open source interface to be developed

Page 6: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 6

Deployment status in Cern CC

• MSA with sensors for performance and exception monitoring, measuring 100-150 quantities per box

• Deployed on ~1500 RedHat Linux nodes

• 30 clusters, with specific configuration files

Batch 1000 nodes

Interactive 70 nodes

Disk server 200 nodes

Tape server 80 nodes

WWW, DB, MISC 200 nodes

Page 7: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 7

Status of exception monitoring

• ~50 possible alarms per monitored nodeHighLoad, DaemonDead, FileSysFull, install / config problems

• Operator alarm displays– PVSS-based, developed as part of PVSS-tests– WP4 alarm display under active development

Page 8: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 8

PVSS operator alarm display

Page 9: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 9

WP4 operator alarm display

Page 10: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 10

Performance monitoring

• WP4 Measurement Repository with Oracle backendis currently being deployed in the CERN CC for LCG-1

• Data access– C-API to the repository is available,

Perl and Java implementations to be done– Simple CLI is being delivered– GUI is being delivered

Page 11: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 11

Anamon

Page 12: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 12

Open issues

• Current solution is still very node-centric• Not much experience with consumers• No correlations engines, no corrective actions yet…• Integration with configuration system to be done

Page 13: GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 13

Summary and Outlook

• Fabric monitoring infrastructure for LCG-1 at Cernis being deployed

• Monitoring Sensor Agent has been operating very well• Measurement Repository will now be challenged• Consumers can start consuming…• An interesting 6 months period await us!