lemon monitoring miroslav siket, german cancio, david front, maciej stepniewski cern-it/fio-fs lcg...

15
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

Upload: nigel-poole

Post on 05-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

Lemon Monitoring

Miroslav Siket, German Cancio, David Front, Maciej StepniewskiCERN-IT/FIO-FS

LCG Operations WorkshopBologna, 24-26 May 2005

Page 2: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

2

Outline

• Lemon• Structure and design• How it works, deployment• Use cases, web interface• Installation and setup• Summary

Page 3: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

3

Lemon – LHC Era Monitoring

• Lemon is a system containing tools for monitoring status and performance of computers:

– Distributed monitoring system scalable to ~10k nodes– Provides active monitoring of software and hardware in the Computer

Center on centrally managed clusters– Facilitates early error detection and problem prevention– Executes corrective actions and sends notifications– Provides persistent storage of the monitoring data– Offers a framework for further creation of sensors for monitoring– Site independent functionality

• Link: http://cern.ch/lemon• Part of the ELFms toolsuite: http://cern.ch/elfms

Page 4: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

4

Lemon Use

• It is used in-and-outside CERN by:– System administrators, service managers, cluster responsibles– Developers and service/data challenges– Managers and general users

• Deployments outside CERN:– EDG testbeds– Accelerator (AB) department at CERN– CMS online– GridICE– BARC India (development partner)

Page 5: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

5

Lemon architecture

CorrelationEngines

Web browser

Lemon CLI

User

MonitoringRepository

TCP/UDP

SOAP

SOAP

Repositorybackend

ProtNodes

Monitoring Agent

Sensor SensorSensor

RRDTool / PHP

apache

HTTP

Page 6: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

6

Components

• Lemon is a typical server/client application with following components:

– MSA – Monitoring Sensor Agent (Lemon Agent)• Daemon on a client machine that spawns multiple Monitoring Sensors to measure data in

defined intervals and sends data to Monitoring Repository– MS - Monitoring Sensor

• Uses standard C++, perl API – it is easy to write your own sensor• Several sensors exist for performance, process, hw and sw monitoring, grid VO’s job

reporting, database monitoring, security, alarms (total 260 metrics)– MR – Monitoring Repository

• Server application that receives samples and processes/validates them• Stores the full monitoring history data • Two implementations - flat files or Oracle DB based

– LRF - Lemon RRD Framework• Pre-processes data into rrd files and creates cluster summaries• These are used for web graphics• Provides service and cluster overview in its web displays

– LAG – Lemon Alarm Gateway• Generic gateway for alarms (in development)

• Gateways to MonALISA and GridICE exist

Page 7: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

7

Lemon at CERN

• Lemon monitors about 2200 computers in ~100 clusters

• On average it collects about 70 metrics from each host

• Integrated with Sure alarm system• Collecting about 1.5 GB/day• LEAF (LHC-Era Automated Fabric)

for high-level intervention scheduling

Node ConfigurationManagement

NodeManagement

Configuration• Derived from the Quattor Configuration Database (CDB)

• individual configuration per cluster/host• hierarchical structure

Alarm system• Sure – legacy system receiving alarms from Lemon• Integration with new LASER system (LHC alarm system) via LAG is ongoing

Page 8: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

8

Web interface

• Cluster view displays accumulated statistics and status for all machines in the cluster

• Host view gives overview of the host status with basic metrics

• Other views available:– Rack view– Hardware type view– Other views can be added, working on user defined views

• With the newest version (to be released soon):– Generic entry page displaying status overview of the key services– Configurable views

• In development: database services monitoring with database specific view

Page 9: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

9

Use(ful) case

• Kernel upgrade– Kernel version is “measured” on the boot of the machine– Automatic tools for upgrading the kernel on a cluster retrieve information

from Lemon and schedule reboot of a machine based on this info– Web interface allows monitoring of the progress

Reboot occurrence history graph

Page 10: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

10

<?xml version="1.0" ?> <CC> <ROOM ID=“0513-S-0034" DESCRIPTION=“Tape Vault" R="0" G="0" B="0">

<DOORS R="0" G="255" B="0"><DOOR X="63" Y="39" LX="64" LY="39" /> <DOOR X="34" Y="0" LX="36" LY="0" />

</DOORS> <RACKS R="0" G="0" B="203">

<RACK ID="EA01" X="73" Y="9" LX="75" LY="10" PLANNED="0"/> <RACK ID="EA03" X="73" Y="8" LX="75" LY="9" PLANNED="0"/>

</RACKS> <WALLS R="0" G="0" B="0">

<WALL X="0" Y="0" LX="0" LY="60" /> <WALL X="0" Y="0" LX="76" LY="0" />

</WALLS> <STEPS R="255" G="163" B="0">

<STEP X="47" Y="36" LX="52" LY="37" /> <STEP X="47" Y="37" LX="52" LY="38" />

</STEPS> </ROOM></CC>

Computer Center display

• Lemon Web Interface can be interfaced with a Computer Center database of objects (racks, silos, …)

• Provides search of objects as well as listing• Interfaced through a XML defined geometry of the computer center• Generic design that can be used anywhere:

Page 11: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

11

Service challenges, GRID VOs

• Lemon allows for

– Virtual clusters• clusters defined on request by service managers• or defined by scripts – updated dynamically on demand• or defined for specific purpose• Examples: Alice MDC, network challenges,…

– Clusters defined dynamically• example: hosts running GRID jobs on the batch cluster

belonging to the given Virtual Organization• hooks in Lemon for defining any dynamic grouping of hosts

Page 12: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

12

Automatic recovery actions and Alarms

• Alarm Sensor– For defined values of measured metrics an actuator is called with

predefined action– An example: ssh daemon dead – action /sbin/service sshd start– Definition: metric X, field Y <op> reference value Z => call actuator

• <op> can be ==,<,>,regexp, range, etc..• If success log only, else call action up to max times

– Each occurrence is logged in the Monitoring Repository– Already about 70 predefined alarms with automatic recovery actions– After first month of deployment it reduced number of problem tickets by half

• Correlation engine (CMDaemon)– Allows ‘global’ correlations, and in the future client/server alarms and

recovery actions

• Lemon Alarm gateway (LAG)– Lemon’s LAG can be used to feed alarms into arbitrary alarm systems

(under development)

Page 13: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

13

Installation and setup (I)

Lemon installation consists of three steps:1. Server installation2. Client installation3. Web interface installation

1. Server installation:– install edg-fabricMonitoring-server rpm (“flat file” server)– Configure receiving port in /etc/edg-fmon-server.conf– Start the server daemon

2. Client installation:– Install edg-fabricMonitoring-agent rpm (comes with default metric

configuration)– Configure server and its port in /etc/edg-fmon-agent.conf– Start the client daemon on all monitored hosts

Page 14: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

14

Installation and setup (II)

3. Web interface installation– Install and start apache server (with php) on your server– Install rrdtool and lrf (lemon rrd framework) rpms– Configure your clusters in clusters.conf file and start lemonmrd daemon

• Drink Champagne… you have Lemon up and running! ;-)– You can do all this on your laptop!

• Possible additional components:– Computer center synoptic view through xml file– Problem tracking system integration (through php plug-in to your DB/application)– Quattor CDB configuration view – through CDB xml profiles– Oracle based Repository (for very large installations with high scalability and

increased functionality)– Other, new components are easy to add

• View detailed instructions at: http://cern.ch/lemon/doc/installation/installation.html

Page 15: Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005

25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna

15

Summary

• Lemon serves to provide monitoring information about the farms in Computer Centers (or your laptop).

• Lemon provides framework for recovery actions and alarms. • Lemon is easy to install (…and it is easy to add your own metrics and visualize

them).• It is flexible with respect to your needs – you can add clusters, views, specify

your definition of virtual and dynamic clusters.• It has been a useful tool for general monitoring of performance and also for

system administrators in debugging problems.

• For more information check http://cern.ch/lemon