lemon monitoring miroslav siket, german cancio, david front, maciej stepniewski cern-it/fio-fs lcg...
TRANSCRIPT
Lemon Monitoring
Miroslav Siket, German Cancio, David Front, Maciej StepniewskiCERN-IT/FIO-FS
LCG Operations WorkshopBologna, 24-26 May 2005
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
2
Outline
• Lemon• Structure and design• How it works, deployment• Use cases, web interface• Installation and setup• Summary
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
3
Lemon – LHC Era Monitoring
• Lemon is a system containing tools for monitoring status and performance of computers:
– Distributed monitoring system scalable to ~10k nodes– Provides active monitoring of software and hardware in the Computer
Center on centrally managed clusters– Facilitates early error detection and problem prevention– Executes corrective actions and sends notifications– Provides persistent storage of the monitoring data– Offers a framework for further creation of sensors for monitoring– Site independent functionality
• Link: http://cern.ch/lemon• Part of the ELFms toolsuite: http://cern.ch/elfms
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
4
Lemon Use
• It is used in-and-outside CERN by:– System administrators, service managers, cluster responsibles– Developers and service/data challenges– Managers and general users
• Deployments outside CERN:– EDG testbeds– Accelerator (AB) department at CERN– CMS online– GridICE– BARC India (development partner)
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
5
Lemon architecture
CorrelationEngines
Web browser
Lemon CLI
User
MonitoringRepository
TCP/UDP
SOAP
SOAP
Repositorybackend
ProtNodes
Monitoring Agent
Sensor SensorSensor
RRDTool / PHP
apache
HTTP
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
6
Components
• Lemon is a typical server/client application with following components:
– MSA – Monitoring Sensor Agent (Lemon Agent)• Daemon on a client machine that spawns multiple Monitoring Sensors to measure data in
defined intervals and sends data to Monitoring Repository– MS - Monitoring Sensor
• Uses standard C++, perl API – it is easy to write your own sensor• Several sensors exist for performance, process, hw and sw monitoring, grid VO’s job
reporting, database monitoring, security, alarms (total 260 metrics)– MR – Monitoring Repository
• Server application that receives samples and processes/validates them• Stores the full monitoring history data • Two implementations - flat files or Oracle DB based
– LRF - Lemon RRD Framework• Pre-processes data into rrd files and creates cluster summaries• These are used for web graphics• Provides service and cluster overview in its web displays
– LAG – Lemon Alarm Gateway• Generic gateway for alarms (in development)
• Gateways to MonALISA and GridICE exist
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
7
Lemon at CERN
• Lemon monitors about 2200 computers in ~100 clusters
• On average it collects about 70 metrics from each host
• Integrated with Sure alarm system• Collecting about 1.5 GB/day• LEAF (LHC-Era Automated Fabric)
for high-level intervention scheduling
Node ConfigurationManagement
NodeManagement
Configuration• Derived from the Quattor Configuration Database (CDB)
• individual configuration per cluster/host• hierarchical structure
Alarm system• Sure – legacy system receiving alarms from Lemon• Integration with new LASER system (LHC alarm system) via LAG is ongoing
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
8
Web interface
• Cluster view displays accumulated statistics and status for all machines in the cluster
• Host view gives overview of the host status with basic metrics
• Other views available:– Rack view– Hardware type view– Other views can be added, working on user defined views
• With the newest version (to be released soon):– Generic entry page displaying status overview of the key services– Configurable views
• In development: database services monitoring with database specific view
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
9
Use(ful) case
• Kernel upgrade– Kernel version is “measured” on the boot of the machine– Automatic tools for upgrading the kernel on a cluster retrieve information
from Lemon and schedule reboot of a machine based on this info– Web interface allows monitoring of the progress
Reboot occurrence history graph
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
10
<?xml version="1.0" ?> <CC> <ROOM ID=“0513-S-0034" DESCRIPTION=“Tape Vault" R="0" G="0" B="0">
<DOORS R="0" G="255" B="0"><DOOR X="63" Y="39" LX="64" LY="39" /> <DOOR X="34" Y="0" LX="36" LY="0" />
</DOORS> <RACKS R="0" G="0" B="203">
<RACK ID="EA01" X="73" Y="9" LX="75" LY="10" PLANNED="0"/> <RACK ID="EA03" X="73" Y="8" LX="75" LY="9" PLANNED="0"/>
</RACKS> <WALLS R="0" G="0" B="0">
<WALL X="0" Y="0" LX="0" LY="60" /> <WALL X="0" Y="0" LX="76" LY="0" />
</WALLS> <STEPS R="255" G="163" B="0">
<STEP X="47" Y="36" LX="52" LY="37" /> <STEP X="47" Y="37" LX="52" LY="38" />
</STEPS> </ROOM></CC>
Computer Center display
• Lemon Web Interface can be interfaced with a Computer Center database of objects (racks, silos, …)
• Provides search of objects as well as listing• Interfaced through a XML defined geometry of the computer center• Generic design that can be used anywhere:
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
11
Service challenges, GRID VOs
• Lemon allows for
– Virtual clusters• clusters defined on request by service managers• or defined by scripts – updated dynamically on demand• or defined for specific purpose• Examples: Alice MDC, network challenges,…
– Clusters defined dynamically• example: hosts running GRID jobs on the batch cluster
belonging to the given Virtual Organization• hooks in Lemon for defining any dynamic grouping of hosts
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
12
Automatic recovery actions and Alarms
• Alarm Sensor– For defined values of measured metrics an actuator is called with
predefined action– An example: ssh daemon dead – action /sbin/service sshd start– Definition: metric X, field Y <op> reference value Z => call actuator
• <op> can be ==,<,>,regexp, range, etc..• If success log only, else call action up to max times
– Each occurrence is logged in the Monitoring Repository– Already about 70 predefined alarms with automatic recovery actions– After first month of deployment it reduced number of problem tickets by half
• Correlation engine (CMDaemon)– Allows ‘global’ correlations, and in the future client/server alarms and
recovery actions
• Lemon Alarm gateway (LAG)– Lemon’s LAG can be used to feed alarms into arbitrary alarm systems
(under development)
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
13
Installation and setup (I)
Lemon installation consists of three steps:1. Server installation2. Client installation3. Web interface installation
1. Server installation:– install edg-fabricMonitoring-server rpm (“flat file” server)– Configure receiving port in /etc/edg-fmon-server.conf– Start the server daemon
2. Client installation:– Install edg-fabricMonitoring-agent rpm (comes with default metric
configuration)– Configure server and its port in /etc/edg-fmon-agent.conf– Start the client daemon on all monitored hosts
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
14
Installation and setup (II)
3. Web interface installation– Install and start apache server (with php) on your server– Install rrdtool and lrf (lemon rrd framework) rpms– Configure your clusters in clusters.conf file and start lemonmrd daemon
• Drink Champagne… you have Lemon up and running! ;-)– You can do all this on your laptop!
• Possible additional components:– Computer center synoptic view through xml file– Problem tracking system integration (through php plug-in to your DB/application)– Quattor CDB configuration view – through CDB xml profiles– Oracle based Repository (for very large installations with high scalability and
increased functionality)– Other, new components are easy to add
• View detailed instructions at: http://cern.ch/lemon/doc/installation/installation.html
25/05/2005 LCG Operations Workshop 24-26/05/2005 Bologna
15
Summary
• Lemon serves to provide monitoring information about the farms in Computer Centers (or your laptop).
• Lemon provides framework for recovery actions and alarms. • Lemon is easy to install (…and it is easy to add your own metrics and visualize
them).• It is flexible with respect to your needs – you can add clusters, views, specify
your definition of virtual and dynamic clusters.• It has been a useful tool for general monitoring of performance and also for
system administrators in debugging problems.
• For more information check http://cern.ch/lemon