monitoring lhcb with nagios10cds.cern.ch/record/1215280/files/poster-2009-136.pdf · web interface...
TRANSCRIPT
Monitoring the LHCb Experiment Computing Infrastructure with NAGIOSEnrico Bonaccorsi, Niko Neufeld
CERN
Monitoring the LHCb Experiment Computing Infrastructure with NAGIOSEnrico Bonaccorsi, Niko Neufeld
CERN
LHCb has a large and complex infrastructure consisting of thousands of servers and embedded computers, hundreds of network devices and a lot of common infrastructure services such as shared storage, login and time services, databases and many others. All aspects that are operatively critic are integrated into the standard Experiment Control System (ECS) based on PVSSII.
-ture, whose monitoring is based on NAGIOS. We present the design and implementation of the fabric management based on NAGIOS. Care has been taken to complement rather than
duplicate functionality available in the Experiment Control System.
1044 Servers8421 Services
Systems under monitoringSystems under monitoring
Storage ControllersUPS
Maraton Boards
Storage ControllersNetwork devices
CPU LoadQuattor
Network interfaces errorsFiber Channel
Disk SpaceSwap / Pagefile usage
CPU usageNetwork interfaces
Trunk StatusJumbo Frames
Network interfaces errorsError logs
BackupsWeb services
Terminal servicesDomain Controllers
GatewaysFarm
DatabasesCastor reachability
WAN status
ServersNetwork
Services
Embedded / Special Systems
Simplest setup has central server running Nagios daemon that runs local check scripts which the status of services on that and remote hosts.A host is a computer running on the network which runs one or more services to be checked.A service is anything on the host that you want checked. Its state can be one of: OK, Warning, Critical or Unknown A check is a script run on the server whose exit status de-termines the state of the service: 0, 1, 2 or -1 Event handlers are optional commands that are ex-ecuted whenever a host or service state change occurs. An obvious use for event handlers (especially with ser-
Architecture
Hosts, Services, Checks & Event Handlers
Nagios is a popular open source computer system and network monitoring software application. It watches hosts and services, alerting users when things go wrong and again when they get better.
When incidents are detected, The monitoring system sends out noti�cations to the admnistrative contacts
(email, instant message, SMS, etc.).
Notification
SMS / M
ail
OracleDatabases
Farm Servers
Nagios
Control PCsCh
ecks
ChecksChecks
Checks
NAGIOS runs on a server as a daemon and periodically runs plugins in order to understand the status of hosts and services. The main tasks are the plugins executions scheduling, the dependency calculations and the alert-
Due the modularity of the software every action is del-
and binaries for plugins).
software could be programmed to execute, when a ser-vice or a host status change occurs, any kind of external
-tions by mail, instant messaging, sms or by a phone call using Voice over IP and a PBX like asterisk.
sw-d2c05-m1
sw-sx-01
sw-mgmt-sx-01
sw-sx-s1b04-02
sw-storecluster-d1sw-storecluster-c1
sw-sx-s1b04-03
VLAN extensionsw-sx-s1d02-02
sw-sx-s1d01-01plus
sw-sx-s1d02-01
SX
UX
CERN
sw-2889-R008-01
GW02
netctrl01
CERN GPNCERN LCG
NETWORK-MGMT
sw-ux-01
net-mgmt
net-mgmt
net-mgmt
net-mgmt
net-mgmt
Gi 4/46: online wifiGi 4/47: gas control
Mgmt_eth0,1
sw-2870-01gas control
Gi 4/3
CERN TNCERN TN
Gi 4/23Te 0/3
Te 0/0
Po 2Te 0/1, 0/2
LHCb Online Network Topology (SX85)
Control room
Gi 4/45
Gi 4/34
Gi 4/6: streamGi 4/36: ECS
Po 11(Gi 4/43,44): LHCB_DB
Network Devices
Every network device in LHCb implement the Simple Network Management Protocol (SNMP), allowing us to monitor not only if a cer-tain device is up but also CPU utilisation, ports and trunk status, uplinks and event logs. WAN links are also under monitoring and additionally we check the ability to process jumbo frames on the network devices that pro-vide the network infrastructure from the TELL1 boards to the High Level Trigger (HLT).
Web Interface & NAGVIS
The web interface is word wide reachable through reverse proxy at the following URL: https://lbnagios.cern.ch. A complementary interface that displays the LHCb network map has been implemented using NagVis, this mapis displayed in the control room giving an instant overview of the infrastructures status.
ServersCombining local and remote executions of plugins our system monitors CPU usage, load average, local disks,
the LDAP servers, status of the NFS mounts, SSH daemon, uptime, quattor daemons for the Linux based servers and all
servers.In case of failure the system will try to restart automatically the services.
DNS and DHCPNAGIOS query every 10 minutes the status of DNS and DHCP services simulating real requests.
Uninterruptable power supplies The UPS battery charge, output amperes, frequency in and out, current load, status temperature, test date and result, volts in and out are being monitored.
BackupsLHCb uses Amanda as a backup solution, every night the backup consistency is checked by NAGIOS.
Web ServicesAll the web services are under monitoring; our implementation checks both the status of the TCP ports and additionally the dy-namic creation of web pages.
DatabasesThe experiment itself and in particular the data acquisition part depends on a number of oracle databases: listener and db con-sistency are checked constantly.
StorageStorage controllers are constantly monitored by SNMP looking for hardware disk failures. The storage is connected via optical
publish the disks using NFS and CIFS protocols: status of the daemons and the free space on the storage is under monitor-ing.
We have deployed our NAGIOS monitoring infrastructure six months ago and have it run suc-cessfully since. The software provides us with an essential insight into our network- and server-availability and gives us the possibility to solve problems ideally even before anyone notices
critical events are forwarded to PVSSII which are seen by non-experts operators.We have successfully implemented a light independent system to monitor the key infrastructure of the LHCb ECS itself, this give us an important second level of monitoring in case of serious
The Systems Managers are kept informed about prob-lems discovered via mail or short message on the mobile by the monitoring system. When a service which has dependencies fails NAGIOS
until the problem with the hierarchically highest ser-vice is solved (for example in case of problems on the main routers) avoiding the receiving of a large amount of mail or SMS and informing us only about the main problem. In case of failure of the main switch connected to our monitoring server, NAGIOS is able to enable a backup network interface directly connected to the CERN net-
Routers, UPS,Network Devices