nagios on tier1 farm jonathan wheeler ral tier1 fabric team 20 th june 2008
TRANSCRIPT
![Page 1: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/1.jpg)
Nagios on Tier1 farm
Jonathan WheelerRAL Tier1 Fabric Team
20th June 2008
![Page 2: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/2.jpg)
Overview
• What we had before (Sure)• Introduction to Nagios and how it is
configured for the farm• What might we do next
![Page 3: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/3.jpg)
Sure monitoring - 1
• Consists of a server and clients• Communication via sysreq
command• Required scripts set up for each
client to run checks and report results to server
![Page 4: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/4.jpg)
Sure monitoring - 2
3 main tasks:a) check host alive
• active using ping• passive accepting heartbeat messages
b) receive alarm messagesc) receive “backup started” and
“backup finished” messages
![Page 5: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/5.jpg)
Sure monitoring - 3
Problems:• configuration not directly under Tier1
control• requires locally-written and locally
maintained scripts• limited view of farm alarms and state• alarms only visible on server screen
![Page 6: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/6.jpg)
Introduction to Nagios
• highly configurable• under active development (Nagios 2.11
legacy, Nagios 3.0.2 latest stable)• active user community (mailing list)• some commercial offerings• extensive documentation part of
installation• allows local extensions
![Page 7: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/7.jpg)
Introduction to Nagios – basics -1
Nagios:• schedules test commands, for
example: is space used in /var filesystem larger than permitted limit
• accepts results as return code (0 - OK, 1 – warning, 2 – critical, 3/-1 – unknown), and a single line message
![Page 8: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/8.jpg)
Introduction to Nagios – basics -2
Nagios (continued):• displays via Web interface to
authorised users • sends notification via e-mail, SMS,
RSS, Morse code, jungle drums etc• may run an event handler, e.g. if a
test fails, then put this batch node offline
![Page 9: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/9.jpg)
Introduction to Nagios – networked clients
• Nagios server can use check_nrpe command to run test on networked client
• client must be running nrpe client process to
– accept and run check requests– accept results and return to server
• Nagios server can also use ssh or smtp to perform checks (little experience on Tier1)
![Page 10: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/10.jpg)
Nagios server
Nagiosclient
Nagiosclient
Nagiosclient
Nagiosclient
Single server, many clients
![Page 11: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/11.jpg)
Introduction to Nagios – slave servers
• Running scheduled checks and web server puts heavy load on Nagios server
• Tier1 uses master and slave servers:– master keeps all results, runs web server
and sends notifications– slaves schedule tests, run them and
return results to master (using send_nsca command to nsca daemon)
![Page 12: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/12.jpg)
Introduction to Nagios – “freshness”
If slave server has crashed:• master server checks whether tests
have been run to schedule (freshness checking)
• if test is stale (test results not returned to schedule), master will run test (force check)
![Page 13: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/13.jpg)
Master and slaves servers; many clients
Master server
Slave server Slave server Slave server
Client
Client
Client Client Client
Client
Client Client
Client
![Page 14: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/14.jpg)
Introduction to Nagios – clearing alarms
If check condition has been corrected and
you want to clear alarm before the nextscheduled test:• can force check (from master or slave)
by issuing appropriate formatted command to server
• scripts available to do this
![Page 15: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/15.jpg)
Introduction to Nagios - configuration
In our configuration Nagios knows about:– hosts– host groups– services (for checking)– contacts and contact groups– time periods (when tests are valid, when
to send contact messages)
![Page 16: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/16.jpg)
Introduction to Nagios - configuration
• Configuration is made simpler by extensive use of templates, for example:– define a template for a generic host– use it to define many other hosts, only
changing parameters that are different (e.g. host name, address, group to which it belongs)
– can be recursive
![Page 17: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/17.jpg)
# Generic host definition templatedefine host{
name generic-host; name of host templatenotifications_enabled 1; Host notifications are enabledevent_handler_enabled 1; Host event handler is enabledflap_detection_enabled 1; Flap detection is enabledprocess_perf_data 1; Process performance dataretain_status_information 1; Retain status information retain_nonstatus_information 1; Retain non-status information register 0; Template definitioncheck_command check-host-alivemax_check_attempts 10notification_interval 720notification_period 24x7notification_options d,u,r
}
![Page 18: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/18.jpg)
define host{use generic-hosthost_name ganglia0430parents swt-5530-0alias Ganglia Hosthostgroups aux-servicescontact_groups thorneaddress 130.246.183.173
}
define host{use generic-hosthost_name shelobparents swt-4400-1alias CSF Webserver
……………
![Page 19: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/19.jpg)
Introduction to Nagios - plugins
• Test scripts are known as plugins• Can be written in any suitable
language: shell script, Perl, C, Pascal• About 60 standard plugins (available
by RPM from Dag Wieers’ repository)• About 30+ locally written plugins• plus 14+ specially written for Castor
![Page 20: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008](https://reader036.vdocuments.net/reader036/viewer/2022062417/5515f409550346cf6f8b5555/html5/thumbnails/20.jpg)
Nagios links
• Nagios home page: http://www.nagios.org/
• For locally written plugins: http://cvs.gridpp.rl.ac.uk/viewcvs/viewcvs.cgi/nagios/plugins/
• For GridPP information about Nagios: http://www.gridpp.ac.uk/wiki/Nagios