staying sane with nagios
DESCRIPTION
From an invited talk I did at PICC-10 (now known as LOPSA-East) about how to manage a Nagios installation without pulling your hair out. In the ensuing years, I've automated more, but still have the same kind of mindset about inheritance and so on.TRANSCRIPT
![Page 1: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/1.jpg)
Staying Sane with Nagios
Matt Simmons
@standaloneSA
http://www.standalone-sysadmin.com
![Page 2: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/2.jpg)
Introduction & Outline
Confessions:
Global Sanity Small & Medium Shops Large Scale Shops Add Ons Warnings Additional Resources
I am not actually a Nagios Expert I do actually LIKE NagiosOutline:
![Page 3: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/3.jpg)
I know what you're thinking...
Nagios?
Sane???
Unlikely!!!
Serenity Now!!!
![Page 4: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/4.jpg)
Nagios? SANE?!?
Serenity Now!!!
![Page 5: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/5.jpg)
Global Sanity
Universal Advice Affects installations of all sizes
Documentation Centralized Authentication Plugin Development
![Page 6: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/6.jpg)
Global Sanity: Documentation
Read the documentation Object Definitions
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html Use 3_0 when searching Bookmark the good ones Nagiosbook.org will be soon coming out with 3.x docs
http://www.nagiosbook.org/
![Page 7: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/7.jpg)
Global Sanity: Central Auth
Centralized Authentication LDAP / AD with Apache
(I use Likewise Open) Domain users -> Nagios Contacts
[email protected] Access to CGI interface
![Page 8: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/8.jpg)
Global Sanity: Do Not Reinvent the Wheel...
Nagios Exchange http://exchange.nagios.org/ Pros:
Nearly 2000 Listings >1600 plugins
Cons: Varying quality and reliability Old, unmaintained, code rot, etc
![Page 9: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/9.jpg)
Global Sanity: ...unless you have to
Writing your own Nagios Plugins Great guide
http://nagiosplug.sourceforge.net/developer-guidelines.html Extended Output Huge Community Any language you want
![Page 10: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/10.jpg)
Small & Medium Shops
Not exclusively small or medium, just a non-automatic way of doing things
For people who: Manually edit / create entries in config files Don't use extensive 3rd party management software Have a small team of responsible admins Don't require large distributed monitoring networks
![Page 11: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/11.jpg)
Configuration Sanity
When: Creating new configs Working with existing configs Testing Responding to events
![Page 12: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/12.jpg)
Syntax Highlighting
This?
![Page 13: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/13.jpg)
Syntax Highlighting
Or this?
![Page 14: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/14.jpg)
Config File Hierarchy
Default config is stupid. cfg_dir directive is key
*.cfg – recursively
Hierarchy should resemble “real life” Allows for additional “group” security Use what makes sense to you and document it
![Page 15: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/15.jpg)
Config File Hierarchy: Example
Output of “tree -d” on my Nagios objects directory
|-- commands |-- computers | |-- groups | |-- linux | | `-- services | `-- windows |-- misc `-- network |-- firewalls |-- links |-- routers `-- switches
![Page 16: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/16.jpg)
Regular Expressions
Not all regexes are created equal use_regexp_matching
Only when object names contain: * ?
use_true_regexp_matching 'man regex' All object names Caution: Unintended consequences
![Page 17: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/17.jpg)
Better Object Formatting
This?
![Page 18: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/18.jpg)
Better Object Formatting
Or this?
![Page 19: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/19.jpg)
Revision Control
CVS/SVN/git(?) Simple, maintainable, recoverable Self-documenting (if done correctly)
![Page 20: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/20.jpg)
(ab)Use Inheritance
Templates register = 0
Multiple Inheritance Beware the spaghetti code
![Page 21: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/21.jpg)
Use Hostgroups
define service{
service_description SSH Service Check
check_command check_ssh
host_name linux01, linux02, linux03, ... linux50
}
![Page 22: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/22.jpg)
Use Hostgroupsdefine hostgroup{
hostgroup_name linuxservers
}
define host{
use generichost
host_name linux01
address 192.168.0.10
hostgroups linuxservers
}
define service{
service_description SSH service check
check_command check_ssh
hostgroup_name linuxservers
}
![Page 23: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/23.jpg)
Script / Automate
Automate as much as possible New Hosts New Services Commands
mkhost.sh as a template
![Page 24: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/24.jpg)
Use alternate contacts file when testing new features
Coworkers are under enough stress as it is No messy explanations Use symlinks to point to “real” contacts file
![Page 25: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/25.jpg)
Plugin Sanity
Thoughts about writing, configuring, and using Nagios plugins
![Page 26: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/26.jpg)
SNMP
Use it whenever possible. Really.
![Page 27: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/27.jpg)
NRPE vs check_by_ssh
Nagios Remote Plugin Executable(?) Skip it when possible
Use SNMP
NRPE
![Page 28: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/28.jpg)
When checking disk usage
Do not specify the partitions to check Instead, specify the partitions to NOT check Too easy to forget to add new partitions. If possible, use a plugin that produces statistics
for graphing usage trends
![Page 29: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/29.jpg)
Notification Sanity
Notifications suck. Here are some ways to make them
not suck as much.
![Page 30: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/30.jpg)
Alternate Communication Method
When the network Is down, email is down too Have a non-email contact method
SMS, cell modem, smoke signals Test it occasionally
![Page 31: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/31.jpg)
Use parents
Establish a path FROM THE NAGIOS SERVER Failure will trigger “unreachable” states
“u” notification flag
Only useful for non-local-subnet hosts typically If the local switch dies, alerts don't go out anyway
Typically
![Page 32: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/32.jpg)
Use Dependencies
Available for both hosts and services The disks didn't blow up, SNMP crashed What do you mean, the website is unavailable when
the database crashes
Dependencies != parents Parents establish a line between the host and
Nagios Dependencies establish logical object relationships
![Page 33: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/33.jpg)
Notifications are Commands
Use Them Execute what you need, when you need, where you
need through extra-nagios scripts
Your imagination is the limit Electrical relays? Flashing lights? HALON release?
Please don't.
![Page 34: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/34.jpg)
Use Passive Checks (when necessary / appropriate)
For “normal” passive checks, specify freshness checks
Useful for SNMP traps Combine with snmptrapd
Distributed Monitoring Use for capacity reasons Physical separation calls for separate Nagios
installs (in my opinion)
![Page 35: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/35.jpg)
Macros GOOD
60 bajillion available - http://nagios.sourceforge.net/docs/3_0/macrolist.html
On Demand Macros Specify “remote” macros from other hosts
$HOSTMACRO:SOMEHOST$
Custom Variable Macros _MACADDRESS00:01:02:03:04:05
$_HOSTMACADDRESS$
Available as environmental variables in scripts $NAGIOS_MACRONAME
![Page 36: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/36.jpg)
Use Flap Detection
Or not. Who wants a charged cellphone battery?
Measures state changes:
Weighted measure of the last 21 checks More recent counts higher
![Page 37: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/37.jpg)
Large Shops
Too many nodes to easily configure by hand, or too many nodes to deal with using one server
Scaling Nagios Centralized Management Web Configurators
![Page 38: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/38.jpg)
Scaling Nagios
large_installation_tweaks No summary macros, memory handling is different,
and processes fork() less
Distributed monitoring Assign groups of hosts to one Nagios server
(reporting via NSCA / Passive checks)
Check tuning docs: http://nagios.sourceforge.net/docs/3_0/tuning.html
![Page 39: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/39.jpg)
Centralized Management
Puppet / chef / cfengine / whatever Distribute nagios user's key if necessary Install nagios agents (NSCA / NRPE) Automate Configuration Build
Puppet's built-in Nagios types sound convenient...sort of
![Page 40: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/40.jpg)
Nagios Web Configuration
Dozen, If not hundreds I don't know of a great one. May be worth building or finding one that
matches your inventory system Don't double-up on data if you don't have to
![Page 41: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/41.jpg)
Malproductive Practices
Overreliance on Event Handlers Please don't do anything terribly important. Edge cases are scary.
Overabuse of inheritance Spaghetti code Hard to trace
Overcomplification Simple is nearly always better
![Page 42: Staying Sane with Nagios](https://reader033.vdocuments.net/reader033/viewer/2022051014/54b72adc4a795916198b47cd/html5/thumbnails/42.jpg)
Learn More
Mailing List Nagios Users
https://lists.sourceforge.net/lists/listinfo/nagios-users
LinkedIn Nagios Users
http://www.linkedin.com/groupAnswers?viewQuestions=&gid=131532&forumID=3&sik=1272591931152