nagios conference 2012 - andrew widdersheim - nagios is down boss wants to see you
DESCRIPTION
Andrew Widdersheim's presentation on using Nagios high availability. The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcnaTRANSCRIPT
![Page 2: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/2.jpg)
2012 2
Nooooooooooo!!!
![Page 3: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/3.jpg)
2012 3
Breaking News!
![Page 4: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/4.jpg)
2012 4
Nagios High Availability Options
Merlin by op5
Classic method described in Nagios Core documentation
Some type of virtualized solution like VMWare
or…
![Page 5: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/5.jpg)
Nagios High Availability
+
= Win
2012 5
![Page 6: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/6.jpg)
DRBD magic
2012 6
![Page 7: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/7.jpg)
DRBD magic
2012 7
Linbit
Free
Runs in Kernel either by module or in the mainline code if Kernel is new enough
Each server gets its own independent storage
Able to maintain the data’s consistency between the nodes
Resource level fencing
![Page 8: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/8.jpg)
DRBD considerations
2012 8
DRBD is as fast as the slowest node
Network latency
Replication over great distances can be done
DRBD proxy can increase performance over great distances but does cost money
Recommend using dedicated cross-over link for best performance
Protocol Choices
Protocol A: write IO is reported as completed, if it has reached local disk and local TCP send buffer.
Protocol B: write IO is reported as completed, if it has reached local disk and remote buffer cache.
Protocol C: write IO is reported as completed, if it has reached both local and remote disk.
![Page 9: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/9.jpg)
Pacemaker
2012 9
![Page 10: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/10.jpg)
Pacemaker + DRBD + Nagios
2012 10
PacemakerResource Manager
CoroSync / HeartbeatMessaging
Node1 Node2Hardware
Primary SecondaryDRBD
ext4Filesystem
192.168.1.57VIP
rrdcached
NCSA
NPCD
Apache
Nagios
Nagios Stuff
![Page 11: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/11.jpg)
Pacemaker + DRBD + Nagios
11
Primary SecondaryDRBD
ext4Filesystem
192.168.1.57VIP
rrdcached
NCSA
NPCD
Apache
Nagios
Nagios Stuff
PacemakerResource Manager
CoroSync / HeartbeatMessaging
Node1 Node2Hardware
2012
![Page 12: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/12.jpg)
Pacemaker + DRBD + Nagios
2012 12
PrimarySecondaryDRBD
ext4
rrdcached
NCSA
NPCD
Apache
Nagios
192.168.1.57
PacemakerResource Manager
CoroSync / HeartbeatMessaging
Node1 Node2Hardware
![Page 13: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/13.jpg)
Pacemaker and Nagios
2012 13
![Page 14: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/14.jpg)
Pacemaker and Nagios
2012 14
primitive p_nagios lsb:nagios \ op start interval="0" timeout="180s" \ op stop interval="0" timeout="40s" \ op monitor interval="30s" \ meta target-role="Started"
primitive p_fs_nagios ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/r1" directory="/drbd/r1" fstype="ext4“ options="noatime" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="180s" \ op monitor interval="30s" timeout="40s"
group g_nagios p_fs_nagios p_nagios_ip p_nagios_bacula p_nagios_mysql \p_nagios_rrdcached p_nagios_npcd p_nagios_nsca p_nagios_apache \p_nagios_syslog-ng p_nagios \
meta target-role="Started"
![Page 15: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/15.jpg)
Pacemaker and Nagios
2012 15
![Page 16: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/16.jpg)
Pacemaker considerations
2012 16
Redundant communication links are a must
Recommend use of crossover to help accomplish this
Init scripts for Nagios must be LSB compliant… some are not
![Page 17: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/17.jpg)
What to replicate?
2012 17
Configuration
Host
Service
Multi check command files
Webinject command files
PNP4Nagios RRD’s
Nagios log files
retention.dat
Mail Queue (eh…)
![Page 18: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/18.jpg)
Everything else?
2012 18
Binaries and main configuration files installed using packages independently on each server
Able to update one node at a time
Easy to roll back should there be an issue
Version/change management
Consistent build process
NDO and MySQL hosted on separate HA cluster
![Page 19: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/19.jpg)
RPM’s
2012 19
Build and maintain our own RPM’s
Lets us configure everything to our liking
Lets us update at our own pace
Controlled through SVN with a post-commit to automatically update our own Nagios repository with new packages/updates. Then it is as simple as doing “yum update” on your servers.
A lot of upfront work but was worth it
![Page 20: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/20.jpg)
How has this helped?
2012 20
Have been able to repair, upgrade and move hardware with minimal downtime
Updated OS and restart server with minimal downtime
Able to update to 3.4.1 and promptly patch issue affecting Nagios downtime’s that was not caught in QA
CGI pages of death
![Page 21: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/21.jpg)
What doesn’t this solve?
2012 21
Having an HA cluster is great but there are still things that can go wrong having a cluster does not solve
Configuration issues are probably the most prevalent thing we run into that might bring down Nagios without there being a major hardware/DC issue
We make use of NagiosQL which does a backup when a configuration is changed. This allows us to rollback unwanted changes but isn’t the best.
![Page 22: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/22.jpg)
Two is better than one
2012 22
Setting up another cluster for “development” with similar hardware and software is a great way to test things outside of production
Lets you spot potential problems before they become a problem
![Page 23: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/23.jpg)
Monitoring your cluster
2012 23
check_crm
http://exchange.nagios.org/directory/Plugins/Clustering-and-High-2DAvailability/Check-CRM/details
check_drbd
http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_drbd/details
check_heartbeat_link
http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_heartbeat_link/details
![Page 24: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/24.jpg)
Gotcha’s
2012 24
RPM’s and symlinks in an HA solution are bad
Symlink /usr/local/nagios/etc/ -> /drbd/r1/nagios/etc when node is secondary and you update RPM your symlink will get blown away
Restarting services controlled by Pacemaker should be done within Pacemaker
crm resource restart p_nagios
![Page 25: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/25.jpg)
Quick Stats
2012 25
Thousands of host and service checks
Average check latency ~.300 sec
Average checks per second ~70
Mostly active checks polling every 5 minutes
DL360 G5
6 146GB 10k SAS drives in RAID10
2 quad core E5450 @ 3.00GHz
8GB Memory
![Page 26: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/26.jpg)
Tuning
2012 26
RAM disk for check results queue, NPCD queue, objects.cache and status.dat
NDOUtils with async patch
Built in since version 1.5
Limit what you send to NDOUtils
Bulk Mode with npcdmod
rrdcached
Restarting Nagios through external command eventually resulted in higher latencies for some reason
Large installation tweaks
Disable environment macros
A lot of trial and error with scheduling and reaper frequencies
Small amount of check optimization
Measuring Nagios performance using PNP4Nagios is a must
![Page 27: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/27.jpg)
RAM disk + ndo-async + rrdcached
2012 27
![Page 28: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/28.jpg)
non-external command file restarts
2012 28
![Page 29: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/29.jpg)
nsca-2.9
2012 29
![Page 30: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/30.jpg)
One Year’s Progress
2012 30
![Page 31: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/31.jpg)
How we run today
2012 31
![Page 32: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you](https://reader036.vdocuments.net/reader036/viewer/2022070304/54c901624a795961428b4578/html5/thumbnails/32.jpg)
Quick Stats
2012 32
Questions?