mike guthrie - revamping your 10 year old nagios installation

25
Revamping Your 10 Year Old Nagios Installation By Mike Guthrie [email protected]

Upload: nagios

Post on 15-Apr-2017

565 views

Category:

Presentations & Public Speaking


7 download

TRANSCRIPT

Page 1: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Revamping Your 10

Year Old Nagios

Installation

By Mike [email protected]

Page 2: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Case Study: Red Ventures• Digital Marketing Company

• Acquire customers for our partners– Optimize SEO for websites

– Take inbound call volume for sales calls

Page 3: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

RV Technology Notes• LAMP Environment - PHP and JS

• We LOVE data – Many TB of DB storage

• We move fast…think Agile development on steroids.

• 50-60 in-house developers

• Redundancy – CLT and ATL datacenters

• Almost everything is clustered

• Our speed often creates technical debt

Page 4: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

March 2015 – Nagios Profile

• 2 Nagios Installations – CLT and ATL

• 1100 Hosts/8000 Services (Now 1500/13000)– Linux servers (web, mysql, cron, load balancers)

– Windows servers (phone, terminal)

– Network (Routers, UPS, PDU)

• PNP4Nagios for Performance Data

• Thruk UI

Page 5: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Key Problems• No system to configs whatsoever

• No consistency between ATL and CLT in setup

• Adding one check to a server type meant touching hundreds of files

• Terrible alerts storms

• Misdirected or missing alerts

• Lots of hosts not being monitored at all

• ATL latency problems

• Broken escalations

• No effective historical reporting

Page 6: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

What Every Engineer Wants To Hear

Page 7: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Goals

• Manageable configuration

• Minimize time spent on maintenance

• Reporting / Dashboards / Visualization

• Scalability

• Noise reduction

Page 8: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Step 1: Fix Config Management• Version controlled and synced:

– Contacts– Templates– Commands – Hostgroups– Escalations– Dependencies

• Decoupled: – Hosts– One-offs escalations and dependencies

Page 9: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Step 1:Fix Config Management

• All hosts / services use templates

• Almost all service checks are applied through Service -> Hostgroup relationships

• Hostgroup = roles / attributes– linux-server (Load, Memory, Disk, Procs, etc)

– mysql-server (Mysql, Slaving, Storage partition)

– supervisord-server (supervisord procs running)

• Use host variables for differing ports, SNMP strings, active disk partitions, etc

Page 10: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

define host {host_name rv-atl-serverl01alias rv-atl-serverl01use rv-routershostgroups MsSQLcontact_groups phoneops

}

define service {host_name rv-atl-serverl01use rv-routers-serviceservice_description PINGcontact_groups phoneopscheck_command check_ping!200.0,30%!300.0,70%

}define service {

host_name rv-atl-serverl01use rv-windows-serviceservice_description DISKcontact_groups phoneopscheck_command check_windows_disk!public!CHIJKM!80!85

}

define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description CPU LOADcheck_command check_snmp_load_windows!public!50!80contact_groups phoneopscheck_period 24x7MinusSQLBackup

}

define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description SQL Servicecheck_command check_windows_service!public!SQL Server

\\(MSSQLSERVER\\)contact_groups phoneops

}

define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description VIRTUAL MEMORY USAGEcheck_command check_snmp_misc!public!Virtual Memory!90!95contact_groups phoneops

}

define host {

host_name rv-atl-serverl01

alias rv-atl-serverl01

use mssql-server

hostgroups windows-server,mssql-server

_SNMP public

}

Host Config Before

Host Config After

Page 11: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Step 2: Config Automation

• Most of our servers are puppet managed (transitioning to Salt)

• Linux machines need to be self-aware of what they need to have monitored

• Linux servers use passive checks to propogate themselves up to Nagios

Page 12: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

GenerateGenerate

Enforces

Process

Result

Remote Host

• NRPE Config

• Passive Crontab

Nagios /

Webhook

• Does this

host exist?

Config Manager

• Puppet

• Salt

• Add Host

• Verify

• Restart

• Notify

Page 13: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Result

• CONSISTENCY!

• All Linux configs are now auto-generated

• Everything else is either cloned or generated from a custom webtool

• Maintenance time went from 10-20 hours per week to less than 1 hour most weeks

Page 14: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Step #2: Reporting / Visualization

• Need NOC-level visibility

• Need cluster-level views of performance data

• Historical view of state changes and notifications

Page 15: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Nagios

Core

CLT

Nagios

Core

ATL

Perfdata

(Carbon)

Ndoutils

(Mysql)

Event

Server?

(TODO)

Grafana Dashboards

Custom NOC Dashboard

Thruk UI

Page 16: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

NOC Dashboard

Page 17: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Page 18: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Graphite + Grafana = Awesome• Opted not to use Graphiosservice_perfdata_file_template=\

$TIMET$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICEPERFDATA

service_perfdata_file_processing_command=<customScript>

• Nagios writes to buffer file

• Custom scripts grabs the buffer and flushes it to carbon

• Multiline socket write over UDP

• Will send 1000 data points in less than .03 seconds

• Carbon can scale far beyond anything we can throw at it

Page 19: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Grafana• Makes combining and templating graphs EASY• Can combine all sorts of metrics on a graph and perform a variety of

mathematical functions on them

atl.rv-atl-server*.CPU_Load.load5

*.rv-{atl,clt}-server*.CPU_Load.load

• Can setup new NOC dashboards in minutes• Also using this for application monitoring data• Allows us to easily spot performance anomalies• Helps with event correlation

Page 20: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

This is OK

This is not OK

Page 21: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

TODO

• Event Automation – create automatic response

tasks to known issues with common fixes

• Better connectivity to application monitoring

• Adaptive monitoring for situations like this:

Page 22: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Implementation• Left the old servers alone and running• Spun up new servers with notifications and event handling disabled• Migrated 600+ configs by hand• 500+ generated automatically• Problem states were perfect for identifying what wasn’t setup yet• Took about 6 weeks to migrate configs to new machines• Launch day was changing Thruk’s backend config to point to new

servers, and switch over notifications• Audit, design, migration, and stable implementation took about 90

days

Page 23: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Things I Learned• Take the time to understand what you’re monitoring

• Lack of understanding will produce alert noise, which is ineffective monitoring

• In a complex system, log everything

• Small changes do the most damage

• Automation is cool except for when it automatically sends 60% of your environment into a CPU death spiral

• I wish Nagios Core allowed hostgroup exclusions in service definitions (hint, hint)

• Nagios is still the best tool and monitoring tool out there

Page 24: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Thank you!

Any Questions?

Page 25: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation