A Deep Dive into Nagios Analytics
Alexis Lê-Quôc (@alq)http://datadoghq.com
@alqDev & OpsNagios user since 2008Datadog co-founder
Top 3 failed checks
Top 3 failed checks
That I responded tolast week
That woke me up
That most of my teamresponded to at least once
That impacts our businessthe most?
That I responded to5 weeks ago
Top 3 failed checks
That I responded tolast week
That woke me up
That most of my teamresponded to at least once
That impacts our businessthe most?
That I responded to5 weeks ago
Using memory to prioritize
remediation...
At best, finding local optimums
At worst, brownian motion
Performance Metrics
Nagios TrafficOther Sources
In the “Cloud”
Nagios a “chatty” source
out of 40+ Datadog supports
Almost 13000 Nagios “events”over past week
86 notifications!
More data?More questions.
A dialog with dataNot a scientific study
Population
25% 50% 75% 100% 20 93 322 904
Does size matter?
Weekly Count per host split by quartile
Weekly count per host split by quartile
Outliers Sick hosts,
silenced checks
Notifications1-3% of alerts notify
Little difference per quartile
Does time of day matter?
Mean about the sameacross quartiles
Time-based deviation?
Does the day of week matter?
Squeaky wheels? (checks)
Outlier in more detail
Squeaky wheel?(hosts)
Similar pattern as checks
Young Old
Seldom happen
s
Happens
Often
Happen once in a while
Occur often, for a long time Tolerated
More data?More questions.
Find out tomorrow!Awk
Postgres
R
d3
Presentation matters
Take-aways
•Don’t rely on your memory
•Your Nagios logs are a treasure trove
•Have a dialog with your data
•Presentation matters