from zero to visibility
DESCRIPTION
Monitorama Portland 2014 Portland, OR 2014-05-05 to 2014-05-07 When I joined a startup already in progress as their first ops hire, what monitoring existed was a twisty maze of half-measures. The devteam dreaded oncall, and our Mean Time To Lost Sleep was way too low. Improving visibility into our infrastructure and application performance required trying new tools and changing how we thought about what we were measuring. Join me for a tragicomic journey from the vale of blissful ignorance through the straits of Nagios and into the mountains of Graphite. Thrill! to the victories. Cringe! at the rewards of hubris. Share! your own insights, because this tale never really ends.TRANSCRIPT
From Zero To Visibility
Bridget Kromhout
8thbridge.comsmall social commerce startupacquired in the last month by Fluid, Inc.small devteamI am the ops team
http://www.thedirtbox.com/wp-content/uploads/2013/01/ping-pongart.jpg
twisty maze of little shell scripts
http://www.pcgameshardware.de/screenshots/1280x1024/2007/07/CA01.jpg
time-consuming to understanddifficult to modifydoesn’t scale
artisanal monitoring?!
http://shop.bespokebacon.com/images/bespoke-logo.final(3).png
New Relicpros:nice graphsapplication-level viewgood error analysis
cons:slow to updatemany false-positive alertshigh prices (better now)
motivating change
http://99designs.com/illustrations/contests/illustration-pagerduty-161025/entries
as hideous as you remember
“Horrendous interface”“Well, it’s more “old” than anything
else. At least everything is in the
same place as you left it because it’s
been the same since 1912.”https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/
not alone!
“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
who watches the RabbitMQ?
-- @murphy_slaw (via @lozzd)
http://images.sodahead.com/profiles/0/0/0/5/1/6/6/3/9/Watchmen-trademark-symbol-62141795529.jpeghttp://portertech.ca/images/2011-11-01/sensu-diagram.png
hating on nagios: the middle years
“hadoop does not suffer from a paucity of configuration options” http://jaganesundar.wordpress.com/2011/12/05/installing-and-configuring-hadoop-0-20-205-using-it-rpm/
monitor all the ports?!
best way to monitor HBase:hbck: the HBase consistency checker
nagios -> bash script -> parsing output of hbck
http://www.ymc.ch/en/how-to-monitor-hbase-health-by-nagios
http://modiinhub.com/wp-content/uploads/2014/02/logo-mongodb-tagline.png
“Cyber” monday: 1988 called; wants its word back.
wow. such nosql. very webscale.
“a single write operation holds the lock exclusively, and no other read or write operations may share the lock.”
“If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it.” Ian Malpass, Etsy
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
the (former) state of our graphite & statsd
● Graphite 0.9.9○ hand-rolled○ over 2 years old○ missing new features (Consolidate by!)
● StatsD was newish, but…○ hand-rolled○ running in a screen session○ on a special snowflake box
http://media-cache-ec0.pinimg.com/736x/68/c2/9d/68c29deb72bad94cd4e3c1aa0f3cdcd8.jpg
this is wrong tool. never use this.
Community cookbooks?
● StatsD○ https://github.com/librato/statsd-cookbook
● Graphite ones good, but…○ focus on Apache (we use nginx)○ we haven’t moved to Chef 11 (gasp!)
when in doubt: tcpdump is your friend
http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/
carbon-aggravator (between 0.9.10 & 0.9.12)
# If set true, metric received will be forwarded to# DESTINATIONS in addition to# the output of the aggregation rules. If set false # the carbon-aggregator will# only ever send the output of aggregation.FORWARD_ALL = True
carbonate: A+++ would clone again
whisper-fill.pybackfill datapoints between whisper files
life as a third wheel party
thresholds: because not every outage is abrupt
normal traffic
decision to turn off
decision to turnback on
accidental removal
open-source error reporting
all the things
StatsDApplication-level error analysis
Alarms for autoscaling
Timers & counters
Log & host-level
Hadoop & HBase visualization
MongoDBGraphs
Time-series data graphing
client-side plugins
Threshold-based alarmsDashboard
external checks
What’s next?
http://blog.xebia.fr/wp-content/uploads/2013/12/file-logstash-es-kibana.png
what even is ideal monitoring solution
http://www.quickmeme.com/img/f5/f512ff9bee084263df5571d3c81388019dcb063173e1dbcfa2babac9274576b6.jpg
❏ finds real problems❏ actionable alerting❏ usable by all❏ …?
questions; comments; whatnot
Twitter: @bridgetkromhoutEmail: [email protected]
In person: DevOps Days Minneapolis (devopsdays.org)