webinar: five mms monitoring alerts to keep your mongodb deployment on track
DESCRIPTION
MongoDB Management Service (MMS) is is a cloud-based suite of services for managing MongoDB deployments, providing both monitoring and backup capabilities. In this webinar we'll outline 5 alerts you should set up in MMS to keep your MongoDB deployment on track. We’ll explore what each alert means for a MongoDB instance, as well as how to calibrate the alert triggers to be relevant to your environment.TRANSCRIPT
Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track
Angshuman Bagchi ([email protected])Technical Services Engineer
Agenda
• What is MMS Monitoring?• What are Alerts?• How to pick an Alert?• Five recommended Alerts• Wrap up
What is MMS Monitoring?
Who uses MMS?
What are MMS alerts?
Source:http://www.cleanfunnypics.com/no-its-not-empty/#axzz2pqknJJbC
How to pick an Alert?
• Is there an absolute limit to alert on?• What is normal (baseline) ?• What is worrying (warning) ?• What is a definite problem (critical) ?• Likelihood of false positives ?
... there is no magic formula
Five recommended alerts
• Host Recovering (All, but by definition Secondary)
• Replication Lag (Secondary)• Connections (All mongos, mongod)• Lock % (Primary, Secondary)• Replica (Primary, Secondary)
Host Recovering
• General alert triggered if any instance enters RECOVERING mode
• Required for all use-cases• All Replica Sets should have this. • Sometimes, during maintenance this
may be expected
Host Recovering
Replication Lag
• No secondary should be behind• Secondary reads affected• All Replica Sets should have this• Only exception is configured slaveDelay
Replication Lag
Absolute Limit?Yes, about 1 or 2s. To prevent false positives absolute threshold > 240s should be alerted
Normal Lag is ideally 0s
Worrying < 60s, some false positives
Critical > 240s
False positives Above 240s likelihood low.
Example: replication lag
150,000s of lag ~ almost 2 days of lag!
Example: replication lag
• Secondaries under specified vs primaries• Access patterns between primary /
secondaries• Insufficient bandwidth• Foreground index builds on secondaries
“…when you have eliminated the impossible, whatever remains, however improbable, must be the truth…” -- Sherlock Holmes
Sir Arthur Conan Doyle, The Sign of the Four
Example: replication lag
Example:• ~1500 ops per minute (opcounters)• 0.1 MB per object (average object size,
local db)
~1500 ops/min / 60 seconds * 0.1 MB/op * 8b/B =~ 20 mbps required bandwidth
Connections
• Each connection consumes ~ 1MB and a file descriptor
• 5000 connections => 5GB of RAM• Stability and predictability are key
Pro-Tip: know thyself
You have to recognize normal to know when it isn’t.
Source: http://www.flickr.com/photos/skippy/6853920/
Connections
Absolute Limit? Yes, but this is too high. We need to alert before that
NormalTBD based on deployment, number of nodes, connection pool settings, app servers, load etc. Say, X during peak load
Worrying 50% increase, so, 1.5X
Critical Double, so 2X
Lock %
• Lock contention degrades performance• High lock % starves replication, reads.• Bounds need to be determined
Lock %
Absolute Limit?Yes, >80% occasional degraded performance, 90% major impact regularly
NormalTBD. Write heavy loads see higher values. Normal, say X% during peak load
Worrying Double, so approximately 2X%
Critical TBD. For Prod > 80%
Replica
• Represents oplog window• Depends on
– Rate of operations inserted into oplog– Size of operations– Size of oplog capped collection
• Normal maintenance window X 3 • Resizing the oplog is non-trivial
Replica
Absolute Limit? 50% below Normal
Normal TBD. Say X hours during peak
Worrying 25% below Normal
Critical 50% below Normal
Summary
• Use similar approach for other metrics• Different audiences for alerts
– Worrying alerts ops team– Critical goes out to a wider audience
• Get started with MMS Monitoring and alerts!
I got alerted … now what?
mms.mongodb.com