mongodb indexing constraints and creative schemas
TRANSCRIPT
My Background
• For the past year, I’ve looked at MongoDB logs at least once every day.
• We routinely answer the question “how can I improve performance?”
Thursday, June 27, 13
Who’s this talk for?
• New to MongoDB
• Seeing some slow operations, and need help debugging
• Running database operations on a sizeable deploy
• I have a MongoDB deployment, and I’ve hit a performance wall
Thursday, June 27, 13
What should you learn?Know where to look on a running MongoDBto uncover slowness, and discuss solutions.
MongoDB has performance “patterns”.
How to think about improving performance.
And . . .
Thursday, June 27, 13
Schema Design
Design with the end in mind.
Thursday, June 27, 13
MongoDB Indexing Constraints
• One index per query *
• One range operator per query ($)
• Range operator must be last field in index
• Using RAM well
* except $or, but the sin with $or is appending a sort to the query.
Thursday, June 27, 13
The Tools
• `mongostat` for MongoDB Behavior
• `tail` the logs for current options
• `iostat` for disk util
• `top -c` for CPU usage
Thursday, June 27, 13
First, a Simple One
query getmore command res faults locked db ar|aw netIn netOut conn time 129 4 7 126m 2 my_db:0.0% 3|0 27k 445k 42 15:36:54 64 4 3 126m 0 my_db:0.0% 5|0 12k 379k 42 15:36:55 65 7 8 126m 0 my_db:0.1% 3|0 15k 230k 42 15:36:56 65 3 3 126m 1 my_db:0.0% 3|0 13k 170k 42 15:36:57 66 1 6 126m 1 my_db:0.0% 0|0 14k 262k 42 15:36:58 32 8 5 126m 0 my_db:0.0% 5|0 5k 445k 42 15:36:59
a truncated mongostat
Alerted due to high CPU
Thursday, June 27, 13
log
[conn73454] query my_db.my_collection query: { $query: { publisher: "US Weekly" }, orderby: { publishAt: -1 } } ntoreturn:5 ntoskip:0 nscanned:33236 scanAndOrder:1 keyUpdates:0 numYields: 21 locks(micros) r:317266 nreturned:5 reslen:3127 178ms
Thursday, June 27, 13
Solution
{ $query: { publisher: "US Weekly" }, orderby: { publishedAt: -1 } }
db.my_collection.ensureIndex({“publisher”: 1, publishedAt: -1}, {background: true})
We are fixing this query
With this index
I would show you the logs, but now they are silent.
Thursday, June 27, 13
The Pattern
Inefficient Read Queries from in-memory table scans cause high CPU load
Caused by not matching indexes to queries.
Thursday, June 27, 13
Example 2
MongoDB Twitter-ish Feed
Customer was building a network graph of users.
Thursday, June 27, 13
Naive Method
{ creator_id: ObjectId(), status: “This is so awesome!”}
Statuses Users
{ _id: ObjectId(), friends: [array-o-friends]}
db.status.find({creator_id: {$in: [array-o-friends]}}).sort({_id: -1})
Query
Thursday, June 27, 13
Solution
{ creator_id: ObjectId(), friends_of_creator: [array-of-viewers], status: “This is so awesome!”}
Statuses Users
{ _id: ObjectId(), friends: [array-o-friends]}
db.statuses.find({friends_of_creator: ObjectId()}).sort({_id: -1})
Query
Thursday, June 27, 13
The Pattern
With graphs, query on viewable by.
What worked with minimal documents was not scaling.
Thursday, June 27, 13
Similar Issues - Messages
{ sender_id: ObjectId(), recipient_id: ObjectId(), message: “This is so awesome!”}
Naive{ sender_id: ObjectId(), recipient_id: ObjectId(), participants: [ObjectId(), ObjectId()], thread_id: ObjectId(), message: “This is so awesome!”}
Solution
db.messages.find({participants: ObjectId()}).sort({_id: -1})
Query
db.messages.find({$or: [{sender_id: ObjectId()}, {recipient_id: ObjectId()]}).sort({_id: -1})
Naive Query
Thursday, June 27, 13
Example 3
insert query update delete getmore command faults locked % idx miss % qr|qw ar|aw *0 *0 *0 *0 0 1|0 1422 0 0 0|0 50|0 *0 6 *0 *0 0 6|0 575 0 0 0|0 51|0 *0 3 *0 *0 0 1|0 1047 0 0 0|0 50|0 *0 2 *0 *0 0 3|0 1660 0 0 0|0 50|0
a truncated mongostat
Alerted on high CPU
Thursday, June 27, 13
tail
[initandlisten] connection accepted from ....[conn4229724] authenticate: { authenticate: ....[initandlisten] connection accepted from ....[conn4229725] authenticate: { authenticate: .....[conn4229717] query ..... 102ms[conn4229725] query ..... 140ms
amazingly quietThursday, June 27, 13
currentOp> db.currentOP(){ "inprog" : [ { "opid" : 66178716, "lockType" : "read", "secs_running" : 760, "op" : "query", "ns" : "my_db.my_collection", "query" : {
keywords: $in: [“keyword1”, “keyword2”],tags: $in: [“tags1”, “tags2”]
},orderby: {
“created_at”: -1},
"numYields" : 21 }
]}
Thursday, June 27, 13
Solution
> db.currentOP().inprog.filter(function(row) { return row.secs_running > 100 && row.op == "query"
}).forEach(function(row) { db.killOp(row.opid)
})
Return Stability to Database
Disable query, and refactor schema.
Thursday, June 27, 13
Refactoring
I have one word for you, “Schema”
Thursday, June 27, 13
Example 4
A map reduce has gradually runslower and slower.
Thursday, June 27, 13
Finding Offenders
Find the time of the slowest query of the day:grep '[0-9]\{3,100\}ms$' $MONGODB_LOG | awk '{print $NF}' | sort -n
Thursday, June 27, 13
Slowest Map Reducemy_db.$cmd command: {
mapreduce: "my_collection", map: function() {}, query: { $or: [
{ object.type: "this" }, { object.type: "that" } ],time: { $lt: new Date(1359025311290), $gt: new Date(1358420511290) }, object.ver: 1, origin: "tnh"
},out: "my_new_collection", reduce: function(keys, vals) { ....}
} ntoreturn:1 keyUpdates:0 numYields: 32696 locks(micros) W:143870 r:511858643 w:6279425 reslen:140 421185ms
Thursday, June 27, 13
Solution
Query is slow because it has multiple multi-value operators: $or, $gte, and $lte
Problem
Solution Change schema to use an “hour_created” attribute:
hour_created: “%Y-%m-%d %H”
Create an index on “hour_created” with followed by “$or” values. Query using the new “hour_created.”
Thursday, June 27, 13
Words of caution
2 / 4 solutions were to add an index.
New indexes as a solution scales poorly.
Thursday, June 27, 13
Sometimes . . .
It is best to do nothing, except add shards / add hardware.
Go back to the drawing board on the design.
Thursday, June 27, 13
Bad things happen to good databases?
• ORMs
• Manage your indexes and queries.
• Constraints will set you free.
Thursday, June 27, 13
Road Map for Refactoring
• Measure, measure, measure.
• Find your slowest queries and determine if they can be indexed
• Rephrase the problem you are solving by asking “How do I want to query my data?”
Thursday, June 27, 13