webinar: general technical overview of mongodb
DESCRIPTION
MongoDB is the leading open-source, document database. In this webinar we'll dive into the technical details of MongoDB by first mapping it from relational concepts. Next we'll discuss an example data model and associated query functionality using commands pulled straight from the MongoDB shell. Finally, we'll delve into some of the deployment functionality provided by MongoDB including solutions for data redundancy, node failover and auto-sharding.TRANSCRIPT
Solutions Architect, 10gen
Sandeep Parikh
MongoDB Technical Overview
Agenda
Relational Databases
MongoDB Features
MongoDB Functionality
Scaling and Deployment
Aggregates, Statistics, Analytics
Advanced Topics
About 10gen
• Background – Founded in 2007 – First release of MongoDB in 2009 – 74M+ in funding
• MongoDB – Core server – Native drivers
• Subscriptions, Consulting, Training
• Monitoring
Relational Databases
Relational Databases
User·Name·Email address
Category·Name·URL
Comment·Comment·Date·Author
Article·Name·Slug·Publish date·Text
Tag·Name·URL
RDBMS Strengths
• Data stored is very compact
• Rigid schemas have led to powerful query capabilities
• Data is optimized for joins and storage
• Robust ecosystem of tools, libraries, integratons
• 40+ years old!
Enter “Big Data”
• Gartner defines it with 3Vs
• Volume – Vast amounts of data being collected
• Variety – Evolving data – Uncontrolled formats, no single schema – Unknown at design time
• Velocity – Inbound data speed – Fast read/write operations – Low latency
Mapping Big Data to RDBMS
• Difficult to store uncontrolled data formats
• Scaling via big iron or custom data marts/partitioning schemes
• Schema must be known at design time
• Impedance mismatch with agile development and deployment techniques
• Doesn’t map well to native language constructs
MongoDB Features
Goals
• Scale horizontally over commodity systems
• Incorporate what works for RDBMSs – Rich data models, ad-hoc queries, full indexes
• Drop what doesn’t work well – Multi-row transactions, complex joins
• Do not homogenize APIs
• Match agile development and deployment workflows
Key Features
• Data stored as documents (JSON) – Flexible-schema
• Full CRUD support (Create, Read, Update, Delete) – Atomic in-place updates – Ad-hoc queries: Equality, RegEx, Ranges, Geospatial
• Secondary indexes
• Replication – redundancy, failover
• Sharding – partitioning for read/write scalability
Document Oriented, Dynamic Schema
name: “jeff”, eyes: “blue”, height: 72, boss: “ben”}
{name: “brendan”, aliases: [“el diablo”]}
name: “ben”, hat: ”yes”}
{name: “matt”, pizza: “DiGiorno”, height: 72, boss: 555.555.1212}
{name: “will”, eyes: “blue”, birthplace: “NY”, aliases: [“bill”, “la ciacco”], gender: ”???”, boss: ”ben”}
Disk seeks and data locality
Seek = 5+ ms Read = really really fast
User Comment
Article
Disk seeks and data locality
Article
User
Comment Comment Comment Comment Comment
MongoDB Security
• SSL – Between your app and MongoDB – Between nodes in MongoDB cluster
• Authorization at the database level – Read Only / Read + Write / Administrator
• Roadmap – 2.4: SASL, Kerberos authentication – 2.6: Pluggable authentication
Use Cases
Content Management
Operational Intelligence
High Volume Data Feeds E-Commerce User Data
Management
MongoDB Functionality
> var new_article = {
author: “roger”,
date: new Date(),
title: “My Favorite 2012 Movies”,
body: “Here are my favorite movies from 2012…”
tags: [“horror”, “action”, “independent”]
}
> db.articles.save(new_article)
Documents
> db.articles.find()
{
_id: ObjectId(“4c4ba5c0672c685e5e8aabf3”),
author: “roger”,
date: ISODate("2013-01-08T22:10:19.880Z")
title: “My Favorite 2012 Movies”,
body: “Here are my favorite movies from 2012…”
tags: [“horror”, “action”, “independent”]
}
// _id is unique but can be anything you like
Querying
// create an ascending index on “author”
> db.articles.ensureIndex({author:1})
> db.articles.find({author:”roger”})
{
_id: ObjectId(“4c4ba5c0672c685e5e8aabf3”),
author: “roger”,
…
}
Indexes
// Query Operators:
// $all, $exists, $mod, $ne, $in, $nin, $nor, $or,
// $size, $type, $lt, $lte, $gt, $gte
// find articles with any tags
> db.articles.find({tags: {$exists: true}})
// find posts matching a regular expression
> db.articles.find( {author: /^rog*/i } )
// count posts by author
> db.articles.find( {author: ‘roger’} ).count()
Ad-Hoc Queries
// Update Modifiers
// $set, $unset, $inc, $push, $pushAll, $pull,
// $pullAll, $bit
> comment = {
author: “fred”,
date: new Date(),
text: “Best list ever!”
}
> db.articles.update({ _id: “...” }, {
$push: {comments: comment}
});
Atomic Updates
{
_id: ObjectId("4c4ba5c0672c685e5e8aabf3"),
author: "roger",
date: ISODate("2013-01-08T22:10:19.880Z"),
title: “My Favorite 2012 Movies”,
body: “Here are my favorite movies from 2012…”
tags: [“horror”, “action”, “independent”]
comments : [
{ author: "Fred",
date: ISODate("2013-01-08T23:44:15.458Z"),
text: "Best list ever!” }
]
}
Nested Documents
// Index nested documents
> db.articles.ensureIndex({“comments.author”:1})
> db.articles.find({“comments.author”:’Fred’})
// Index on tags
> db.articles.ensureIndex({tags: 1})
> db.articles.find({tags: ’Manga’})
// Geospatial indexes
> db.articles.ensureIndex({location: “2d”})
> db.posts.find({location: {$near: [22,42]}})
Secondary Indexes
Scaling MongoDB
Scaling MongoDB
• Replica Sets – Redundancy, failover, read scalability
• Sharding – Auto-partitions data, read/write scalability
• Multi-datacenter deployments
• Tunable consistency
• Engineering for zero downtime
Secondary Secondary
Primary
Client ApplicationDriver
Write
Read
Replica Sets
Node 1Secondary
Node 2Secondary
Node 3Primary
Replication
Heartbeat
ReplicationReplica Set – Initialize
Node 1Secondary
Node 2Secondary
Node 3
Heartbeat
Primary Election
Replica Set – Failure
Node 1Secondary
Node 2Primary
Node 3
Replication
Heartbeat
Replica Set – Failover
Node 1Secondary
Node 2Primary
Replication
Heartbeat
Node 3Recovery
Replication
Replica Set – Recovery
Node 1Secondary
Node 2Primary
Replication
Heartbeat
Node 3Secondary
Replication
Replica Set – Recovered
Secondary Secondary
Primary
Client ApplicationDriver
Write
Read Read
Scaling Reads
Node 1SecondaryConfigServer
Node 1SecondaryConfigServer
Node 1SecondaryConfigServer
Shard Shard Shard
Mongos
App Server
Mongos
App Server
Mongos
App Server
Sharding
Data stored in shard
• Shard is a node of the cluster
• For production deployments a shard is a replica set
Shard
Primary
Secondary
Secondary
Shard
orMongod
Config server stores meta data
• Config Server – Stores cluster chunk
ranges and locations – Production deployments
need 3 nodes – Two phase commit (not
a replica set)
orNode 1SecondaryConfigServer
Node 1SecondaryConfigServer
Node 1SecondaryConfigServer
Node 1SecondaryConfigServer
Mongos manages the data
• Mongos – Acts as a router / balancer – No local data (persists to config database) – Can have 1 or many
App Server
Mongos Mongos
App Server App Server App Server
Mongos
or
Node 1SecondaryConfigServer
Node 1SecondaryConfigServer
Node 1SecondaryConfigServer
Shard Shard Shard
Mongos
App Server
Mongos
App Server
Mongos
App Server
Sharding
Aggregates, Statistics, Analytics
Analyzing Data in MongoDB
• Custom application code – Run your queries, compute your results
• Aggregation framework – Declarative, pipeline-based approach
• Native Map/Reduce in MongoDB – Javascript functions distributed across cluster
• Hadoop – Offline batch processing/computation
db.article.aggregate(
{ $project: {
author: 1,
tags: 1,
}},
{ $unwind: "$tags" },
{ $group: {
_id: “$tags”,
authors: {
$addToSet : "$author"
}
}}
);
Aggregation Framework
{
title: “this is my title” ,
author: “bob” ,
posted: new Date () ,
tags: [“fun”, “good”, “fun”],
comments: [
{ author:“joe”,
text: “this is cool” },
{ author:“sam” ,
text: “this is bad” }
],
other: { foo : 5 }
}
// Operations: $project, $match, $limit, $skip, $unwind, $group, $sort
Mapping SQL to Aggregation SQL statement MongoDB command
SELECT COUNT(*) FROM users
db.users.aggregate([ { $group: {_id:null, count: {$sum:1}} } ])
SELECT SUM(price) FROM orders
db.users.aggregate([ { $group: {_id:null, total: {$sum:”$price”}} } ])
SELECT cust_id, SUM(PRICE) from orders GROUP BY cust_id
db.users.aggregate([ { $group: {_id:”$cust_id”, total:{$sum:”$price”}} } ])
SELECT cust_id, SUM(price) FROM orders WHERE active=true GROUP BY cust_id
db.users.aggregate([ { $match: {active:true} }, { $group: {_id:”$cust_id”, total:{$sum:”$price”}} } ])
Native Map/Reduce
• More complex aggregation tasks
• Map and Reduce functions written in JS
• Can be distributed across sharded cluster for increased parallelism
var map = function() {
emit(this.author, {votes: this.votes});
};
var reduce = function(key, values) {
var sum = 0;
values.forEach(function(doc) {
sum += doc.votes;
});
return {votes: sum};
};
Map/Reduce Functions
Hadoop and MongoDB
• MongoDB-Hadoop adapter
• 1.0 released, 1.1 in development
• Supports Hadoop – Map/Reduce, Streaming, Pig
• MongoDB as input/output storage for Hadoop jobs – No need to go through HDFS
• Leverage power of Hadoop ecosystem against operational data in MongoDB
MongoDB Resources
• Presentations, Webinars – www.10gen.com/presentations
• MongoDB documentation – docs.mongodb.org
• Community – groups.google.com/group/mongodb-user – stackoverflow.com/questions/tagged/mongodb
Questions