realtime analytics with mongodb
Post on 15-Jan-2015
22.347 Views
Preview:
DESCRIPTION
TRANSCRIPT
©Yottaa Confidential. Do Not Distribute.
a better internet experience
Scaling Rails @ Yottaa
September 20th 2010
Jared Rosoff@forjared
jrosoff@yottaa.com
From zero to humongous
2
• About our application • How we chose MongoDB • How we use MongoDB
About our application
3
• We collect lots of data– 6000+ URLs– 300 samples per URL per day– Some samples are >1MB (firebug) – Missing a sample isn’t a bit deal
• We visualize data in real-time– No delay when showing data– “On-Demand” samples – The “check now” button
The Yottaa Network
4
How we chose mongo
5
©Yottaa Confidential. Do Not Distribute.
Requirements
• Our data set is going to grow very quickly – Scalable by default
• We have a very small team– Focus on application, not infrastructure
• We are a startup – Requirements change hourly
• Operations– We’re 100% in the cloud
6
Rails default architecture
MySQL
Data Source Collection Server
User Reporting Server
“Just” a Rails App
Performance Bottleneck: Too much load
Let’s add replication!
MySQLMasterMySQL
MasterMySQLMaster
MySQLMaster
Replication
Data Source Collection Server
User Reporting Server
Off the shelf!Scalable Reads!
Performance Bottleneck: Still can’t scale
writes
What about sharding?
MySQLMasterMySQL
MasterMySQLMaster
Data Source Collection Server
User Reporting Server
Shar
ding
Shar
ding
Scalable Writes!
Development Bottleneck:
Need to write custom code
Key Value stores to the rescue?
MySQLMasterMySQL
MasterCassandra
orVoldemort
Data Source Collection Server
User Reporting Server
Scalable Writes!
Development Bottleneck:
Reporting is limited / hard
Can I Hadoop my way out of this?
MySQLMasterMySQL
MasterCassandra
orVoldemort
Data Source Collection Server
User Reporting Server
Hadoop
MySQLMasterMySQL
MasterMySQLSlave
MySQLMaster
Scalable Writes!
Flexible Reports!
“Just” a Rails App
Development Bottleneck:
Too many systems!
MongoDB!
MySQLMasterMySQL
MasterMongoDB
Data Source Collection Server
User Reporting Server
Scalable Writes!
“Just” a rails app
Flexible Reporting!
MongoD
MongoD
MongoD
Data Source
App Server
CollectionN
ginx
Pass
enge
r
Mon
gos
ReportingUser
Sharding!
High ConcurrencyScale-Out
LoadBalancer
Sharding is critical
14
• Distribute write load across servers• Decentralize data storage
Scale out!
Before Sharding
15
AppServer
App Server
App Server
Need higher write volume
Buy a bigger database
Need more storage volume
Buy a bigger database
After Sharding
16
AppServer
App Server
App Server
Need higher write volume
Add more servers
Need more storage volume
Add more servers
Scale out is the new scale up
17
AppServer
App Server
App Server
How we’re using MongoDB
18
Our Data Model
19
• Document per URL we track – Meta-data– Summary Data– Most recent measurements
• Document per URL per Day– Detailed metrics– Pre-aggregated data
Thinking in rows
20
URL
Location Connect First Byte
Last Byte Timestamp{ url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 }
{ url: ‘www.google.com’, location: “NYC” connect: 23, first_byte: 123, last_byte: 245, timestamp: 2345 }
Thinking in rows
21
URL
Location Connect First Byte
Last Byte Timestamp
What was the average connect time for google on friday?
From SFO?From NYC?Between 1AM-2AM?
Thinking in rows
22
URL
Location Connect First Byte
Last Byte Timestamp
AVG
AVG
AVG
Day 1
Day 2
Day 3
Result
Up to 100’s of samples per
URL per day!!
30 days average query
range
An “average” chart had to hit
600 rows
Thinking in Documents
23
URL www.google.com
Day 9/20/2010
Last Byte
Sum 2312
Count 12
Locations
Location SFO
Sum 1200
Count 5
Location NYC
Sum 1112
Count 7
This document contains all data for www.google.com collected during 9/20/2010
This tells us the average value for this metric for this url / time period
Average value from SFO
Average value from NYC
Storing a sample
24
Create the document if it doesn’t already exist
Update the location specific value
Update the aggregate value
Which document we’re updating
Atomically update the document
db.metrics.dailies.update( { url: ‘www.google.com’,
day: ‘9/20/2010’ }, { ‘$inc’: { ‘connect.sum’:1234,
‘connect.count’:1, ‘connect.sfo.sum’:1234, ‘connect.sfo.count’:1 } }, { upsert: true } );
Putting it together
25
{ url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 }
Atomically update the daily
data
1
Atomically update the
weekly data
2
Atomically update the
monthly data
3
Drawing connect time graph
26
We just want connect time data
Data for google
The range of dates for the chart
Compound index to make this query fast
db.metrics.dailies.ensureIndex({url:1,day:-1})
db.metrics.dailies.find( { url: ‘www.google.com’,
day: { “$gte”: ‘9/1/2010’, “$lte”:’9/20/2010’ },
{ ‘connect’:true});
More efficient charts
27
URL Day <data>
AVG
AVG
AVG
Day 1
Day 2
Day 3
Result
1 Document per URL per
Day
30 days == 30 documents
Average chart hits 30
documents.
20x fewer
Real Time Updates
28
URL Most Recent DataSingle query to fetch all
metric data for a URL
Fast enough that browser can poll
constantly for updated data without impacting
server
Final thoughts
• Mongo has been a great choice • 80gb of data and counting
– Majorly compressed after moving from table to document oriented data model
• 100’s of updates per second 24x7• Not using Sharding in production yet,
but planning on it soon • You are using replication, right?
29
top related