realtime analytics with mongodb

©Yottaa Confidential. Do Not Distribute.

a better internet experience

Scaling Rails @ Yottaa

September 20th 2010

Jared Rosoff@forjared

jrosoff@yottaa.com

From zero to humongous

• About our application • How we chose MongoDB • How we use MongoDB

About our application

• We collect lots of data– 6000+ URLs– 300 samples per URL per day– Some samples are >1MB (firebug) – Missing a sample isn’t a bit deal

• We visualize data in real-time– No delay when showing data– “On-Demand” samples – The “check now” button

The Yottaa Network

How we chose mongo

©Yottaa Confidential. Do Not Distribute.

Requirements

• Our data set is going to grow very quickly – Scalable by default

• We have a very small team– Focus on application, not infrastructure

• We are a startup – Requirements change hourly

• Operations– We’re 100% in the cloud

Rails default architecture

Data Source Collection Server

User Reporting Server

“Just” a Rails App

Performance Bottleneck: Too much load

Let’s add replication!

MySQLMasterMySQL

MasterMySQLMaster

MySQLMaster

Replication

Off the shelf!Scalable Reads!

Performance Bottleneck: Still can’t scale

writes

What about sharding?

MySQLMasterMySQL

MasterMySQLMaster

Scalable Writes!

Development Bottleneck:

Need to write custom code

Key Value stores to the rescue?

MySQLMasterMySQL

MasterCassandra

orVoldemort

Scalable Writes!

Reporting is limited / hard

Can I Hadoop my way out of this?

MySQLMasterMySQL

MasterCassandra

orVoldemort

Hadoop

MySQLMasterMySQL

MasterMySQLSlave

MySQLMaster

Scalable Writes!

Flexible Reports!

“Just” a Rails App

Too many systems!

MongoDB!

MySQLMasterMySQL

MasterMongoDB

Scalable Writes!

“Just” a rails app

Flexible Reporting!

MongoD

Data Source

App Server

CollectionN

ReportingUser

Sharding!

High ConcurrencyScale-Out

LoadBalancer

Sharding is critical

• Distribute write load across servers• Decentralize data storage

Scale out!

Before Sharding

AppServer

App Server

Need higher write volume

Buy a bigger database

Need more storage volume

Buy a bigger database

After Sharding

AppServer

App Server

Need higher write volume

Add more servers

Need more storage volume

Add more servers

Scale out is the new scale up

AppServer

App Server

How we’re using MongoDB

Our Data Model

• Document per URL we track – Meta-data– Summary Data– Most recent measurements

• Document per URL per Day– Detailed metrics– Pre-aggregated data

Thinking in rows

Location Connect First Byte

Last Byte Timestamp{ url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 }

{ url: ‘www.google.com’, location: “NYC” connect: 23, first_byte: 123, last_byte: 245, timestamp: 2345 }

Thinking in rows

Last Byte Timestamp

What was the average connect time for google on friday?

From SFO?From NYC?Between 1AM-2AM?

Thinking in rows

Last Byte Timestamp

Result

Up to 100’s of samples per

URL per day!!

30 days average query

An “average” chart had to hit

600 rows

Thinking in Documents

URL www.google.com

Day 9/20/2010

Last Byte

Sum 2312

Count 12

Locations

Location SFO

Sum 1200

Count 5

Location NYC

Sum 1112

Count 7

This document contains all data for www.google.com collected during 9/20/2010

This tells us the average value for this metric for this url / time period

Average value from SFO

Average value from NYC

Storing a sample

Create the document if it doesn’t already exist

Update the location specific value

Update the aggregate value

Which document we’re updating

Atomically update the document

db.metrics.dailies.update( { url: ‘www.google.com’,

day: ‘9/20/2010’ }, { ‘$inc’: { ‘connect.sum’:1234,

‘connect.count’:1, ‘connect.sfo.sum’:1234, ‘connect.sfo.count’:1 } }, { upsert: true } );

Putting it together

{ url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 }

Atomically update the daily

Atomically update the

weekly data

Atomically update the

monthly data

Drawing connect time graph

We just want connect time data

Data for google

The range of dates for the chart

Compound index to make this query fast

db.metrics.dailies.ensureIndex({url:1,day:-1})

db.metrics.dailies.find( { url: ‘www.google.com’,

day: { “$gte”: ‘9/1/2010’, “$lte”:’9/20/2010’ },

{ ‘connect’:true});

More efficient charts

URL Day <data>

Result

1 Document per URL per

30 days == 30 documents

Average chart hits 30

documents.

20x fewer

Real Time Updates

URL Most Recent DataSingle query to fetch all

metric data for a URL

Fast enough that browser can poll

constantly for updated data without impacting

server

Final thoughts

• Mongo has been a great choice • 80gb of data and counting

– Majorly compressed after moving from table to document oriented data model

• 100’s of updates per second 24x7• Not using Sharding in production yet,

but planning on it soon • You are using replication, right?

realtime analytics with mongodb

metric data

gb of data

data model19document

updated data

url time periodaverage

oriented data model

requirementsour data

development bottleneck

Technology

realtime analytics

webinar: faster big data analytics with mongodb

realtime analytics in hadoop

near-realtime analytics with kafka and hbase

storage solutions for mongodb analytics › ... ›...

realtime analytics with apache cassandra · realtime...

real time analytics with mongodb

pentaho analytics for mongodb cookbook- sample chapter

realtime analytics to amplify social media signals

scaling facebook's realtime endpoint with mongodb, snap...

mongodb analytics

klmug presentation - simple analytics with mongodb

analytics with mongodb

realtime analytics with flink and druid

realtime analytics with mongodb - mongodb meetup nyc

sisense and simba mongodb analytics webinar

webinar: mongodb and analytics: building solutions with the...

mongodb: optimising for performance, scale & analytics

realtime predictive analytics using rabbitmq & scikit-learn

lightning talk: real-time analytics from mongodb