20131017 - en - presentation damn data

Moving Forwards with Cassandra!

Storing traffic data historically

21/10/2013 Pieter Callewaert

Be-Mobile│1

1. Detection 2. Aggregation 3. Distribution

Be-Mobile

0SmartMove

Mobility Database

Sensor: Input API for sensors

Icarus: Floating car data

Editor: Manual data entry

Navigation traffic services

0Road

sensor data

0Probevehicle

data

0Public

sourcing

Traffic operators

Connector: External data0Other

mobility data

Phone, SMS, CamerasApps, Web

Traffic centers, Public transport, Fuel, Parking

Traveler information services

Smartphone & (mobile) web

Radio & TV traffic services

Traffic management services

Road traffic mgt services

Fleet & logistics traffic mgt

Traffic analysis & consulting

Be-Mobile│2

Be-Mobile is hiring!

http://www.be-mobile.be/about/careers

Be-Mobile NV Technologiepark 12b 9052 Ghent Belgium

www.be-mobile.be

Be-Mobile





Be-Mobile│3

Requirements…

February 2010

Be-Mobile│4

What data do we want to store?

Green dots: Nodes

Red dots: Super Nodes (connecting Links)

Blue lines: Segments

Orange lines: Links

Be-Mobile│5

2010: New project

We wanted to store our raw traffic data into a database so it would be easy to query and generate reports.

Requirements (February 2010):

• 50 000 links stored every 15 minutes (+ 4,8 million records each day)

• Low cost

• C# .NET

No problem, we already had a Microsoft SQL Server database, and the needed experience to do this.

Be-Mobile│6

But requirements change…

Requirements (October 2010):

• 50 000 links 520 000 stored every 15 5 minutes (+ 150 million records each day) data size is 31x larger

• Very Low cost

• C#

Be-Mobile│7

Back to the drawing board

Pre selection of possible contenders:

• Microsoft SQL server : Relational database

• MongoDB : Document data store

• Apache Cassandra : Column family data store

Be-Mobile│8

Microsoft SQL Server

Current approach didn’t work. 2 options:

• Buy high end hardware

• Distribute load over multiple servers?

Pros:

• We have experience with SQL Server

• (Compression)

Cons:

• High hardware costs, our high license costs…

• No experience with this volume data in SQL Server

• Partitioning data over multiple servers can be tricky

Be-Mobile│9

Mongo DB

MongoDB 1.6.x

Proof of concept ready in less than a day.

Pros:

• Backed by a company (10gen, now MongoDB Inc.),

• Open Source,

• Official C# driver,

• EASY!

Cons:

• Easier to scale beyond 1 server, but still not that straight forward,

• No (native) compression,

• 16 MB document limit forced us to make a more complex data model.

Be-Mobile│10

Apache Cassandra

Apache Cassandra 0.8/1.0

Took about 10 days to create a proof of concept.

Pros:

• Backed by a company (Datastax),

• Open Source,

• Scales automatically,

• Also configuring replication is easy.

• Compression (since 1.0),

Cons:

• Not easy to learn,

• Thrift interface,

• Data modeling was not easy.

• No official C# driver

Be-Mobile│11

How does Apache Cassandra work?

Introduction to Apache Cassandra

Be-Mobile│12

The basics

• Open source by Facebook in 2008

• Marriage between Amazon Dynamo and Google BigTable

• No single point of failure (Dynamo)

• Consistent hashing for data distribution (Dynamo)

• BigTable data model

• A cluster is represented in a ring (of nodes)

• When a new node is added, it takes place where needed.

• Other cool stuff:

• Multi datacenter setups

• Able how your replicated data is spread

• Native Hadoop/Pig support

• Able to define a time to live

• Terminology:

• Keyspace = (SQL) database

• Column family/table = (SQL) table

Be-Mobile│13

Data modeling was hard, until CQL3 came along…

Before:

Complex data models with column families

Connect with thrift interface.

Hard to correctly model you problem.

After:

CQL3 is available with thrift and native transport.

Easy to query (SELECT, INSERT, UPDATE,…)

You have to ‘static’ model, but can use maps, sets and lists as column type to add dynamic columns.

Be-Mobile│14

Consistency? Replication?

Replication

Define the replication factor when creating the key space.

Consistency

With every read or write you can define a consistency level.

• ONE

• TWO

• THREE

• QUORUM

• LOCAL_QUORUM

• EACH_QUORUM

• ALL

(QUORUM: (replication_factor / 2) + 1)

* Example shown with virtual nodes set on 1

Be-Mobile│15

Awesome tools: cqlsh

Packed with Apache Cassandra

Run your own queries on the data.

• Tab completion

• Colored view

• Perfect for your first steps with Apache Cassandra

• Allows tracing!

Be-Mobile│16

Awesome tools: nodetool

Packed with Apache Cassandra

Nodetool: CLI-based administration tool

THE tool to use when operating a cassandra cluster

• Allows to manage your cluster, see metrics, status,…

• See internals of your cluster

• Show off your stats…

Be-Mobile│17

Awesome tools: Datastax OpsCenter (Community)

Web-based front-end to monitor your cluster

Be-Mobile│18

Does it scale?

• In November 2011 Netflix published a blog post about a benchmark to test scalability.

• This was done with Apache Cassandra 0.8.6 on Amazon EC2 instances.

• Test was run on 48, 96, 144 and 288 nodes

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html











Be-Mobile│19

Apache Cassandra at Be-Mobile

Implementation

Be-Mobile│20

Current situation

In the meanwhile requirements did change again.

Requirements (September 2013):

• 50 000 links 520 000 links stored every 15 5 minutes (+ 150 million records each day) data size is 31x larger

• Very Low cost

• C#

• Average 1,2m segments stored every minute (+ 1,73 billion records each day) stored for maximum 31 days.

Be-Mobile│21

Implementation

Data model v3

Thanks to CQL3 we were able to create an easy to understand data model.

2 almost identical data models for our segments and links.

• Data is partitioned by “id” and “date“

• “datetime” is the clustering key

• Data is sorted by “datetime” descending, so the newest data is always first

The segments table is defined with a default time to live for data.

Be-Mobile│22

Our cluster (since September 2013)

12 nodes (commodity hardware!)

• Intel Core i7-4770,

• 32GB RAM,

• 240 GB SSD,

• 2 x 2TB 7200 RPM HDD

Running Ubuntu Linux 12.04 with Apache Cassandra 2.0.1

Connection with our own API on each node, developed in C# and ServiceStack.

Cluster data size: 12 TB

Every minute 1.2m records are written in 5 seconds

Be-Mobile│23

Results

11th of March 2013

12th of March 2013

Be-Mobile│24

Thanks! Questions?

20131017 - en - presentation damn data

Technology