20131017 - en - presentation damn data

25
Moving Forwards with Cassandra! Storing traffic data historically 21/10/2013 Pieter Callewaert

Upload: pietercalle

Post on 04-Jul-2015

1.307 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: 20131017 - en - presentation damn data

Moving Forwards with Cassandra!

Storing traffic data historically

21/10/2013 Pieter Callewaert

Page 2: 20131017 - en - presentation damn data

Be-Mobile│1

1. Detection 2. Aggregation 3. Distribution

Be-Mobile

0SmartMove

Mobility Database

Sensor: Input API for sensors

Icarus: Floating car data

Editor: Manual data entry

Navigation traffic services

0Road

sensor data

0Probevehicle

data

0Public

sourcing

Traffic operators

Connector: External data0Other

mobility data

Phone, SMS, CamerasApps, Web

Traffic centers, Public transport, Fuel, Parking

Traveler information services

Smartphone & (mobile) web

Radio & TV traffic services

Traffic management services

Road traffic mgt services

Fleet & logistics traffic mgt

Traffic analysis & consulting

Page 3: 20131017 - en - presentation damn data

Be-Mobile│2

Be-Mobile is hiring!

http://www.be-mobile.be/about/careers

Be-Mobile NV Technologiepark 12b 9052 Ghent Belgium

www.be-mobile.be

Be-Mobile

Page 4: 20131017 - en - presentation damn data

Be-Mobile│3

Requirements…

February 2010

Page 5: 20131017 - en - presentation damn data

Be-Mobile│4

What data do we want to store?

Green dots: Nodes

Red dots: Super Nodes (connecting Links)

Blue lines: Segments

Orange lines: Links

Page 6: 20131017 - en - presentation damn data

Be-Mobile│5

2010: New project

We wanted to store our raw traffic data into a database so it would be easy to query and generate reports.

Requirements (February 2010):

• 50 000 links stored every 15 minutes (+ 4,8 million records each day)

• Low cost

• C# .NET

No problem, we already had a Microsoft SQL Server database, and the needed experience to do this.

Page 7: 20131017 - en - presentation damn data

Be-Mobile│6

But requirements change…

Requirements (October 2010):

• 50 000 links 520 000 stored every 15 5 minutes (+ 150 million records each day) data size is 31x larger

• Very Low cost

• C#

Page 8: 20131017 - en - presentation damn data

Be-Mobile│7

Back to the drawing board

Pre selection of possible contenders:

• Microsoft SQL server : Relational database

• MongoDB : Document data store

• Apache Cassandra : Column family data store

Page 9: 20131017 - en - presentation damn data

Be-Mobile│8

Microsoft SQL Server

Current approach didn’t work. 2 options:

• Buy high end hardware

• Distribute load over multiple servers?

Pros:

• We have experience with SQL Server

• (Compression)

Cons:

• High hardware costs, our high license costs…

• No experience with this volume data in SQL Server

• Partitioning data over multiple servers can be tricky

Page 10: 20131017 - en - presentation damn data

Be-Mobile│9

Mongo DB

MongoDB 1.6.x

Proof of concept ready in less than a day.

Pros:

• Backed by a company (10gen, now MongoDB Inc.),

• Open Source,

• Official C# driver,

• EASY!

Cons:

• Easier to scale beyond 1 server, but still not that straight forward,

• No (native) compression,

• 16 MB document limit forced us to make a more complex data model.

Page 11: 20131017 - en - presentation damn data

Be-Mobile│10

Apache Cassandra

Apache Cassandra 0.8/1.0

Took about 10 days to create a proof of concept.

Pros:

• Backed by a company (Datastax),

• Open Source,

• Scales automatically,

• Also configuring replication is easy.

• Compression (since 1.0),

Cons:

• Not easy to learn,

• Thrift interface,

• Data modeling was not easy.

• No official C# driver

Page 12: 20131017 - en - presentation damn data

Be-Mobile│11

How does Apache Cassandra work?

Introduction to Apache Cassandra

Page 13: 20131017 - en - presentation damn data

Be-Mobile│12

The basics

• Open source by Facebook in 2008

• Marriage between Amazon Dynamo and Google BigTable

• No single point of failure (Dynamo)

• Consistent hashing for data distribution (Dynamo)

• BigTable data model

• A cluster is represented in a ring (of nodes)

• When a new node is added, it takes place where needed.

• Other cool stuff:

• Multi datacenter setups

• Able how your replicated data is spread

• Native Hadoop/Pig support

• Able to define a time to live

• Terminology:

• Keyspace = (SQL) database

• Column family/table = (SQL) table

Page 14: 20131017 - en - presentation damn data

Be-Mobile│13

Data modeling was hard, until CQL3 came along…

Before:

Complex data models with column families

Connect with thrift interface.

Hard to correctly model you problem.

After:

CQL3 is available with thrift and native transport.

Easy to query (SELECT, INSERT, UPDATE,…)

You have to ‘static’ model, but can use maps, sets and lists as column type to add dynamic columns.

Page 15: 20131017 - en - presentation damn data

Be-Mobile│14

Consistency? Replication?

Replication

Define the replication factor when creating the key space.

Consistency

With every read or write you can define a consistency level.

• ONE

• TWO

• THREE

• QUORUM

• LOCAL_QUORUM

• EACH_QUORUM

• ALL

(QUORUM: (replication_factor / 2) + 1)

* Example shown with virtual nodes set on 1

Page 16: 20131017 - en - presentation damn data

Be-Mobile│15

Awesome tools: cqlsh

Packed with Apache Cassandra

Run your own queries on the data.

• Tab completion

• Colored view

• Perfect for your first steps with Apache Cassandra

• Allows tracing!

Page 17: 20131017 - en - presentation damn data

Be-Mobile│16

Awesome tools: nodetool

Packed with Apache Cassandra

Nodetool: CLI-based administration tool

THE tool to use when operating a cassandra cluster

• Allows to manage your cluster, see metrics, status,…

• See internals of your cluster

• Show off your stats…

Page 18: 20131017 - en - presentation damn data

Be-Mobile│17

Awesome tools: Datastax OpsCenter (Community)

Web-based front-end to monitor your cluster

Page 20: 20131017 - en - presentation damn data

Be-Mobile│19

Apache Cassandra at Be-Mobile

Implementation

Page 21: 20131017 - en - presentation damn data

Be-Mobile│20

Current situation

In the meanwhile requirements did change again.

Requirements (September 2013):

• 50 000 links 520 000 links stored every 15 5 minutes (+ 150 million records each day) data size is 31x larger

• Very Low cost

• C#

• Average 1,2m segments stored every minute (+ 1,73 billion records each day) stored for maximum 31 days.

Page 22: 20131017 - en - presentation damn data

Be-Mobile│21

Implementation

Data model v3

Thanks to CQL3 we were able to create an easy to understand data model.

2 almost identical data models for our segments and links.

• Data is partitioned by “id” and “date“

• “datetime” is the clustering key

• Data is sorted by “datetime” descending, so the newest data is always first

The segments table is defined with a default time to live for data.

Page 23: 20131017 - en - presentation damn data

Be-Mobile│22

Our cluster (since September 2013)

12 nodes (commodity hardware!)

• Intel Core i7-4770,

• 32GB RAM,

• 240 GB SSD,

• 2 x 2TB 7200 RPM HDD

Running Ubuntu Linux 12.04 with Apache Cassandra 2.0.1

Connection with our own API on each node, developed in C# and ServiceStack.

Cluster data size: 12 TB

Every minute 1.2m records are written in 5 seconds

Page 24: 20131017 - en - presentation damn data

Be-Mobile│23

Results

11th of March 2013

12th of March 2013

Page 25: 20131017 - en - presentation damn data

Be-Mobile│24

Thanks! Questions?