cassandra 2.0 and timeseries

43
©2013 DataStax Confidential. Do not distribute without consent. @PatrickMcFadin Patrick McFadin Chief Evangelist/Solution Architect - DataStax Cassandra 2.0: Intro + Time Series Friday, October 11, 13

Upload: patrick-mcfadin

Post on 15-Jan-2015

5.470 views

Category:

Technology


2 download

DESCRIPTION

At this meetup Patrick McFadin, Solutions Architect at DataStax, will be discussing the most recently added features in Apache Cassandra 2.0, including: Lightweight transactions, eager retries, improved compaction, triggers, and CQL cursors. He'll also be touching on time series data with Apache Cassandra.

TRANSCRIPT

Page 1: Cassandra 2.0 and timeseries

©2013 DataStax Confidential. Do not distribute without consent.

@PatrickMcFadin

Patrick McFadinChief Evangelist/Solution Architect - DataStax

Cassandra 2.0: Intro + Time Series

Friday, October 11, 13

Page 2: Cassandra 2.0 and timeseries

Who I am

2

• Patrick McFadin• Solution Architect at DataStax• Cassandra MVP• User for years• Follow me for more:

I talk about Cassandra and building scalable, resilient apps ALL THE TIME!

@PatrickMcFadin

Dude. Uptime == $$

Friday, October 11, 13

Page 3: Cassandra 2.0 and timeseries

Cassandra - An introduction

Friday, October 11, 13

Page 4: Cassandra 2.0 and timeseries

Cassandra - Intro

• Based on Amazon Dynamo and Google BigTable paper• Shared nothing• Data safe as possible• Predictable scaling

4

Dynamo

BigTable

Friday, October 11, 13

Page 5: Cassandra 2.0 and timeseries

Cassandra - More than one server

• All nodes participate in a cluster• Shared nothing• Add or remove as needed•More capacity? Add a server

5

• Each node owns a token• Tokens denote a range of keys

• 4 nodes? -> Key range/4• Each node owns 1/4 the data

Friday, October 11, 13

Page 6: Cassandra 2.0 and timeseries

Cassandra - Locally Distributed

• Client writes to any node• Node coordinates with others• Data replicated in parallel• Replication factor: How many

copies of your data?• RF = 3 here

6

Each node stores 3/4 of clusters total data.

Friday, October 11, 13

Page 7: Cassandra 2.0 and timeseries

Cassandra - Geographically Distributed

• Client writes local• Data syncs across WAN• Replication Factor per DC

7

Single coordinator

Friday, October 11, 13

Page 8: Cassandra 2.0 and timeseries

Cassandra - Consistency

• Consistency Level (CL)• Client specifies per read or write

8

• ALL = All replicas ack• QUORUM = > 51% of replicas ack• LOCAL_QUORUM = > 51% in local DC ack• ONE = Only one replica acks

Friday, October 11, 13

Page 9: Cassandra 2.0 and timeseries

Cassandra - Transparent to the application

• A single node failure shouldn’t bring failure• Replication Factor + Consistency Level = Success• This example:• RF = 3• CL = QUORUM

9

>51% Ack so we are good!

Friday, October 11, 13

Page 10: Cassandra 2.0 and timeseries

Cassandra Applications - Drivers

• DataStax Drivers for Cassandra• Java• C#• Python•more on the way

10Friday, October 11, 13

Page 11: Cassandra 2.0 and timeseries

Application Example - Layout

• Active-Active• Service based DNS routing

11

Cassandra Replication

Friday, October 11, 13

Page 12: Cassandra 2.0 and timeseries

Application Example - Uptime

12

• Normal server maintenance• Application is unaware

Cassandra Replication

Friday, October 11, 13

Page 13: Cassandra 2.0 and timeseries

Application Example - Failure

13

• Data center failure• Data is safe. Route traffic.

33

Another happy user!

Friday, October 11, 13

Page 14: Cassandra 2.0 and timeseries

Cassandra 2.0 - Big new features

Friday, October 11, 13

Page 15: Cassandra 2.0 and timeseries

Five Years of Cassandra

Jul-09 May-10 Feb-11 Dec-11 Oct-12 Jul-13

0.1 0.3 0.6 0.7 1.0 1.2...

2.0

DSE

Jul-08

Friday, October 11, 13

Page 16: Cassandra 2.0 and timeseries

SELECT * FROM usersWHERE username = ’jbellis’

[empty resultset]

Session 1SELECT * FROM usersWHERE username = ’jbellis’

[empty resultset]

Session 2

Lightweight transactions: the problem

INSERT INTO users (username,password)VALUES (’jbellis’,‘xdg44hh’)

INSERT INTO users (userName,password)VALUES (’jbellis’,‘8dhh43k’)

It’s a Race!

Who wins?

Friday, October 11, 13

Page 17: Cassandra 2.0 and timeseries

Client(locks)

Coordinatorrequest

Replica

internalrequest

Why Locking Doesn’t Work

• Client locks•Write times out• Lock released•Hint is replayed!!

Friday, October 11, 13

Page 18: Cassandra 2.0 and timeseries

Client(locks)

Coordinatorrequest

Replica

internalrequest

X

Why Locking Doesn’t Work

• Client locks•Write times out• Lock released•Hint is replayed!!

Friday, October 11, 13

Page 19: Cassandra 2.0 and timeseries

Client(locks)

Coordinatorrequest

Replica

internalrequest

hint X

Why Locking Doesn’t Work

• Client locks•Write times out• Lock released•Hint is replayed!!

Friday, October 11, 13

Page 20: Cassandra 2.0 and timeseries

Client(locks)

Coordinatorrequest

Replica

internalrequest

hint

timeoutresponse

X

Why Locking Doesn’t Work

• Client locks•Write times out• Lock released•Hint is replayed!!

Friday, October 11, 13

Page 21: Cassandra 2.0 and timeseries

Paxos• Consensus algorithm• All operations are quorum-based• Each replica sends information about unfinished operations to the leader

during prepare• Paxos made Simple

Friday, October 11, 13

Page 22: Cassandra 2.0 and timeseries

LWT: details• 4 round trips vs 1 for normal updates• Paxos state is durable• Immediate consistency with no leader election or failover• ConsistencyLevel.SERIAL• http://www.datastax.com/dev/blog/lightweight-transactions-in-

cassandra-2-0

Friday, October 11, 13

Page 24: Cassandra 2.0 and timeseries

UPDATE USERS SET email = ’[email protected]’, ...WHERE username = ’jbellis’IF email = ’[email protected]’;

INSERT INTO USERS (username, email, ...)VALUES (‘jbellis’, ‘[email protected]’, ... )IF NOT EXISTS;

Using LWT

• Don’t overwrite an existing record

• Only update record if condition is met

Friday, October 11, 13

Page 25: Cassandra 2.0 and timeseries

CQL Improvements• Cursors• Large result sets now have ->next() functionality

• Prevents massive result sets OOMing• No more client side hacks with LIMIT

• Warning: Not isolated

Friday, October 11, 13

Page 26: Cassandra 2.0 and timeseries

CQL Improvements• ALTER DROP• Remove a field from a CQL table.

• Conditional schema changes• Only execute if condition met

CREATE KEYSPACE IF NOT EXISTS ksWITH replication = { 'class': 'SimpleStrategy','replication_factor' : 3 };

CREATE TABLE IF NOT EXISTS test (k int PRIMARY KEY);

DROP KEYSPACE IF EXISTS ks;

ALTER TABLE users DROP address3;

Friday, October 11, 13

Page 27: Cassandra 2.0 and timeseries

CQL Improvements• Aliases in SELECT

• Limit and TTL in prepared statements

SELECT event_id, dateOf(created_at) AS creation_date, blobAsText(content) AS content FROM timeline;

event_id | creation_date | content-------------------------+--------------------------+---------------------- 550e8400-e29b-41d4-a716 | 2013-07-26 10:44:33+0200 | Something happened!?

SELECT * FROM myTable LIMIT ?;

UPDATE myTable USING TTL ? SET v = 2 WHERE k = 'foo';

Friday, October 11, 13

Page 28: Cassandra 2.0 and timeseries

Triggers

CREATE TRIGGER <name> ON <table> USING <classname>;

DROP TRIGGER <name> ON [<keyspace>.]<table>;

• Executed on the coordinator before mutation• Takes original mutation and adds any new• Jars deployed per server

Friday, October 11, 13

Page 29: Cassandra 2.0 and timeseries

Trigger implementationclass MyTrigger implements ITrigger{ public Collection<RowMutation> augment(ByteBuffer key, ColumnFamily update) { ... }}

• You have to implement your own ITrigger (for now)• Compile and deploy to each server

Friday, October 11, 13

Page 30: Cassandra 2.0 and timeseries

Experimental!• Relies on internal RowMutation, ColumnFamily classes•Not sandboxed. Be careful!• Expect changes in 2.1

Friday, October 11, 13

Page 31: Cassandra 2.0 and timeseries

Cassandra and Time Series

Friday, October 11, 13

Page 32: Cassandra 2.0 and timeseries

Time Series Taming the beast• Peter Higgs and Francois Englert. Nobel prize for Physics• Theorized the existence of the Higgs boson

• Found using ATLAS

• Data stored in P-BEAST

• Time series running on Cassandra

Friday, October 11, 13

Page 33: Cassandra 2.0 and timeseries

Use Cassandra for time series

Friday, October 11, 13

Page 34: Cassandra 2.0 and timeseries

Use Cassandra for time series

Get a nobel prize

Friday, October 11, 13

Page 35: Cassandra 2.0 and timeseries

Time Series Why• Storage model from BigTable is perfect• One row key and tons of (variable)columns• Single layout on disk

Row Key Column Name Column Name

Column Value Column Value

Friday, October 11, 13

Page 36: Cassandra 2.0 and timeseries

Time Series Example• Storing weather data• One weather station• Temperature measurements every minute

WeatherStation ID 2013-10-09 10:00 AM 2013-10-09 10:00 AM 2013-10-10 11:00 AM

72 Degrees 72 Degrees 65 Degrees

Friday, October 11, 13

Page 37: Cassandra 2.0 and timeseries

Time Series Example• Query data•Weather Station ID = Locality of single node

WeatherStation ID100 2013-10-09 10:00 AM 2013-10-09 10:00 AM 2013-10-10 11:00 AM

72 Degrees 72 Degrees 65 Degrees

Date query weatherStationID = 100 ANDdate = 2013-10-09 10:00 AM

weatherStationID = 100 ANDdate > 2013-10-09 10:00 AM ANDdate < 2013-10-10 11:01 AM

Date Range

OR

Friday, October 11, 13

Page 38: Cassandra 2.0 and timeseries

Time Series How• CQL expresses this well• Data partitioned by weather station ID and time

• Easy to insert data

• Easy to query

CREATE TABLE temperature ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time));

INSERT INTO temperature(weatherstation_id,event_time,temperature) VALUES ('1234ABCD','2013-04-03 07:01:00','72F');

SELECT temperature FROM temperature WHERE weatherstation_id='1234ABCD'AND event_time > '2013-04-03 07:01:00'AND event_time < '2013-04-03 07:04:00';

Friday, October 11, 13

Page 39: Cassandra 2.0 and timeseries

Time Series Further partitioning• At every minute you will eventually run out of rows• 2 billion columns per storage row• Data partitioned by weather station ID and time• Use the partition key to split things up

CREATE TABLE temperature_by_day ( weatherstation_id text, date text, event_time timestamp, temperature text, PRIMARY KEY ((weatherstation_id,date),event_time));

Friday, October 11, 13

Page 40: Cassandra 2.0 and timeseries

Time Series Further Partitioning• Still easy to insert

• Still easy to query

INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature) VALUES ('1234ABCD','2013-04-03','2013-04-03 07:01:00','72F');

SELECT temperature FROM temperature_by_day WHERE weatherstation_id='1234ABCD' AND date='2013-04-03'AND event_time > '2013-04-03 07:01:00'AND event_time < '2013-04-03 07:04:00';

Friday, October 11, 13

Page 41: Cassandra 2.0 and timeseries

Time Series Use cases• Logging• Thing Tracking (IoT)• Sensor Data• User Tracking• Fraud Detection•Nobel prizes!

Friday, October 11, 13

Page 42: Cassandra 2.0 and timeseries

Thank you!

Apache Cassandra 2.0 - Data model on fire

Next talk in my data model series!

Friday, October 11, 13

Page 43: Cassandra 2.0 and timeseries

©2013 DataStax Confidential. Do not distribute without consent. 39Friday, October 11, 13