cassandra's sweet spot - an introduction to apache cassandra

72
Cassandra’s sweet spot Dave Gardner @davegardnerisme

Upload: dave-gardner

Post on 11-May-2015

14.072 views

Category:

Technology


9 download

DESCRIPTION

Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling. Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880

TRANSCRIPT

Page 1: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Cassandra’s sweet spot

Dave Gardner@davegardnerisme

Page 2: Cassandra's Sweet Spot - an introduction to Apache Cassandra

jobs.hailocab.com

Looking for an expert backend Java dev – speak to me!

meetup.com/Cassandra-London

Next event 21st November

Page 3: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Building applications with Cassandra

• Key features

• Creating an application

• Data modeling

Page 4: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Comparing Cassandra with X

“Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I don't know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case?”

27th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/7773

Page 5: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Comparing Cassandra with X

“They have approximately nothing in common. And, no, Cassandra is definitely not dying off.”

28th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/7773

Page 6: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #1

To use a NoSQL solution effectively, we need to identify it's sweet spot.

Page 7: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #1

To use a NoSQL solution effectively, we need to identify it's sweet spot.

This means learning about each solution; how is it designed? what algorithms does it use?http://www.alberton.info/nosql_databases_what_when_why_phpuk2011.html

Page 8: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Comparing Cassandra with X

“they say … I can’t decide between this project and this project even though they look nothing like each other. And the fact that you can’t decide indicates that you don’t actually have a problem that requires them.”

Benjamin Black – NoSQL Tapes (at 30:15)

http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip

Page 9: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

1. Elastic

Read and write throughput increases linearly as new machines are added

http://cassandra.apache.org/

Page 10: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

2. Decentralised

Fault tolerant with no single point of failure; no “master” node

http://cassandra.apache.org/

Page 11: Cassandra's Sweet Spot - an introduction to Apache Cassandra

The dynamo paper

• Consistent hashing• Vector clocks• Gossip protocol• Hinted handoff• Read repair

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Page 12: Cassandra's Sweet Spot - an introduction to Apache Cassandra

The dynamo paper

RF = 3#1

#4

#6

#2

#3

Client

#5

Coordinator

Page 13: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

3. Rich data model

Column based, range slices, column slices, secondary indexes, counters, expiring columns

http://cassandra.apache.org/

Page 14: Cassandra's Sweet Spot - an introduction to Apache Cassandra

The big table paper

• Sparse "columnar" data model• SSTable disk storage• Append-only commit log• Memtable (buffer and sort)• Immutable SSTable files• Compactionhttp://labs.google.com/papers/bigtable-osdi06.pdfhttp://www.slideshare.net/geminimobile/bigtable-4820829

Page 15: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Row Key

The big table paper

Name

Value

Column

Name

Value

Column

Name

Value

Column

Column Family

Page 16: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

4. You're in control

Tunable consistency, per operation

http://cassandra.apache.org/

Page 17: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Consistency levels

How many replicas must respond to declare success?

Page 18: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Consistency levels: write operations

Level Description

ANY One node, including hinted handoff

ONE One node

QUORUM N/2 + 1 replicas

LOCAL_QUORUM N/2 + 1 replicas in local data centre

EACH_QUORUM N/2 + 1 replicas in each data centre

ALL All replicas

http://wiki.apache.org/cassandra/API#Write

Page 19: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Consistency levels: read operations

Level Description

ONE 1st Response

QUORUM N/2 + 1 replicas

LOCAL_QUORUM N/2 + 1 replicas in local data centre

EACH_QUORUM N/2 + 1 replicas in each data centre

ALL All replicas

http://wiki.apache.org/cassandra/API#Read

Page 20: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

5. Performant

Well known for high write performance

http://www.datastax.com/docs/1.0/introduction/index#core-strengths-of-cassandra

Page 21: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Benchmark*

http://blog.cubrid.org/dev-platform/nosql-benchmarking/

* Add pinch of salt

Page 22: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Recap: headline features

1. Elastic

2. Decentralised

3. Rich data model

4. You’re in control (tunable consistency)

5. Performant

Page 23: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Some ads

Our user knowledge

Choose which ad to show

Page 24: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Allow us to capture user behaviour/data via “pixels” - placing users into segments (different buckets)

http://pixel.wehaveyourkidneys.com/add.php?add=foo

Page 25: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Record clicks and impressions of each ad; storing data per-ad and per-segment

http://pixel.wehaveyourkidneys.com/adImpression.php?ad=1http://pixel.wehaveyourkidneys.com/adClick.php?ad=1

Page 26: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Real-time ad performance analytics, broken down by segment(which segments are performing well?)

http://www.wehaveyourkidneys.com/adPerformance.php?ad=1

Page 27: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Recommendations based on best-performing ads

(this is left as an exercise for the reader)

Page 28: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Additional requirements

• Large number of users

• High volume of impressions

• Highly available – downtime is money

Page 29: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A good fit for Cassandra?

Yes!

Big data, high availability and lots of writes are all good signs that Cassandra will fit well.

http://www.nosqldatabases.com/main/2010/10/19/what-is-cassandra-good-for.html

Page 30: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A good fit for Cassandra?

Although there are many things that people are using Cassandra for.

Highly available HTTP request routing (tiny data!)

http://blip.tv/datastax/highly-available-http-request-routing-dns-using-cassandra-5501901

Page 31: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #2

Cassandra is an excellent fit where availability matters, where there is a lot of data or where you have a large number of write operations.

Page 32: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Demo

Live demo before we start

Page 33: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling

Start from your queries, work backwards

http://www.slideshare.net/mattdennis/cassandra-data-modelinghttp://blip.tv/datastax/data-modeling-workshop-5496906

Page 34: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data model basics: conflict resolution

Per-column timestamp-based conflict resolution

http://cassandra.apache.org/

{ column: foo, value: bar, timestamp: 1000}

{ column: foo, value: zing, timestamp: 1001}

Page 35: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data model basics: conflict resolution

Per-column timestamp-based conflict resolution

http://cassandra.apache.org/

{ column: foo, value: bar, timestamp: 1000}

{ column: foo, value: zing, timestamp: 1001}

Page 36: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data model basics: column ordering

Columns ordered at time of writing, according to Column Family schema

http://cassandra.apache.org/

{ column: zebra, value: foo, timestamp: 1000}

{ column: badger, value: foo, timestamp: 1001}

Page 37: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data model basics: column ordering

Columns ordered at time of writing, according to Column Family schema

http://cassandra.apache.org/

{ badger: foo, zebra: foo}

with AsciiType column schema

Page 38: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

Add user to bucket X, with expiry time YWhich buckets is user X in?

["user"][<uuid>][<bucketId>] = 1

[CF] [rowKey] [columnName] = value

Page 39: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

user Column Family:

[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1

Q: Is user in segment X?A: Single column fetch

Page 40: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

user Column Family:

[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1

Q: Which segments is user X in?A: Column slice fetch

Page 41: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #3

With column slices, we get the columns back ordered, according to our schema

We cannot do the same for rows however, unless we use the Order Preserving Partitioner

Page 42: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #4

Don’t use the Order Preserving Partitioner unless you absolutely have to

http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Page 43: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

Add user to bucket X, with expiry time Y

Which buckets is user X in?

["user"][<uuid>][<bucketId>] = 1

[CF] [rowKey] [columnName] = value

Page 44: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Expiring columns

An expiring column will be automatically deleted after n seconds

http://cassandra.apache.org/

Page 45: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

$pool = new ConnectionPool( 'whyk', array('localhost') );$users = new ColumnFamily($pool, 'users');$users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires );

Using phpcassa client: https://github.com/thobbs/phpcassa

Page 46: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

UPDATE users USING TTL = 3600SET 'foo' = 1WHERE KEY = 'f97be9cc-5255-4578-8813-76701c0945bd'

Using CQL http://www.datastax.com/dev/blog/what%E2%80%99s-new-in-cassandra-0-8-part-1-cql-the-cassandra-query-language

http://www.datastax.com/docs/1.0/references/cql

Page 47: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #5

Try to exploit Cassandra’s columnar data model; avoid read-before write and locking by safely mutating individual columns

Page 48: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

Track overall ad performance; how many clicks/impressions per ad?

["ads"][<adId>][<stamp>]["click"] = #["ads"][<adId>][<stamp>]["impression"] = #

[CF] [Row] [S.Col] [Col] = value

Using super columns

Page 49: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #6

Friends don’t let friends use Super Columns.

http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-the-unwary/

Page 50: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

Try again using regular columns:

["ads"][<adId>][<stamp>-"click"] = #["ads"][<adId>][<stamp>-"impression"] = #

[CF] [Row] [Col] = value

Page 51: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

ads Column Family:

[1][2011103015-click] = 1[1][2011103015-impression] = 3434[1][2011103016-click] = 12[1][2011103016-impression] = 5411[1][2011103017-click] = 2[1][2011103017-impression] = 345

Q: Get performance of ad X between two date/timesA: Column slice against single row specifying a start stamp and end stamp + 1

Page 52: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Think carefully about your data

This scheme works because I’m assuming each ad has a relatively short lifespan. This means that there are lots of rows and hence the load is spread.

Other options:http://rubyscale.com/2011/basic-time-series-with-cassandra/

Page 53: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Counters

• Distributed atomic counters

• Easy to use

• Not idempotent

http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters

Page 54: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

$stamp = date('YmdH');$ads->add( $adId, // row key "$stamp-impression", // column 1 // increment );

We’ll store performance metrics in hour buckets for graphing.

Page 55: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

UPDATE adsSET '2011103015-impression' = '2011103015-impression' + 1WHERE KEY = '1’

Page 56: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: performance/segment

We can add in another dimension to our stats so we can breakdown by segment.

["ads"][<adId>] [<stamp>-<segment>-"click"] = #

[CF] [Row] [Col] = value

Page 57: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: performance/segment

ads Column Family:

[1][2011103015-bar-click] = 1[1][2011103015-bar-impression] = 3434[1][2011103015-foo-click] = 12[1][2011103015-foo-impression] = 5411[1][2011103016-bar-click] = 2

Q: Get performance of ad X between two date/times, split by segmentA: Column slice against single row specifying a start stamp and end stamp + 1

Page 58: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: performance/segment

$stamp = date('YmdH');$ads->add( "$adId-segments", // row key "$stamp-$segment-impression", // column 1 // incr );

We’ll store performance metrics in hour buckets for graphing.

Page 59: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: segment stats

Track overall clicks/impressions per bucket; which buckets are most clicky?

["segments"][<adId>-"segments"] [<stamp>-<segment>-"click"] = #

[CF] [Row] [Col] = value

Page 60: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Recap: Data modeling

• Think about the queries, work backwards

• Don’t overuse single rows; try to spread the load

• Don’t use super columns

• Ask on IRC! #cassandra

Page 61: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Recap: Common data modeling patterns

1. Using column names with no value

[cf][rowKey][columnName] = 1

Page 62: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Recap: Common data modeling patterns

2. Counters

[cf][rowKey][columnName]++

Page 63: Cassandra's Sweet Spot - an introduction to Apache Cassandra

And also…

3. Serialising a whole object

[cf][rowKey][columnName] = { foo: 3, bar: 11 }

Page 64: Cassandra's Sweet Spot - an introduction to Apache Cassandra

There’s more: Brisk

Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra

DataStax now offer this functionality in their “Enterprise” product

http://www.datastax.com/products/enterprise

Page 65: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Hive

CREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );

SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;

Page 66: Cassandra's Sweet Spot - an introduction to Apache Cassandra

There’s more: Supercharged Cassandra

Acunu have reengineered the entire Unix storage stack, optimised specifically for Big Data workloads

Includes instant snapshot of CFs

http://www.acunu.com/products/choosing-cassandra/

Page 67: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

Cassandra is founded on sound design principles

Page 68: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

The Cassandra data model, sometimes mentioned as a weakness, is incredibly powerful

Page 69: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

The clients are getting better; CQL is a step forward

Page 70: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

Hadoop integration means we can analyse data directly from a Cassandra cluster

Page 71: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

Cassandra’s sweet spot is highly available “big data” (especially time-series) with large numbers of writes

Page 72: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Thanks

Learn more about Cassandrameetup.com/Cassandra-London

Checkout the code https://github.com/davegardnerisme/we-have-your-kidneys

Watch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations