cassandra's sweet spot - an introduction to apache cassandra

Post on 11-May-2015

14.072 Views

Category:

Technology

9 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling. Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880

TRANSCRIPT

Cassandra’s sweet spot

Dave Gardner@davegardnerisme

jobs.hailocab.com

Looking for an expert backend Java dev – speak to me!

meetup.com/Cassandra-London

Next event 21st November

Building applications with Cassandra

• Key features

• Creating an application

• Data modeling

Comparing Cassandra with X

“Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I don't know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case?”

27th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/7773

Comparing Cassandra with X

“They have approximately nothing in common. And, no, Cassandra is definitely not dying off.”

28th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/7773

Top Tip #1

To use a NoSQL solution effectively, we need to identify it's sweet spot.

Top Tip #1

To use a NoSQL solution effectively, we need to identify it's sweet spot.

This means learning about each solution; how is it designed? what algorithms does it use?http://www.alberton.info/nosql_databases_what_when_why_phpuk2011.html

Comparing Cassandra with X

“they say … I can’t decide between this project and this project even though they look nothing like each other. And the fact that you can’t decide indicates that you don’t actually have a problem that requires them.”

Benjamin Black – NoSQL Tapes (at 30:15)

http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip

Headline features

1. Elastic

Read and write throughput increases linearly as new machines are added

http://cassandra.apache.org/

Headline features

2. Decentralised

Fault tolerant with no single point of failure; no “master” node

http://cassandra.apache.org/

The dynamo paper

• Consistent hashing• Vector clocks• Gossip protocol• Hinted handoff• Read repair

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

The dynamo paper

RF = 3#1

#4

#6

#2

#3

Client

#5

Coordinator

Headline features

3. Rich data model

Column based, range slices, column slices, secondary indexes, counters, expiring columns

http://cassandra.apache.org/

The big table paper

• Sparse "columnar" data model• SSTable disk storage• Append-only commit log• Memtable (buffer and sort)• Immutable SSTable files• Compactionhttp://labs.google.com/papers/bigtable-osdi06.pdfhttp://www.slideshare.net/geminimobile/bigtable-4820829

Row Key

The big table paper

Name

Value

Column

Name

Value

Column

Name

Value

Column

Column Family

Headline features

4. You're in control

Tunable consistency, per operation

http://cassandra.apache.org/

Consistency levels

How many replicas must respond to declare success?

Consistency levels: write operations

Level Description

ANY One node, including hinted handoff

ONE One node

QUORUM N/2 + 1 replicas

LOCAL_QUORUM N/2 + 1 replicas in local data centre

EACH_QUORUM N/2 + 1 replicas in each data centre

ALL All replicas

http://wiki.apache.org/cassandra/API#Write

Consistency levels: read operations

Level Description

ONE 1st Response

QUORUM N/2 + 1 replicas

LOCAL_QUORUM N/2 + 1 replicas in local data centre

EACH_QUORUM N/2 + 1 replicas in each data centre

ALL All replicas

http://wiki.apache.org/cassandra/API#Read

Headline features

5. Performant

Well known for high write performance

http://www.datastax.com/docs/1.0/introduction/index#core-strengths-of-cassandra

Benchmark*

http://blog.cubrid.org/dev-platform/nosql-benchmarking/

* Add pinch of salt

Recap: headline features

1. Elastic

2. Decentralised

3. Rich data model

4. You’re in control (tunable consistency)

5. Performant

A simple ad-targeting application

Some ads

Our user knowledge

Choose which ad to show

A simple ad-targeting application

Allow us to capture user behaviour/data via “pixels” - placing users into segments (different buckets)

http://pixel.wehaveyourkidneys.com/add.php?add=foo

A simple ad-targeting application

Record clicks and impressions of each ad; storing data per-ad and per-segment

http://pixel.wehaveyourkidneys.com/adImpression.php?ad=1http://pixel.wehaveyourkidneys.com/adClick.php?ad=1

A simple ad-targeting application

Real-time ad performance analytics, broken down by segment(which segments are performing well?)

http://www.wehaveyourkidneys.com/adPerformance.php?ad=1

A simple ad-targeting application

Recommendations based on best-performing ads

(this is left as an exercise for the reader)

Additional requirements

• Large number of users

• High volume of impressions

• Highly available – downtime is money

A good fit for Cassandra?

Yes!

Big data, high availability and lots of writes are all good signs that Cassandra will fit well.

http://www.nosqldatabases.com/main/2010/10/19/what-is-cassandra-good-for.html

A good fit for Cassandra?

Although there are many things that people are using Cassandra for.

Highly available HTTP request routing (tiny data!)

http://blip.tv/datastax/highly-available-http-request-routing-dns-using-cassandra-5501901

Top Tip #2

Cassandra is an excellent fit where availability matters, where there is a lot of data or where you have a large number of write operations.

Demo

Live demo before we start

Data modeling

Start from your queries, work backwards

http://www.slideshare.net/mattdennis/cassandra-data-modelinghttp://blip.tv/datastax/data-modeling-workshop-5496906

Data model basics: conflict resolution

Per-column timestamp-based conflict resolution

http://cassandra.apache.org/

{ column: foo, value: bar, timestamp: 1000}

{ column: foo, value: zing, timestamp: 1001}

Data model basics: conflict resolution

Per-column timestamp-based conflict resolution

http://cassandra.apache.org/

{ column: foo, value: bar, timestamp: 1000}

{ column: foo, value: zing, timestamp: 1001}

Data model basics: column ordering

Columns ordered at time of writing, according to Column Family schema

http://cassandra.apache.org/

{ column: zebra, value: foo, timestamp: 1000}

{ column: badger, value: foo, timestamp: 1001}

Data model basics: column ordering

Columns ordered at time of writing, according to Column Family schema

http://cassandra.apache.org/

{ badger: foo, zebra: foo}

with AsciiType column schema

Data modeling: user segments

Add user to bucket X, with expiry time YWhich buckets is user X in?

["user"][<uuid>][<bucketId>] = 1

[CF] [rowKey] [columnName] = value

Data modeling: user segments

user Column Family:

[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1

Q: Is user in segment X?A: Single column fetch

Data modeling: user segments

user Column Family:

[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1

Q: Which segments is user X in?A: Column slice fetch

Top Tip #3

With column slices, we get the columns back ordered, according to our schema

We cannot do the same for rows however, unless we use the Order Preserving Partitioner

Top Tip #4

Don’t use the Order Preserving Partitioner unless you absolutely have to

http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Data modeling: user segments

Add user to bucket X, with expiry time Y

Which buckets is user X in?

["user"][<uuid>][<bucketId>] = 1

[CF] [rowKey] [columnName] = value

Expiring columns

An expiring column will be automatically deleted after n seconds

http://cassandra.apache.org/

Data modeling: user segments

$pool = new ConnectionPool( 'whyk', array('localhost') );$users = new ColumnFamily($pool, 'users');$users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires );

Using phpcassa client: https://github.com/thobbs/phpcassa

Data modeling: user segments

UPDATE users USING TTL = 3600SET 'foo' = 1WHERE KEY = 'f97be9cc-5255-4578-8813-76701c0945bd'

Using CQL http://www.datastax.com/dev/blog/what%E2%80%99s-new-in-cassandra-0-8-part-1-cql-the-cassandra-query-language

http://www.datastax.com/docs/1.0/references/cql

Top Tip #5

Try to exploit Cassandra’s columnar data model; avoid read-before write and locking by safely mutating individual columns

Data modeling: ad performance

Track overall ad performance; how many clicks/impressions per ad?

["ads"][<adId>][<stamp>]["click"] = #["ads"][<adId>][<stamp>]["impression"] = #

[CF] [Row] [S.Col] [Col] = value

Using super columns

Top Tip #6

Friends don’t let friends use Super Columns.

http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-the-unwary/

Data modeling: ad performance

Try again using regular columns:

["ads"][<adId>][<stamp>-"click"] = #["ads"][<adId>][<stamp>-"impression"] = #

[CF] [Row] [Col] = value

Data modeling: ad performance

ads Column Family:

[1][2011103015-click] = 1[1][2011103015-impression] = 3434[1][2011103016-click] = 12[1][2011103016-impression] = 5411[1][2011103017-click] = 2[1][2011103017-impression] = 345

Q: Get performance of ad X between two date/timesA: Column slice against single row specifying a start stamp and end stamp + 1

Think carefully about your data

This scheme works because I’m assuming each ad has a relatively short lifespan. This means that there are lots of rows and hence the load is spread.

Other options:http://rubyscale.com/2011/basic-time-series-with-cassandra/

Counters

• Distributed atomic counters

• Easy to use

• Not idempotent

http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters

Data modeling: ad performance

$stamp = date('YmdH');$ads->add( $adId, // row key "$stamp-impression", // column 1 // increment );

We’ll store performance metrics in hour buckets for graphing.

Data modeling: ad performance

UPDATE adsSET '2011103015-impression' = '2011103015-impression' + 1WHERE KEY = '1’

Data modeling: performance/segment

We can add in another dimension to our stats so we can breakdown by segment.

["ads"][<adId>] [<stamp>-<segment>-"click"] = #

[CF] [Row] [Col] = value

Data modeling: performance/segment

ads Column Family:

[1][2011103015-bar-click] = 1[1][2011103015-bar-impression] = 3434[1][2011103015-foo-click] = 12[1][2011103015-foo-impression] = 5411[1][2011103016-bar-click] = 2

Q: Get performance of ad X between two date/times, split by segmentA: Column slice against single row specifying a start stamp and end stamp + 1

Data modeling: performance/segment

$stamp = date('YmdH');$ads->add( "$adId-segments", // row key "$stamp-$segment-impression", // column 1 // incr );

We’ll store performance metrics in hour buckets for graphing.

Data modeling: segment stats

Track overall clicks/impressions per bucket; which buckets are most clicky?

["segments"][<adId>-"segments"] [<stamp>-<segment>-"click"] = #

[CF] [Row] [Col] = value

Recap: Data modeling

• Think about the queries, work backwards

• Don’t overuse single rows; try to spread the load

• Don’t use super columns

• Ask on IRC! #cassandra

Recap: Common data modeling patterns

1. Using column names with no value

[cf][rowKey][columnName] = 1

Recap: Common data modeling patterns

2. Counters

[cf][rowKey][columnName]++

And also…

3. Serialising a whole object

[cf][rowKey][columnName] = { foo: 3, bar: 11 }

There’s more: Brisk

Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra

DataStax now offer this functionality in their “Enterprise” product

http://www.datastax.com/products/enterprise

Hive

CREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );

SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;

There’s more: Supercharged Cassandra

Acunu have reengineered the entire Unix storage stack, optimised specifically for Big Data workloads

Includes instant snapshot of CFs

http://www.acunu.com/products/choosing-cassandra/

In conclusion

Cassandra is founded on sound design principles

In conclusion

The Cassandra data model, sometimes mentioned as a weakness, is incredibly powerful

In conclusion

The clients are getting better; CQL is a step forward

In conclusion

Hadoop integration means we can analyse data directly from a Cassandra cluster

In conclusion

Cassandra’s sweet spot is highly available “big data” (especially time-series) with large numbers of writes

Thanks

Learn more about Cassandrameetup.com/Cassandra-London

Checkout the code https://github.com/davegardnerisme/we-have-your-kidneys

Watch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations

top related