cassandra introduction apache con 2014 budapest

Introduction to Cassandra DuyHai DOAN, Technical Advocate

@doanduyhai

About me!

Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  OSS technical point of contact in Europe

☞ duy_hai.doan@datastax.com •  production troubleshooting

@doanduyhai 3

Datastax!•  Founded in April 2010

•  We drive Apache Cassandra™

•  Home to Cassandra chair & most committers (≈80%)

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + features

@doanduyhai

Agenda!

•  Architecture •  Replication •  Consistency •  Read/write path •  Data Model •  Failure handling

What is Cassandra!

@doanduyhai

Cassandra history!

NoSQL database •  created at Facebook •  open-sourced since 2008 •  current version = 2.1 •  column-oriented ☞ distributed table

@doanduyhai

Cassandra 5 key facts!

Linear scalability Small & « huge » scale •  3 à1k+ nodes cluster •  3Gb à Pb+

@doanduyhai

Continuous availability (≈100% up-time) •  resilient architecture (Dynamo) •  rolling upgrades •  data backward compatible n/n+1 versions

@doanduyhai

Multi-data centers •  out-of-the-box (config only) •  AWS conf for multi-region DCs •  GCE/CloudStack support •  resilience, work-load segregation •  virtual data-centers

@doanduyhai

Operational simplicity •  1 node = 1 process + 1 config file •  deployment automation •  OpsCenter for monitoring

@doanduyhai

Analytics combo •  Cassandra + Spark = awesome ! •  realtime streaming

Cassandra architecture!

@doanduyhai

Cassandra architecture!

DATA STORE (BIG TABLES)

CLUSTER (DYNAMODB)

API (CQL & RPC)

Client request

CLUSTER (DYNAMODB)

API (CQL & RPC)

DATA STORE (BIG TABLES)

@doanduyhai

Data distribution!

Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2)

@doanduyhai

Token Ranges!

A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X]

@doanduyhai

Linear scalability!

8 nodes 10 nodes

Replication & Consistency Model!

@doanduyhai

Failure tolerance by replication!

Replication Factor (RF) = 3

3{B, A, H}

{C, B, A}

{D, C, B}

@doanduyhai

Coordinator node!Incoming requests (read/write) Coordinator node handles the request

Every node can be coordinator àmasterless

coordinator Client request

@doanduyhai

Consistency!

Tunable at runtime •  ONE •  QUORUM (strict majority w.r.t. RF) •  ALL Apply both to read & write

@doanduyhai

Write consistency!Write ONE •  write request to all replicas in //

coordinator

@doanduyhai

Write consistency!Write ONE •  write request to all replicas in // •  wait for ONE ack before returning to

client

coordinator

@doanduyhai

Write consistency!Write ONE •  write request to all replicas in // •  wait for ONE ack before returning to

client •  other acks later, asynchronously

coordinator

10 µs

120 µs

@doanduyhai

Write consistency!Write QUORUM •  write request to all replicas in // •  wait for QUORUM acks before

returning to client •  other acks later, asynchronously

coordinator

10 µs

120 µs

@doanduyhai

Read consistency!Read ONE •  read from one node among all replicas

coordinator

@doanduyhai

Read consistency!Read ONE •  read from one node among all replicas •  contact the fastest node (stats)

coordinator

@doanduyhai

Read consistency!Read QUORUM •  read from one fastest node

coordinator

@doanduyhai

Read consistency!Read QUORUM •  read from one fastest node •  AND request digest from other

replicas to reach QUORUM

coordinator

@doanduyhai

replicas to reach QUORUM •  return most up-to-date data to client

coordinator

@doanduyhai

replicas to reach QUORUM •  return most up-to-date data to client •  repair if digest mismatch n1

coordinator

@doanduyhai

Consistency in action!

RF = 3, Write ONE, Read ONE

Read ONE: A

data replication in progress …

Write ONE: B

@doanduyhai

RF = 3, Write ONE, Read QUORUM

Write ONE: B

Read QUORUM: A

@doanduyhai

RF = 3, Write ONE, Read ALL

Read ALL: B

Write ONE: B

@doanduyhai

RF = 3, Write QUORUM, Read ONE

Write QUORUM: B

Read ONE: A

@doanduyhai

RF = 3, Write QUORUM, Read QUORUM

Read QUORUM: B

Write QUORUM: B

@doanduyhai

Consistency level!

ONE Fast, may not read latest written value

@doanduyhai

Consistency level!

QUORUM Strict majority w.r.t. Replication Factor

Good balance

@doanduyhai

Consistency level!

ALL Paranoid

Slow, no high availability

@doanduyhai 40

Consistency summary!

ONERead + ONEWrite

☞ available for read/write even (N-1) replicas down

QUORUMRead + QUORUMWrite

☞ available for read/write even 1+ replica down

Read & Write Path!

@doanduyhai

Cassandra Write Path!

Commit log1

Commit log2

Commit logn

Memory

@doanduyhai

Memory

Commit log1

Commit log2

Commit logn

MemTable Table1

MemTable Table2

MemTable TableN

@doanduyhai

Commit log1

Commit log2

Commit logn

Table1

SStable1

Table2 Table3

SStable2 SStable3 3

Memory

@doanduyhai

Commit log1

Commit log2

Commit logn

Table1

SStable1

Table2 Table3

SStable2 SStable3

Memory . . . MemTable Table1

MemTable Table2

MemTable TableN

@doanduyhai

Commit log1

Commit log2

Commit logn

Table1

SStable1

Table2 Table3

SStable2 SStable3

Memory

SStable1

SStable2

SStable3 . . .

@doanduyhai

Cassandra Read Path!

Row Cache 1

JVM Heap

Off Heap

SELECT .. FROM … WHERE #partition =…;

SStable3 SStable2 SStable1

@doanduyhai

Bloom Filter Bloom Filter

Row Cache 1

JVM Heap

Off Heap

Bloom Filter 2

@doanduyhai

Bloom filters in action!

1 0 0 1* 0 0 1 0 1 1

#partition = foo

h1 h2 h3

#partition = bar Write

Read #partition = qux

@doanduyhai

Partition Key Cache

Row Cache 1

JVM Heap

Off Heap

Bloom Filter 2

× ✓ ×

@doanduyhai

Partition Key Cache!

#partition001 data data

…. #partition002

#Partition Offset …

#partition001 0x0 …

Partition Key Cache SStable

@doanduyhai

Row Cache 1

JVM Heap

Off Heap

Partition Key Cache 3

Key Index Sample 4

SStable2

Bloom Filter 2

@doanduyhai

Key Index Sample!

…. #partition128

SStable

Sample Offset

#partition001 0x0

#partition128 0x4500

Key Index Sample

@doanduyhai

Row Cache 1

JVM Heap

Off Heap

Key Index Sample 4

Compression Offset 5

SStable2

Bloom Filter 2

@doanduyhai

Compression Offset!

Compression Offset

Normal offset Compressed offset

0x0 0x0

0x4500 0x100

0x851513 0x1353

0x5464321 0x21245

@doanduyhai

Row Cache 1

JVM Heap

Off Heap

Key Index Sample 4

Compression Offset 5

SStable2 6

MemTable 7

Bloom Filter 2

Data model!

Last Write Win!CQL basics!

From SQL to CQL!

@doanduyhai

Last Write Win (LWW)!

jdoe age name

33 John DOE

INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);

#partition

@doanduyhai

jdoe age (t1) name (t1)

33 John DOE

auto-generated timestamp (μs)

@doanduyhai

UPDATE users SET age = 34 WHERE login = jdoe;

33 John DOE jdoe

age (t2)

SSTable1 SSTable2

@doanduyhai

DELETE age FROM users WHERE login = jdoe;

jdoe age (t3)

tombstone

33 John DOE jdoe

age (t2)

SSTable1 SSTable2 SSTable3

@doanduyhai

SELECT age FROM users WHERE login = jdoe;

jdoe age (t3)

ý jdoe

age (t1) name (t1)

33 John DOE jdoe

age (t2)

@doanduyhai

✓ ✕ ✕

jdoe age (t3)

ý jdoe

age (t1) name (t1)

33 John DOE jdoe

age (t2)

@doanduyhai

Compaction!

jdoe age (t3)

ý jdoe

age (t1) name (t1)

33 John DOE jdoe

age (t2)

New SSTable

ý John DOE

@doanduyhai

CRUD operations!

UPDATE users SET age = 34 WHERE login = jdoe;

DELETE age FROM users WHERE login = jdoe;

@doanduyhai

DDL Statements!

Keyspace (tables container) •  CREATE/ALTER/DROP Table •  CREATE/ALTER/DROP Type (custom type) •  CREATE/ALTER/DROP

@doanduyhai

User management & permissions!

User •  CREATE/ALTER/DROP Permissions •  GRANT <xxx> PERMISSION ON <resource> TO <user_name> •  REVOKE <xxx> PERMISSION ON <resource> FROM <user_name>

@doanduyhai

Simple Table!

CREATE TABLE users ( login text, name text, age int, … PRIMARY KEY(login));

partition key (#partition)

@doanduyhai

Clustered table!

CREATE TABLE mailbox ( login text, message_id timeuuid, interlocutor text, message text, PRIMARY KEY((login), message_id));

partition key clustering column unicity

@doanduyhai

Queries!

Get message by user and message_id (date)

SELECT * FROM mailbox WHERE login = jdoe and message_id = ‘2014-09-25 16:00:00’;

Get message by user and date interval

SELECT * FROM mailbox WHERE login = jdoe and message_id <= ‘2014-09-25 16:00:00’ and message_id >= ‘2014-09-20 16:00:00’;

@doanduyhai

Queries!

Get message by message_id only (#partition not provided)

SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;

Get message by date interval (#partition not provided)

SELECT * FROM mailbox WHERE and message_id <= ‘2014-09-25 16:00:00’ and message_id >= ‘2014-09-20 16:00:00’;

@doanduyhai

Queries!

SELECT * FROM mailbox WHERE login >= hsue and login <= jdoe;

Get message by user range (range query on #partition)

SELECT * FROM mailbox WHERE login like ‘%doe%‘;

Get message by user pattern (non exact match on #partition)

@doanduyhai

WHERE clause restrictions!

All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition Only exact match (=) on #partition, range queries (<, ≤, >, ≥) not allowed •  ☞ full cluster scan On clustering columns, only range queries (<, ≤, >, ≥) and exact match WHERE clause only possible on columns defined in PRIMARY KEY

@doanduyhai

Collections & maps!

CREATE TABLE users ( login text, name text, age int, friends set<text>, hobbies list<text>, languages map<int, text>, … PRIMARY KEY(login));

@doanduyhai

Collections & maps!

All elements loaded at each access •  ☞ low cardinality only (≤ 1000 elements)

Should not be used use for 1-N, N-N relations

@doanduyhai

User Defined Type (UDT)!

CREATE TABLE users ( login text, … street_number int, street_name text, postcode int, country text, … PRIMARY KEY(login));

Instead of

@doanduyhai

User Defined Type (UDT)!

CREATE TYPE address ( street_number int, street_name text, postcode int, country text);

CREATE TABLE users ( login text, … location frozen <address>, … PRIMARY KEY(login));

@doanduyhai

UDT in action!

INSERT INTO users(login,name, location) VALUES ( ‘jdoe’, ’John DOE’, { ‘street_number’: 124, ‘street_name’: ‘Congress Avenue’, ‘postcode’: 95054, ‘country’: ‘USA’ });

@doanduyhai

From SQL to CQL!

Remember…

@doanduyhai

From SQL to CQL!

Remember…

CQL is not SQL

@doanduyhai

From SQL to CQL!

Remember…

there is no join

(do you want to scale ?)

@doanduyhai

From SQL to CQL!

Remember…

there is no integrity constraint

(do you want to read-before-write ?)

@doanduyhai

From SQL to CQL!

1-1, N-1 : normalized

Comment

CREATE TABLE comments ( article_id uuid, comment_id timeuuid, author_id uuid, // typical join id content text, PRIMARY KEY((article_id), comment_id));

@doanduyhai

From SQL to CQL!

1-1, N-1 : de-normalized

Comment

CREATE TABLE comments ( article_id uuid, comment_id timeuuid, author person, // person is UDT content text, PRIMARY KEY((article_id), comment_id));

@doanduyhai

From SQL to CQL!

N-1, N-N : normalized

CREATE TABLE friends( login text, friend_login text, // typical join id PRIMARY KEY((login), friend_login));

@doanduyhai

From SQL to CQL!

N-1, N-N : de-normalized

CREATE TABLE friends( login text, friend_login text, friend person, // person is UDT PRIMARY KEY((login), friend_login));

@doanduyhai

Data modeling best practices!

Start by queries •  identify core functional read paths •  1 read scenario ≈ 1 SELECT

@doanduyhai

Start by queries •  identify core functional read paths •  1 read scenario ≈ 1 SELECT Denormalize •  wisely, only duplicate necessary & immutable data •  functional/technical trade-off

@doanduyhai

Person UDT - firstname/lastname - date of birth - sexe - mood - location

@doanduyhai

John DOE, male birthdate: 21/02/1981 subscribed since 03/06/2011 ☉ San Mateo, CA

’’Impossible is not John DOE’’

Full detail read from User table on click

Failure Handling!

@doanduyhai

Failure Handling!Failure occurs when •  node hardware failure (disk, …) •  network issues •  heavy load, node flapping •  JVM long GCs

@doanduyhai

Hinted Handoff!When 1 replica down •  coordinator stores Hints

coordinator ×

@doanduyhai

Hinted Handoff!When replica up •  coordinator forward stored Hints

coordinator

@doanduyhai

Consistent read!Read at QUORUM or more •  read from one fastest node •  AND request digest from other

replicas to reach QUORUM •  return most up-to-date data to client •  repair if digest mismatch n1

coordinator

@doanduyhai

Read Repair!Read at CL < QUORUM •  read from one least-loaded node •  every x reads, request digests from

other replicas •  compare digest & repair

asynchronously •  read_repair_chance (10%)

coordinator

@doanduyhai

Manual repair!Should be part of cluster operations •  scheduled •  I/O intensive Use incremental repair of 2.1

@doanduyhai Company Confidential 101

Training Day | December 3rd Beginner Track •  Introduction to Cassandra •  Introduction to Spark, Shark, Scala and

Cassandra

Advanced Track •  Data Modeling •  Performance Tuning

Conference Day | December 4th Cassandra Summit Europe 2014 will be the single largest gathering of Cassandra users in Europe. Learn how the world's most successful companies are transforming their businesses and growing faster than ever using Apache Cassandra. http://bit.ly/cassandrasummit2014

Thank You @doanduyhai

duy_hai.doan@datastax.com

https://academy.datastax.com/

cassandra introduction apache con 2014 budapest

Technology

cassandra day denver 2014: introduction to apache cassandra

apache cassandra - einführung

apache cassandra 2.0

apache cassandra overview

apache cassandra

apache cassandra from the ground up -...

apache cassandra in action - o'reilly...

diaposotivas apache-cassandra

apache cassandra, part 3 – machinery, work with cassandra

cassandra summit 2014: apache cassandra at telefonica cbs

devcenter apache cassandra

amazon managed apache cassandra service - developer...

apache cassandra at wayin

nosql apache cassandra

introduction to cassandra • why spark - apache cassandra |...

apache cassandra in bangalore - cassandra internals and...

cassandra community webinar - introduction to apache...

pycon 2012 apache cassandra

cassandra day atlanta 2015: troubleshooting with apache...

cassandra summit 2014: apache cassandra on pivotal...