cassandra data structures and algorithms

@doanduyhai

Cassandra data structures & algorithms DuyHai DOAN, Technical Advocate

@doanduyhai

Shameless self-promotion!

2

Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  Cassandra technical point of contact •  Cassandra troubleshooting

@doanduyhai

Agenda!

3

Data structures •  CRDT •  Bloom filter •  Merkle tree Algorithms •  HyperLogLog

@doanduyhai

Why Cassandra ?!

4

Linear scalability ≈ unbounded extensivity •  1k+ nodes cluster

@doanduyhai

Why Cassandra ?!

5

Continuous availability (≈100% up-time) •  resilient architecture (Dynamo) •  rolling upgrades •  data backward compatible n/n+1 versions

@doanduyhai

Why Cassandra ?!

6

Multi-data centers •  out-of-the-box (config only) •  AWS conf for multi-region DCs Operational simplicity •  1 node = 1 process + 1 config file

@doanduyhai

Cassandra architecture!

7

Data-store layer •  Google Big Table paper •  Columns/Columns Family Cluster layer •  Amazon DynamoDB paper •  masterless architecture

@doanduyhai

Cassandra architecture!

8

DATA STORE (BIG TABLES)

CLUSTER (DYNAMO)

API (CQL & RPC)

DISKS

Node1

Client request

Node2

CLUSTER (DYNAMO)

API (CQL & RPC)

DISKS

DATA STORE (BIG TABLES)

@doanduyhai

Data access!

9

By CQL query via native protocol •  INSERT, UPDATE, DELETE, SELECT •  CREATE/ALTER/DROP TABLE

Always by partition key (#partition) •  partition == physical row

@doanduyhai

Data distribution!

10

Random: hash of #partition → token = hash(#p) Hash: ]0, 2127-1] Each node: 1/8 of ]0, 2127-1]

n1

n2

n3

n4

n5

n6

n7

n8

@doanduyhai

Data replication!

11

Replication Factor = 3

n1

n2

n3

n4

n5

n6

n7

n8

1

2

3

@doanduyhai

Coordinator node!Incoming requests (read/write) Coordinator node handles the request

Every node can be coordinator àmasterless

n1

n2

n3

n4

n5

n6

n7

n8

1

2

3

coordinator request

12

CRDT!

by Marc Shapiro, 2011

@doanduyhai

INSERT!

14

Table « users »

ddoan age name

33 DuyHai DOAN

INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);

@doanduyhai

INSERT!

15

Table « users »

ddoan age name

33 DuyHai DOAN


#partition column names

@doanduyhai

INSERT!

ddoan age (t1) name (t1)

33 DuyHai DOAN

16

Table « users »


auto-generated timestamp (μs)

.

@doanduyhai

UPDATE!

17

Table « users »

UPDATE users SET age = 34 WHERE login = ddoan;


33 DuyHai DOAN ddoan

age (t2)

34

File1 File2

@doanduyhai

DELETE!

18

Table « users »

DELETE age FROM users WHERE login = ddoan;

ddoan age (t3)

ý

tombstone



age (t2)

34

File1 File2 File3

@doanduyhai

SELECT!

19

Table « users »

SELECT age FROM users WHERE login = ddoan;

? ? ?

ddoan age (t3)

ý ddoan

age (t1) name (t1)


age (t2)

34

File1 File2 File3

@doanduyhai

SELECT!

20

Table « users »

SELECT age FROM users WHERE login = ddoan;

✓ ✕ ✕

ddoan age (t3)

ý ddoan

age (t1) name (t1)


age (t2)

34

File1 File2 File3

@doanduyhai

Cassandra columns!

21

look very similar to …

CRDT

@doanduyhai

CRDT Recap!

22

CRDT = Convergent Replicated Data Types Useful in distributed system Formal proof for strong « eventual convergence » of replicated data

@doanduyhai

CRDT Recap!

23

A join semilattice (or just semilattice hereafter) is a partial order ≤v equipped with a least upper bound (LUB) ⊔v, defined as follows: Definition 2.4 m = x ⊔v y is a Least Upper Bound of {x, y} under ≤v iff •  x ≤v m and •  y ≤v m and •  there is no m′ ≤v m such that x ≤v m′ and y ≤v m′ It follows from the definition that ⊔v is: commutative: x ⊔v y =v y ⊔v x; idempotent: x ⊔v x =v x; and associative: (x⊔v y)⊔v z =v x⊔v (y⊔v z).

@doanduyhai

CRDT Recap!

24

Definition 2.5 (Join Semilattice). An ordered set (S, ≤v) is a Join Semilattice iff ∀x,y ∈ S, x ⊔v y exists.

Let’s define Stk,n = set of Cassandra columns identified by

•  partition key k •  column name n •  assigned a timestamp t The ordered set (St

k,n, maxt) is a Join Semilattice

#partition column name(t1)

… #partition

column name(t2)

…

@doanduyhai

Cassandra column as CRDT!

25

Proof: •  S1

k,n ≤ maxt (S1k,n,S2

k,n) •  S2


k,n) •  there is no Sx


k,n) such that S1k,n ≤ Sx

k,n and S2k,n ≤ Sx

k,n

@doanduyhai


26

Idempotent ! ddoan

age (t2)

33 ddoan

age (t2)

33 ddoan

age (t2)

33

ddoan age (t2)

33

node1 node2 node3

coordinator

@doanduyhai


27

Commutative !ddoan

age (t1)

33 ddoan

age (t2)

34

ddoan age (t2)

34

node1 node2

coordinator

ddoan age (t2)

34 ddoan

age (t1)

33

node2 node1

ddoan age (t2)

34

coordinator

@doanduyhai


28

Associative !ddoan

age (t1)

33 ddoan

age (t2)

34

ddoan age (t3)

35 ddoan

age (t2)

34

node1 node2

node3

coordinator

@doanduyhai


29

ddoan address(t1) age (t1)

12 rue de.. 33 ddoan

age (t2)

34

ddoan age (t3)

35 ddoan

address(t7)

17 avenue..

File1 File2

File3 File4

@doanduyhai


30

ddoan age (t1)

33 ddoan age (t2)

34 ddoan

age (t3)

35

Stddoan,age =

Stddoan,address=

ddoan address(t1)

12 rue de.. ddoan

address(t7)

17 avenue..

t

t

@doanduyhai

Eventual convergence!

31

Proposition 2.1. Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels.

@doanduyhai


32

Proposition 2.1. Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels. !!eventually-reliable point-to-point channels à there is a network cable connecting 2 nodes …

@doanduyhai


33

The system transmits payload infinitely often between pairs of replicas

@doanduyhai


34

Repair


@doanduyhai


35

Strong hypothesis in the case of Cassandra CRDT !

@doanduyhai


36

maxtimestamp as merge function !

Strong hypothesis in the case of Cassandra CRDT !

@doanduyhai


37

Time is reliable … isn’t it ? !

@doanduyhai


38

NTP server-side mandatory

Q & R

! " !

Bloom filters!

by Burton Howard Bloom, 1970

@doanduyhai

Cassandra Write Path!

41

Commit log1

. . .

1

Commit log2

Commit logn

Memory

@doanduyhai


42

Memory

Commit log1

. . .

1

Commit log2

Commit logn

MemTable Table1

MemTable Table2

MemTable TableN

2

. . .

@doanduyhai


43

Commit log1

Commit log2

Commit logn

Table1

SStable1

Table2 Table3

SStable2 SStable3 3

Memory

. . .

@doanduyhai


44

Commit log1

Commit log2

Commit logn

Table1

SStable1

Table2 Table3

SStable2 SStable3

Memory . . . MemTable Table1

MemTable Table2

MemTable TableN

. . .

@doanduyhai


45

Commit log1

Commit log2

Commit logn

Table1

SStable1

Table2 Table3

SStable2 SStable3

Memory

SStable1

SStable2

SStable3 . . .

@doanduyhai

Cassandra Read Path!

46

Either in memory

@doanduyhai


47

Either in memory

or hit disk (many SSTables)

@doanduyhai


48

How to optimize disk seeks ?

@doanduyhai


49

Only read necessary SSTables !

@doanduyhai


50

Bloom filters !

@doanduyhai

Bloom filters recap!

51

Space-efficient probabilistic data structure. Used for membership test True negative, possible false positive

@doanduyhai

Bloom filters in Cassandra!

52

For each SSTable, create a bloom filter Upon data insertion, populate it Upon data retrieval, ask the bloom filter for skipping

@doanduyhai

Bloom filters in action!

53

1 0 0 1 0 0 1 0 0 0

#partition = foo

h2 h3

Write

h1

@doanduyhai


54

1 0 0 1* 0 0 1 0 1 1

#partition = foo

h1 h2 h3

#partition = bar Write

@doanduyhai


55

1 0 0 1* 0 0 1 0 1 1

#partition = foo

h1 h2 h3

#partition = bar Write

Read #partition = qux

@doanduyhai

Bloom filters maths!

56

@doanduyhai


57

probability of a bit to be set to 1:

1 0 0 1 0 0 1 0 0 0

m bits

1m

1− 1m

probability of a bit to be set to 0:

@doanduyhai


58

probability with k … and n … of the bit to be set to 1: 1− 1− 1m

"

#$

%

&'kn

probability with k hash functions of the bit to be set to 0: 1− 1m

"

#$

%

&'k

probability with k … and n elements inserted … : 1− 1m

"

#$

%

&'kn

@doanduyhai


59

But why do we need to calculate probability of a bit: •  to be set to 1 •  then to be set to 0 •  then back to 1 again ?

@doanduyhai


60

Because of bits colliding on 1 when applying many k & n !

@doanduyhai


61

For an element not in the SSTable, probability that all k hash functions return 1 (false positive chance, fpc):

1− 1− 1m

"

#$

%

&'kn"

#$$

%

&''

k

≈ 1− e−knm

"

#$

%

&'

k

To minimize fpc: koptimal ≈mnln(2)

@doanduyhai


62

fpc = 1− e−mnln(2)n

m

"

#

$$$

%

&

'''

mnln(2)

= 1− eln( 12)"

#$

%

&'

mnln(2)

=12

mnln(2)

ln( fpc) = mnln(12)ln(2) = −m

nln(2)2

m = nln( 1

fpc)

ln(2)2

@doanduyhai


63

m = nln( 1

fpc)

ln(2)2

For n = 109 of #partition •  fpc = 10%, m ≈ 500Mb •  fpc = 1%, m ≈ 1.2Gb

@doanduyhai

Bloom filters (notes)!

64

Cannot remove elements once inserted (1-bit colliding) •  cannot resize •  collision increases with load

Q & R

! " !

Merkle tree!

by Ralph Merkle, 1987

@doanduyhai

Repairing data!

67

Repair


@doanduyhai

Why repair ?!

68

Data diverge between replicas because: •  writing with low consistency for perf •  nodes down •  network down •  dropped writes

@doanduyhai

Repairing data!

69

Compare full data ? •  read all data •  I/O intensive •  network intensive (streaming is expensive)

@doanduyhai

Repairing data!

70

Compare full data ? •  read all data •  I/O intensive •  network intensive (streaming is expensive)

Compare digests ? •  read all data •  I/O intensive •  network intensive (streaming is expensive)

@doanduyhai

Merkle tree!

71

Tree of digests •  leaf nodes : digest of data •  non-leaf nodes: digest of children nodes digest •  tree resolution = nb leaf nodes = 2depth

@doanduyhai

Merkle tree in action!

72

Depth = 15, resolution = 32 768 leaf nodes

leaf1 leaf2 leaf3

node node

root

…

n-partitions bucket n-partitions bucket n-partitions bucket

@doanduyhai


73

Repair process •  send the tree to replicas •  compare digests, starting from root node •  if mismatch, stream partition bucket(s) that differ

@doanduyhai


74

If mismatch, stream partition bucket(s) that differ Example •  327 680 partitions •  resolution = 32 768 à10 partitions/bucket •  1 column differs in 1 partition à 10 partitions streamed

leaf

10-partitions

@doanduyhai

Over-streaming nightmare!

75

@doanduyhai


76

Improve tree resolution by increasing depth (dynamically)

leaf1

node

… leaf2 leafN

node node

node node

root

leaf1

node

… leaf3 leafN

node node

node node

root

node node node node

node node node node node

leaf2

@doanduyhai


77

Improve tree resolution by repairing by partition ranges

leaf1

node

… leaf2 leafN

node node

node node

root

leaf1

node

… leaf2 leafN

node node

node node

root

leaf1

node

… leaf2 leafN

node node

node node

root

Q & R

! " !

HyperLogLog!

by late Philippe Flajolet, 2007

@doanduyhai


80

Remember that ?

Table1

SStable1

Table2 Table3

SStable2 SStable3

SStable1

SStable2

SStable3

@doanduyhai


81

Even Bloom filter can’t save you if data spills on many SSTables

@doanduyhai


82

Compaction !

@doanduyhai

Compaction!

83

Algorithm: •  take n SSTables •  load data in memory

•  for each Stk,n apply the merge function (maxtimestamp)

•  remove (when applicable) tombstones •  build a new SSTable

@doanduyhai

Compaction!

84

Build a new SSTable à allocate memory for new Bloom filter

@doanduyhai

Compaction!

85

But how large is the new Bloom filter ?

@doanduyhai

Compaction!

86

SStable1 SStable2

Bloom filters

double size?

in between ?

same size ?

@doanduyhai

Compaction!

87

Bloom filter size depends on … elements cardinality (fpc constant)

@doanduyhai

Compaction!

88

Bloom filter size depends on … elements cardinality (fpc constant) If we can count distinct elements in SSTable1 & SSTable2, we can allocate new Bloom filter

@doanduyhai

Compaction!

89

Bloom filters

Given constant fpc, if cardinality = C1+C2, then m = …

SStable1 SStable2

Cardinality: C1 Cardinality: C2

@doanduyhai

Compaction!

90

But counting exact cardinality is memory-expensive ...

@doanduyhai

Compaction!

91

Can’t we have a cardinality estimate ?

@doanduyhai

Cardinality estimators!

92

Counter Bytes used Error Java HashSet 10 447 016 0%

Linear Probabilistic Counter 3 384 1% HyperLogLog 512 3%

credits: http://highscalability.com/

@doanduyhai

LogLog intuition!

93

1)  given a well distributed hash function h 2)  given a sufficiently high number of elements n For a set of n elements, look that the bit pattern

∀ i ∈ [1,n], h(elementi)

0xxxxx… 1xxxxx…

≈ n/2 ≈ n/2

@doanduyhai

LogLog intuition!

94

∀ i ∈ [1,n], h(elementi)

01xxxx… 10xxxx…

≈ n/4

00xxxx… 11xxxx…

000xxx… 001xxx… 010xxx… 011xxx… 100xxx… 101xxx… 110xxx… 111xxx…

≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8

≈ n/4 ≈ n/4 ≈ n/4

@doanduyhai

LogLog intuition!

95

Flip back the reasonning. If we see a hash like this: 000 000 000 1… Since the hash distribution is uniform, we should also have seen: 000 000 001 0… and 000 000 001 1… and 000 000 010 0… and … 111 111 111 1… Thus an estimated cardinality of 210 elements for n

@doanduyhai

LogLog intuition!

96

Toy example: n = 8, hash of 8 elements, 3 bit long:

000, 001, 010, 011, 100, 101, 110, 111 Uniform hash à equi-probability of each combination If I observed 001, I should have seen 000 too, and 010 too … If I observed 001, I should have seen 7 other combinations If I observed 001, n ≈ 8 (23)

@doanduyhai

LogLog intuition!

97

1)  given a well distributed hash function h 2)  given a sufficiently high number of elements n If I find a hash starting with 01…, it’s likely that there are 22 distinct elements (n = 22) 001…, it’s likely that there are 23 distinct elements (n = 23) 0001…, it’s likely that there are 24 distinct elements (n = 24) … 00000000001…, it’s likely that there are 2r distinct elements (n = 2r)

r

@doanduyhai

LogLog intuition!

98

max(r) = longest 0000…1 position observed among all hash values

n ≈ 2max(r)

@doanduyhai

LogLog intuition!

99

Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…

@doanduyhai

LogLog intuition!

100


n ≈ 2max(r) ≈ 29 ≈ 512 ?

@doanduyhai

LogLog intuition!

101


n ≈ 2max(r) ≈ 29 ≈ 512 ?

outlier & skewed distribution sensitivity

@doanduyhai

HyperLogLog intuition!

102

To eliminate outliers … use harmonic mean !

credits: http://economistatlarge.com

@doanduyhai


103

Harmonic means definition (thank you Wikipedia)

H =m

1x1+1x2+...+ 1

xm

@doanduyhai


104

First, split the set into m = 2b buckets Bucket number is determined by first b bits

b = 6, m = 32 buckets

Buckets list: B1, B2, … B32 (index is 1-based)

h(element) = 001001 0100…

@doanduyhai


105

Example m = 8 (23) buckets

000xxx… 001xxx… 010xxx… 011xxx… 100xxx… 101xxx… 110xxx… 111xxx…

B1 B2 B3 B4 B5 B6 B7 B8

@doanduyhai


106

New intuition: •  in each bucket j, there are ≈ Mj elements •  harmonic mean (Mj) = H(Mj) ≈ n/m

n ≈ mH(Mj)

@doanduyhai


107

But how do we calculate each Mj ?

@doanduyhai


108

Use LogLog !

@doanduyhai


109

How to solve a big hard problem ?

@doanduyhai


110

So on each hash value

bits for choosing bucket Bj

001100 0000001… bits for LogLog

@doanduyhai

HyperLogLog improvement!

111

Greater precision compared to LogLog Computation can be distributed (each bucket processed separately)

@doanduyhai

HyperLogLog the maths!

112

@doanduyhai

HyperLogLog formal definition!

113

Let h : D → [0, 1] ≡ {0, 1}∞ hash data from domain D to the binary domain. Let ρ(s), for s ∈ {0, 1}∞ , be the position of the leftmost 1-bit. (ρ(0001 · · · ) = 4) It is the rank of the 0000..1 observed sequence Let m = 2b with b∈Z>0 m = number of buckets Let M : multiset of items from domain D M is the set of elements to estimate cardinality

@doanduyhai


114

Algorithm HYPERLOGLOG

Initialize a collection of m registers, M1, . . . , Mm, to −∞ for each element v ∈ M do •  set x := h(v) //hash of v in binary form •  set j = 1 + ⟨x1x2 · · · xb⟩2 //bucket number (1-based) •  set w := xb+1xb+2 · · · //bits for LogLog •  set Mj := max(Mj, ρ(w)) //take the longest 0000..1 position

observed in bucket Bj

@doanduyhai


115

Compute //what is that Z ? Z = 2−M j

j=1

m

∑#

$%%

&

'((

−1

Return n ≈ αmm2Z •  αm as given by Equation (3) //what is that αm ? !

@doanduyhai

HyperLogLog maths workout!

116

Mj = longest 0000...1 observed for bucket j. H(Mj) ≈ n/m

H =m

1x1+1x2+...+ 1

xm

Remember our intuition n ≈ mH(Mj) ?

Harmonic mean definition

@doanduyhai


117

H =m

1x1+1x2+...+ 1

xm

=m 11x jj=1

m∑

H =m 1x jj=1

m∑"

#$$

%

&''

−1

=m xj−1

j=1

m∑( )

−1

@doanduyhai


118

H =m xj−1

j=1

m∑( )

−1

Z = 2−M j

j=1

m

∑#

$%%

&

'((

−1

compare it with

let xj = 2Mj , the cardinality estimate for bucket Bj

H =mZ

@doanduyhai


119

Remember our intuition n ≈ mH(Mj) ?

n ≈ mH ≈ m2Z ☞ αmm2Z

@doanduyhai

HyperLogLog harder maths!

120

What’s about the αm constant ?

@doanduyhai


121

You don’t want to dig into that, trust me …

@doanduyhai


122

8 pages full of this:

@doanduyhai


123

and this

@doanduyhai


124

and this…

@doanduyhai

Compaction!

125

Bloom filters

Given constant fpc, if cardinality = C1+C2, then m = …

SStable1 SStable2

Cardinality: C1 Cardinality: C2

Q & R

! " !

Thank You @doanduyhai

[email protected]

cassandra data structures and algorithms

Technology