cassandra data structures and algorithms
DESCRIPTION
CRDT, Bloom Filter, Merkle Tree & HyperLogLog in CassandraTRANSCRIPT
@doanduyhai
Cassandra data structures & algorithms DuyHai DOAN, Technical Advocate
@doanduyhai
Shameless self-promotion!
2
Duy Hai DOAN Cassandra technical advocate • talks, meetups, confs • open-source devs (Achilles, …) • Cassandra technical point of contact • Cassandra troubleshooting
@doanduyhai
Agenda!
3
Data structures • CRDT • Bloom filter • Merkle tree Algorithms • HyperLogLog
@doanduyhai
Why Cassandra ?!
4
Linear scalability ≈ unbounded extensivity • 1k+ nodes cluster
@doanduyhai
Why Cassandra ?!
5
Continuous availability (≈100% up-time) • resilient architecture (Dynamo) • rolling upgrades • data backward compatible n/n+1 versions
@doanduyhai
Why Cassandra ?!
6
Multi-data centers • out-of-the-box (config only) • AWS conf for multi-region DCs Operational simplicity • 1 node = 1 process + 1 config file
@doanduyhai
Cassandra architecture!
7
Data-store layer • Google Big Table paper • Columns/Columns Family Cluster layer • Amazon DynamoDB paper • masterless architecture
@doanduyhai
Cassandra architecture!
8
DATA STORE (BIG TABLES)
CLUSTER (DYNAMO)
API (CQL & RPC)
DISKS
Node1
Client request
Node2
CLUSTER (DYNAMO)
API (CQL & RPC)
DISKS
DATA STORE (BIG TABLES)
@doanduyhai
Data access!
9
By CQL query via native protocol • INSERT, UPDATE, DELETE, SELECT • CREATE/ALTER/DROP TABLE
Always by partition key (#partition) • partition == physical row
@doanduyhai
Data distribution!
10
Random: hash of #partition → token = hash(#p) Hash: ]0, 2127-1] Each node: 1/8 of ]0, 2127-1]
n1
n2
n3
n4
n5
n6
n7
n8
@doanduyhai
Data replication!
11
Replication Factor = 3
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
@doanduyhai
Coordinator node!Incoming requests (read/write) Coordinator node handles the request
Every node can be coordinator àmasterless
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator request
12
CRDT!
by Marc Shapiro, 2011
@doanduyhai
INSERT!
14
Table « users »
ddoan age name
33 DuyHai DOAN
INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);
@doanduyhai
INSERT!
15
Table « users »
ddoan age name
33 DuyHai DOAN
INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);
#partition column names
@doanduyhai
INSERT!
ddoan age (t1) name (t1)
33 DuyHai DOAN
16
Table « users »
INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);
auto-generated timestamp (μs)
.
@doanduyhai
UPDATE!
17
Table « users »
UPDATE users SET age = 34 WHERE login = ddoan;
ddoan age (t1) name (t1)
33 DuyHai DOAN ddoan
age (t2)
34
File1 File2
@doanduyhai
DELETE!
18
Table « users »
DELETE age FROM users WHERE login = ddoan;
ddoan age (t3)
ý
tombstone
ddoan age (t1) name (t1)
33 DuyHai DOAN ddoan
age (t2)
34
File1 File2 File3
@doanduyhai
SELECT!
19
Table « users »
SELECT age FROM users WHERE login = ddoan;
? ? ?
ddoan age (t3)
ý ddoan
age (t1) name (t1)
33 DuyHai DOAN ddoan
age (t2)
34
File1 File2 File3
@doanduyhai
SELECT!
20
Table « users »
SELECT age FROM users WHERE login = ddoan;
✓ ✕ ✕
ddoan age (t3)
ý ddoan
age (t1) name (t1)
33 DuyHai DOAN ddoan
age (t2)
34
File1 File2 File3
@doanduyhai
Cassandra columns!
21
look very similar to …
CRDT
@doanduyhai
CRDT Recap!
22
CRDT = Convergent Replicated Data Types Useful in distributed system Formal proof for strong « eventual convergence » of replicated data
@doanduyhai
CRDT Recap!
23
A join semilattice (or just semilattice hereafter) is a partial order ≤v equipped with a least upper bound (LUB) ⊔v, defined as follows: Definition 2.4 m = x ⊔v y is a Least Upper Bound of {x, y} under ≤v iff • x ≤v m and • y ≤v m and • there is no m′ ≤v m such that x ≤v m′ and y ≤v m′ It follows from the definition that ⊔v is: commutative: x ⊔v y =v y ⊔v x; idempotent: x ⊔v x =v x; and associative: (x⊔v y)⊔v z =v x⊔v (y⊔v z).
@doanduyhai
CRDT Recap!
24
Definition 2.5 (Join Semilattice). An ordered set (S, ≤v) is a Join Semilattice iff ∀x,y ∈ S, x ⊔v y exists.
Let’s define Stk,n = set of Cassandra columns identified by
• partition key k • column name n • assigned a timestamp t The ordered set (St
k,n, maxt) is a Join Semilattice
#partition column name(t1)
… #partition
column name(t2)
…
@doanduyhai
Cassandra column as CRDT!
25
Proof: • S1
k,n ≤ maxt (S1k,n,S2
k,n) • S2
k,n ≤ maxt (S1k,n,S2
k,n) • there is no Sx
k,n ≤ maxt (S1k,n,S2
k,n) such that S1k,n ≤ Sx
k,n and S2k,n ≤ Sx
k,n
@doanduyhai
Cassandra column as CRDT!
26
Idempotent ! ddoan
age (t2)
33 ddoan
age (t2)
33 ddoan
age (t2)
33
ddoan age (t2)
33
node1 node2 node3
coordinator
@doanduyhai
Cassandra column as CRDT!
27
Commutative !ddoan
age (t1)
33 ddoan
age (t2)
34
ddoan age (t2)
34
node1 node2
coordinator
ddoan age (t2)
34 ddoan
age (t1)
33
node2 node1
ddoan age (t2)
34
coordinator
@doanduyhai
Cassandra column as CRDT!
28
Associative !ddoan
age (t1)
33 ddoan
age (t2)
34
ddoan age (t3)
35 ddoan
age (t2)
34
node1 node2
node3
coordinator
@doanduyhai
Cassandra column as CRDT!
29
ddoan address(t1) age (t1)
12 rue de.. 33 ddoan
age (t2)
34
ddoan age (t3)
35 ddoan
address(t7)
17 avenue..
File1 File2
File3 File4
@doanduyhai
Cassandra column as CRDT!
30
ddoan age (t1)
33 ddoan age (t2)
34 ddoan
age (t3)
35
Stddoan,age =
Stddoan,address=
ddoan address(t1)
12 rue de.. ddoan
address(t7)
17 avenue..
t
t
@doanduyhai
Eventual convergence!
31
Proposition 2.1. Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels.
@doanduyhai
Eventual convergence!
32
Proposition 2.1. Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels. !!eventually-reliable point-to-point channels à there is a network cable connecting 2 nodes …
@doanduyhai
Eventual convergence!
33
The system transmits payload infinitely often between pairs of replicas
@doanduyhai
Eventual convergence!
34
Repair
The system transmits payload infinitely often between pairs of replicas
@doanduyhai
Eventual convergence!
35
Strong hypothesis in the case of Cassandra CRDT !
@doanduyhai
Eventual convergence!
36
maxtimestamp as merge function !
Strong hypothesis in the case of Cassandra CRDT !
@doanduyhai
Eventual convergence!
37
Time is reliable … isn’t it ? !
@doanduyhai
Eventual convergence!
38
NTP server-side mandatory
Q & R
! " !
Bloom filters!
by Burton Howard Bloom, 1970
@doanduyhai
Cassandra Write Path!
41
Commit log1
. . .
1
Commit log2
Commit logn
Memory
@doanduyhai
Cassandra Write Path!
42
Memory
Commit log1
. . .
1
Commit log2
Commit logn
MemTable Table1
MemTable Table2
MemTable TableN
2
. . .
@doanduyhai
Cassandra Write Path!
43
Commit log1
Commit log2
Commit logn
Table1
SStable1
Table2 Table3
SStable2 SStable3 3
Memory
. . .
@doanduyhai
Cassandra Write Path!
44
Commit log1
Commit log2
Commit logn
Table1
SStable1
Table2 Table3
SStable2 SStable3
Memory . . . MemTable Table1
MemTable Table2
MemTable TableN
. . .
@doanduyhai
Cassandra Write Path!
45
Commit log1
Commit log2
Commit logn
Table1
SStable1
Table2 Table3
SStable2 SStable3
Memory
SStable1
SStable2
SStable3 . . .
@doanduyhai
Cassandra Read Path!
46
Either in memory
@doanduyhai
Cassandra Read Path!
47
Either in memory
or hit disk (many SSTables)
@doanduyhai
Cassandra Read Path!
48
How to optimize disk seeks ?
@doanduyhai
Cassandra Read Path!
49
Only read necessary SSTables !
@doanduyhai
Cassandra Read Path!
50
Bloom filters !
@doanduyhai
Bloom filters recap!
51
Space-efficient probabilistic data structure. Used for membership test True negative, possible false positive
@doanduyhai
Bloom filters in Cassandra!
52
For each SSTable, create a bloom filter Upon data insertion, populate it Upon data retrieval, ask the bloom filter for skipping
@doanduyhai
Bloom filters in action!
53
1 0 0 1 0 0 1 0 0 0
#partition = foo
h2 h3
Write
h1
@doanduyhai
Bloom filters in action!
54
1 0 0 1* 0 0 1 0 1 1
#partition = foo
h1 h2 h3
#partition = bar Write
@doanduyhai
Bloom filters in action!
55
1 0 0 1* 0 0 1 0 1 1
#partition = foo
h1 h2 h3
#partition = bar Write
Read #partition = qux
@doanduyhai
Bloom filters maths!
56
@doanduyhai
Bloom filters maths!
57
probability of a bit to be set to 1:
1 0 0 1 0 0 1 0 0 0
m bits
1m
1− 1m
probability of a bit to be set to 0:
@doanduyhai
Bloom filters maths!
58
probability with k … and n … of the bit to be set to 1: 1− 1− 1m
"
#$
%
&'kn
probability with k hash functions of the bit to be set to 0: 1− 1m
"
#$
%
&'k
probability with k … and n elements inserted … : 1− 1m
"
#$
%
&'kn
@doanduyhai
Bloom filters maths!
59
But why do we need to calculate probability of a bit: • to be set to 1 • then to be set to 0 • then back to 1 again ?
@doanduyhai
Bloom filters maths!
60
Because of bits colliding on 1 when applying many k & n !
@doanduyhai
Bloom filters maths!
61
For an element not in the SSTable, probability that all k hash functions return 1 (false positive chance, fpc):
1− 1− 1m
"
#$
%
&'kn"
#$$
%
&''
k
≈ 1− e−knm
"
#$
%
&'
k
To minimize fpc: koptimal ≈mnln(2)
@doanduyhai
Bloom filters maths!
62
fpc = 1− e−mnln(2)n
m
"
#
$$$
%
&
'''
mnln(2)
= 1− eln( 12)"
#$
%
&'
mnln(2)
=12
mnln(2)
ln( fpc) = mnln(12)ln(2) = −m
nln(2)2
m = nln( 1
fpc)
ln(2)2
@doanduyhai
Bloom filters maths!
63
m = nln( 1
fpc)
ln(2)2
For n = 109 of #partition • fpc = 10%, m ≈ 500Mb • fpc = 1%, m ≈ 1.2Gb
@doanduyhai
Bloom filters (notes)!
64
Cannot remove elements once inserted (1-bit colliding) • cannot resize • collision increases with load
Q & R
! " !
Merkle tree!
by Ralph Merkle, 1987
@doanduyhai
Repairing data!
67
Repair
The system transmits payload infinitely often between pairs of replicas
@doanduyhai
Why repair ?!
68
Data diverge between replicas because: • writing with low consistency for perf • nodes down • network down • dropped writes
@doanduyhai
Repairing data!
69
Compare full data ? • read all data • I/O intensive • network intensive (streaming is expensive)
@doanduyhai
Repairing data!
70
Compare full data ? • read all data • I/O intensive • network intensive (streaming is expensive)
Compare digests ? • read all data • I/O intensive • network intensive (streaming is expensive)
@doanduyhai
Merkle tree!
71
Tree of digests • leaf nodes : digest of data • non-leaf nodes: digest of children nodes digest • tree resolution = nb leaf nodes = 2depth
@doanduyhai
Merkle tree in action!
72
Depth = 15, resolution = 32 768 leaf nodes
leaf1 leaf2 leaf3
node node
root
…
n-partitions bucket n-partitions bucket n-partitions bucket
@doanduyhai
Merkle tree in action!
73
Repair process • send the tree to replicas • compare digests, starting from root node • if mismatch, stream partition bucket(s) that differ
@doanduyhai
Merkle tree in action!
74
If mismatch, stream partition bucket(s) that differ Example • 327 680 partitions • resolution = 32 768 à10 partitions/bucket • 1 column differs in 1 partition à 10 partitions streamed
leaf
10-partitions
@doanduyhai
Over-streaming nightmare!
75
@doanduyhai
Merkle tree in action!
76
Improve tree resolution by increasing depth (dynamically)
leaf1
node
… leaf2 leafN
node node
node node
root
leaf1
node
… leaf3 leafN
node node
node node
root
node node node node
node node node node node
leaf2
@doanduyhai
Merkle tree in action!
77
Improve tree resolution by repairing by partition ranges
leaf1
node
… leaf2 leafN
node node
node node
root
leaf1
node
… leaf2 leafN
node node
node node
root
leaf1
node
… leaf2 leafN
node node
node node
root
Q & R
! " !
HyperLogLog!
by late Philippe Flajolet, 2007
@doanduyhai
Cassandra Read Path!
80
Remember that ?
Table1
SStable1
Table2 Table3
SStable2 SStable3
SStable1
SStable2
SStable3
@doanduyhai
Cassandra Read Path!
81
Even Bloom filter can’t save you if data spills on many SSTables
@doanduyhai
Cassandra Read Path!
82
Compaction !
@doanduyhai
Compaction!
83
Algorithm: • take n SSTables • load data in memory
• for each Stk,n apply the merge function (maxtimestamp)
• remove (when applicable) tombstones • build a new SSTable
@doanduyhai
Compaction!
84
Build a new SSTable à allocate memory for new Bloom filter
@doanduyhai
Compaction!
85
But how large is the new Bloom filter ?
@doanduyhai
Compaction!
86
SStable1 SStable2
Bloom filters
double size?
in between ?
same size ?
@doanduyhai
Compaction!
87
Bloom filter size depends on … elements cardinality (fpc constant)
@doanduyhai
Compaction!
88
Bloom filter size depends on … elements cardinality (fpc constant) If we can count distinct elements in SSTable1 & SSTable2, we can allocate new Bloom filter
@doanduyhai
Compaction!
89
Bloom filters
Given constant fpc, if cardinality = C1+C2, then m = …
SStable1 SStable2
Cardinality: C1 Cardinality: C2
@doanduyhai
Compaction!
90
But counting exact cardinality is memory-expensive ...
@doanduyhai
Compaction!
91
Can’t we have a cardinality estimate ?
@doanduyhai
Cardinality estimators!
92
Counter Bytes used Error Java HashSet 10 447 016 0%
Linear Probabilistic Counter 3 384 1% HyperLogLog 512 3%
credits: http://highscalability.com/
@doanduyhai
LogLog intuition!
93
1) given a well distributed hash function h 2) given a sufficiently high number of elements n For a set of n elements, look that the bit pattern
∀ i ∈ [1,n], h(elementi)
0xxxxx… 1xxxxx…
≈ n/2 ≈ n/2
@doanduyhai
LogLog intuition!
94
∀ i ∈ [1,n], h(elementi)
01xxxx… 10xxxx…
≈ n/4
00xxxx… 11xxxx…
000xxx… 001xxx… 010xxx… 011xxx… 100xxx… 101xxx… 110xxx… 111xxx…
≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8
≈ n/4 ≈ n/4 ≈ n/4
@doanduyhai
LogLog intuition!
95
Flip back the reasonning. If we see a hash like this: 000 000 000 1… Since the hash distribution is uniform, we should also have seen: 000 000 001 0… and 000 000 001 1… and 000 000 010 0… and … 111 111 111 1… Thus an estimated cardinality of 210 elements for n
@doanduyhai
LogLog intuition!
96
Toy example: n = 8, hash of 8 elements, 3 bit long:
000, 001, 010, 011, 100, 101, 110, 111 Uniform hash à equi-probability of each combination If I observed 001, I should have seen 000 too, and 010 too … If I observed 001, I should have seen 7 other combinations If I observed 001, n ≈ 8 (23)
@doanduyhai
LogLog intuition!
97
1) given a well distributed hash function h 2) given a sufficiently high number of elements n If I find a hash starting with 01…, it’s likely that there are 22 distinct elements (n = 22) 001…, it’s likely that there are 23 distinct elements (n = 23) 0001…, it’s likely that there are 24 distinct elements (n = 24) … 00000000001…, it’s likely that there are 2r distinct elements (n = 2r)
r
@doanduyhai
LogLog intuition!
98
max(r) = longest 0000…1 position observed among all hash values
n ≈ 2max(r)
@doanduyhai
LogLog intuition!
99
Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…
@doanduyhai
LogLog intuition!
100
Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…
n ≈ 2max(r) ≈ 29 ≈ 512 ?
@doanduyhai
LogLog intuition!
101
Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…
n ≈ 2max(r) ≈ 29 ≈ 512 ?
outlier & skewed distribution sensitivity
@doanduyhai
HyperLogLog intuition!
102
To eliminate outliers … use harmonic mean !
credits: http://economistatlarge.com
@doanduyhai
HyperLogLog intuition!
103
Harmonic means definition (thank you Wikipedia)
H =m
1x1+1x2+...+ 1
xm
@doanduyhai
HyperLogLog intuition!
104
First, split the set into m = 2b buckets Bucket number is determined by first b bits
b = 6, m = 32 buckets
Buckets list: B1, B2, … B32 (index is 1-based)
h(element) = 001001 0100…
@doanduyhai
HyperLogLog intuition!
105
Example m = 8 (23) buckets
000xxx… 001xxx… 010xxx… 011xxx… 100xxx… 101xxx… 110xxx… 111xxx…
B1 B2 B3 B4 B5 B6 B7 B8
@doanduyhai
HyperLogLog intuition!
106
New intuition: • in each bucket j, there are ≈ Mj elements • harmonic mean (Mj) = H(Mj) ≈ n/m
n ≈ mH(Mj)
@doanduyhai
HyperLogLog intuition!
107
But how do we calculate each Mj ?
@doanduyhai
HyperLogLog intuition!
108
Use LogLog !
@doanduyhai
HyperLogLog intuition!
109
How to solve a big hard problem ?
@doanduyhai
HyperLogLog intuition!
110
So on each hash value
bits for choosing bucket Bj
001100 0000001… bits for LogLog
@doanduyhai
HyperLogLog improvement!
111
Greater precision compared to LogLog Computation can be distributed (each bucket processed separately)
@doanduyhai
HyperLogLog the maths!
112
@doanduyhai
HyperLogLog formal definition!
113
Let h : D → [0, 1] ≡ {0, 1}∞ hash data from domain D to the binary domain. Let ρ(s), for s ∈ {0, 1}∞ , be the position of the leftmost 1-bit. (ρ(0001 · · · ) = 4) It is the rank of the 0000..1 observed sequence Let m = 2b with b∈Z>0 m = number of buckets Let M : multiset of items from domain D M is the set of elements to estimate cardinality
@doanduyhai
HyperLogLog formal definition!
114
Algorithm HYPERLOGLOG
Initialize a collection of m registers, M1, . . . , Mm, to −∞ for each element v ∈ M do • set x := h(v) //hash of v in binary form • set j = 1 + ⟨x1x2 · · · xb⟩2 //bucket number (1-based) • set w := xb+1xb+2 · · · //bits for LogLog • set Mj := max(Mj, ρ(w)) //take the longest 0000..1 position
observed in bucket Bj
@doanduyhai
HyperLogLog formal definition!
115
Compute //what is that Z ? Z = 2−M j
j=1
m
∑#
$%%
&
'((
−1
Return n ≈ αmm2Z • αm as given by Equation (3) //what is that αm ? !
@doanduyhai
HyperLogLog maths workout!
116
Mj = longest 0000...1 observed for bucket j. H(Mj) ≈ n/m
H =m
1x1+1x2+...+ 1
xm
Remember our intuition n ≈ mH(Mj) ?
Harmonic mean definition
@doanduyhai
HyperLogLog maths workout!
117
H =m
1x1+1x2+...+ 1
xm
=m 11x jj=1
m∑
H =m 1x jj=1
m∑"
#$$
%
&''
−1
=m xj−1
j=1
m∑( )
−1
@doanduyhai
HyperLogLog maths workout!
118
H =m xj−1
j=1
m∑( )
−1
Z = 2−M j
j=1
m
∑#
$%%
&
'((
−1
compare it with
let xj = 2Mj , the cardinality estimate for bucket Bj
H =mZ
@doanduyhai
HyperLogLog maths workout!
119
Remember our intuition n ≈ mH(Mj) ?
n ≈ mH ≈ m2Z ☞ αmm2Z
@doanduyhai
HyperLogLog harder maths!
120
What’s about the αm constant ?
@doanduyhai
HyperLogLog harder maths!
121
You don’t want to dig into that, trust me …
@doanduyhai
HyperLogLog harder maths!
122
8 pages full of this:
@doanduyhai
HyperLogLog harder maths!
123
and this
@doanduyhai
HyperLogLog harder maths!
124
and this…
@doanduyhai
Compaction!
125
Bloom filters
Given constant fpc, if cardinality = C1+C2, then m = …
SStable1 SStable2
Cardinality: C1 Cardinality: C2
Q & R
! " !
Thank You @doanduyhai