cassandra data structures and algorithms

127
@doanduyhai Cassandra data structures & algorithms DuyHai DOAN, Technical Advocate

Upload: duyhai-doan

Post on 01-Jul-2015

1.292 views

Category:

Technology


3 download

DESCRIPTION

CRDT, Bloom Filter, Merkle Tree & HyperLogLog in Cassandra

TRANSCRIPT

Page 1: Cassandra data structures and algorithms

@doanduyhai

Cassandra data structures & algorithms DuyHai DOAN, Technical Advocate

Page 2: Cassandra data structures and algorithms

@doanduyhai

Shameless self-promotion!

2

Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  Cassandra technical point of contact •  Cassandra troubleshooting

Page 3: Cassandra data structures and algorithms

@doanduyhai

Agenda!

3

Data structures •  CRDT •  Bloom filter •  Merkle tree Algorithms •  HyperLogLog

Page 4: Cassandra data structures and algorithms

@doanduyhai

Why Cassandra ?!

4

Linear scalability ≈ unbounded extensivity •  1k+ nodes cluster

Page 5: Cassandra data structures and algorithms

@doanduyhai

Why Cassandra ?!

5

Continuous availability (≈100% up-time) •  resilient architecture (Dynamo) •  rolling upgrades •  data backward compatible n/n+1 versions

Page 6: Cassandra data structures and algorithms

@doanduyhai

Why Cassandra ?!

6

Multi-data centers •  out-of-the-box (config only) •  AWS conf for multi-region DCs Operational simplicity •  1 node = 1 process + 1 config file

Page 7: Cassandra data structures and algorithms

@doanduyhai

Cassandra architecture!

7

Data-store layer •  Google Big Table paper •  Columns/Columns Family Cluster layer •  Amazon DynamoDB paper •  masterless architecture

Page 8: Cassandra data structures and algorithms

@doanduyhai

Cassandra architecture!

8

DATA STORE (BIG TABLES)

CLUSTER (DYNAMO)

API (CQL & RPC)

DISKS

Node1

Client request

Node2

CLUSTER (DYNAMO)

API (CQL & RPC)

DISKS

DATA STORE (BIG TABLES)

Page 9: Cassandra data structures and algorithms

@doanduyhai

Data access!

9

By CQL query via native protocol •  INSERT, UPDATE, DELETE, SELECT •  CREATE/ALTER/DROP TABLE

Always by partition key (#partition) •  partition == physical row

Page 10: Cassandra data structures and algorithms

@doanduyhai

Data distribution!

10

Random: hash of #partition → token = hash(#p) Hash: ]0, 2127-1] Each node: 1/8 of ]0, 2127-1]

n1

n2

n3

n4

n5

n6

n7

n8

Page 11: Cassandra data structures and algorithms

@doanduyhai

Data replication!

11

Replication Factor = 3

n1

n2

n3

n4

n5

n6

n7

n8

1

2

3

Page 12: Cassandra data structures and algorithms

@doanduyhai

Coordinator node!Incoming requests (read/write) Coordinator node handles the request

Every node can be coordinator àmasterless

n1

n2

n3

n4

n5

n6

n7

n8

1

2

3

coordinator request

12

Page 13: Cassandra data structures and algorithms

CRDT!

by Marc Shapiro, 2011

Page 14: Cassandra data structures and algorithms

@doanduyhai

INSERT!

14

Table « users »

ddoan age name

33 DuyHai DOAN

INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);

Page 15: Cassandra data structures and algorithms

@doanduyhai

INSERT!

15

Table « users »

ddoan age name

33 DuyHai DOAN

INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);

#partition column names

Page 16: Cassandra data structures and algorithms

@doanduyhai

INSERT!

ddoan age (t1) name (t1)

33 DuyHai DOAN

16

Table « users »

INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);

auto-generated timestamp (μs)

.

Page 17: Cassandra data structures and algorithms

@doanduyhai

UPDATE!

17

Table « users »

UPDATE users SET age = 34 WHERE login = ddoan;

ddoan age (t1) name (t1)

33 DuyHai DOAN ddoan

age (t2)

34

File1 File2

Page 18: Cassandra data structures and algorithms

@doanduyhai

DELETE!

18

Table « users »

DELETE age FROM users WHERE login = ddoan;

ddoan age (t3)

ý

tombstone

ddoan age (t1) name (t1)

33 DuyHai DOAN ddoan

age (t2)

34

File1 File2 File3

Page 19: Cassandra data structures and algorithms

@doanduyhai

SELECT!

19

Table « users »

SELECT age FROM users WHERE login = ddoan;

? ? ?

ddoan age (t3)

ý ddoan

age (t1) name (t1)

33 DuyHai DOAN ddoan

age (t2)

34

File1 File2 File3

Page 20: Cassandra data structures and algorithms

@doanduyhai

SELECT!

20

Table « users »

SELECT age FROM users WHERE login = ddoan;

✓ ✕ ✕

ddoan age (t3)

ý ddoan

age (t1) name (t1)

33 DuyHai DOAN ddoan

age (t2)

34

File1 File2 File3

Page 21: Cassandra data structures and algorithms

@doanduyhai

Cassandra columns!

21

look very similar to …

CRDT

Page 22: Cassandra data structures and algorithms

@doanduyhai

CRDT Recap!

22

CRDT = Convergent Replicated Data Types Useful in distributed system Formal proof for strong « eventual convergence » of replicated data

Page 23: Cassandra data structures and algorithms

@doanduyhai

CRDT Recap!

23

A join semilattice (or just semilattice hereafter) is a partial order ≤v equipped with a least upper bound (LUB) ⊔v, defined as follows: Definition 2.4 m = x ⊔v y is a Least Upper Bound of {x, y} under ≤v iff •  x ≤v m and •  y ≤v m and •  there is no m′ ≤v m such that x ≤v m′ and y ≤v m′ It follows from the definition that ⊔v is: commutative: x ⊔v y =v y ⊔v x; idempotent: x ⊔v x =v x; and associative: (x⊔v y)⊔v z =v x⊔v (y⊔v z).

Page 24: Cassandra data structures and algorithms

@doanduyhai

CRDT Recap!

24

Definition 2.5 (Join Semilattice). An ordered set (S, ≤v) is a Join Semilattice iff ∀x,y ∈ S, x ⊔v y exists.

Let’s define Stk,n = set of Cassandra columns identified by

•  partition key k •  column name n •  assigned a timestamp t The ordered set (St

k,n, maxt) is a Join Semilattice

#partition column name(t1)

… #partition

column name(t2)

Page 25: Cassandra data structures and algorithms

@doanduyhai

Cassandra column as CRDT!

25

Proof: •  S1

k,n ≤ maxt (S1k,n,S2

k,n) •  S2

k,n ≤ maxt (S1k,n,S2

k,n) •  there is no Sx

k,n ≤ maxt (S1k,n,S2

k,n) such that S1k,n ≤ Sx

k,n and S2k,n ≤ Sx

k,n

Page 26: Cassandra data structures and algorithms

@doanduyhai

Cassandra column as CRDT!

26

Idempotent ! ddoan

age (t2)

33 ddoan

age (t2)

33 ddoan

age (t2)

33

ddoan age (t2)

33

node1 node2 node3

coordinator

Page 27: Cassandra data structures and algorithms

@doanduyhai

Cassandra column as CRDT!

27

Commutative !ddoan

age (t1)

33 ddoan

age (t2)

34

ddoan age (t2)

34

node1 node2

coordinator

ddoan age (t2)

34 ddoan

age (t1)

33

node2 node1

ddoan age (t2)

34

coordinator

Page 28: Cassandra data structures and algorithms

@doanduyhai

Cassandra column as CRDT!

28

Associative !ddoan

age (t1)

33 ddoan

age (t2)

34

ddoan age (t3)

35 ddoan

age (t2)

34

node1 node2

node3

coordinator

Page 29: Cassandra data structures and algorithms

@doanduyhai

Cassandra column as CRDT!

29

ddoan address(t1) age (t1)

12 rue de.. 33 ddoan

age (t2)

34

ddoan age (t3)

35 ddoan

address(t7)

17 avenue..

File1 File2

File3 File4

Page 30: Cassandra data structures and algorithms

@doanduyhai

Cassandra column as CRDT!

30

ddoan age (t1)

33 ddoan age (t2)

34 ddoan

age (t3)

35

Stddoan,age =

Stddoan,address=

ddoan address(t1)

12 rue de.. ddoan

address(t7)

17 avenue..

t

t

Page 31: Cassandra data structures and algorithms

@doanduyhai

Eventual convergence!

31

Proposition 2.1. Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels.

Page 32: Cassandra data structures and algorithms

@doanduyhai

Eventual convergence!

32

Proposition 2.1. Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels. !!eventually-reliable point-to-point channels à there is a network cable connecting 2 nodes …

Page 33: Cassandra data structures and algorithms

@doanduyhai

Eventual convergence!

33

The system transmits payload infinitely often between pairs of replicas

Page 34: Cassandra data structures and algorithms

@doanduyhai

Eventual convergence!

34

Repair

The system transmits payload infinitely often between pairs of replicas

Page 35: Cassandra data structures and algorithms

@doanduyhai

Eventual convergence!

35

Strong hypothesis in the case of Cassandra CRDT !

Page 36: Cassandra data structures and algorithms

@doanduyhai

Eventual convergence!

36

maxtimestamp as merge function !

Strong hypothesis in the case of Cassandra CRDT !

Page 37: Cassandra data structures and algorithms

@doanduyhai

Eventual convergence!

37

Time is reliable … isn’t it ? !

Page 38: Cassandra data structures and algorithms

@doanduyhai

Eventual convergence!

38

NTP server-side mandatory

Page 39: Cassandra data structures and algorithms

Q & R

! " !

Page 40: Cassandra data structures and algorithms

Bloom filters!

by Burton Howard Bloom, 1970

Page 41: Cassandra data structures and algorithms

@doanduyhai

Cassandra Write Path!

41

Commit log1

. . .

1

Commit log2

Commit logn

Memory

Page 42: Cassandra data structures and algorithms

@doanduyhai

Cassandra Write Path!

42

Memory

Commit log1

. . .

1

Commit log2

Commit logn

MemTable Table1

MemTable Table2

MemTable TableN

2

. . .

Page 43: Cassandra data structures and algorithms

@doanduyhai

Cassandra Write Path!

43

Commit log1

Commit log2

Commit logn

Table1

SStable1

Table2 Table3

SStable2 SStable3 3

Memory

. . .

Page 44: Cassandra data structures and algorithms

@doanduyhai

Cassandra Write Path!

44

Commit log1

Commit log2

Commit logn

Table1

SStable1

Table2 Table3

SStable2 SStable3

Memory . . . MemTable Table1

MemTable Table2

MemTable TableN

. . .

Page 45: Cassandra data structures and algorithms

@doanduyhai

Cassandra Write Path!

45

Commit log1

Commit log2

Commit logn

Table1

SStable1

Table2 Table3

SStable2 SStable3

Memory

SStable1

SStable2

SStable3 . . .

Page 46: Cassandra data structures and algorithms

@doanduyhai

Cassandra Read Path!

46

Either in memory

Page 47: Cassandra data structures and algorithms

@doanduyhai

Cassandra Read Path!

47

Either in memory

or hit disk (many SSTables)

Page 48: Cassandra data structures and algorithms

@doanduyhai

Cassandra Read Path!

48

How to optimize disk seeks ?

Page 49: Cassandra data structures and algorithms

@doanduyhai

Cassandra Read Path!

49

Only read necessary SSTables !

Page 50: Cassandra data structures and algorithms

@doanduyhai

Cassandra Read Path!

50

Bloom filters !

Page 51: Cassandra data structures and algorithms

@doanduyhai

Bloom filters recap!

51

Space-efficient probabilistic data structure. Used for membership test True negative, possible false positive

Page 52: Cassandra data structures and algorithms

@doanduyhai

Bloom filters in Cassandra!

52

For each SSTable, create a bloom filter Upon data insertion, populate it Upon data retrieval, ask the bloom filter for skipping

Page 53: Cassandra data structures and algorithms

@doanduyhai

Bloom filters in action!

53

1 0 0 1 0 0 1 0 0 0

#partition =  foo 

h2 h3

Write

h1

Page 54: Cassandra data structures and algorithms

@doanduyhai

Bloom filters in action!

54

1 0 0 1* 0 0 1 0 1 1

#partition =  foo 

h1 h2 h3

#partition =  bar  Write

Page 55: Cassandra data structures and algorithms

@doanduyhai

Bloom filters in action!

55

1 0 0 1* 0 0 1 0 1 1

#partition =  foo 

h1 h2 h3

#partition =  bar  Write

Read #partition =  qux 

Page 56: Cassandra data structures and algorithms

@doanduyhai

Bloom filters maths!

56

Page 57: Cassandra data structures and algorithms

@doanduyhai

Bloom filters maths!

57

probability of a bit to be set to 1:

1 0 0 1 0 0 1 0 0 0

m bits

1m

1− 1m

probability of a bit to be set to 0:

Page 58: Cassandra data structures and algorithms

@doanduyhai

Bloom filters maths!

58

probability with k … and n … of the bit to be set to 1: 1− 1− 1m

"

#$

%

&'kn

probability with k hash functions of the bit to be set to 0: 1− 1m

"

#$

%

&'k

probability with k … and n elements inserted … : 1− 1m

"

#$

%

&'kn

Page 59: Cassandra data structures and algorithms

@doanduyhai

Bloom filters maths!

59

But why do we need to calculate probability of a bit: •  to be set to 1 •  then to be set to 0 •  then back to 1 again ?

Page 60: Cassandra data structures and algorithms

@doanduyhai

Bloom filters maths!

60

Because of bits colliding on 1 when applying many k & n !

Page 61: Cassandra data structures and algorithms

@doanduyhai

Bloom filters maths!

61

For an element not in the SSTable, probability that all k hash functions return 1 (false positive chance, fpc):

1− 1− 1m

"

#$

%

&'kn"

#$$

%

&''

k

≈ 1− e−knm

"

#$

%

&'

k

To minimize fpc: koptimal ≈mnln(2)

Page 62: Cassandra data structures and algorithms

@doanduyhai

Bloom filters maths!

62

fpc = 1− e−mnln(2)n

m

"

#

$$$

%

&

'''

mnln(2)

= 1− eln( 12)"

#$

%

&'

mnln(2)

=12

mnln(2)

ln( fpc) = mnln(12)ln(2) = −m

nln(2)2

m = nln( 1

fpc)

ln(2)2

Page 63: Cassandra data structures and algorithms

@doanduyhai

Bloom filters maths!

63

m = nln( 1

fpc)

ln(2)2

For n = 109 of #partition •  fpc = 10%, m ≈ 500Mb •  fpc = 1%, m ≈ 1.2Gb

Page 64: Cassandra data structures and algorithms

@doanduyhai

Bloom filters (notes)!

64

Cannot remove elements once inserted (1-bit colliding) •  cannot resize •  collision increases with load

Page 65: Cassandra data structures and algorithms

Q & R

! " !

Page 66: Cassandra data structures and algorithms

Merkle tree!

by Ralph Merkle, 1987

Page 67: Cassandra data structures and algorithms

@doanduyhai

Repairing data!

67

Repair

The system transmits payload infinitely often between pairs of replicas

Page 68: Cassandra data structures and algorithms

@doanduyhai

Why repair ?!

68

Data diverge between replicas because: •  writing with low consistency for perf •  nodes down •  network down •  dropped writes

Page 69: Cassandra data structures and algorithms

@doanduyhai

Repairing data!

69

Compare full data ? •  read all data •  I/O intensive •  network intensive (streaming is expensive)

Page 70: Cassandra data structures and algorithms

@doanduyhai

Repairing data!

70

Compare full data ? •  read all data •  I/O intensive •  network intensive (streaming is expensive)

Compare digests ? •  read all data •  I/O intensive •  network intensive (streaming is expensive)

Page 71: Cassandra data structures and algorithms

@doanduyhai

Merkle tree!

71

Tree of digests •  leaf nodes : digest of data •  non-leaf nodes: digest of children nodes digest •  tree resolution = nb leaf nodes = 2depth

Page 72: Cassandra data structures and algorithms

@doanduyhai

Merkle tree in action!

72

Depth = 15, resolution = 32 768 leaf nodes

leaf1 leaf2 leaf3

node node

root

n-partitions bucket n-partitions bucket n-partitions bucket

Page 73: Cassandra data structures and algorithms

@doanduyhai

Merkle tree in action!

73

Repair process •  send the tree to replicas •  compare digests, starting from root node •  if mismatch, stream partition bucket(s) that differ

Page 74: Cassandra data structures and algorithms

@doanduyhai

Merkle tree in action!

74

If mismatch, stream partition bucket(s) that differ Example •  327 680 partitions •  resolution = 32 768 à10 partitions/bucket •  1 column differs in 1 partition à 10 partitions streamed

leaf

10-partitions

Page 75: Cassandra data structures and algorithms

@doanduyhai

Over-streaming nightmare!

75

Page 76: Cassandra data structures and algorithms

@doanduyhai

Merkle tree in action!

76

Improve tree resolution by increasing depth (dynamically)

leaf1

node

… leaf2 leafN

node node

node node

root

leaf1

node

… leaf3 leafN

node node

node node

root

node node node node

node node node node node

leaf2

Page 77: Cassandra data structures and algorithms

@doanduyhai

Merkle tree in action!

77

Improve tree resolution by repairing by partition ranges

leaf1

node

… leaf2 leafN

node node

node node

root

leaf1

node

… leaf2 leafN

node node

node node

root

leaf1

node

… leaf2 leafN

node node

node node

root

Page 78: Cassandra data structures and algorithms

Q & R

! " !

Page 79: Cassandra data structures and algorithms

HyperLogLog!

by late Philippe Flajolet, 2007

Page 80: Cassandra data structures and algorithms

@doanduyhai

Cassandra Read Path!

80

Remember that ?

Table1

SStable1

Table2 Table3

SStable2 SStable3

SStable1

SStable2

SStable3

Page 81: Cassandra data structures and algorithms

@doanduyhai

Cassandra Read Path!

81

Even Bloom filter can’t save you if data spills on many SSTables

Page 82: Cassandra data structures and algorithms

@doanduyhai

Cassandra Read Path!

82

Compaction !

Page 83: Cassandra data structures and algorithms

@doanduyhai

Compaction!

83

Algorithm: •  take n SSTables •  load data in memory

•  for each Stk,n apply the merge function (maxtimestamp)

•  remove (when applicable) tombstones •  build a new SSTable

Page 84: Cassandra data structures and algorithms

@doanduyhai

Compaction!

84

Build a new SSTable à allocate memory for new Bloom filter

Page 85: Cassandra data structures and algorithms

@doanduyhai

Compaction!

85

But how large is the new Bloom filter ?

Page 86: Cassandra data structures and algorithms

@doanduyhai

Compaction!

86

SStable1 SStable2

Bloom filters

double size?

in between ?

same size ?

Page 87: Cassandra data structures and algorithms

@doanduyhai

Compaction!

87

Bloom filter size depends on … elements cardinality (fpc constant)

Page 88: Cassandra data structures and algorithms

@doanduyhai

Compaction!

88

Bloom filter size depends on … elements cardinality (fpc constant) If we can count distinct elements in SSTable1 & SSTable2, we can allocate new Bloom filter

Page 89: Cassandra data structures and algorithms

@doanduyhai

Compaction!

89

Bloom filters

Given constant fpc, if cardinality = C1+C2, then m = …

SStable1 SStable2

Cardinality: C1 Cardinality: C2

Page 90: Cassandra data structures and algorithms

@doanduyhai

Compaction!

90

But counting exact cardinality is memory-expensive ...

Page 91: Cassandra data structures and algorithms

@doanduyhai

Compaction!

91

Can’t we have a cardinality estimate ?

Page 92: Cassandra data structures and algorithms

@doanduyhai

Cardinality estimators!

92

Counter Bytes used Error Java HashSet 10 447 016 0%

Linear Probabilistic Counter 3 384 1% HyperLogLog 512 3%

credits: http://highscalability.com/

Page 93: Cassandra data structures and algorithms

@doanduyhai

LogLog intuition!

93

1)  given a well distributed hash function h 2)  given a sufficiently high number of elements n For a set of n elements, look that the bit pattern

∀ i ∈ [1,n], h(elementi)

0xxxxx… 1xxxxx…

≈ n/2 ≈ n/2

Page 94: Cassandra data structures and algorithms

@doanduyhai

LogLog intuition!

94

∀ i ∈ [1,n], h(elementi)

01xxxx… 10xxxx…

≈ n/4

00xxxx… 11xxxx…

000xxx… 001xxx… 010xxx… 011xxx… 100xxx… 101xxx… 110xxx… 111xxx…

≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8

≈ n/4 ≈ n/4 ≈ n/4

Page 95: Cassandra data structures and algorithms

@doanduyhai

LogLog intuition!

95

Flip back the reasonning. If we see a hash like this: 000 000 000 1… Since the hash distribution is uniform, we should also have seen: 000 000 001 0… and 000 000 001 1… and 000 000 010 0… and … 111 111 111 1… Thus an estimated cardinality of 210 elements for n

Page 96: Cassandra data structures and algorithms

@doanduyhai

LogLog intuition!

96

Toy example: n = 8, hash of 8 elements, 3 bit long:

000, 001, 010, 011, 100, 101, 110, 111 Uniform hash à equi-probability of each combination If I observed 001, I should have seen 000 too, and 010 too … If I observed 001, I should have seen 7 other combinations If I observed 001, n ≈ 8 (23)

Page 97: Cassandra data structures and algorithms

@doanduyhai

LogLog intuition!

97

1)  given a well distributed hash function h 2)  given a sufficiently high number of elements n If I find a hash starting with 01…, it’s likely that there are 22 distinct elements (n = 22) 001…, it’s likely that there are 23 distinct elements (n = 23) 0001…, it’s likely that there are 24 distinct elements (n = 24) … 00000000001…, it’s likely that there are 2r distinct elements (n = 2r)

r

Page 98: Cassandra data structures and algorithms

@doanduyhai

LogLog intuition!

98

max(r) = longest 0000…1 position observed among all hash values

n ≈ 2max(r)

Page 99: Cassandra data structures and algorithms

@doanduyhai

LogLog intuition!

99

Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…

Page 100: Cassandra data structures and algorithms

@doanduyhai

LogLog intuition!

100

Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…

n ≈ 2max(r) ≈ 29 ≈ 512 ?

Page 101: Cassandra data structures and algorithms

@doanduyhai

LogLog intuition!

101

Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…

n ≈ 2max(r) ≈ 29 ≈ 512 ?

outlier & skewed distribution sensitivity

Page 102: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog intuition!

102

To eliminate outliers … use harmonic mean !

credits: http://economistatlarge.com

Page 103: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog intuition!

103

Harmonic means definition (thank you Wikipedia)

H =m

1x1+1x2+...+ 1

xm

Page 104: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog intuition!

104

First, split the set into m = 2b buckets Bucket number is determined by first b bits

b = 6, m = 32 buckets

Buckets list: B1, B2, … B32 (index is 1-based)

h(element) = 001001 0100…

Page 105: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog intuition!

105

Example m = 8 (23) buckets

000xxx… 001xxx… 010xxx… 011xxx… 100xxx… 101xxx… 110xxx… 111xxx…

B1 B2 B3 B4 B5 B6 B7 B8

Page 106: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog intuition!

106

New intuition: •  in each bucket j, there are ≈ Mj elements •  harmonic mean (Mj) = H(Mj) ≈ n/m

n ≈ mH(Mj)

Page 107: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog intuition!

107

But how do we calculate each Mj ?

Page 108: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog intuition!

108

Use LogLog !

Page 109: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog intuition!

109

How to solve a big hard problem ?

Page 110: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog intuition!

110

So on each hash value

bits for choosing bucket Bj

001100 0000001… bits for LogLog

Page 111: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog improvement!

111

Greater precision compared to LogLog Computation can be distributed (each bucket processed separately)

Page 112: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog the maths!

112

Page 113: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog formal definition!

113

Let h : D → [0, 1] ≡ {0, 1}∞ hash data from domain D to the binary domain. Let ρ(s), for s ∈ {0, 1}∞ , be the position of the leftmost 1-bit. (ρ(0001 · · · ) = 4) It is the rank of the 0000..1 observed sequence Let m = 2b with b∈Z>0 m = number of buckets Let M : multiset of items from domain D M is the set of elements to estimate cardinality

Page 114: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog formal definition!

114

Algorithm HYPERLOGLOG

Initialize a collection of m registers, M1, . . . , Mm, to −∞ for each element v ∈ M do •  set x := h(v) //hash of v in binary form •  set j = 1 + ⟨x1x2 · · · xb⟩2 //bucket number (1-based) •  set w := xb+1xb+2 · · · //bits for LogLog •  set Mj := max(Mj, ρ(w)) //take the longest 0000..1 position

observed in bucket Bj

Page 115: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog formal definition!

115

Compute //what is that Z ? Z = 2−M j

j=1

m

∑#

$%%

&

'((

−1

Return n ≈ αmm2Z •  αm as given by Equation (3) //what is that αm ? !

Page 116: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog maths workout!

116

Mj = longest 0000...1 observed for bucket j. H(Mj) ≈ n/m

H =m

1x1+1x2+...+ 1

xm

Remember our intuition n ≈ mH(Mj) ?

Harmonic mean definition

Page 117: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog maths workout!

117

H =m

1x1+1x2+...+ 1

xm

=m 11x jj=1

m∑

H =m 1x jj=1

m∑"

#$$

%

&''

−1

=m xj−1

j=1

m∑( )

−1

Page 118: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog maths workout!

118

H =m xj−1

j=1

m∑( )

−1

Z = 2−M j

j=1

m

∑#

$%%

&

'((

−1

compare it with

let xj = 2Mj , the cardinality estimate for bucket Bj

H =mZ

Page 119: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog maths workout!

119

Remember our intuition n ≈ mH(Mj) ?

n ≈ mH ≈ m2Z ☞ αmm2Z

Page 120: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog harder maths!

120

What’s about the αm constant ?

Page 121: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog harder maths!

121

You don’t want to dig into that, trust me …

Page 122: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog harder maths!

122

8 pages full of this:

Page 123: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog harder maths!

123

and this

Page 124: Cassandra data structures and algorithms

@doanduyhai

HyperLogLog harder maths!

124

and this…

Page 125: Cassandra data structures and algorithms

@doanduyhai

Compaction!

125

Bloom filters

Given constant fpc, if cardinality = C1+C2, then m = …

SStable1 SStable2

Cardinality: C1 Cardinality: C2

Page 126: Cassandra data structures and algorithms

Q & R

! " !

Page 127: Cassandra data structures and algorithms

Thank You @doanduyhai

[email protected]