cassandra & the acunu data platform

47
Tom Wilkie Founder & VP Engineering @tom_wilkie Cassandra & the Acunu Data Platform

Upload: acunu

Post on 24-Jan-2015

2.857 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Cassandra & the Acunu Data Platform

Tom WilkieFounder & VP Engineering

@tom_wilkie

Cassandra & the Acunu Data

Platform

Page 2: Cassandra & the Acunu Data Platform

Before the Flood

Old hardware

1990

BTree File systems

RAID

Small databases

BTree indexes

Page 3: Cassandra & the Acunu Data Platform

Two Revolutions

BTree file systems

2010

New hardware

RAID

Write-optimised indexes

Distributed, shared-nothing databases

BTree file systems

New hardware

RAID

Write-optimised indexes

...

Page 4: Cassandra & the Acunu Data Platform

Bridging the Gap

Castle

2011

Distributed, shared-nothing databases

New hardware

Castle

New hardware

...

Page 5: Cassandra & the Acunu Data Platform

Why?

Page 6: Cassandra & the Acunu Data Platform
Page 7: Cassandra & the Acunu Data Platform

SNAPSHOTS*

* And clones!

Page 8: Cassandra & the Acunu Data Platform

Small random inserts Inserting 3 billion rows

Acunu powered Cassandra -‘standard’ Cassandra -

Page 9: Cassandra & the Acunu Data Platform

Insert latency While inserting 3 billion rows

Acunu powered Cassandra x‘standard’ Cassandra +

Page 10: Cassandra & the Acunu Data Platform

Small random range queriesPerformed immediately after inserts

Acunu powered Cassandra -‘standard’ Cassandra -

Page 11: Cassandra & the Acunu Data Platform

Standard Acunu Benefits

inserts rate95% latency

~32k/s~32s

~45k/s~0.3s

>1.4x>100x

gets rate95% latency

~100/s~2s

~350/s~0.5s

>3.5x>4x

range queries95% latency

~0.4/s~15s

~40/s~2s

>100x>7.5x

Performance summary

Page 12: Cassandra & the Acunu Data Platform

How?

Page 13: Cassandra & the Acunu Data Platform

Acun

u Ke

rnel

Use

rspa

ce

Linu

x Ke

rnel

Dou

blin

g Ar

rays

arra

ys

rang

e qu

erie

ske

y in

sert

inse

rtqu

eues

Bloo

m fi

lters

x

userspaceinterface

kernelspaceinterface

doubling arraymapping layer

modlist btreemapping layer

block mapping &cacheing layer

linux's block &MM layers

Mem

ory

man

ager

"Ext

ent"

laye

r exte

ntal

loca

tor

& m

appe

r

frees

pace

man

ager

btre

era

nge

quer

ies

key

get

key

inse

rtVe

rsio

n tre

e

Stre

amin

g in

terfa

ceke

y in

sert

key

get

buffe

red

valu

e ge

tbu

ffere

dva

lue

inse

rtra

nge

quer

ies

Cac

he

flusher

exte

nt b

lock

cach

e

page

cac

he

prefetcher

In-k

erne

l w

orkl

oads

Bloc

k la

yer

shar

ed b

uffe

rsas

ync,

sha

red

mem

ory

ring

Shar

ed m

emor

y in

terfa

ceke

ys

valu

es

Arra

ys

valu

e ar

raysbt

ree

key

get

arra

ysm

anag

emen

t

mer

ges

Page 14: Cassandra & the Acunu Data Platform

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cacheprefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

• Opensource (GPLv2, MIT for user libraries)

• http://bitbucket.org/acunu

• Loadable Kernel Module, targeting CentOS’s 2.6.18

• http://www.acunu.com/blogs/andy-twigg/why-acunu-kernel/

Castle

Page 15: Cassandra & the Acunu Data Platform

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

The Interface

castle_{back,objects}.c

Page 16: Cassandra & the Acunu Data Platform

v1

The Interface

v2

v6

v5

v0

v1

v3

v4

v3

Page 17: Cassandra & the Acunu Data Platform

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

Doubling Array

castle_{da,bloom}.c

Page 18: Cassandra & the Acunu Data Platform

B-Tree

logB N

B

• If node is full, split and insert new node into parent (recurse)

• For random inserts, nodes placed randomly on disk

Page 19: Cassandra & the Acunu Data Platform

Update Range Query(Size Z)

B-Tree O(logB N)random IOs

O(Z/B) random IOs

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

Page 20: Cassandra & the Acunu Data Platform

Doubling Array

2

9

2 9

Inserts

Buffer arrays in memory until we have > B of them

Page 21: Cassandra & the Acunu Data Platform

Doubling Array

11

8 8 11

2 9 2 8 9 11

Inserts

etc...

Similar to log-structured merge trees (LSM), cache-oblivious lookahead array (COLA), ...

Page 23: Cassandra & the Acunu Data Platform

Update Range Query(Size Z)

B-Tree O(logB N)random IOs

O(Z/B) random IOs

Doubling Array O((log N)/B)sequential IOs

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

Page 24: Cassandra & the Acunu Data Platform

Doubling ArrayQueries

• Add an index to each array to do lookups

• query(k) searches each array independently

query(k)

Page 25: Cassandra & the Acunu Data Platform

Doubling Array

• Bloom Filters can help exclude arrays from search

• ... but don’t help with range queries

Queries

query(k)

Page 26: Cassandra & the Acunu Data Platform

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

Update Range Query(Size Z)

B-Tree O(logB N)random IOs

O(Z/B) random IOs

Doubling Array O((log N)/B)sequential IOs

O(Z/B) sequential IOs

~ log (2^30)/log 100= 5 IOs/update

~ log (2^30)/100= 0.2 IOs/update

8KB @ 100MB/s = 13k IOs/s

8KB @ 100MB/s, w/ 8ms seek = 100 IOs/s

13k / 0.2 = 65k updates/s

100 / 5 = 20 updates/s

Page 27: Cassandra & the Acunu Data Platform

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

Doubling Array

castle_{da,bloom}.c

Page 28: Cassandra & the Acunu Data Platform

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

“Mod-list” B-Tree

castle_{btree,versions}.c

Page 29: Cassandra & the Acunu Data Platform

Copy-on-Write BTreeIdea:

• Apply path-copying [DSST] to the B-tree

Problems:

• Space blowup: Each update may rewrite an entire path

• Slow updates: as above

A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!)

Page 30: Cassandra & the Acunu Data Platform

Nv = #keys live (accessible) at version v

Update Range Query

Space

CoW B-Tree

O(logB Nv)random IOs

O(Z/B) random IOs O(N B logB Nv)

Page 31: Cassandra & the Acunu Data Platform

1 a 1 b

• Inserts produce arraysv1

“BigTable” snapshots

Page 32: Cassandra & the Acunu Data Platform

1 a 1 b

“BigTable” snapshots

• Inserts produce arrays

• Snapshots increment ref counts on arrays

• Merges product more arrays, decrement ref count on old arrays

2 a 2 b

v1 v2

1 c

Page 33: Cassandra & the Acunu Data Platform

• Inserts produce arrays

• Snapshots increment ref counts on arrays

• Merges product more arrays, decrement ref count on old arrays

1 1

v1 v2

1

1 a 1 b

1 a b c

“BigTable” snapshots

Page 34: Cassandra & the Acunu Data Platform

“BigTable” snapshots

• Inserts produce arrays

• Snapshots increment ref counts on arrays

• Merges product more arrays, decrement ref count on old arrays

• Space blowup

1 1

v1 v2

1

1 a 1 b

1 a b c

Page 35: Cassandra & the Acunu Data Platform

Nv = #keys live (accessible) at version v

Update Range Query

Space

CoW B-Tree

O(logB Nv)random IOs

O(Z/B) random IOs O(N B logB Nv)

“BigTable” style DA

O((log N)/B)sequential IOs

O(Z/B) sequential IOs O(VN)

Page 36: Cassandra & the Acunu Data Platform

“Mod-list” BTreeIdea:

• Apply fat-nodes [DSST] to the B-tree

• ie insert (key, version, value) tuples, with special operations

Problems:

• Similar performance to a BTree

If you limit the #versions, can be constructed sequentially, and embedded into a DA

Page 37: Cassandra & the Acunu Data Platform

Nv = #keys live (accessible) at version v

Update Range Query

Space

CoW B-Tree

O(logB Nv)random IOs

O(Z/B) random IOs O(N B logB Nv)

“BigTable” style DA

O((log N)/B)sequential IOs

O(Z/B) sequential IOs O(VN)

“Mod-list” in a DA

O((log N)/B)sequential IOs

O(Z/B) sequential IOs O(N)CASTLE

LevelDB

Page 38: Cassandra & the Acunu Data Platform

Stratified BTreeProblem: Embedded “Mod-list” #versions limit

Solution: Version-split arrays during merges

v0

v1 v2

v-split

v2v2 v2v0 v0

k1 k4 k5k3k2

{v2}

{v1,v0} v1 v1 v1v0 v1 v0 v0

k1 k4 k5k2

v0 entries here are duplicates

v1 v2 v2 v1 v2 v1 v0 v1 v0 v1 v0 v1

newer older

merge

v1 v2v2 v1 v2 v1v0 v1 v0 v0

k1 k4 k5k3k2

(duplicates removed)

Page 39: Cassandra & the Acunu Data Platform

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

“Mod-list” B-Tree

castle_{btree,versions}.c

Page 40: Cassandra & the Acunu Data Platform

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cacheflusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

Disk Layout: RDA

castle_{cache,extent,freespace,rebuild}.c

Page 41: Cassandra & the Acunu Data Platform

13

89

5

14

2 12 34

67 8

1 34 5

67 10

1112 1315

16

910

1114

5 2

8 9

1413 12 15

16

Disk Layout: RDArandom duplicate allocation

Page 42: Cassandra & the Acunu Data Platform

Future

Page 43: Cassandra & the Acunu Data Platform

Memcache + Cassandra

Castle

H/W

Castle

H/W

...

Cassandra memcache Cassandra memcache

Cass client memcachedget/insert get/put

100k random inserts/sec!

Page 44: Cassandra & the Acunu Data Platform

v16 v24

v13

v1

v15v12 v13

v16 v24

v13

v1

v15v12 v13

v16 v24

v13

v1

v15v12 v13

v16 v24

v13

v1

v15v12 v13

Page 45: Cassandra & the Acunu Data Platform

• Castle: like BDB, but for Big Data

• 2 orders of magnitude better performance and predictability

• Part of the Acunu Data Platform

Page 47: Cassandra & the Acunu Data Platform

References[LSM] The Log-Structured Merge-Tree (LSM-Tree)Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil

http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log-Structured%20Merge-Tree%20%28LSM-

Tree%29.pdf

[COLA] Cache-Oblivious Streaming B-trees, Michael A. Bender et al

http://www.cs.sunysb.edu/~bender/newpub/BenderFaFi07.pdf

[DSST] Making Data Structures Persistent - J. R. Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Data Structures Persistent, Journal of Computer and System Sciences, Vol. 38, No. 1, 1989

http://www.cs.cmu.edu/~sleator/papers/making-data-structures-persistent.pdf

Stratified B-trees and versioned dictionaries, - Andy Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton, John Wilkes, Tom Wilkie, HotStorage’11

http://www.usenix.org/event/hotstorage11/tech/final_files/Twigg.pdf

[RDA] Random duplicate storage strategies for load balancing in multimedia servers, 2000, Joep Aerts and Jan Korst and Sebastian Egner

http://www.win.tue.nl/~joep/IPL.ps

Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the

Apache Software Foundation.