in the brain of tom wilkie

Post on 29-Nov-2014

3.930 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

A repeat of my oscon talk, with some more details. Check out the video at http://skillsmatter.com/podcast/nosql/castle-big-data

TRANSCRIPT

Tom WilkieFounder & VP Engineering

@tom_wilkie

Castle: Reinventing Storage for Big Data

Before the Flood

Old hardware

1990

BTree File systems

RAID

Small databases

BTree indexes

Two Revolutions

BTree file systems

2010

New hardware

RAID

Write-optimised indexes

Distributed, shared-nothing databases

BTree file systems

New hardware

RAID

Write-optimised indexes

...

Bridging the Gap

Castle

2011

Distributed, shared-nothing databases

New hardware

Castle

New hardware

...

SNAPSHOTS*

* And clones!

What’s in the Castle?

Acun

u Ke

rnel

Use

rspa

ce

Linu

x Ke

rnel

Dou

blin

g Ar

rays

arra

ys

rang

e qu

erie

ske

y in

sert

inse

rtqu

eues

Bloo

m fi

lters

x

userspaceinterface

kernelspaceinterface

doubling arraymapping layer

modlist btreemapping layer

block mapping &cacheing layer

linux's block &MM layers

Mem

ory

man

ager

"Ext

ent"

laye

r exte

ntal

loca

tor

& m

appe

r

frees

pace

man

ager

btre

era

nge

quer

ies

key

get

key

inse

rtVe

rsio

n tre

e

Stre

amin

g in

terfa

ceke

y in

sert

key

get

buffe

red

valu

e ge

tbu

ffere

dva

lue

inse

rtra

nge

quer

ies

Cac

he

flusher

exte

nt b

lock

cach

e

page

cac

he

prefetcher

In-k

erne

l w

orkl

oads

Bloc

k la

yer

shar

ed b

uffe

rsas

ync,

sha

red

mem

ory

ring

Shar

ed m

emor

y in

terfa

ceke

ys

valu

es

Arra

ys

valu

e ar

raysbt

ree

key

get

arra

ysm

anag

emen

t

mer

ges

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cacheprefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

• Opensource (GPLv2, MIT for user libraries)

• http://bitbucket.org/acunu

• Loadable Kernel Module, targeting CentOS’s 2.6.18

• http://www.acunu.com/blogs/andy-twigg/why-acunu-kernel/

Castle

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

The Interface

castle_{back,objects}.c

The Interface

v16

v3

v24

v13

v1

v15v12

v0

v13

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

Doubling Array

castle_{da,bloom}.c

B-Tree

logB N

B

• If node is full, split and insert new node into parent (recurse)

• For random inserts, nodes placed randomly on disk

Update Range Query(Size Z)

B-Tree O(logB N)random IOs

O(Z/B) random IOs

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

Doubling Array

2

9

2 9

Inserts

Buffer arrays in memory until we have > B of them

Doubling Array

11

8 8 11

2 9 2 8 9 11

Inserts

etc...

Similar to log-structured merge trees (LSM), cache-oblivious lookahead array (COLA), ...

Update Range Query(Size Z)

B-Tree O(logB N)random IOs

O(Z/B) random IOs

Doubling Array O((log N)/B)sequential IOs

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

Doubling ArrayQueries

• Add an index to each array to do lookups

• query(k) searches each array independently

query(k)

Doubling Array

• Bloom Filters can help exclude arrays from search

• ... but don’t help with range queries

Queries

query(k)

B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

Update Range Query(Size Z)

B-Tree O(logB N)random IOs

O(Z/B) random IOs

Doubling Array O((log N)/B)sequential IOs

O(Z/B) sequential IOs

~ log (2^30)/log 100= 5 IOs/update

~ log (2^30)/100= 0.2 IOs/update

8KB @ 100MB/s = 13k IOs/s

8KB @ 100MB/s, w/ 8ms seek = 100 IOs/s

13k / 0.2 = 65k updates/s

100 / 5 = 20 updates/s

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

Doubling Array

castle_{da,bloom}.c

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

“Mod-list” B-Tree

castle_{btree,versions}.c

Copy-on-Write BTreeIdea:

• Apply path-copying [DSST] to the B-tree

Problems:

• Space blowup: Each update may rewrite an entire path

• Slow updates: as above

A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!)

Nv = #keys live (accessible) at version v

Update Range Query

Space

CoW B-Tree

O(logB Nv)random IOs

O(Z/B) random IOs O(N B logB Nv)

1 a 1 b

• Inserts produce arraysv1

“BigTable” snapshots

1 a 1 b

“BigTable” snapshots

• Inserts produce arrays

• Snapshots increment ref counts on arrays

• Merges product more arrays, decrement ref count on old arrays

2 a 2 b

v1 v2

1 c

• Inserts produce arrays

• Snapshots increment ref counts on arrays

• Merges product more arrays, decrement ref count on old arrays

1 1

v1 v2

1

1 a 1 b

1 a b c

“BigTable” snapshots

“BigTable” snapshots

• Inserts produce arrays

• Snapshots increment ref counts on arrays

• Merges product more arrays, decrement ref count on old arrays

• Space blowup

1 1

v1 v2

1

1 a 1 b

1 a b c

Nv = #keys live (accessible) at version v

Update Range Query

Space

CoW B-Tree

O(logB Nv)random IOs

O(Z/B) random IOs O(N B logB Nv)

“BigTable” style DA

O((log N)/B)sequential IOs

O(Z/B) sequential IOs O(VN)

“Mod-list” BTreeIdea:

• Apply fat-nodes [DSST] to the B-tree

• ie insert (key, version, value) tuples, with special operations

Problems:

• Similar performance to a BTree

If you limit the #versions, can be constructed sequentially, and embedded into a DA

Nv = #keys live (accessible) at version v

Update Range Query

Space

CoW B-Tree

O(logB Nv)random IOs

O(Z/B) random IOs O(N B logB Nv)

“BigTable” style DA

O((log N)/B)sequential IOs

O(Z/B) sequential IOs O(VN)

“Mod-list” in a DA

O((log N)/B)sequential IOs

O(Z/B) sequential IOs O(N)CASTLE

LevelDB

Stratified BTreeProblem: Embedded “Mod-list” #versions limit

Solution: Version-split arrays during merges

v0

v1 v2

v-split

v2v2 v2v0 v0

k1 k4 k5k3k2

{v2}

{v1,v0} v1 v1 v1v0 v1 v0 v0

k1 k4 k5k2

v0 entries here are duplicates

v1 v2 v2 v1 v2 v1 v0 v1 v0 v1 v0 v1

newer older

merge

v1 v2v2 v1 v2 v1v0 v1 v0 v0

k1 k4 k5k3k2

(duplicates removed)

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cache

flusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

“Mod-list” B-Tree

castle_{btree,versions}.c

Acunu Kernel

Userspace

Linux Kernel

Doubling Arrays

arrays range

querieskey

insert

insertqueues

Bloom filters

x

user

spac

ein

terfa

ceke

rnel

spac

ein

terfa

cedo

ublin

g a

rray

map

ping

laye

rm

odlis

t btre

em

appi

ng la

yer

bloc

k m

appi

ng &

cach

eing

laye

rlin

ux's

bloc

k &

MM

laye

rs Memory manager

"Extent" layerextent

allocator& mapper

freespacemanager

btreerange

queries

key get

key insert

Version tree

Streaming interfacekey

insertkey get

bufferedvalue get

bufferedvalue insert

range queries

Cacheflusher

extent blockcache

page cache

prefetcher

In-kernel workloads

Block layer

shared buffersasync, sharedmemory ring

Shared memory interfacekeys

values

Arrays

value arrays

btree

key get

arraysmanagement

merges

Disk Layout: RDA

castle_{cache,extent,freespace,rebuild}.c

13

89

5

14

2 12 34

67 8

1 34 5

67 10

1112 1315

16

910

1114

5 2

8 9

1413 12 15

16

Disk Layout: RDArandom duplicate allocation

Performance Comparison

Small random inserts Inserting 3 billion rows

Acunu powered Cassandra -‘standard’ Cassandra -

Insert latency While inserting 3 billion rows

Acunu powered Cassandra x‘standard’ Cassandra +

Small random range queriesPerformed immediately after inserts

Acunu powered Cassandra -‘standard’ Cassandra -

Standard Acunu Benefits

inserts rate95% latency

~32k/s~32s

~45k/s~0.3s

>1.4x>100x

gets rate95% latency

~100/s~2s

~350/s~0.5s

>3.5x>4x

range queries95% latency

~0.4/s~15s

~40/s~2s

>100x>7.5x

Performance summary

Future

Memcache + Cassandra

Castle

H/W

Castle

H/W

...

Cassandra memcache Cassandra memcache

Cass client memcachedget/insert get/put

100k random inserts/sec!

v16 v24

v13

v1

v15v12 v13

v16 v24

v13

v1

v15v12 v13

v16 v24

v13

v1

v15v12 v13

v16 v24

v13

v1

v15v12 v13

• Castle: like BDB, but for Big Data

• DA: transforms random IO into sequential IO

• Snapshots & Clones: addressing real problems with new workloads

• 2 orders of magnitude better performance and predictability

References[LSM] The Log-Structured Merge-Tree (LSM-Tree)Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil

http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log-Structured%20Merge-Tree%20%28LSM-

Tree%29.pdf

[COLA] Cache-Oblivious Streaming B-trees, Michael A. Bender et al

http://www.cs.sunysb.edu/~bender/newpub/BenderFaFi07.pdf

[DSST] Making Data Structures Persistent - J. R. Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Data Structures Persistent, Journal of Computer and System Sciences, Vol. 38, No. 1, 1989

http://www.cs.cmu.edu/~sleator/papers/making-data-structures-persistent.pdf

Stratified B-trees and versioned dictionaries, - Andy Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton, John Wilkes, Tom Wilkie, HotStorage’11

http://www.usenix.org/event/hotstorage11/tech/final_files/Twigg.pdf

[RDA] Random duplicate storage strategies for load balancing in multimedia servers, 2000, Joep Aerts and Jan Korst and Sebastian Egner

http://www.win.tue.nl/~joep/IPL.ps

Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the

Apache Software Foundation.

top related