the “what”, “why”, and “how” of fractal tree indexing for...

TokuMX Internals

The “What”, “Why”, and

“How” of Fractal Tree

Indexing for MongoDB

Zardosht Kasheff

@zkasheff

@tokutek

What is TokuMX?

• TokuMX = MongoDB with improved storage (Fractal

Tree Indexes!)

• Drop in replacement for MongoDB v2.2 applications

o Including replication and sharding

o Same data model

o Same query language

o Drivers just work

o 2.4 compatibility soon

• Open source • https://github.com/Tokutek/mongo/

TokuMX Benefits

Top 5 benefits to TokuMX

are…

TokuMX Benefit #1

Improved write performance on large data

TokuMX Benefit #2

Compression! (up to 25x)

TokuMX achieved

11.6:1 compression

TokuMX Benefit #3

No Fragmentation.

TokuMX Benefit #4

Scale up

• No global

read/write lock

• Document level

locking

• Sysbench

Benchmark on

data > RAM

TokuMX Benefit #4

Scale up

• No global

read/write lock

• Document level

locking

• Sysbench

Benchmark on

data < RAM

TokuMX Benefit #5

Transactions: MVCC + multi-statement on single servers

TokuMX Top 5 Benefits Recap

• Improved write performance on large data

• Compression! (up to 25x)

• No fragmentation (Deprecated compact!)

• Scale up

• Transactions (MVCC + multi-statement)

Bottom line: TokuMX makes MongoDB applications stable

and fast for large databases.

TokuMX: How?

Built a storage core from the ground up, with Fractal Tree

indexes, a data structure designed with large data in

mind.

• Some benefits thanks to Fractal Tree indexes

• Some benefits thanks to good old fashioned engineering

Benefits:

• Improved write performance on large data

• Compression! (up to 10x)

• No fragmentation (Deprecated compact!)

• Scale up

• Transactions (MVCC + multi-statement)

Thanks to Fractal Trees

Good old fashioned engineering

Agenda

• Focus on how TokuMX brings the benefits that Fractal

Trees are responsible for. (We won’t focus on scale up

and transactions).

• Compare side-by-side the B-Tree (what many databases

use) and the Fractal Tree. Understand the differences.

• Use differences to show, one by one, how TokuMX’s

Fractal Trees enable:

– Fast writes on big data

– Compression

– No fragmentation

But first, a spoiler…

Spoiler!!

• MySQL customer I/O utilization graph:

It’s all about I/O!!

Without Fractal Trees With Fractal Trees

Fractal Trees v. B-Trees Contrast and Compare

Fractal Trees v. B-Trees

What is a B-Tree?

• Traditional data structure used in databases for over 40

years.

• Used in NEARLY ALL databases, such as MongoDB,

MySQL, BerkeleyDB, etc…


What is a B-Tree?

Simple and elegant data structure:

• Internal nodes store as many pivots and pointers

that fit.

• Leaf nodes store data.

Leaf Nodes

Internal Nodes


What is a Fractal Tree?

Another simple and elegant data structure:

• Internal nodes store pivots, pointers, and buffers.

• Leaf nodes store data.

Pointers and pivots

Buffer

Leaf node


What is a Fractal Tree?

Buffers are important:

• Batch up writes

• Will dig into what this means soon.

Pointers and pivots

Buffer

Leaf node


Characteristics of B-Trees and Fractal Trees for large data:

• Very high percentage of leaf nodes do not fit in memory

• Therefore, accessing a random leaf node likely requires

I/O

On disk, not in memory

Understanding TokuMX’s Fractal

Tree Benefit #1:

Write performance

Write performance. How…

100mm inserts into a collection with 3 secondary indexes

With Less I/O!

100mm inserts into a collection with 3 secondary indexes

Fractal Tree v B-Tree for write I/O

Fractal Trees have significantly better write performance

than B-Trees when data > RAM

– B-Trees become I/O bound. (Disks do < 500 I/O per second)

– Fractal Trees are not I/O bound

This is why B-Tree insertion performance “falls off a cliff”.

MySQL MongoDB cliff

Conventional Wisdom

This also leads to the following conventional wisdom:

• Keep indexes in memory.

• Keep “working set” in memory.

• Have a “right-most insertion pattern” on indexes

All of these tips are designed to work around the fact that B-

Trees become I/O bound when writing to large databases.

Now let’s understand why…

How a B-Tree does writes

Random Writes require I/O

B-Trees algorithm for doing a write:

• Find the appropriate leaf node where the write belongs

• Bring the leaf node into memory EXPENSIVE!

• Modify the leaf node

For large data, nearly all B-Tree leaf nodes are not in memory,

so algorithm requires practically one I/O per write

How a Fractal Tree does Writes

• Writes are batched in buffers with messages

• When a buffer is full, messages spills into buffers of

child node (who also spill if they get full)

• Through spilling, messages eventually make it to

leaf nodes.

Let’s zoom in here for the next slide


Internal nodes

When does a Fractal Tree do I/O for Writes?

– When flushing a buffer’s worth of writes.

Here we see the BIG difference in I/O performance for

Fractal Trees v. B-Trees:

B-Trees do an I/O to write one measly document.

Fractal Trees do an I/O to write a buffer’s worth of

documents. This is why I/O is drastically reduced!


Fractal Tree Wisdom

This also leads to the following wisdom for Fractal Trees:

• Indexes don’t need to fit in memory.

• “Working set” does not need to be in memory.

• Indexes don’t need to worry about their “insertion

pattern”.

These capabilities reduce complexity of database design,

and enable rich indexes and queries that B-trees cannot

support.


Tree Benefit #2:

Compression

• BitTorrent Peer Snapshot Data (~31 million documents), 3 indexes

• http://cs.brown.edu/~pavlo/torrent/

What Compression?

TokuMX achieved

11.6:1 compression

TokuMX compression algorithm is simple!

1. Take large chunks of data

2. Use standard compression algorithms (zlib, lzma, or

quicklz) and compress them

3. There is no step 3!

Effectiveness of these compression algorithms is

dependent on how much data you give it. TokuMX

gives lots of data, so TokuMX compresses well.

The secret is…

Compression: How?

TokuMX node sizes (4 MB) are larger than B-Trees

Compression: The Secret

Small: 8KB or 16KB

Large: 4MB

Larger node size leads to better compression

So the question is, why do Fractal Trees have such large

node sizes?

Again, it’s all about the I/O.

For writes: – B-Trees: reading a large node to write one measly row is

painful

– Fractal Trees: reading a large node to write a proportionally

large buffer is not painful. In fact, it’s better. Reading larger

nodes means you pay more disk bandwidth cost than disk

seek cost.

Conclusion: Fractal Trees should use large nodes for

writes, for better performance AND compression.

Fractal Trees: Why Large Nodes?

What about reading a single document?

The problem:

• For point query, we are reading one

measly document

• Just as B-Trees don’t want to do a large

I/O to write one measly document, Fractal

Trees should not read 4MB to read one

measly document.

Fractal Trees: Large Nodes + Reads

What about reading a single document?

The solution: • Partition the 4MB leaf node into 64KB “basement nodes”.

(value of 64KB is configurable)

• 64KB chunks are individually compressed, concatenated,

and written disk to represent a leaf node

• When flush data for writes, read the full 4MB row

• When reading “one measly document”, read only

appropriate 64KB chunk of data

64 KB chunks are nice sweet spot to get good compression

and point query performance

Fractal Trees: Large Nodes + Reads

Summary: • Use large nodes: 4MB

• Partition leaf nodes into 64KB contiguous chunks

• Compress 64 KB chunks individually with standard

compression algorithms (zlib, lzma, or quicklz), getting

good compression

• Concatenate compressed chunks to make large

compressed leaf node.

Fractal Trees: Compression


Tree Benefit #3:

No Fragmentation

Fragmentation happens when nodes on disk get

rearranged in random order, with wasted space

accumulating between nodes.

Why MongoDB Users care about fragmentation:

• Wasted space between blocks makes keeping

working set in memory more difficult, leads to

disk bloat

• Blocks of data rearranged in random order leads

to performance degradation

What is Fragmentation?

Workarounds for Fragmentation

MongoDB workarounds:

– Pad inserted documents

with some additional space

to account for future

updates

– Occasionally bring the

database down and run

compact. This correctly

rearranges blocks and

removes wasted space

– Aggressively preallocate

files to reserve space

TokuMX workarounds:

Why TokuMX Users don’t care about fragmentation:

• On wasted space between blocks:

– Compression greatly mitigates impact of wasted space on disk

usage

– Write performance allows working set to exceed memory

• On blocks of data being rearranged in random order:

– Short answer: large leaf nodes practically eliminate the I/O

impact of rearranged data blocks (once again, it’s all about the

I/O)

– Long answer: let’s do some analysis…

Why TokuMX does not Fragment

First, let’s assume the following costs of disk access:

• Disk seek time: 10ms 100 I/Os per second

• Disk bandwidth time: 100MB/s

Numbers are meant to be nice estimates to make math

simple.

Question to ask ourselves that shows the impact of

fragmentation:

At what rate (determined in bytes/second) can I read an

entire B-Tree?

Impact of Rearranged blocks


entire B-Tree?

Non-fragmented B-Trees:

– all data sequentially arranged, therefore sequentially accessed

– Effective rate: 100 MB/s (at most) great performance!

Fragmented B-Tree:

– Suppose node size 8KB, accessing leaf node requires I/O

– Cost of reading block of data is seek time + bandwidth time

– seek time: 10ms, bandwidth time: 100us dominated by seek

– Effective rate: 8KB/10ms = 800 KB/s poor performance!

This is the poor performance one sees with

fragmentation, and why users want to compact



entire Fractal Tree?

Non-fragmented Fractal Tree:

– Effective rate: 100 MB/s (at most) great performance!

“Fragmented” Fractal Tree:

– Suppose node size 1MB compressed, 4MB uncompressed

– Cost of reading block of data is seek time + bandwidth time

– seek time: 10ms, bandwidth time: 10ms

– Effective rate: 1 MB / 20 ms = 50 MB/s great performance!

Large Fractal Tree nodes mitigate I/O seek cost of a

fragmented collection!


• Don’t worry about fragmentation .

Summary on Fragmentation

TokuMX Resources

tokutek.com/products/downloads

• [email protected]

• [email protected]

For evaluations or enterprise support:

[email protected]

[email protected], @zkasheff on twitter

http://www.tokutek.com/products/downloads

mailto:[email protected]










the “what”, “why”, and “how” of fractal tree indexing for...

Documents