the “what”, “why”, and “how” of fractal tree indexing for...
TRANSCRIPT
TokuMX Internals
The “What”, “Why”, and
“How” of Fractal Tree
Indexing for MongoDB
Zardosht Kasheff
@zkasheff
@tokutek
What is TokuMX?
• TokuMX = MongoDB with improved storage (Fractal
Tree Indexes!)
• Drop in replacement for MongoDB v2.2 applications
o Including replication and sharding
o Same data model
o Same query language
o Drivers just work
o 2.4 compatibility soon
• Open source • https://github.com/Tokutek/mongo/
TokuMX Benefits
Top 5 benefits to TokuMX
are…
TokuMX Benefit #1
Improved write performance on large data
TokuMX Benefit #2
Compression! (up to 25x)
TokuMX achieved
11.6:1 compression
TokuMX Benefit #3
No Fragmentation.
TokuMX Benefit #4
Scale up
• No global
read/write lock
• Document level
locking
• Sysbench
Benchmark on
data > RAM
TokuMX Benefit #4
Scale up
• No global
read/write lock
• Document level
locking
• Sysbench
Benchmark on
data < RAM
TokuMX Benefit #5
Transactions: MVCC + multi-statement on single servers
TokuMX Top 5 Benefits Recap
• Improved write performance on large data
• Compression! (up to 25x)
• No fragmentation (Deprecated compact!)
• Scale up
• Transactions (MVCC + multi-statement)
Bottom line: TokuMX makes MongoDB applications stable
and fast for large databases.
TokuMX: How?
Built a storage core from the ground up, with Fractal Tree
indexes, a data structure designed with large data in
mind.
• Some benefits thanks to Fractal Tree indexes
• Some benefits thanks to good old fashioned engineering
Benefits:
• Improved write performance on large data
• Compression! (up to 10x)
• No fragmentation (Deprecated compact!)
• Scale up
• Transactions (MVCC + multi-statement)
Thanks to Fractal Trees
Good old fashioned engineering
Agenda
• Focus on how TokuMX brings the benefits that Fractal
Trees are responsible for. (We won’t focus on scale up
and transactions).
• Compare side-by-side the B-Tree (what many databases
use) and the Fractal Tree. Understand the differences.
• Use differences to show, one by one, how TokuMX’s
Fractal Trees enable:
– Fast writes on big data
– Compression
– No fragmentation
But first, a spoiler…
Spoiler!!
• MySQL customer I/O utilization graph:
It’s all about I/O!!
Without Fractal Trees With Fractal Trees
Fractal Trees v. B-Trees Contrast and Compare
Fractal Trees v. B-Trees
What is a B-Tree?
• Traditional data structure used in databases for over 40
years.
• Used in NEARLY ALL databases, such as MongoDB,
MySQL, BerkeleyDB, etc…
Fractal Trees v. B-Trees
What is a B-Tree?
Simple and elegant data structure:
• Internal nodes store as many pivots and pointers
that fit.
• Leaf nodes store data.
Leaf Nodes
Internal Nodes
Fractal Trees v. B-Trees
What is a Fractal Tree?
Another simple and elegant data structure:
• Internal nodes store pivots, pointers, and buffers.
• Leaf nodes store data.
Pointers and pivots
Buffer
Leaf node
Fractal Trees v. B-Trees
What is a Fractal Tree?
Buffers are important:
• Batch up writes
• Will dig into what this means soon.
Pointers and pivots
Buffer
Leaf node
Fractal Trees v. B-Trees
Characteristics of B-Trees and Fractal Trees for large data:
• Very high percentage of leaf nodes do not fit in memory
• Therefore, accessing a random leaf node likely requires
I/O
On disk, not in memory
Understanding TokuMX’s Fractal
Tree Benefit #1:
Write performance
Write performance. How…
100mm inserts into a collection with 3 secondary indexes
With Less I/O!
100mm inserts into a collection with 3 secondary indexes
Fractal Tree v B-Tree for write I/O
Fractal Trees have significantly better write performance
than B-Trees when data > RAM
– B-Trees become I/O bound. (Disks do < 500 I/O per second)
– Fractal Trees are not I/O bound
This is why B-Tree insertion performance “falls off a cliff”.
MySQL MongoDB cliff
Conventional Wisdom
This also leads to the following conventional wisdom:
• Keep indexes in memory.
• Keep “working set” in memory.
• Have a “right-most insertion pattern” on indexes
All of these tips are designed to work around the fact that B-
Trees become I/O bound when writing to large databases.
Now let’s understand why…
How a B-Tree does writes
Random Writes require I/O
B-Trees algorithm for doing a write:
• Find the appropriate leaf node where the write belongs
• Bring the leaf node into memory EXPENSIVE!
• Modify the leaf node
For large data, nearly all B-Tree leaf nodes are not in memory,
so algorithm requires practically one I/O per write
How a Fractal Tree does Writes
• Writes are batched in buffers with messages
• When a buffer is full, messages spills into buffers of
child node (who also spill if they get full)
• Through spilling, messages eventually make it to
leaf nodes.
Let’s zoom in here for the next slide
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
How a Fractal Tree does Writes
Internal nodes
When does a Fractal Tree do I/O for Writes?
– When flushing a buffer’s worth of writes.
Here we see the BIG difference in I/O performance for
Fractal Trees v. B-Trees:
B-Trees do an I/O to write one measly document.
Fractal Trees do an I/O to write a buffer’s worth of
documents. This is why I/O is drastically reduced!
How a Fractal Tree does Writes
Fractal Tree Wisdom
This also leads to the following wisdom for Fractal Trees:
• Indexes don’t need to fit in memory.
• “Working set” does not need to be in memory.
• Indexes don’t need to worry about their “insertion
pattern”.
These capabilities reduce complexity of database design,
and enable rich indexes and queries that B-trees cannot
support.
Understanding TokuMX’s Fractal
Tree Benefit #2:
Compression
• BitTorrent Peer Snapshot Data (~31 million documents), 3 indexes
• http://cs.brown.edu/~pavlo/torrent/
What Compression?
TokuMX achieved
11.6:1 compression
TokuMX compression algorithm is simple!
1. Take large chunks of data
2. Use standard compression algorithms (zlib, lzma, or
quicklz) and compress them
3. There is no step 3!
Effectiveness of these compression algorithms is
dependent on how much data you give it. TokuMX
gives lots of data, so TokuMX compresses well.
The secret is…
Compression: How?
TokuMX node sizes (4 MB) are larger than B-Trees
Compression: The Secret
Small: 8KB or 16KB
Large: 4MB
Larger node size leads to better compression
So the question is, why do Fractal Trees have such large
node sizes?
Again, it’s all about the I/O.
For writes: – B-Trees: reading a large node to write one measly row is
painful
– Fractal Trees: reading a large node to write a proportionally
large buffer is not painful. In fact, it’s better. Reading larger
nodes means you pay more disk bandwidth cost than disk
seek cost.
Conclusion: Fractal Trees should use large nodes for
writes, for better performance AND compression.
Fractal Trees: Why Large Nodes?
What about reading a single document?
The problem:
• For point query, we are reading one
measly document
• Just as B-Trees don’t want to do a large
I/O to write one measly document, Fractal
Trees should not read 4MB to read one
measly document.
Fractal Trees: Large Nodes + Reads
What about reading a single document?
The solution: • Partition the 4MB leaf node into 64KB “basement nodes”.
(value of 64KB is configurable)
• 64KB chunks are individually compressed, concatenated,
and written disk to represent a leaf node
• When flush data for writes, read the full 4MB row
• When reading “one measly document”, read only
appropriate 64KB chunk of data
64 KB chunks are nice sweet spot to get good compression
and point query performance
Fractal Trees: Large Nodes + Reads
Summary: • Use large nodes: 4MB
• Partition leaf nodes into 64KB contiguous chunks
• Compress 64 KB chunks individually with standard
compression algorithms (zlib, lzma, or quicklz), getting
good compression
• Concatenate compressed chunks to make large
compressed leaf node.
Fractal Trees: Compression
Understanding TokuMX’s Fractal
Tree Benefit #3:
No Fragmentation
Fragmentation happens when nodes on disk get
rearranged in random order, with wasted space
accumulating between nodes.
Why MongoDB Users care about fragmentation:
• Wasted space between blocks makes keeping
working set in memory more difficult, leads to
disk bloat
• Blocks of data rearranged in random order leads
to performance degradation
What is Fragmentation?
Workarounds for Fragmentation
MongoDB workarounds:
– Pad inserted documents
with some additional space
to account for future
updates
– Occasionally bring the
database down and run
compact. This correctly
rearranges blocks and
removes wasted space
– Aggressively preallocate
files to reserve space
TokuMX workarounds:
Why TokuMX Users don’t care about fragmentation:
• On wasted space between blocks:
– Compression greatly mitigates impact of wasted space on disk
usage
– Write performance allows working set to exceed memory
• On blocks of data being rearranged in random order:
– Short answer: large leaf nodes practically eliminate the I/O
impact of rearranged data blocks (once again, it’s all about the
I/O)
– Long answer: let’s do some analysis…
Why TokuMX does not Fragment
First, let’s assume the following costs of disk access:
• Disk seek time: 10ms 100 I/Os per second
• Disk bandwidth time: 100MB/s
Numbers are meant to be nice estimates to make math
simple.
Question to ask ourselves that shows the impact of
fragmentation:
At what rate (determined in bytes/second) can I read an
entire B-Tree?
Impact of Rearranged blocks
At what rate (determined in bytes/second) can I read an
entire B-Tree?
Non-fragmented B-Trees:
– all data sequentially arranged, therefore sequentially accessed
– Effective rate: 100 MB/s (at most) great performance!
Fragmented B-Tree:
– Suppose node size 8KB, accessing leaf node requires I/O
– Cost of reading block of data is seek time + bandwidth time
– seek time: 10ms, bandwidth time: 100us dominated by seek
– Effective rate: 8KB/10ms = 800 KB/s poor performance!
This is the poor performance one sees with
fragmentation, and why users want to compact
Impact of Rearranged blocks
At what rate (determined in bytes/second) can I read an
entire Fractal Tree?
Non-fragmented Fractal Tree:
– Effective rate: 100 MB/s (at most) great performance!
“Fragmented” Fractal Tree:
– Suppose node size 1MB compressed, 4MB uncompressed
– Cost of reading block of data is seek time + bandwidth time
– seek time: 10ms, bandwidth time: 10ms
– Effective rate: 1 MB / 20 ms = 50 MB/s great performance!
Large Fractal Tree nodes mitigate I/O seek cost of a
fragmented collection!
Impact of Rearranged blocks
• Don’t worry about fragmentation .
Summary on Fragmentation
TokuMX Resources
tokutek.com/products/downloads
For evaluations or enterprise support:
[email protected], @zkasheff on twitter