fractal tree indexes : from theory to practice
DESCRIPTION
Fractal Tree Indexes are compared to the indexing incumbent, B-trees. The capabilities are then shown what they bring to MySQL (in TokuDB) and MongoDB (in TokuMX). Presented at Percona Live London 2013.TRANSCRIPT
![Page 1: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/1.jpg)
®
Fractal Tree® IndexesTheory to Practice
Percona Live London 2013
Tim Callaghan, [email protected]
@tmcallaghan
Tuesday, November 12, 13
![Page 2: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/2.jpg)
®
Ever seen this?
IO Utilization Graph, performance is IO limited
Tuesday, November 12, 13
![Page 3: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/3.jpg)
®
Who is Tokutek?
Tokutek builds high-performance database software!
TokuDB - storage engine for MySQL and MariaDB
TokuMX - storage engine for MongoDB
HDD & SSD!storage"
"Storage Engine"
Developer Interface"
Tuesday, November 12, 13
![Page 4: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/4.jpg)
®
Who am I?
• 17 year database consumer• schema design, development, deployment• database administration + infrastructure• mostly Oracle
• 5 year database producer• 2 years @ VoltDB• 2+ years @ Tokutek
Tuesday, November 12, 13
![Page 5: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/5.jpg)
®
Housekeeping
• Feedback is important to me• Ideas for Webinars or Presentations?
• Who’s using MongoDB?
• Anyone using TokuDB or TokuMX?
• Please ask questions
Tuesday, November 12, 13
![Page 6: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/6.jpg)
®
Agenda
• Why Fractal Tree indexes are cool• What they enable in MySQL
® (TokuDB)
• What they enable in MongoDB® (TokuMX)
• Q+A
Tuesday, November 12, 13
![Page 7: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/7.jpg)
®
Indexing:
B-trees and Fractal Tree Indexes
Tuesday, November 12, 13
![Page 8: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/8.jpg)
®
B-trees
Tuesday, November 12, 13
![Page 9: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/9.jpg)
®
B-tree Overview - vocabulary
Internal Nodes - Path to data
Leaf Nodes - Actual Data - Sorted
Pointers
Pivots
Tuesday, November 12, 13
![Page 10: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/10.jpg)
®
B-tree Overview - example
22
10 99
2, 3, 4 10,20 22,25 99
* Pivot Rule is >=
Tuesday, November 12, 13
![Page 11: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/11.jpg)
®
B-tree Overview - search
22
10 99
2, 3, 4 10,20 22,25 99
“Find 25”
Tuesday, November 12, 13
![Page 12: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/12.jpg)
®
B-tree Overview - insert
22
10 99
2, 3, 4 10,15,20 22,25 99
“Insert 15”
Tuesday, November 12, 13
![Page 13: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/13.jpg)
RAM
RAM
DISK
®
B-tree Overview - performance
22
10 99
2, 3, 4 10,20 22,25 99
Performance is IO limited when data > RAM, one IO is needed for each insert/update
(actually it’s one IO for every index on the table)
Tuesday, November 12, 13
![Page 14: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/14.jpg)
®
Fractal Tree Indexes
Tuesday, November 12, 13
![Page 15: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/15.jpg)
®
Fractal Tree Indexes
similar to B-trees•store data in leaf nodes•use index key for ordering
message buffer
message buffer
message buffer
All internal nodes have message
buffers
different than B-trees•message buffers•big nodes (4MB vs. ~16KB)
As buffers overflow, they cascade down
the tree
Messages are eventually applied to
leaf nodes
Tuesday, November 12, 13
![Page 16: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/16.jpg)
®
Fractal Tree Indexes - sample data
25
10 99
2,3,4 10,20 22,25 99
Looks a lot like a b-tree!
Tuesday, November 12, 13
![Page 17: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/17.jpg)
®
insert 15;
Fractal Tree Indexes - insert
25
10 99
2,3,4 10,20 22,25 99
insert (15)
• search operations must consider messages along the way• messages cascade down the tree as buffers fill up• they are eventually applied to the leaf nodes, hundreds or
thousands of operations for a single IO• CPU and cache are conserved as important data is not ejected
Tuesday, November 12, 13
![Page 18: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/18.jpg)
®
Fractal Tree Indexes - other operations
25
10 99
2,3,4 10,20 22,25 99
add_column(c4 bigint)delete(99)
increment(22,+5)...
insert (100)delete(8)delete(2)insert (8)
Lots of operations can be messages!
Tuesday, November 12, 13
![Page 19: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/19.jpg)
®
TokuDB
Fractal Tree Indexing + MySQL/MariaDB
Tuesday, November 12, 13
![Page 20: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/20.jpg)
®
What is TokuDB?
• Transactional MySQL Storage Engine - think InnoDB• Available for MySQL 5.5 and MariaDB 5.5• ACID and MVCC• Free/OSS Community Edition– http://github.com/Tokutek/ft-engine
• Enterprise Edition– Commercial support + hot backup
20
Performance + Compression + Agility
Tuesday, November 12, 13
![Page 21: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/21.jpg)
®
TokuDB Performance
Warning - Benchmarks Ahead!
Tuesday, November 12, 13
![Page 22: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/22.jpg)
®
Indexed Insertion Performance
• High-performance insert/update/delete for large databases (> RAM) while maintaining indexes
22
* old numbers, now > 25K/sec
Tuesday, November 12, 13
![Page 23: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/23.jpg)
®
Sysbench Performance
Sysbench read/write workload, > RAM
23
The fastest IO is the one you never have to do (compression)
Tuesday, November 12, 13
![Page 24: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/24.jpg)
®
• Efficient index maintenance, especially secondary indexes
• Clustered secondary indexes• Additional copy of the row is stored in the index• No additional IO to get row data from primary key• Think better covering index (all non-indexed columns)• Compression eliminates size concerns
• Big blocks = sequential IO for range scans• Basement nodes are always co-located
• Multi-threaded bulk loader
24
Performance Advantages
Tuesday, November 12, 13
![Page 25: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/25.jpg)
®
TokuDB Compression
Tuesday, November 12, 13
![Page 26: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/26.jpg)
®
Compression: TokuDB vs. InnoDB
• InnoDB compression misses force node splits, which greatly reduces performance– MySQL 5.6 “dynamic padding” (from FB), less cache
• Larger block size and flexible on-disk size wins!• Multiple compression algorithms (lzma, quicklz, zlib)• Larger, less frequent writes (much less IO)• Why it matters on spinning disks:
– Compressed reads and amortized compressed writes overcome IO limitations
• Why it matters on flash/SSD:– Buy less : 250GB * 10x = as 2.5TB)– Large/less frequent writes are flash friendly
26
Tuesday, November 12, 13
![Page 27: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/27.jpg)
®
Compression + IO Reduction
• Server was at 90% IO utilization with InnoDB, 10% IO utilization with TokuDB
27
Tuesday, November 12, 13
![Page 28: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/28.jpg)
®
Compression Performance
• iiBench benchmark
28
Tuesday, November 12, 13
![Page 29: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/29.jpg)
®
Compression Achieved
• log data (extremely compressible)
29
Tuesday, November 12, 13
![Page 30: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/30.jpg)
®
TokuDB Agility
Tuesday, November 12, 13
![Page 31: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/31.jpg)
®
The Challenge of MySQL Schema Changes
• Common schema changes can take hours in MySQL– Adding, dropping, or expanding a column– Adding an index
• And the table is unavailable for writes during the process
• As a workaround, people generally– Use a replication slave, then swap with master– Use helper tools: Percona OSC, MySQL 5.6
o These have IO, CPU, RAM consequences
31
Tuesday, November 12, 13
![Page 32: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/32.jpg)
®
Schema Changes Without Downtime
• In TokuDB, column add/drop/expand is instantaneous– “it’s just a message”
• Indexes can be created in the background while table is fully available– TokuDB just builds the index, it does not
rebuild the table (MySQL getting better)
32
Tuesday, November 12, 13
![Page 33: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/33.jpg)
®
TokuMX
Fractal Tree Indexing + MongoDB
Tuesday, November 12, 13
![Page 34: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/34.jpg)
®
What is TokuMX?
• TokuMX = MongoDB with improved storage (Fractal Tree indexes)
• Drop in replacement for MongoDB v2.2 applications– Including replication and sharding– Same data model– Same query language– Drivers just work
• Open Source– http://github.com/Tokutek/mongo
Performance + Compression + Transactions
Tuesday, November 12, 13
![Page 35: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/35.jpg)
®
MongoDB Storage
18
4 5555
(1,ptr5) (4,ptr1),(12,ptr8)
(19,ptr7) (10000,ptr2)
The “pointer” tells MongoDB where to look in the heap for the requested document (another IO)
35
85
40 120
(2,ptr5), (22,ptr6)
(50,ptr4) (100,ptr7) (222,ptr3)
PK index (_id + pointer) Secondary index (foo + pointer)
db.test.insert({foo:55})db.test.ensureIndex({foo:1})
memory mapped heap
Tuesday, November 12, 13
![Page 36: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/36.jpg)
®
TokuMX Storage
18
4 5555
(1,doc) (4,doc),(12,doc)
(19,doc) (10000,doc)
36
85
40 120
(2,4), (22,12) (50,19) (100,10000) (222,1)
PK index (_id + document) Secondary index (foo + _id)
db.test.insert({foo:55})db.test.ensureIndex({foo:1})
memory mapped heap
One less IO per _id lookup, document is clustered in the index
Tuesday, November 12, 13
![Page 37: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/37.jpg)
®
TokuMX Performance
Tuesday, November 12, 13
![Page 38: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/38.jpg)
®
Performance - Indexed Insertion
• 100mm inserts into a collection with 3 secondary indexes
38
Tuesday, November 12, 13
![Page 39: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/39.jpg)
®
• Indexed Insertion : Multikey (100 inserts per doc)
39
Performance - Inserts on Indexed Arrays
Tuesday, November 12, 13
![Page 40: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/40.jpg)
®
Performance - Replication
• TokuMX replication allows secondary servers to process replication without IO– Simply injecting messages into the Fractal Tree
Indexes on the secondary server– The “Hard Work” was done on the primaryoUniqueness checkingo Transactional lockingoUpdate effort (read-before-write)
– Elimination of replication lag• Your secondaries are fully available for read scaling!– Wasn’t that the point?
40
Tuesday, November 12, 13
![Page 41: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/41.jpg)
®
Performance - Lock Refinement
41
• TokuMX performs locking at the document level– Extreme concurrency!
instance
database database
collection collection collection collection
document
document
document
document
document
document document
document
document
document
MongoDB v2.2
MongoDB v2.0
TokuMX
Tuesday, November 12, 13
![Page 42: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/42.jpg)
®42
Performance - Lock Refinement
Tuesday, November 12, 13
![Page 43: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/43.jpg)
®
• Sysbench benchmark (> RAM)
43
Performance - Lock Refinement + Reduced IO
Tuesday, November 12, 13
![Page 44: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/44.jpg)
®
– Indexed insertion benchmark
44
Performance - Reduced IO
Tuesday, November 12, 13
![Page 45: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/45.jpg)
®
Performance - Clustered Indexes
• Clustered secondary indexes• Additional copy of the document is stored in the index• No additional IO to get row data from primary key• Think better covered index (all non-indexed fields)• Good for point queries, great for range scans• Compression eliminates size concerns
45
Tuesday, November 12, 13
![Page 46: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/46.jpg)
®
Performance - Memory Management
• Two approaches to memory management– MongoDB = memory-mapped filesoOperating system determines what data is
important– TokuMX = managed cacheoUser defined sizeo TokuMX determines what data is important
• Run multiple TokuMX instances on a single server– Each has it’s own fixed cache size
46
Tuesday, November 12, 13
![Page 47: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/47.jpg)
®
TokuMX Compression
Tuesday, November 12, 13
![Page 48: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/48.jpg)
®
Compression
• MongoDB does not offer compression– Compressed file systems?– Shortened field names?
o Remember: each field name is stored in every single document• TokuMX easily achieves 5x-10x compression
– Buy less disk or flash– Compressed reads and writes reduce overall IO
• TokuMX support 3 compression types– zlib, quicklz, lzma (size vs. speed)– all data is compressed
• Use descriptive field names!– They are easy to compress
48
Tuesday, November 12, 13
![Page 49: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/49.jpg)
®
Compression
• 31 million documents, bit torrent peer data– http://cs.brown.edu/~pavlo/torrent/
49
Tuesday, November 12, 13
![Page 50: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/50.jpg)
®
TokuMX Transactions
Tuesday, November 12, 13
![Page 51: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/51.jpg)
®
ACID + MVCC
• ACID– In MongoDB, multi-insertion operations allow for
partial successo Asked to store 5 documents, 3 succeeded
– We offer “all or nothing” behavior– Document level locking
• MVCC– In MongoDB, queries can be interrupted by writers.
o The effect of these writers are visible to the reader– TokuMX offers MVCC
o Reads are consistent as of the operation start
51
Tuesday, November 12, 13
![Page 52: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/52.jpg)
®
Multi-statement Transactions
• TokuMX brings the following to MongoDB– db.runCommand({“beginTransaction”, “isolation”:
“mvcc”})– ... perform 1 or more operations– db.runCommand(“rollbackTransaction”) |
db.runCommand(“commitTransaction”)
• Not allowed in sharded environments– mongos will reject
52
Tuesday, November 12, 13
![Page 53: Fractal Tree Indexes : From Theory to Practice](https://reader034.vdocuments.net/reader034/viewer/2022052301/554f81e7b4c905435d8b4a31/html5/thumbnails/53.jpg)
®
Tim CallaghanVP/Engineering, Tokutek
[email protected]@tmcallaghan
Questions?
Tuesday, November 12, 13