percona ft / tokudb
TRANSCRIPT
![Page 1: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/1.jpg)
Vadim TkachenkoPerconaApril’16
Percona Fractal Tree / TokuDB
![Page 2: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/2.jpg)
2
Agenda
Why new data structure
Fractal Tree & LSM tree
Internals of Fractal Tree
When it is useful
How to use it
![Page 3: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/3.jpg)
Why new data structure
![Page 4: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/4.jpg)
4
Before it was B-Tree
• “Traditional” data structure• In the field from 1970-ies
![Page 5: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/5.jpg)
5
Before there was B-Tree
![Page 6: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/6.jpg)
6
When B-Tree is good
• When datasize doesn’t exceed memory limits• When the application is mostly performing read (SELECT)
operations, or when read performance is more important than write performance
![Page 7: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/7.jpg)
7
When B-Tree is not good
• As soon as the data size exceeds available memory, performance drops rapidly
• Choosing a flash-based storage helps performance, but only to a certain extent -- in the long run, memory limits cause performance to suffer
![Page 8: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/8.jpg)
8
To summarize
• B-tree was designed to provide optimal data retrieval performance, but not data updates (insert, delete, update)
• This shortcoming created a need for data structures that provide better performance for data storage.
![Page 9: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/9.jpg)
9
Cases when B-Tree is not optimal
• accepting and storing event logs• storing measurements from a high-frequency sensor,• tracking user clicks, and so on.
• For such cases, two new data structures were created: log structured merge (LSM) trees and Fractal Trees®.
![Page 10: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/10.jpg)
10
LSM & Fractal Tree
![Page 11: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/11.jpg)
11
LSM tree & Fractal tree
• Shift balance from optimal reads toward faster writes
![Page 12: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/12.jpg)
Fractal Trees
![Page 13: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/13.jpg)
13
Fractal Trees
• Invented ~ 2007• Tokutek and TokuDB as commercial engine• 2015 – part of Percona
![Page 14: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/14.jpg)
14
Fractal Tree
• Delay writes (send messages)• Combine multiple delayed writes into single IO• => SELECTs have much work to do• Walk through all messages
![Page 15: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/15.jpg)
15
![Page 16: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/16.jpg)
16
Fractal tree benefits
• Tables that have a lot of indexes (preferably non-unique indexes)
• Heavy write workload into the tables• Systems with slow storage times• Saving space when the environment storage is fast but
expensive.
![Page 17: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/17.jpg)
17
From idea to reality
• Need concurrency-control mechanisms• Need crash safety• Need transactions, logging+recovery• Need to support multithreading.• Need to integrate with MySQL API layer
• Not everything perfect yet
![Page 18: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/18.jpg)
Fractal Tree Internals
![Page 19: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/19.jpg)
19
On MySQL Level:
CREATE TABLE metrics ( ts timestamp, device_id int, metric_id int, cnt int, val double, PRIMARY KEY (ts, device_id, metric_id), KEY metric_id (metric_id, ts), KEY device_id (device_id, ts))
![Page 20: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/20.jpg)
20
Internally 3 trees
• Primary Key (ts, device_id, metric_id) => data• Key (metric_id, ts) => PK (ts, device_id, metric_id) • Key (device_id, ts) => PK (ts, device_id, metric_id)
• Notice – long PK adds overhead
![Page 21: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/21.jpg)
21
Root Node
F – tokudb_fanout (default 16)Tokudb_block_size (default 4MB)
![Page 22: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/22.jpg)
22
Basement node (leaf)
![Page 23: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/23.jpg)
23
• Tokudb_read_block_size (default 64KB)• Chunk used for compression/decompression• Smaller size is better for point lookups
![Page 24: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/24.jpg)
24
Shape your tree (settings per TABLE)
• tokudb_block_size (default 4MiB)• size of Node IN Memory (on disk it will be compressed)
• tokudb_read_block_size (default 64KiB)• size of basement node - minimal reading block size, also block size for
compression• Balance: smaller tokudb_read_block_size - better for Point Reads, but
leads for more random IO• tokudb_fanout (default 16) - defines maximal amount of
pivots per non-leaf node. (amount of pivots = tokudb_fanout-1)
![Page 25: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/25.jpg)
25
Recommendations
tokudb_block_size:
4MiB block size is good for spinning disk.
For SSD smaller block size might be beneficial, I often use 1MiBIn reality 64-128KiB should be even better, but TokuDB does not handle these properly (performance bug: linear search of a free block in fragmented storage)
![Page 26: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/26.jpg)
26
Recommendations
tokudb_read_block_size:Recommended to set 16KiB if you expect point queries (again, too bad this setting is per-table, not per-index)
![Page 27: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/27.jpg)
27
How to see the shape of the tree
tokuftdump --summary
![Page 28: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/28.jpg)
28
![Page 29: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/29.jpg)
29
tokuftdump --summary
leaf nodes: 6797non-leaf nodes: 97Leaf size: 4,278,632,448Total size: 4,286,052,352Total uncompressed size: 6,231,518,882Messages count: 70155Messages size: 10,535,155Records count: 30000000Tree height: 2height: 0, nodes count: 6797; avg children/node: 59.364131 basement nodes: 403498; msg size: 0; disksize: 4,278,632,448; uncompressed size: 6,220,381,082; ratio: 1.453825height: 1, nodes count: 96; avg children/node: 70.802083 msg cnt: 65001; msg size: 9,756,907; disksize: 6,907,904; uncompressed size: 10,334,469; ratio: 1.496035height: 2, nodes count: 1; avg children/node: 96.000000 msg cnt: 5154; msg size: 778,248; disksize: 512,000; uncompressed size: 803,331; ratio: 1.569006
![Page 30: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/30.jpg)
30
FT properties
• “Delay writes” for as long as possible =>• writes are amortized into 1 single big write instead of N random writes• May result in serious liability: huge amount of messages not merged to leaf-
nodes• SELECT will require traversing through all messages• Especially bad for point SELECT queries
• Remember: Primary Key or Unique Key constraints REQUIRE a HIDDEN POINT SELECT lookup
• UNIQUE KEY - Performance Killer for TokuDB• non-sequential PRIMARY KEY - Performance Killer for TokuDB
![Page 31: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/31.jpg)
31
Implication of slow selects
• Unique keys – background checks – implicit reads• Foreign Keys – background checks (not supported in
TokuDB)• Select by index – requires two lookups
![Page 32: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/32.jpg)
32
Covering indexes
• SELECT user_name FROM users WHERE user_email=’[email protected]’
• Instead of INDEX (user_email) =>• INDEX (user_email, user_name)
![Page 33: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/33.jpg)
33
When to use Fractal Tree?
• Table with many indexes (better if not UNIQUE), intensive writes into this table
• Slow storage• Saving space of fast expensive storage• Less write amplification (good for SSD health)• Cloud instances are often good fit: storage either slow, or
expensive when fast.
![Page 34: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/34.jpg)
34
Benchmarks
![Page 35: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/35.jpg)
35
![Page 36: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/36.jpg)
36
![Page 37: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/37.jpg)
Stories on PerconaFT internalsSection Information
![Page 38: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/38.jpg)
38
Eviction
• Algorithm to maintain cached nodes within limit
![Page 39: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/39.jpg)
39
Eviction
• tokudb_cache_size - Amount of memory TokuDB allocates for nodes in memory.
• TokuDB’s term is “CACHETABLE”, status variables• show global status like '%CACHETABLE%';
• Eviction - background process to keep memory consumption <= tokudb_cache_size.• It starts in only when size_of(nodes_in_memory) > tokudb_cache_size
• TokuDB will use more memory than tokudb_cache_size, • User thread will be stopped if used memory > tokudb_cache_size*1.2
![Page 40: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/40.jpg)
40
Eviction algorithm
CACHETABLE uses GCLOCK algorithm (not LRU) to manage nodes in memory.
Eviction algorithm in simple steps:• If size_of(nodes_in_memory) > tokudb_cache_size
Find victim to remove from memoryNode with smallest access_count is removed (evicted)If Node is DIRTY - node is sent into background process to write on diskTokudb_CACHETABLE_SIZE_WRITING - size of nodes in background write queue
• Potential memory consumption is tokudb_cache_size + Tokudb_CACHETABLE_SIZE_WRITING
![Page 41: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/41.jpg)
41
Partial eviction
• For non-leaf non-dirty nodes Evictor may choose to perform partial eviction
• 2 stage of partial evictions:• Compress a part of node• If still not-used, remove from memory
• Variables to controls this:• tokudb_enable_partial_eviction• tokudb_compress_buffers_before_eviction
![Page 42: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/42.jpg)
42
Partial eviction
![Page 43: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/43.jpg)
43
![Page 44: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/44.jpg)
44
TokuDB Compression
• Only non-compressed data stored in memory (unless partial compressed part of non-leaf node).
• It seems beneficial to use OS cache as a secondary cache for compressed nodes, for this:• tokudb_directio=OFF• USE cgroups to limit total memory usage by mysqld process
![Page 45: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/45.jpg)
45
Checkpointing
• Checkpointing - is the periodical process to get datafiles in sync with transactional redo log files.
• show global status like '%CHECKPOINT%';
• In TokuDB checkpointing is time-based, in InnoDB - log file size based.• In InnoDB checkpointing is fuzzy. In TokuDB it starts by timer and runs until it is done.
• Checkpointing interval in TokuDB:
• tokudb_checkpointing_period=N sec
![Page 46: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/46.jpg)
46
![Page 47: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/47.jpg)
47
![Page 48: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/48.jpg)
48
Checkpoint algorithm
• START CHECKPOINT; • begin_checkpoint; ←- all transactions are stalled• mark all nodes in memory as PENDING;• end_begin_checkpoint;
• Checkpoint thread: go through all PENDING nodes; if dirty - write to disk• User threads: if user query faces PENDING node; node is CLONED and put into background checkpoint
thread pool
• By default checkpoint thread pool size (amount of threads) = CPU CORES / 4.• That is 4 threads on 16 cores servers.• In CPU bound workload it takes 25% of CPU power from user threads!!!!• Variable: tokudb_checkpoint_pool_threads=N
![Page 49: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/49.jpg)
49
![Page 50: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/50.jpg)
Few words on LSMSection Information
![Page 51: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/51.jpg)
51
LSM tree
• Older than Fractal Tree• Google BigTable as primary driver of interest• Cassandra• RocksDB• MongoRocks• MyRocks
![Page 52: Percona FT / TokuDB](https://reader033.vdocuments.net/reader033/viewer/2022051123/587146c11a28ab55588b59d1/html5/thumbnails/52.jpg)
52
Instead of final summary
• Alternative data structures have their place• Use wisely, know limitations• A lot of work ahead