learning data systems components
TRANSCRIPT
![Page 1: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/1.jpg)
Tim Kraska <[email protected]> [Disclaimer: I am NOT talking on behalf of Google]
Learning Data Systems Components
Work partially done at
![Page 2: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/2.jpg)
HashMaps
SortingJoins
BloomFilter
Tree
“Machine Learning Just Ate Algorithms In One Large Bite….” [Christopher Manning, Professor at Stanford]
Comments on Social Media
![Page 3: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/3.jpg)
Disclaimer
HashMaps
Sorting
Joins
BloomFilter
Tree
![Page 4: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/4.jpg)
Fundamental Building Blocks
Sorting
B-TreeHash- Map
Scheduling
Join
Priority Queue
Bloom Filter
CachingRange Filter
![Page 5: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/5.jpg)
Databases as an Example:
B-Trees
![Page 6: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/6.jpg)
![Page 7: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/7.jpg)
![Page 8: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/8.jpg)
![Page 9: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/9.jpg)
B-CC-G
G-JK-N
N-R
Q-SS-U
U-VV-X
X-@
![Page 10: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/10.jpg)
B-CC-G
G-JK-N
N-R
A-B B-C C-G …Key
![Page 11: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/11.jpg)
B-CC-G
G-JK-N
N-R
A-B B-C C-G …
AA-AL
AL-AK
AK-AP … BA-
BEBI- BL
BL-BR … … ... ... …
….
Key
![Page 12: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/12.jpg)
A-B B-C C-G …
AA-AL
AL-AK
AK-AP … BA-
BEBI- BL
BL-BR … … ... ... …
….
… … … … …. … …. …….….
… … … … …. … …. … … … … … …. … …. …
Key
![Page 13: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/13.jpg)
B-CC-G
G-JK-N
N-R
![Page 14: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/14.jpg)
The Librarian
![Page 15: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/15.jpg)
Harry Potter
Childreen Books
Curious George
O’Reilly Books
Travel Books
DaVinci Code
The Girl on the Train
Bill Brycen
The Source
The GruffaloThe Gruffalo
A Day in the Life of Marlon Bundo
ML With Python
![Page 16: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/16.jpg)
Harry Potter
Childreen Books
Curious George
O’Reilly Books
Travel Books
DaVinci Code
The Girl on the Train
Bill Brycen
The Source
The GruffaloMake Way for Ducklings
A Day in the Life of Marlon Bundo
ML With Python
![Page 17: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/17.jpg)
B-CC-G
G-JK-N
N-R
![Page 18: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/18.jpg)
B-CC-G
G-JK-N
N-R
A-B
B-C
C-G …
AA-AL
AL-AK
AK-AP …
BA-BE
BI- BL
BL-BR … … ... ... …
….
… … … … …. … …. …….….
… … … … …. … …. … … … … … …. … …. …
Key
![Page 19: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/19.jpg)
B-CC-G
G-JK-N
N-R
Model
Key
![Page 20: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/20.jpg)
Fundamental Algorithms & Data Structures
Hash-MapTreeSortingJoin
Range-Filter Priority Queue
…..Scheduling Cache Policy
Bloom-Filter
![Page 21: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/21.jpg)
Not convinced yet?
![Page 22: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/22.jpg)
Another Example:Index All Integers from 900 to 800M
900 901 902 903 904 905 906 907 908 909 800M…
… … … …
… … … … … … … … … ... ... …….
… … … … …. … …. …….….
… … … … …. … …. … … … … … …. … …. …
B-Tree?
![Page 23: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/23.jpg)
A More Concrete Example:Index All Integers from 900 to 800M
900 901 902 903 904 905 906 907 908 909 800M…
data_array[lookup_key - 900]
![Page 24: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/24.jpg)
Goal: Index All Integers from 900 to 800M
900 901 902 903 904 905 906 907 908 909 800M…
900 902 904 906 908 910 912 914 916 918 800M…
Index All Even Integers from 900 to 800M
data_array[(lookup_key – 900) / 2]
![Page 25: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/25.jpg)
Still holds for other data distributions
![Page 26: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/26.jpg)
Key InsightTraditional data structures (typically) make no assumptions about the data
But knowing the data distribution might allow for significant performance gains and might even change the complexity of data structures (e.g., O(log n) ! O(1) for lookups or O(n) ! O(1) for storage)
![Page 27: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/27.jpg)
Building A System From Scratch For Every Use Case Is Not Economical
![Page 28: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/28.jpg)
Conceptually a B-Tree maps a key to a page
B-Tree
key
page
For simplicity assume all pages are continuously stored in main memory
![Page 29: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/29.jpg)
Alternative ViewB-Tree maps a key to a position with a fixed min/max error
For simplicity assume all pages are continuously stored in main memory
B-Tree
Sorted Array
key
position
pos pos + page-size
1. B-tree: key→pos 2. Binary search within
min/max-error
![Page 30: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/30.jpg)
Sorted Array
key
position
pos pos + page-size
Model
A B-Tree Is A Model
![Page 31: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/31.jpg)
Finding an item 1. Any model: key → pos estimate 2. Binary search in
[pos - errmin, pos + errmax]
errmin and errmax are known from the
training process
Sorted Array
key
position
pos pos + page-size
Model
A B-Tree Is A Model
![Page 32: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/32.jpg)
A B-Tree Is A ModelA form of a regression modelkey→ pos is equivalent of modeling the CDF of the (observed) key distribution: Pos-estimate = P(X ≤ key) * #keys
Sorted Array
key
position
pos pos + page-size
Model
![Page 33: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/33.jpg)
A B-Tree Is A Model
Pos-estimate = F(key) * #keys
![Page 34: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/34.jpg)
B-Trees Are Regression Trees
B-Tree
Sorted Array
key
position
![Page 35: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/35.jpg)
What Does This Mean
![Page 36: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/36.jpg)
What Does This Mean
Database people were the first to do large scale machine learning :)
![Page 37: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/37.jpg)
Potential Advantages of Learned B-Tree Models
• Smaller indexes → less (main-memory) storage
• Faster Lookups?
• More parallelism → Sequential if-statements are exchanged for multiplications
• Hardware accelerators → Lower power, better $/compute….
• Cheaper inserts? → more on that later. For the moment, assume read-only
![Page 38: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/38.jpg)
A First Attempt
• 200M web-server log records by timestamp-sorted • 2 layer NN, 32 width, ReLU activated • Prediction task: timestamp ! position within
sorted array
![Page 39: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/39.jpg)
Cache-Optimized B-Tree
≈250ns ???
A First Attempt
![Page 40: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/40.jpg)
A First Attempt
≈250ns ≈80,000ns
Cache-Optimized B-Tree
![Page 41: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/41.jpg)
ReasonsProblem I: Tensorflow is designed for large models
Problem II: B-Trees are great for overfitting
Problem III: B-Trees are cache-efficient
Problem IV: Search does not take advantage of the prediction
![Page 42: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/42.jpg)
Problem I: The Learning Index Framework (LIF)
• An index synthesis system
• Given an index configuration generate the best possible code
• Uses ideas from Tupleware [VLDB15]
• Simple models are trained “on-the-fly”, whereas for complex models we use Tensorflow and extract weights afterwards (i.e., no Tensorflow during inference time)
• Best index configuration is found using auto-tuning (e.g., see TuPAQ [SOCC15]
![Page 43: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/43.jpg)
Problem II + III: Precision Gain per Node
…….
…….
…….
…….
…….
…….…….
…….
Index over 100M records. Page-size: 100
Precision Gain: 100M --> 1M (Min/Max-Error: 1M)
Precision Gain: 1M --> 10k
Precision Gain: 10k --> 100
100M records (i.e., 1M pages)
![Page 44: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/44.jpg)
The Last Mile Problem
Pos
Key
![Page 45: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/45.jpg)
Solution: Recursive Model Index (RMI)
![Page 46: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/46.jpg)
How Does The Lookup-Code Look LikeModel on stage 1: f0(key_type key) Models on stage two: f1[] (e.g., the first model in the second stage is is f1[0](key_type key)) Lookup Code:
pos_estimate " f1[f0(key)](key)pos " exp_search(key, pos_estimate, data);
Number of operations with linear regression models: offset " a + b * key
weights2 " weights_stage2[offset]pos_estimate " weights2.a +
weights2.b * keypos " exp_search(key, pos_estimate, data)
2x multiplies 2x additions 1x array-lookup
![Page 47: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/47.jpg)
How Does The Lookup-Code Look LikeModel on stage 1: f0(key_type key) Models on stage two: f1[] (e.g., the first model in the second stage is is f1[0](key_type key)) Lookup Code for a 2-stage RMI:
pos_estimate " f1[f0(key)](key)pos " exp_search(key, pos_estimate, data);
Operations with a 2-stage RMI with linear regression models offset " a + b * key
weights2 " weights_stage2[offset]pos_estimate " weights2.a +
weights2.b * keypos " exp_search(key, pos_estimate, data)
2x multiplies 2x additions 1x array-lookup
![Page 48: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/48.jpg)
Hybrid RMI
Worst-Case Performance is the one of a B-Tree
![Page 49: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/49.jpg)
Problem IV: Min-/Max-Error vs Average Error
0 N
Actual Position
Predicted Position
Min-Model Error
Max-Model Error
![Page 50: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/50.jpg)
Binary Search
0 N
Actual Position
Predicted Position
RightLeft Middle
![Page 51: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/51.jpg)
Binary Search
0 N
Actual Position
Predicted Position
RightLeft Middle
![Page 52: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/52.jpg)
Binary Search
0 N
Actual Position
Predicted Position
MiddleLeft Right
![Page 53: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/53.jpg)
Quaternary Search
0 N
Actual Position
Predicted Position
RightLeft Q2
![Page 54: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/54.jpg)
Quaternary Search
0 N
Actual Position
Predicted Position
RightLeft Q2
Q1: Prediction – 2x std err
Q3: Prediction + 2x std err
![Page 55: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/55.jpg)
Quaternary Search
0 N
Actual Position
Predicted Position
Left Q1 RightQ2 Q3
![Page 56: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/56.jpg)
Exponential Search
0 N
Actual Position
Predicted Position
![Page 57: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/57.jpg)
Does it have to be
![Page 58: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/58.jpg)
Does It Work?
Type Config Lookup time
Speedup vs. BTree
Size (MB) Size vs. Btree
BTree page size: 128 260 ns 1.0X 12.98 MB 1.0X
Learned index
2nd stage size: 10000 222 ns 1.17X 0.15 MB 0.01X
Learned index
2nd stage size: 50000
162 ns 1.60X 0.76 MB 0.05X
Learned index
2nd stage size: 100000
144 ns 1.67X 1.53 MB 0.12X
Learned index
2nd stage size: 200000
126 ns 2.06X 3.05 MB 0.23X60% faster at 1/20th the space, or 17% faster at 1/100th the space
200M records of map data (e.g., restaurant locations). index on longitude Intel-E5 CPU with 32GB RAM without GPU/TPUs No Special SIMD optimization (there is a lot of potential)
![Page 59: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/59.jpg)
Does It Work?
Type Config Lookup time
Speedup vs. BTree
Size (MB) Size vs. Btree
BTree page size: 128 260 ns 1.0X 12.98 MB 1.0X
Learned index
2nd stage size: 10000 222 ns 1.17X 0.15 MB 0.01X
Learned index
2nd stage size: 50000
162 ns 1.60X 0.76 MB 0.05X
Learned index
2nd stage size: 100000
144 ns 1.67X 1.53 MB 0.12X
Learned index
2nd stage size: 200000
126 ns 2.06X 3.05 MB 0.23X60% faster at 1/20th the space, or 17% faster at 1/100th the space
200M records of map data (e.g., restaurant locations). index on longitude Intel-E5 CPU with 32GB RAM without GPU/TPUs No Special SIMD optimization (there is a lot of potential)
![Page 60: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/60.jpg)
You Might Have Seen Certain Blog Posts
![Page 61: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/61.jpg)
0 50 100 150 200 250 300 350
256
32
4
0.5
Size (MB)
Lookup-Time (ns)
FAST
LookupTable
Fixed-Size Read-OptimizedB-Tree w/ interpolation search
Learned Index
BetterW
orse
Better Worse
![Page 62: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/62.jpg)
My Own Comparison
![Page 63: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/63.jpg)
A Comparison To ARTful Indexes (Radix-Tree)
Experimental setup:
• Dense: continuous keys from 0 to 256M • Sparse: 256M keys where each bit is equally likely 0 or 1.
Viktor Leis, Alfons Kemper, Thomas Neumann: The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases. ICDE 2013
![Page 64: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/64.jpg)
A Comparison To ARTful Indexes (Radix-Tree)
Experimental setup: continuous keys from 0 to 256M
Reported lookup throughput: 10M/s ≈ 100ns(1)
Size: not measured, but paper says overhead of ≈8 Bytes per key (dense, best case): 256M * 8 Byte ≈ 1953MB
(1)Numbers from the paper
Viktor Leis, Alfons Kemper, Thomas Neumann: The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases. ICDE 2013
![Page 65: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/65.jpg)
Learned Index
Generate Code: Record lookup(key) {
return data[0 + 1 * key];}
![Page 66: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/66.jpg)
Learned Index
Generate Code: Record lookup(key) {
return data[key];}
![Page 67: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/67.jpg)
Learned Index
Generate Code: Record lookup(key) {
return data[key];}
Lookup Latency: 10ns (learned index) vs 100ns* (ARTfull) or one-order-of-magnitude better Space: 0MB vs 1953MB Infinitely better :)
![Page 68: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/68.jpg)
?
![Page 69: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/69.jpg)
What about Updates and Inserts?
![Page 70: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/70.jpg)
What about Updates and Inserts?Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, Tim Kraska: A-Tree: A Bounded Approximate Index Structurehttps://arxiv.org/abs/1801.10207
![Page 71: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/71.jpg)
The Simple Approach: Delta Indexing
updates
Training a simple Multi-Variate Regression Model Can be done in one pass over the data
![Page 72: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/72.jpg)
Leverage the Distribution
![Page 73: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/73.jpg)
Leverage the Distribution for Appends
Inse
rts
(e
.g.,
Tim
esta
mps
)
Time
New Inserts
If the Learned Model Can Generalize to Inserts Insert complexity is O(1) not O(Log N)
![Page 74: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/74.jpg)
Updates/Inserts • Less beneficial as the data still has to be stored sorted
• Idea: Leave space in the array where more updates/inserts are expected
• Can also be done with traditional trees.
• But, the error of learned indexes should increase with per node in RMI whereas traditional indexes with𝑁 𝑁
![Page 75: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/75.jpg)
Still at the Beginning!
• Can we provide bounds for inserts? • When to retrain? • How to retrain models on the fly? • …
![Page 76: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/76.jpg)
Fundamental Algorithms & Data Structures
Hash-MapTreeSortingJoin
Range-Filter Priority Queue
…..Scheduling Cache Policy
Bloom-Filter
![Page 77: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/77.jpg)
Fundamental Algorithms & Data Structures
TreeSortingJoin
Range-Filter Priority Queue
…..Scheduling Cache Policy
Bloom-FilterHash-Map
![Page 78: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/78.jpg)
Hash FunctionKey ModelKey
Hash Map
Goal: Reduce Conflicts
![Page 79: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/79.jpg)
25% - 70% Reduction in Hash-Map Conflicts
Hash Map - Results
Skip
![Page 80: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/80.jpg)
You Might Have Seen Certain Blog Posts
![Page 81: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/81.jpg)
Independent of Hash-Map Architecture
![Page 82: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/82.jpg)
Hash Map – Example Results
Type Time (ns) Utilization
Stanford AVX Cuckoo, 4 Byte value 31ns 99%Stanford AVX Cuckoo, 20 Byte record - Standard Hash 43ns 99% Commercial Cuckoo, 20Byte record - Standard Hash 90ns 95% In-place chained Hash-map, 20Byte record, learned hash functions 35ns 100%
![Page 83: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/83.jpg)
Fundamental Algorithms & Data Structures
Hash-MapTreeSortingJoin
Range-Filter Priority Queue
…..Scheduling Cache Policy
Bloom-Filter
![Page 84: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/84.jpg)
Fundamental Algorithms & Data Structures
Hash-MapTreeSortingJoin
Range-Filter Priority Queue
…..Scheduling Cache Policy
Bloom-Filter
![Page 85: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/85.jpg)
Is This Key In My Set?
Maybe Yes No No
Maybe No
Is This Key In My Set?
Model
Maybe Yes
36% Space Improvement over Bloom Filter at Same False Positive Rate
Bloom Filter- Approach 1
![Page 86: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/86.jpg)
Bloom Filter- Approach 2 (Future Work)
Hash Function 1
KeyModelKey
Hash Function 2
Hash Function 3
![Page 87: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/87.jpg)
Fundamental Algorithms & Data Structures
Hash-MapTreeSortingJoin
Range-Filter Priority Queue
…..Scheduling Cache Policy
Bloom-Filter
![Page 88: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/88.jpg)
Future Work
CDF
How Would You Design Your Algorithms/Data Structure If You Have a Model for the Empirical Data Distribution?
![Page 89: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/89.jpg)
Future Work
Hash-MapTreeSortingJoin
Range-Filter Priority Queue
…..Scheduling Cache Policy
Bloom-Filter
![Page 90: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/90.jpg)
Future work: Multi-Dim Indexes
![Page 91: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/91.jpg)
Future work: Data Cubes
![Page 92: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/92.jpg)
Other Database Components
• Cardinality Estimation • Cost Model • Query Scheduling • Storage Layout • Query Optimizer • …
How Would You Design Your Algorithms/Data Structure If You Have a Model for the Empirical Data Distribution?
![Page 93: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/93.jpg)
![Page 94: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/94.jpg)
Related Work• Succinct Data Structures ! Most related, but succinct data
structures usually are carefully, manually tuned for each use case • B-Trees with Interpolation search ! Arbitrary worst-case
performance • Perfect Hashing ! Connection to our Hash-Map approach, but
they usually increase in size with N • Mixture of Expert Models ! Used as part of our solution • Adaptive Data Structures / Cracking ! orthogonal problem • Local Sensitive Hashing (LSH) (e.g., learened by NN)! Has nothing to do with Learned Structures
![Page 95: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/95.jpg)
Local Sensitive Hashing (LSH)
Thanks Alkis for the analogy
![Page 96: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/96.jpg)
Summarize
CDF
How Would You Design Your Algorithms/Data Structure If You Have a Model for the Empirical Data Distribution?
![Page 97: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/97.jpg)
Adapts To Your Data
![Page 98: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/98.jpg)
Big Potential For TPUs/GPUs
![Page 99: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/99.jpg)
O(N2)
O(N)
O(Log N)
O(1)N
Tim
e o
r S
pace
Can Lower the Complexity Class
data_array[(lookup_key – 900)]
![Page 100: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/100.jpg)
Warning Not An Almighty Solution
![Page 101: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/101.jpg)
Data System for AI Lab DSAIL@CSAIL
Rese
arch
Ar
eaSy
stem
Fa
cult
yFo
undi
ng
Spon
sors
ML
Facu
lty
![Page 102: Learning Data Systems Components](https://reader033.vdocuments.net/reader033/viewer/2022051510/627e9639db753828802eca63/html5/thumbnails/102.jpg)
Tim Kraska <[email protected]>
Spec
ial
tha
nks
to:
• A new approach to indexing
• Framework to rethink many existing data structures/algorithms
• Under certain conditions, it might allow to change the complexity class of data structures
• The idea might have implications within and outside of database systems
Technical Report: Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis: The Case for Learned Index Structures
Work partially done at