learned index structures - cornell university...learned index structures bigtable research review...

Learned Index Structures

Bigtable Research Review MeetingPresented by Deniz Altinbuken

January 29, 2018

paper by Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis

go/learned-index-structures-presentation

http://go/learned-index-structures-presentation

Objectives1. Show that all index structures can be replaced with deep

learning models: learned indexes.2. Analyze under which conditions learned indexes

outperform traditional index structures and describe the main challenges in designing learned index structures.

3. Show that the idea of replacing core components of a data management system through learned models can be very powerful.

Claims● Traditional indexes assume worst case data distribution so

that they can be general purpose.○ They do not take advantage of patterns.

● Knowing the exact data distribution enables highly optimizing any index the database system uses.

● ML opens up the opportunity to learn a model that reflects the patterns and correlations in the data and thus enable the automatic synthesis of specialized index structures:learned indexes.

A model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records.

Main Idea

Background


Results

Conclusion

Background

Neural Networks: An ExampleRecognizing handwriting● Very difficult to express our intuitions such as "9 has a

loop at the top, and a vertical stroke in the bottom right".● Very difficult to create precise rules and solve this

algorithmically.○ Too many exceptions, special cases.

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neural Networks: An ExampleNeural networks approach the problem in a different way.

● Take a large number of handwritten digits: training data.

● Develop a system which can learn from the training data.

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neural networks approach the problem in a different way.

Neural Networks: An Example

Automatically infer rules for recognizing handwritten digits by going through

examples!

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neural networks approach the problem in a different way.

Neural Networks: An Example

Create a network of neurons that can learn! :)

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neurons: PerceptronA perceptron takes several binary inputs, x1,x2,… and produces a single binary output:

The output is computed as a function of the inputs, where weights w1,w2,… express the importance of inputs to the output.

x1

x2

x3

output

w1 w2 w3

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

The output is determined by whether the weighted sum ∑jwjxj is less than or greater than some threshold value.

Just like the weights, the threshold is a number which is a parameter of the neuron. If the threshold is reached, the neuron fires.

Neurons: Perceptron

x1

x2

x3

output

w1 w2 w3

t

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neurons: PerceptronThe output is determined by whether the weighted sum ∑jwjxj is less than or greater than some threshold value.

Just like the weights, the threshold is a number which is a parameter of the neuron. If the threshold is reached, the neuron fires.

0 if ∑j wjxj ≤ thresholdoutput = 1 if ∑j wjxj > threshold

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

A more common way to describe a perceptron is:

● ∑jwjxj ● -threshold bias

Neurons: Perceptron

0 if w⋅x + bias ≤ 0output = 1 if w⋅x + bias > 0

0 if ∑j wjxj ≤ thresholdoutput = 1 if ∑j wjxj > threshold

Bias describes how easy it is to get the neuron to fire.

w⋅x

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neurons: Perceptron● By varying the weights and the threshold, we get different

models of decision-making.● A complex network of perceptrons that uses layers can

make quite subtle decisions.

outputinputs

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts




outputinputs

1st layer 2nd layer

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts




output

input layer output layerhidden layers

inputs

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neurons: Perceptron

Perceptrons are great for decision making.

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neurons: Perceptron

How about learning?

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neurons: PerceptronEarlier

Automatically infer rules for recognizing handwritten digits by going through

examples!

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Learning● A neural network goes through examples to learn weights

and biases so that the output from the network correctly classifies a given digit.

● When a small change is made in some weight or bias in the network if this causes a small corresponding change in the output from the network, the network can learn.

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Learning● A neural network goes through examples to learn weights

and biases so that the output from the network correctly classifies a given digit.

● When a small change is made in some weight or bias in the network if this causes a small corresponding change in the output from the network, the network can learn.

Trying to create the right mapping for all cases.

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Learning

The neural network is “trained” by adjusting weights and biases to find the perfect model that would generate the

expected output for the “training data”.

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Learning

Through training you minimize the prediction error.

(But having perfect output is difficult.)

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neurons: Sigmoid● Sigmoid neurons are similar to perceptrons, but modified

so that small changes in their weights and bias cause only a small change in their output.

output + Δoutput inputs

w + Δw

Small Δ in any weight or bias causes a small Δ in the output!

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neurons: Sigmoid● A sigmoid takes several inputs, x1,x2,… which can be

any real number between 0 and 1 (i.e. 0.256) and produces a single output, which can also be any real number between 0 and 1.

output = σ(w⋅x + bias)

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts




σ(z) = 11 + e-z

sigmoid function

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts




Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Great for representing probabilities!

Neurons: ReLU (Rectified Linear Unit)● Better for deep learning because it preserves the

information from earlier layers better as it goes through hidden layers.

outputinputs

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Neurons: ReLU (Rectified Linear Unit)● Better for deep learning because it preserves the

information from earlier layers better as it goes through hidden layers.

0 if x ≤ 0output = x if x > 0

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

Activation Functions (Transfer Functions)

To get an intuition about the neurons, it helps to see the shape of the activation function.

Back

grou

ndCo

nclu

sion

Lear

ned

Inde

x St

ruct

ures

Resu

lts

● Indexes are already to a large extent learned models like neural networks.

● Indexes predict the location of a value given a key.○ A B-tree is a model that takes a key as an input and

predicts the position of a data record.○ A bloom filter is a binary classifier, which given a key

predicts if a key exists in a set or not.

Index Structures as Neural Network Models

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

B-treeThe B-tree provides a mapping from a lookup key into a position inside the sorted array of records.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es


For efficiency, index to page granularity.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es


Map a key to a position with a min and max error.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Replace B-trees with ML Models!● We can replace the index with ML models that provide

similar strong guarantees about the min and max error.● The B-tree only provides this guarantee over the stored

data, not for all possible data.○ The min and max error is the maximum error of the

model over the training data.○ Execute the model for every key and remember the

worst over- and under-prediction of a position.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● B-Trees have a bounded cost for inserts and lookups and are good in taking advantage of the cache.

● B-Trees can map keys to pages which are not continuously mapped to memory or disk.

● If a lookup key does not exist in the set, certain models might return positions outside the min/max error range if they are not monotonically increasing models.

Challenges

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● Using ML models has the potential to transform the cost of log n B-tree look-up into a constant operation (in the best case).

● Neural networks are able to learn a wide variety of data distributions, mixtures and other data peculiarities and patterns and make use of these.○ Have to balance the complexity of the model with its

accuracy.

Advantages

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

A First, Naïve Learned Index ● Use 200M web-server log records to build a secondary

index over the timestamps using Tensorflow.○ Two-layer fully-connected NN with 32 neurons per

layer using ReLU activation functions; the timestamps are the inputs and the positions are the outputs.

○ Lookup time ≈ 80,000 ns (model execution only).

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

A First, Naïve Learned Index ● Use 200M web-server log records to build a secondary

index over the timestamps using Tensorflow.○ Two-layer fully-connected NN with 32 neurons per

layer using ReLU activation functions; the timestamps are the inputs and the positions are the outputs.

○ Lookup time ≈ 80,000 ns (model execution only).● CPU and space efficient to narrow down the position for

an item from the entire data set to a region of thousands, but inefficient for the “last mile”.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

A First, Naïve Learned Index

For every key in 100M keys, we want to map it to a position in a sorted array.

When we have one model, it has to be “complex enough” to figure out an

accurate mapping for every key.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

The Recursive Model Index

It is much easier to have a model that can say that a given key from 100M keys

maps to the first 10k, second 10k, etc. positions!

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

The Learning Index Framework (LIF)● The LIF can be regarded as an index synthesis system;

given an index specification, LIF generates different index configurations, optimizes them, and tests them automatically.

● Given a trained Tensorflow model, LIF automatically extracts all weights from the model and generates efficient index structures in C++ based on the model specification.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● Improve last-mile accuracy.○ Reducing min/max error to 100 from 100M records

using a single model is very hard. ○ Reducing the error to 10k from 100M is much easier

to achieve even with simple models. ○ Reducing the error from 10k to 100 is simpler as the

model can focus only on a subset of the data.


Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● Improve last-mile accuracy.○ Reducing min/max error to 100 from 100M records

using a single model is very hard. ○ Reducing the error to 10k from 100M is much easier to

achieve even with simple models. ○ Reducing the error from 10k to 100 is simpler as the

model can focus only on a subset of the data. 💡 Use a hierarchical approach where we can have models focus on smaller subsets of data.


Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

The Recursive Model IndexTake a layered approach and have models focus on limited layers:

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

The Recursive Model IndexTake a layered approach and have models focus on limited layers:

Reduce from 100M to 1M

Reduce from 1M to 10k

Reduce from 10k to 100

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

The Recursive Model IndexTake a layered approach and have models focus on limited layers: Check out the math

in the paper if you’re interested in

the details! :)Reduce from 100M to 1M



Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Hybrid End-to-End TrainingWith a layered approach we can build mixtures of models!

Reduce from 100M to 1M



small ReLU NN

Linear Regression

Linear Regression

Linear Regression

B-tree B-tree B-tree B-tree

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Starting from the entire dataset (line 3), it trains first the top-node model. Based on the prediction of this model, it then picks the model from the next stage (lines 9 and 10) and adds all keys which fall into that model (line 10). Finally, in the case of hybrid indexes, the index is optimized by replacing NN models with B-trees if absolute min-/max-error is above a predefined threshold (lines 11-14).

Hybrid End-to-End Training

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Hybrid End-to-End Training

Worst case is a B-tree!

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

To find the record either binary search or scanning is used.Models might generate more information than page location.● Model Binary Search

○ Set first middle point to pos predicted by the model. ● Biased Search

○ Use standard deviation σ of the last stage model to set middle.

● Biased Quaternary Search○ Pick three middle points as pos − σ, pos, pos + σ.

Search Strategies

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● Turn strings into inputs the NN model can use.○ Represent string as a vector, where each element is

the decimal ASCII value of a char.○ Limit size of vector to N to have equally-sized inputs

● Vector inputs slow the model down significantly.

● Further research is needed to speed this case up :)

Indexing strings

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Inserts and Updates● Appends

○ No need to relearn if the model can only learn the key trend for the new items.

● Inserts in the middle○ If inserts follow roughly a similar pattern as the

learned CDF, retraining is not needed since the index “generalizes” over the new items and inserts become an O(1) operation.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

If we have a model that is more general, it is cheaper to insert new values, since

they will follow the trend.

Inserts and Updates

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

HashmapHashmaps use a hash function to deterministically map keys to random positions inside an array.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

HashmapMain challenge is to reduce conflicts.

● Use a linked-list to handle the “overflow”. ● Use linear or quadratic probing.● Most solutions allocate significantly more memory than

records and combine it with additional data structures.○ Dense hashmap: typical overhead of 78% memory.○ Sparse hashmap: only 4 bits overhead, but is up to

3-7 times slower because of its search and data placement strategy.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● If we could learn a model which uniquely maps every key into a unique position inside the array we could avoid conflicts.

● Learned models are capable of reaching higher utilization of the hashmap depending on the data distribution.

● Scale the distribution by the targeted size M of the hashmap and use h(K) = F(K) ∗ M, K is hash function.

● If the model F perfectly learned the distribution, no conflicts would exist.

Hashmap

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Bloom filterBloom filters are probabilistic data structures used to test whether an element is a member of a set.

Blo

om fi

lter

inse

rtio

nLe

arne

d bl

oom

fil

ter i

nser

tion

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Bloom filter● A bloom filter index needs to learn a function that

separates keys from everything else.○ A good hash function for a bloom filter should have

lots of collisions among keys and lots of collisions among non-keys, but few collisions of keys and non-keys.

● As a classification problem: learn a model f that can predict if an input x is a key or non-key.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● As a classification problem: learn a model f that can predict if an input x is a key or non-key.○ Use sigmoid neurons to find probability between 0,1. ○ The output of NN is the probability that input x is a key

in our database.○ Choose a threshold t above which we will assume the

key exists in our database.○ Tune threshold t to achieve the desired false positive

rate.○ To prevent false negatives, use overflow bloom filter.

Bloom filter

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Results

● 4 datasets to compare the performance of learned index structures with B-trees.○ Compare lookup-time (model execution time + local search time).○ Compare index structure size.○ Compare model error and error variance.

● These results focus on read performance only, loading and insertion time are not included.○ A model without hidden layers can be trained on over 200M records in

just few seconds.

B-tree Results

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

200M log entries for requests to a major university website. Index over all unique timestamps.

Web Log Dataset

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

200M log entries for requests to a major university website. Index over all unique timestamps.

Web Log DatasetThe model error is the averaged standard error over all models on the last stage, whereas the error variance indicates how much this standard error varies between the models.

Baseline

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Model is 3× faster and up to an order-of-magnitude smaller.

Web Log Dataset

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Quarternary search only helps a little bit.

Web Log Dataset

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

The error is high, which influences the search time.

Web Log Dataset

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Maps DatasetIndex of the longitude of ≈ 200M user-maintained features across the world. Relatively linear.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Maps Dataset

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Model is 3× faster and up to an order-of-magnitude smaller.

Maps DatasetQuarternary search does not help.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Lognormal DatasetSynthetic dataset of 190M unique values to test how the index works on heavy-tail distributions. Highly non-linear, making the distribution more difficult to learn.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Lognormal Dataset

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

The error is high, which influences the search time.

Important Observations● 3× faster and being up to an order-of-magnitude smaller. ● Quarternary search only helps for some datasets. ● The model accuracy varies widely. Most noticeable for the

synthetic dataset and the weblog data the error is much higher.

● Second stage size has a significant impact on the index size and lookup performance.○ This is not surprising as the second stage determines

how many models have to be stored. Worth noting is that our second stage uses 10,000 or more models.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Web Document DatasetThe web-document dataset consists of the 10M non-continuous document-ids of a large web index used as part of a real product at a large internet company.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Web Document DatasetSpeedups for learned indexes is not as prominent, so hybrid indexes, which replace bad performing models with B-trees actually help to improve performance.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Web Document DatasetBecause cost of searching is higher, the different search strategies make a bigger difference. The reason why biased search and quaternary search performs better is that they can take the standard error into account.

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● Use 3 int datasets.● Model hash has similar

performance and utilizes the memory better.

● When there are extra slots, the improvement disappears.

Hashmap Results

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● Blacklisted phishing URLs dataset: 1.7M unique URLs.

● The more accurate the model is, the better the savings in bloom filter size.

Bloom filter Results

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

● A normal Bloom filter with a desired 1% false positive rate requires 2.04MB.

● For a 16-dim GRU with a 32-dim embedding for each character; the model is 0.0259MB, with the spillover it is 1.07 MB.

Bloom filter Results

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Conclusion

● Multi-Dimensional Indexes: Extend learned indexes to multi-dimensional index structures. Models, especially neural nets, are extremely good at capturing complex high-dimensional relationships.

● Learned Algorithms: A model can also speed-up sorting and joins, not just indexes.

● GPU/TPUs: GPU/TPUs will make the idea of learned indexes even more viable.

Conclusion and Future Work

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es

Next time● Is this a good idea?

● Related work● Some Notes on "Learned Bloom Filters"

● Don't Throw Out Your Algorithms Book Just Yet

Back

grou

ndRe

sults

Conc

lusi

onLe

arne

d In

dex

Stru

ctur

es
https://mybiasedcoin.blogspot.com/2018/01/some-notes-on-learned-bloom-filters.html?m=1http://dawn.cs.stanford.edu/2018/01/11/index-baselines/

learned index structures - cornell university...learned index structures bigtable research review...

Documents