approximate nearest neighbor methods and vector models – nyc ml meetup
TRANSCRIPT
Approximate nearest neighbors & vector
models
I’m Erik
• @fulhack
• Author of Annoy, Luigi
• Currently CTO of Better
• Previously 5 years at Spotify
What’s nearest neighbor(s)
• Let’s say you have a bunch of points
Grab a bunch of points
5 nearest neighbors
20 nearest neighbors
100 nearest neighbors
…But what’s the point?
• vector models are everywhere
• lots of applications (language processing, recommender systems, computer vision)
MNIST example• 28x28 = 784-dimensional dataset
• Define distance in terms of pixels:
MNIST neighbors
…Much better approach
1. Start with high dimensional data
2. Run dimensionality reduction to 10-1000 dims
3. Do stuff in a small dimensional space
Deep learning for food• Deep model trained on a GPU on 6M random pics
downloaded from Yelp15
6x15
6x32
154x
154x
32
152x
152x
32
76x7
6x64
74x7
4x64
72x7
2x64
36x3
6x12
8
34x3
4x12
8
32x3
2x12
8
16x1
6x25
6
14x1
4x25
6
12x1
2x25
6
6x6x
512
4x4x
512
2x2x
512
2048
2048
128
1244
3x3 convolutions
2x2 maxpoolfully
connected with dropout
bottleneck layer
Distance in smaller space1. Run image through the network
2. Use the 128-dimensional bottleneck layer as an item vector
3. Use cosine distance in the reduced space
Nearest food pics
Vector methods for text
• TF-IDF (old) – no dimensionality reduction
• Latent Semantic Analysis (1988)
• Probabilistic Latent Semantic Analysis (2000)
• Semantic Hashing (2007)
• word2vec (2013), RNN, LSTM, …
Represent documents and/or words as f-dimensional vector
Late
nt fa
ctor
1
Latent factor 2
banana
apple
boat
Vector methods for collaborative filtering
• Supervised methods: See everything from the Netflix Prize
• Unsupervised: Use NLP methods
CF vectors – examplesIPMF item item:
P (i ! j) = exp(bTj bi)/Zi =
exp(bTj bi)P
k exp(bTk bi)
VECTORS:pui = aTubi
simij = cos(bi,bj) =bTi bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P (i ! j) = exp(bTj bi)/Zi =
exp(� |bj � bi|2)Pk exp(� |bk � bi|2)
simij = � |bj � bi|2
(u, i, count)
@L
@au
7
Geospatial indexing• Ping the world: https://github.com/erikbern/ping
• k-NN regression using Annoy
Nearest neighbors the brute force way
• we can always do an exhaustive search to find the nearest neighbors
• imagine MySQL doing a linear scan for every query…
Using word2vec’s brute force search
$ time echo -e "chinese river\nEXIT\n" | ./distance GoogleNews-vectors-negative300.bin! Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721real2m34.346suser1m36.235ssys0m16.362s
Introducing Annoy
• https://github.com/spotify/annoy
• mmap-based ANN library
• Written in C++, with Python and R bindings
• 585 stars on Github
Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978real0m0.470suser0m0.285ssys0m0.162s
Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031real0m2.013suser0m1.386ssys0m0.614s
(performance)
1. Building an Annoy index
Start with the point set
Split it in two halves
Split again
Again…
…more iterations later
Side note: making trees small
• Split until K items in each leaf (K~100)
• Takes (n/K) memory instead of n
Binary tree
2. Searching
Nearest neighbors
Searching the tree
Problemo
• The point that’s the closest isn’t necessarily in the same leaf of the binary tree
• Two points that are really close may end up on different sides of a split
• Solution: go to both sides of a split if it’s close
Trick 1: Priority queue
• Traverse the tree using a priority queue
• sort by min(margin) for the path from the root
Trick 2: many trees
• Construct trees randomly many times
• Use the same priority queue to search all of them at the same time
heap + forest = best
• Since we use a priority queue, we will dive down the best splits with the biggest distance
• More trees always helps!
• Only constraint is more trees require more RAM
Annoy query structure
1. Use priority queue to search all trees until we’ve found k items
2. Take union and remove duplicates (a lot)
3. Compute distance for remaining items
4. Return the nearest n items
Find candidates
Take union of all leaves
Compute distances
Return nearest neighbors
“Curse of dimensionality”
Are we screwed?
• Would be nice if the data is has a much smaller “intrinsic dimension”!
Improving the algorithm
Que
ries/
s
1-NN accuracy
more accurate
faster
• https://github.com/erikbern/ann-benchmarks
ann-benchmarks
perf/accuracy tradeoffs
Que
ries/
s
1-NN accuracy
search more nodes
more trees
Things that work
• Smarter plane splitting
• Priority queue heuristics
• Search more nodes than number of results
• Align nodes closer together
Things that don’t work
• Use lower-precision arithmetic
• Priority queue by other heuristics (number of trees)
• Precompute vector norms
Things for the future
• Use a optimization scheme for tree building
• Add more distance functions (eg. edit distance)
• Use a proper KV store as a backend (eg. LMDB) to support incremental adds, out-of-core, arbitrary keys: https://github.com/Houzz/annoy2
Thanks!• https://github.com/spotify/annoy
• https://github.com/erikbern/ann-benchmarks
• https://github.com/erikbern/ann-presentation
• erikbern.com
• @fulhack