approximate nearest neighbor methods and vector models – nyc ml meetup

Approximate nearest neighbors & vector

models

I’m Erik

• @fulhack

• Author of Annoy, Luigi

• Currently CTO of Better

• Previously 5 years at Spotify

What’s nearest neighbor(s)

• Let’s say you have a bunch of points

Grab a bunch of points

5 nearest neighbors

20 nearest neighbors

100 nearest neighbors

…But what’s the point?

• vector models are everywhere

• lots of applications (language processing, recommender systems, computer vision)

MNIST example• 28x28 = 784-dimensional dataset

• Define distance in terms of pixels:

MNIST neighbors

…Much better approach

1. Start with high dimensional data

2. Run dimensionality reduction to 10-1000 dims

3. Do stuff in a small dimensional space

Deep learning for food• Deep model trained on a GPU on 6M random pics

downloaded from Yelp15

6x15

6x32

154x

154x

32

152x

152x

32

76x7

6x64

74x7

4x64

72x7

2x64

36x3

6x12

8

34x3

4x12

8

32x3

2x12

8

16x1

6x25

6

14x1

4x25

6

12x1

2x25

6

6x6x

512

4x4x

512

2x2x

512

2048

2048

128

1244

3x3 convolutions

2x2 maxpoolfully

connected with dropout

bottleneck layer

Distance in smaller space1. Run image through the network

2. Use the 128-dimensional bottleneck layer as an item vector

3. Use cosine distance in the reduced space

Nearest food pics

Vector methods for text

• TF-IDF (old) – no dimensionality reduction

• Latent Semantic Analysis (1988)

• Probabilistic Latent Semantic Analysis (2000)

• Semantic Hashing (2007)

• word2vec (2013), RNN, LSTM, …

Represent documents and/or words as f-dimensional vector

Late

nt fa

ctor

1

Latent factor 2

banana

apple

boat

Vector methods for collaborative filtering

• Supervised methods: See everything from the Netflix Prize

• Unsupervised: Use NLP methods

CF vectors – examplesIPMF item item:

P (i ! j) = exp(bTj bi)/Zi =

exp(bTj bi)P

k exp(bTk bi)

VECTORS:pui = aTubi

simij = cos(bi,bj) =bTi bj

|bi||bj|

O(f)

i j simi,j

2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81

IPMF item item MDS:

P (i ! j) = exp(bTj bi)/Zi =

exp(� |bj � bi|2)Pk exp(� |bk � bi|2)

simij = � |bj � bi|2

(u, i, count)

@L

@au

7

Geospatial indexing• Ping the world: https://github.com/erikbern/ping

• k-NN regression using Annoy

https://github.com/erikbern/ping

Nearest neighbors the brute force way

• we can always do an exhaustive search to find the nearest neighbors

• imagine MySQL doing a linear scan for every query…

Using word2vec’s brute force search

$ time echo -e "chinese river\nEXIT\n" | ./distance GoogleNews-vectors-negative300.bin! Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721real2m34.346suser1m36.235ssys0m16.362s

Introducing Annoy

• https://github.com/spotify/annoy

• mmap-based ANN library

• Written in C++, with Python and R bindings

• 585 stars on Github

https://github.com/spotify/annoy

Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978real0m0.470suser0m0.285ssys0m0.162s

Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031real0m2.013suser0m1.386ssys0m0.614s

(performance)

1. Building an Annoy index

Start with the point set

Split it in two halves

Split again

Again…

…more iterations later

Side note: making trees small

• Split until K items in each leaf (K~100)

• Takes (n/K) memory instead of n

Binary tree

2. Searching

Nearest neighbors

Searching the tree

Problemo

• The point that’s the closest isn’t necessarily in the same leaf of the binary tree

• Two points that are really close may end up on different sides of a split

• Solution: go to both sides of a split if it’s close

Trick 1: Priority queue

• Traverse the tree using a priority queue

• sort by min(margin) for the path from the root

Trick 2: many trees

• Construct trees randomly many times

• Use the same priority queue to search all of them at the same time

heap + forest = best

• Since we use a priority queue, we will dive down the best splits with the biggest distance

• More trees always helps!

• Only constraint is more trees require more RAM

Annoy query structure

1. Use priority queue to search all trees until we’ve found k items

2. Take union and remove duplicates (a lot)

3. Compute distance for remaining items

4. Return the nearest n items

Find candidates

Take union of all leaves

Compute distances

Return nearest neighbors

“Curse of dimensionality”

Are we screwed?

• Would be nice if the data is has a much smaller “intrinsic dimension”!

Improving the algorithm

Que

ries/

s

1-NN accuracy

more accurate

faster

• https://github.com/erikbern/ann-benchmarks

ann-benchmarks

perf/accuracy tradeoffs

Que

ries/

s

1-NN accuracy

search more nodes

more trees

Things that work

• Smarter plane splitting

• Priority queue heuristics

• Search more nodes than number of results

• Align nodes closer together

Things that don’t work

• Use lower-precision arithmetic

• Priority queue by other heuristics (number of trees)

• Precompute vector norms

Things for the future

• Use a optimization scheme for tree building

• Add more distance functions (eg. edit distance)

• Use a proper KV store as a backend (eg. LMDB) to support incremental adds, out-of-core, arbitrary keys: https://github.com/Houzz/annoy2

https://github.com/Houzz/annoy2

Thanks!• https://github.com/spotify/annoy

• https://github.com/erikbern/ann-benchmarks

• https://github.com/erikbern/ann-presentation

• erikbern.com

• @fulhack

https://github.com/spotify/annoy

https://github.com/erikbern/ann-benchmarks

https://github.com/erikbern/ann-presentation

http://erikbern.com

approximate nearest neighbor methods and vector models – nyc ml meetup

Engineering