openlsh - a framework for locality sensitive hashing

Going Beyond k-meansGoing Beyond k-means

Developments in the ≈60 years since its publication

J Singh and Teresa Brooks

March 17, 2015

Hello Bulgaria

• A website with thousands of pages...

– Some pages identical to other pages

– Some pages nearly identical to other pages

• We want smart indexing of the collection

– Save just one copy of the duplicate pages

– Save one copy of the nearly duplicate pages

– Filter out similar documents when returning search results

• And we want to keep the index up to date

– Detect content changes quickly, possibly without reading old copies from a slow storage

The Naïve Way to Address this Challenge

• Represent each document as a dot in d-dimensional space

• Run a k-means algorithm on the document set

– Resulting in k clusters

• When presented with a new document

– Find the “nearest cluster”

– Find the documents within the nearest cluster that are nearest to the document in question

• Can be skipped if the cluster is small enough

• i.e., k is large enough that everything in the cluster is close!

The Naïve Way has conceptual problems

• No good way to decide optimal k

• All documents have to be re-clustered if we want to change k

• A document may “belong” to multiple clusters

• All clusters are roughly the same size

– In practice, this terrain is lumpy – some documents are one-of-a-kind and others are similar to many others.

The Naïve Way has technical problems

• End result is subject to initial choice of centroids

– Leads to results not being repeatable

• Performance is O(nk), or worse!

– Especially unfortunate because we want k to be large

• Algorithm is not easily adapted to map/reduce

– We need a pipeline of map/reduce jobs to compute it

Any Evolutionary Alternatives?

• Clustering has been picked over quite well due to its combination of interesting math and wide applicability

• Two dominant types have emerged:

– Hierarchical clustering

– Partitional clustering (e.g., k-means)

• k-Means Variations based on

– Choice of Initial Centroids

– Choice of k

– Parameters at each iteration

Another line of inquiry: Nearest Neighbor

• Based on partitioning the search space

– Quad Trees

– kd-Trees

– Locality-Sensitive Hashing

• Hash functions are locality-sensitive, if, for a random hash function h, for any pair of points p,q :

– Pr[h(p)=h(q)] is “high” if p is “close” to q

– Pr[h(p)=h(q)] is “low” if p is “far” from q

More on Nearest Neighbor…

• Locality-Sensitive Hashing†

– Hash functions are locality-sensitive, if, for a random hash random function h, for any pair of points p,q we have:

• Pr[h(p)=h(q)] is “high” if p is “close” to q

• Pr[h(p)=h(q)] is “low” if p is”far” from q

†Indyk-Motwani’98

The LSH Idea

• Treat items as vectors in d-dimensional space.

• Draw k random hyper-planes in that space.

• For each hyper-plane:4

– Is each vector on the (0) side of the hyperplane or the (1) side?

• Hash(Item1) = 000

• Hash(Item3) = 101

• Hashes each item into a number

• The magic is in choosing h1, h2, …

The LSH Hash Code Idea…

• …Breaks d-dimensional space into proximity-polyhedra.

• Each purple block represents a document Buckets

represents a document

– Each Bucket represents a group of alike docs

• Docs within each bucket still need to be compared to see which ones are the “closest”

A Brief History of LSH

• Origins at Stanford (1998)

• Continuing research in universities

– Stanford, MIT, Rutgers, Cornell, …

• Continuing research in Industry

– Intel, Microsoft, Google, …

• Textbook:

– A. Rajaraman and J. Ullman (2010). (http://goo.gl/8AJDgI)

• Our contribution:

– An extensible implementation for large datasets

Choosing hash functions

• Introducing minhash

1. Sample each document to get its “shingles” – small fragments

• “Mary had a “ � “mary”, “ary “, “ry h”, “y ha”, “ had”, …

• “CTAGTATAAA” � “CTAGTATA”, “TAGTATAA”, “AGTATAAA”,

• “now is the time” � “now is”, “is the”, “the time”

2. Calculate the hash value for every shingle.

3. Store the minimum hash value found in step 2.

4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 minhash values.

Interesting thing about minhashes

• The resulting minhashes are 200 integer values representing a random selection of shingles.

– Property of minhashes: If the minhashes for two docs are the same, their shingles are likely to be the same

– If the shingles for two docs are the same, the docs themselves are likely to be the same

themselves are likely to be the same

• Beware…

– Minhash is specific to a particular similarity measure –Jaccard similarity

– Other hash families exist for other similarity measures

All 200 minhashes must match?

• If all minhashes match, it implies a strong similarity between docs.

• To catch most cases with weaker similarity

– Don’t compare all minhashes at once, compare them in bands. Candidate pairs are those that hash to the same bucket for ≥ 1 band.

bucket for ≥ 1 band.

– Sometimes one band will reject a pair and another band will consider it a candidate.

LSH Involves a Tradeoff

• Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.

– False positives ⇒ need to examine more pairs that are not really similar. More processing resources, more time.

– False negatives ⇒ failed to examine pairs that were similar,

– False negatives ⇒ failed to examine pairs that were similar, didn’t find all similar results. But got done faster!

Summary

• Mine the data and place members into hash buckets

• When you need to find a match, hash it and possible nearest neighbors will be in one of b buckets.

one of b buckets.

• Algorithm performance O(n)

Going Beyond k-meansGoing Beyond k-means

J Singh and Teresa Brooks

March 17, 2015

Peerbelt Results Example

MUhNgZKlQ5qWKlSzlQ4auA �

_UOkeHgLQn2HaLM5AdJBcw

fjw99kNgSBSNLXDjBijKIQ �

6sp57Uq1TjCCb6ozoHlcEw � �

P2WtoTI0ROiwMu-KqcFmrg � �

f6aRJpmmQZWshgUJY6ddiA � �

o_CEANscQM-IC2VX-kk6Ag � �

mzfgajvrRNyJGYcr0d7i5g

Knvo0RsrRWeE-75QfeYRAQ

cllezWovQay6ZA1Ubxqbzw

oLXkAmQ5RIOM4svywmynbQ

TTWu2oleRcuHcNKqQqL_9Q

2is32hFhRACt-qAAg15eSQ

KnOpBza6TQO2lNHo45i08A

PebEFSHLQmGxI4aMAP-Pmw

T8TFg700R-WCACYPceCRfg � �

BnM7ETiXQAywiFYEenzGfw

q6DSgUlOTVuro67PY2zpOQ �

YZP6Tk7ZTBKPZnLTSctZEQ

yoTVRL8jSDyJHtS-Vcgkgw

xhA8UNjOTBuDt-VRnMTTnw

BQNIVz_5TxSlXZMJYV9lhA

S6FG_NUaQU-UyIoez_k2zg

_KJHmfuzQtKiCGHVT45JPg

AEWdkJ3QTAiaFRwOsbTcsA

MsVBW3oKT6yZNP9J8-2jKw

c7bRvt-dQse7n4tmFkuQCQ

K4DcDglWS3OdUXTGqTX1LA

lWgrETQwQsmmTDitHstIiQ

eAOq-w3pRJq1T0mdEeYBJA �

OfTond3JRjCmaNaHJc5pcw �

Wv4BFePCR0SSvotcfTbI-A � �

62p0zfd2SZOhH0niF90QcA � �

AxNLgwmBS1uK-QivL3bKWw � �

BcYtpGdbTtazQCp7ez7nCw � �

UpOP24JMSJuP58TAHkvc4w

K7fSX7v0Qcy4PAbGl7ZFFw �

Zwc1YB8SSeSrcALscMfDNQ �

mpmoIZY6S4Si89wdEyX9IA �

3YhvLB30QJiFQXBA1vIqsA �

=-xm8tkdTRN6i18BkP-EF4Q

YQ9K2Ka2TGic_7FZFb7pJg �

Database Architecture Requirements

• Need a very large range of bucket numbers

– Bucket Numbers in our implementation are -231 to +231-1

• Most buckets are empty

– Empty buckets must not take any space in the database

– Some buckets have a lot of documents in them, we need to be able to locate all of them

• To find documents similar to a given document,

– Bucketize the document, then find other documents in the same buckets

Implementation: OpenLSH

• We started OpenLSH to provide a framework for LSH

• Factor out the database

– Started on Google App Engine

– Virtualized interface to make it work on Cassandra

• Factor out the calculation engine

– Started on Google App Engine

– Can plug in Google MapReduce

– Ported to run in Batch mode on Cassandra

Using OpenLSH

• We’re looking for one or two interesting use cases

– Application areas:

• Near de-duplicaction (covered with Peerbelt’s data)

• Stocks that move independent of the herd

• Filtering “unique stories” from the News

• Contact us to discuss

What you can do

• For more information: http://openlsh.datathinks.org/

– Links to code and data set are included

• Run on App Engine

– Minimum setup required

• Adapt it to your environment and need

• If you need help, send email or create a Github issue.

• Send us a pull request for any improvements you make.

Thank you

• J Singh

– Principal, DataThinks

• Algorithms for big data

• @datathinks, @singh_j

• j . singh @ datathinks . org

– Adj. Prof, Computer Science, WPI

• Teresa Brooks

– Senior Software Engineer @ Xero

• teresa.brooks@xero.com

• @VaderGirl13

openlsh - a framework for locality sensitive hashing

Technology

a gentle introduction to locality sensitive hashing with...

1 finding similar pairs divide-compute-merge...

course : data mining topic : locality-sensitive...

applications of lsh (locality-sensitive hashing)

data mining lecture 6 sketching, min-hashing, locality...

classiﬁers -...

locality-sensitive hashing - the stanford university...

locality-sensitive...

vlsh: voronoi-based locality sensitive hashing

summer school on hashing’14 locality sensitive hashing

locality sensitive hashing for protein...

locality sensitive hashing by spark

locality sensitive hashing

research article fast image search with locality-sensitive...

Дмитрий Селиванов, ok.ru. finding similar...

bi-level locality sensitive hashing for k-nearest...

data mining lecture 6 similarity and distance sketching,...

a a bayesian perspective on locality sensitive hashing with...

summer school on hashing’14 locality sensitive hashing...

locality sensitive hashing basics and applications