openlsh - a framework for locality sensitive hashing
Post on 15-Jul-2015
232 Views
Preview:
TRANSCRIPT
Going Beyond k-meansGoing Beyond k-means
Developments in the ≈60 years since its publication
J Singh and Teresa Brooks
March 17, 2015
Hello Bulgaria
• A website with thousands of pages...
– Some pages identical to other pages
– Some pages nearly identical to other pages
• We want smart indexing of the collection
2
© DataThinks 2013-152
• We want smart indexing of the collection
– Save just one copy of the duplicate pages
– Save one copy of the nearly duplicate pages
– Filter out similar documents when returning search results
• And we want to keep the index up to date
– Detect content changes quickly, possibly without reading old copies from a slow storage
The Naïve Way to Address this Challenge
• Represent each document as a dot in d-dimensional space
• Run a k-means algorithm on the document set
– Resulting in k clusters
• When presented with a new document
3
© DataThinks 2013-15 3
• When presented with a new document
– Find the “nearest cluster”
– Find the documents within the nearest cluster that are nearest to the document in question
• Can be skipped if the cluster is small enough
• i.e., k is large enough that everything in the cluster is close!
The Naïve Way has conceptual problems
• No good way to decide optimal k
• All documents have to be re-clustered if we want to change k
• A document may “belong” to multiple clusters
• All clusters are roughly the same size
4
© DataThinks 2013-154
• All clusters are roughly the same size
– In practice, this terrain is lumpy – some documents are one-of-a-kind and others are similar to many others.
The Naïve Way has technical problems
• End result is subject to initial choice of centroids
– Leads to results not being repeatable
• Performance is O(nk), or worse!
– Especially unfortunate because we want k to be large
• Algorithm is not easily adapted to map/reduce
5
© DataThinks 2013-155
• Algorithm is not easily adapted to map/reduce
– We need a pipeline of map/reduce jobs to compute it
Any Evolutionary Alternatives?
• Clustering has been picked over quite well due to its combination of interesting math and wide applicability
• Two dominant types have emerged:
– Hierarchical clustering
6
© DataThinks 2013-156
– Hierarchical clustering
– Partitional clustering (e.g., k-means)
• k-Means Variations based on
– Choice of Initial Centroids
– Choice of k
– Parameters at each iteration
Another line of inquiry: Nearest Neighbor
• Based on partitioning the search space
– Quad Trees
– kd-Trees
7
© DataThinks 2013-157
– Locality-Sensitive Hashing
• Hash functions are locality-sensitive, if, for a random hash function h, for any pair of points p,q :
– Pr[h(p)=h(q)] is “high” if p is “close” to q
– Pr[h(p)=h(q)] is “low” if p is “far” from q
More on Nearest Neighbor…
• Locality-Sensitive Hashing†
– Hash functions are locality-sensitive, if, for a random hash random function h, for any pair of points p,q we have:
• Pr[h(p)=h(q)] is “high” if p is “close” to q
• Pr[h(p)=h(q)] is “low” if p is”far” from q
8
© DataThinks 2013-158
†Indyk-Motwani’98
The LSH Idea
• Treat items as vectors in d-dimensional space.
• Draw k random hyper-planes in that space.
• For each hyper-plane:4
5
9
© DataThinks 2013-15 9
– Is each vector on the (0) side of the hyperplane or the (1) side?
• Hash(Item1) = 000
• Hash(Item3) = 101
• Hashes each item into a number
• The magic is in choosing h1, h2, …
2
13 6
7
h3
h1
h2
The LSH Hash Code Idea…
• …Breaks d-dimensional space into proximity-polyhedra.
• Each purple block represents a document Buckets
10
© DataThinks 2013-15
represents a document
– Each Bucket represents a group of alike docs
• Docs within each bucket still need to be compared to see which ones are the “closest”
A Brief History of LSH
• Origins at Stanford (1998)
• Continuing research in universities
– Stanford, MIT, Rutgers, Cornell, …
• Continuing research in Industry
– Intel, Microsoft, Google, …
11
© DataThinks 2013-1511
– Intel, Microsoft, Google, …
• Textbook:
– A. Rajaraman and J. Ullman (2010). (http://goo.gl/8AJDgI)
• Our contribution:
– An extensible implementation for large datasets
Choosing hash functions
• Introducing minhash
1. Sample each document to get its “shingles” – small fragments
• “Mary had a “ � “mary”, “ary “, “ry h”, “y ha”, “ had”, …
• “CTAGTATAAA” � “CTAGTATA”, “TAGTATAA”, “AGTATAAA”,
• “now is the time” � “now is”, “is the”, “the time”
12
© DataThinks 2013-1512
• “now is the time” � “now is”, “is the”, “the time”
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 minhash values.
Interesting thing about minhashes
• The resulting minhashes are 200 integer values representing a random selection of shingles.
– Property of minhashes: If the minhashes for two docs are the same, their shingles are likely to be the same
– If the shingles for two docs are the same, the docs themselves are likely to be the same
13
© DataThinks 2013-1513
themselves are likely to be the same
• Beware…
– Minhash is specific to a particular similarity measure –Jaccard similarity
– Other hash families exist for other similarity measures
All 200 minhashes must match?
• If all minhashes match, it implies a strong similarity between docs.
• To catch most cases with weaker similarity
– Don’t compare all minhashes at once, compare them in bands. Candidate pairs are those that hash to the same bucket for ≥ 1 band.
14
© DataThinks 2013-1514
bucket for ≥ 1 band.
– Sometimes one band will reject a pair and another band will consider it a candidate.
LSH Involves a Tradeoff
• Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.
– False positives ⇒ need to examine more pairs that are not really similar. More processing resources, more time.
– False negatives ⇒ failed to examine pairs that were similar,
15
© DataThinks 2013-1515
– False negatives ⇒ failed to examine pairs that were similar, didn’t find all similar results. But got done faster!
Summary
• Mine the data and place members into hash buckets
• When you need to find a match, hash it and possible nearest neighbors will be in one of b buckets.
16
© DataThinks 2013-1516
one of b buckets.
• Algorithm performance O(n)
Peerbelt Results Example
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
##
##
#
MUhNgZKlQ5qWKlSzlQ4auA �
_UOkeHgLQn2HaLM5AdJBcw
fjw99kNgSBSNLXDjBijKIQ �
6sp57Uq1TjCCb6ozoHlcEw � �
P2WtoTI0ROiwMu-KqcFmrg � �
f6aRJpmmQZWshgUJY6ddiA � �
o_CEANscQM-IC2VX-kk6Ag � �
mzfgajvrRNyJGYcr0d7i5g
Knvo0RsrRWeE-75QfeYRAQ
cllezWovQay6ZA1Ubxqbzw
oLXkAmQ5RIOM4svywmynbQ
TTWu2oleRcuHcNKqQqL_9Q
2is32hFhRACt-qAAg15eSQ
KnOpBza6TQO2lNHo45i08A
PebEFSHLQmGxI4aMAP-Pmw
T8TFg700R-WCACYPceCRfg � �
18
© DataThinks 2013-1518
T8TFg700R-WCACYPceCRfg � �
BnM7ETiXQAywiFYEenzGfw
q6DSgUlOTVuro67PY2zpOQ �
YZP6Tk7ZTBKPZnLTSctZEQ
yoTVRL8jSDyJHtS-Vcgkgw
xhA8UNjOTBuDt-VRnMTTnw
BQNIVz_5TxSlXZMJYV9lhA
S6FG_NUaQU-UyIoez_k2zg
_KJHmfuzQtKiCGHVT45JPg
AEWdkJ3QTAiaFRwOsbTcsA
MsVBW3oKT6yZNP9J8-2jKw
c7bRvt-dQse7n4tmFkuQCQ
K4DcDglWS3OdUXTGqTX1LA
lWgrETQwQsmmTDitHstIiQ
eAOq-w3pRJq1T0mdEeYBJA �
OfTond3JRjCmaNaHJc5pcw �
Wv4BFePCR0SSvotcfTbI-A � �
62p0zfd2SZOhH0niF90QcA � �
AxNLgwmBS1uK-QivL3bKWw � �
BcYtpGdbTtazQCp7ez7nCw � �
UpOP24JMSJuP58TAHkvc4w
K7fSX7v0Qcy4PAbGl7ZFFw �
Zwc1YB8SSeSrcALscMfDNQ �
mpmoIZY6S4Si89wdEyX9IA �
3YhvLB30QJiFQXBA1vIqsA �
=-xm8tkdTRN6i18BkP-EF4Q
YQ9K2Ka2TGic_7FZFb7pJg �
Database Architecture Requirements
• Need a very large range of bucket numbers
– Bucket Numbers in our implementation are -231 to +231-1
• Most buckets are empty
– Empty buckets must not take any space in the database
19
© DataThinks 2013-1519
– Empty buckets must not take any space in the database
– Some buckets have a lot of documents in them, we need to be able to locate all of them
• To find documents similar to a given document,
– Bucketize the document, then find other documents in the same buckets
Implementation: OpenLSH
• We started OpenLSH to provide a framework for LSH
• Factor out the database
– Started on Google App Engine
– Virtualized interface to make it work on Cassandra
20
© DataThinks 2013-1520
– Virtualized interface to make it work on Cassandra
• Factor out the calculation engine
– Started on Google App Engine
– Can plug in Google MapReduce
– Ported to run in Batch mode on Cassandra
Using OpenLSH
• We’re looking for one or two interesting use cases
– Application areas:
• Near de-duplicaction (covered with Peerbelt’s data)
• Stocks that move independent of the herd
• Filtering “unique stories” from the News
21
© DataThinks 2013-1521
• Contact us to discuss
What you can do
• For more information: http://openlsh.datathinks.org/
– Links to code and data set are included
• Run on App Engine
– Minimum setup required
22
© DataThinks 2013-1522
– Minimum setup required
• Adapt it to your environment and need
• If you need help, send email or create a Github issue.
• Send us a pull request for any improvements you make.
Thank you
• J Singh
– Principal, DataThinks
• Algorithms for big data
• @datathinks, @singh_j
• j . singh @ datathinks . org
23
© DataThinks 2013-1523
• j . singh @ datathinks . org
– Adj. Prof, Computer Science, WPI
• Teresa Brooks
– Senior Software Engineer @ Xero
• teresa.brooks@xero.com
• @VaderGirl13
top related