multidimensional spectral hashingfergus/drafts/mdsh_cvpr11.pdf · multidimensional spectral hashing...

8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #1312 CVPR #1312 CVPR 2012 Submission #1312. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Multidimensional Spectral Hashing Anonymous CVPR submission Paper ID 1312 Abstract With the growing availability of very large image databases, there has been a surge of interest in methods based on “semantic hashing”, i.e. compact binary codes of data-points so that the Hamming distance between code- words correlates with similarity. We define the problem of finding binary codes that reconstruct the affinity between datapoints, rather than their distances. We show that this criterion is still intractable to solve exactly, but a spectral relaxation gives an algorithm where the bits correspond to thresholded eigenvectors of the affinity matrix, and as the number of datapoints goes to infinity these eigenvec- tors converge to eigenfunctions of Laplace-Beltrami oper- ators, similar to the recently proposed Spectral Hashing (SH) method. Unlike SH which uses only single dimension eigenfunctions, our formulation requires multidimensional eigenfunctions but we show how to use a “kernel trick” to allow us to compute with an exponentially large number of bits but using only memory and computation that grows lin- early with dimension. Experiments shows that MDSH out- performs the state-of-the art, especially in the challenging regime of small distances. 1. Introduction Perhaps the most salient aspect of computer vision and machine learning in recent years is the widespread availabil- ity of gigantic datasets with millions or trillions of items. This effect, sometimes called the “data deluge” creates a tremendous opportunity for machine learning applied to vi- sion but also requires new algorithms that can efficiently work with such large datasets, where even a single, linear scan through the data may be prohibitively expensive. In theoretical CS, sublinear time search methods are achieved through techniques called Locality Sensitive Hashing (LSH)[5]. In the simplest version of LSH, called E2-LSH [1], the ith bit in the code of an item x is simply given by sign(w T i x - b i ), where w i is randomly chosen vector and b i a randomly chosen threshold. As the num- ber of bits goes to infinity, it can be shown that Hamming distance in an E2-LSH code will be a monotonic function of the original Euclidean distance.In [12] it was shown how to construct LSH codes that provably approximate a kernel function of the distance, rather than the raw distance. Despite the theoretical guarantees of such randomized approaches in the limit of many bits, the performance with a limited bit budget is often unsatisfactory, and this has lead to interest in approaches that learn the hash codes from data. Semantic Hashing proposed by Salukadnitov and Hinton[13] trains a multi-layer neural network to learn the binary code. Codes trained in this fashion can outper- form random codes in both text and image applications[16]. Due to the complexity of training the multilayer neural net, in recent years there has been much interest in other training algorithms for short binary codes. Weiss et al.[18] formulated the problem as an optimization problem and searched for bits that are independent and also minimize the expected Hamming distance between similar datapoints. They showed that this criterion in intractable but a spec- tral relaxation leads to codes where each bit corresponds to thresholds of eigenfunctions of the Laplace Beltrami op- erator, and suggested approximating the distribution of the data with simple parametric families. Kumar et al.[17] pro- posed an alternative criterion where one seeks to maximize the variance of the bits, and showed that this criterion leads to a code where each bit is sign of a PCA component. Kulis and Darrell[7] suggested optimizing the L2 loss between the Hamming distances and the original Euclidean distances. They showed that optimizing one bit given all the others can be done in closed form. In [10] a hinge loss rather than L2 loss is used, and it is optimized using techniques from struc- tured prediction. Lin et al. [11] suggested an algorithm that incrementally adds bits to increase the Hamming distance between dissimilar objects while keeping the Hamming dis- tance between similar objects small. Yao and Lazebnik[6] also suggested looking at the difference between Hamming distances and Euclidean distances, and suggested an algo- rithm that finds a rotation of the PCA vectors so that after rotation the thresholding operation causes less quantization errors. They also showed that the sign of a random rotation of the PCA coefficients often performs better than the sign 1

Upload: others

Post on 14-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multidimensional Spectral Hashingfergus/drafts/mdsh_cvpr11.pdf · Multidimensional Spectral Hashing Anonymous CVPR submission Paper ID 1312 Abstract With the growing availability

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#1312

CVPR#1312

CVPR 2012 Submission #1312. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Multidimensional Spectral Hashing

Anonymous CVPR submission

Paper ID 1312

Abstract

With the growing availability of very large imagedatabases, there has been a surge of interest in methodsbased on “semantic hashing”, i.e. compact binary codesof data-points so that the Hamming distance between code-words correlates with similarity. We define the problem offinding binary codes that reconstruct the affinity betweendatapoints, rather than their distances. We show that thiscriterion is still intractable to solve exactly, but a spectralrelaxation gives an algorithm where the bits correspondto thresholded eigenvectors of the affinity matrix, and asthe number of datapoints goes to infinity these eigenvec-tors converge to eigenfunctions of Laplace-Beltrami oper-ators, similar to the recently proposed Spectral Hashing(SH) method. Unlike SH which uses only single dimensioneigenfunctions, our formulation requires multidimensionaleigenfunctions but we show how to use a “kernel trick” toallow us to compute with an exponentially large number ofbits but using only memory and computation that grows lin-early with dimension. Experiments shows that MDSH out-performs the state-of-the art, especially in the challengingregime of small distances.

1. IntroductionPerhaps the most salient aspect of computer vision and

machine learning in recent years is the widespread availabil-ity of gigantic datasets with millions or trillions of items.This effect, sometimes called the “data deluge” creates atremendous opportunity for machine learning applied to vi-sion but also requires new algorithms that can efficientlywork with such large datasets, where even a single, linearscan through the data may be prohibitively expensive.

In theoretical CS, sublinear time search methods areachieved through techniques called Locality SensitiveHashing (LSH)[5]. In the simplest version of LSH, calledE2-LSH [1], the ith bit in the code of an item x is simplygiven by sign(wT

i x − bi), where wi is randomly chosenvector and bi a randomly chosen threshold. As the num-ber of bits goes to infinity, it can be shown that Hamming

distance in an E2-LSH code will be a monotonic functionof the original Euclidean distance.In [12] it was shown howto construct LSH codes that provably approximate a kernelfunction of the distance, rather than the raw distance.

Despite the theoretical guarantees of such randomizedapproaches in the limit of many bits, the performance witha limited bit budget is often unsatisfactory, and this haslead to interest in approaches that learn the hash codesfrom data. Semantic Hashing proposed by Salukadnitovand Hinton[13] trains a multi-layer neural network to learnthe binary code. Codes trained in this fashion can outper-form random codes in both text and image applications[16].

Due to the complexity of training the multilayer neuralnet, in recent years there has been much interest in othertraining algorithms for short binary codes. Weiss et al.[18]formulated the problem as an optimization problem andsearched for bits that are independent and also minimizethe expected Hamming distance between similar datapoints.They showed that this criterion in intractable but a spec-tral relaxation leads to codes where each bit correspondsto thresholds of eigenfunctions of the Laplace Beltrami op-erator, and suggested approximating the distribution of thedata with simple parametric families. Kumar et al.[17] pro-posed an alternative criterion where one seeks to maximizethe variance of the bits, and showed that this criterion leadsto a code where each bit is sign of a PCA component. Kulisand Darrell[7] suggested optimizing the L2 loss between theHamming distances and the original Euclidean distances.They showed that optimizing one bit given all the others canbe done in closed form. In [10] a hinge loss rather than L2loss is used, and it is optimized using techniques from struc-tured prediction. Lin et al. [11] suggested an algorithm thatincrementally adds bits to increase the Hamming distancebetween dissimilar objects while keeping the Hamming dis-tance between similar objects small. Yao and Lazebnik[6]also suggested looking at the difference between Hammingdistances and Euclidean distances, and suggested an algo-rithm that finds a rotation of the PCA vectors so that afterrotation the thresholding operation causes less quantizationerrors. They also showed that the sign of a random rotationof the PCA coefficients often performs better than the sign

1

Page 2: Multidimensional Spectral Hashingfergus/drafts/mdsh_cvpr11.pdf · Multidimensional Spectral Hashing Anonymous CVPR submission Paper ID 1312 Abstract With the growing availability

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#1312

CVPR#1312

CVPR 2012 Submission #1312. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

0 0.2 0.4 0.6 0.8 10.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Recall

Pre

cisi

on

LSHSHITQBRE

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

Evaluation using threshold T= δ/4 LSHSHITQBRE

Evaluation using threshold T= δ LSH

SH

ITQ

BRE

Ground truth affinity Ground truth affinity

Figure 1. The puzzle of previous work. The exact same algorithms with the exact same data (32 dimensional Gaussian) give a verydifferent ranking of performance depending on our definition of ground truth neighbors. When the neighborhood for the ground truthis small (left) , SH outperforms all other algorithms, and when it is large (right) it performs the worst. The rightmost images show theHamming similarities of the different codes, projected down to 2D.

the original PCA coefficients.Given that the recent approaches to learn binary codes

use different optimization criteria, it is not surprising thatthey perform differently on different datasets. What is sur-prising, however, is that when performance of the algo-rithms is evaluated using standard precision-recall curves,the ranking of the algorithms can change dramatically evenon the same dataset. In order to use precision-recall curves,one first defines a threshold on the original Euclidean dis-tance. Two datapoints are defined as similar if their Eu-clidean distance is less than the threshold. Given a queryand a threshold on Hamming distance, the retrieved itemsfor the query are all datapoints whose Hamming distance isbelow the threshold. Precision measures the proportion ofretrieved points who are indeed similar, while recall mea-sures the proportion of similar retrieved points out of allsimilar points.

To illustrate the different ranking of the algorithms, con-sider Figure 1 where we show precision-recall curves1 forSpectral Hashing [18] (SH), LSH [1], Gong and Lazebnik’sITQ [6] and Kulis and Darrell’s BRE [7]. The data are sam-pled from a 32 dimensional Gaussian, where the standarddeviation for the dth dimension is given by (1/d)2 and allcodes are 32 bits. In both figures, the training data, testdata and algorithm results are identical. The only differ-ence is the threshold that defines a similar neighbor. In theright column this threshold T is set to be δ, the averagedistance between any two points in the training set, whilein the left column it is set to be δ/4. It can be seen thatthis threshold makes a huge difference on the ranking of thealgorithms. In particular, for the smaller neighborhoods,SH greatly outperforms the other techniques, while for thelarger neighborhood it is the worst algorithm.

We can understand this difference in behavior by visual-

1Note that since the Hamming distances are discrete, the precision re-call curves are also discrete. We connect the points with lines for ease ofvisualization, but intermediate points on the lines are not achievable by thealgorithms.

izing the Hamming distances from a query point to all otherpoints in the dataset. Although the data is 32 dimensional,we visualize it in two dimensions using the first two PCAcomponents (high values are high dot product between thebinary code of a query point and a test point). It can be seenthat Spectral Hashing does a very bad job of approximatingthe distance to far away points and collapses all of them toa similar distance, while the other methods do much betterat capturing the far distances but not small distances.

Which of the algorithms shown in Figure 1 is thereforebetter? Obviously the answer is application dependent, asit depends on the scale of Euclidean distances that we wantour algorithm to recover. Ideally, we would want an algo-rithm that learns different codes depending on the desiredscale of distances. In this paper, we present such an algo-rithm. Our algorithm seeks a binary code so that inner prod-uct among the codes approximates the affinity between dat-apoints, i.e. W (i, j) = exp(−‖xi − xj‖2/2σ2). Since theaffinity is a nonlinear function of distance, approximatingthe affinity uniformly will lead to approximating the dis-tances with nonuniform quality, and will give a good ap-proximation for distances around the value σ while givingworse approximations for distances significantly larger thanσ. We show that approximating the affinity matrix can beformalized as a binary matrix factorization problem whichis NP hard, but a simple relaxation yields a problem whosesolution is simply given by the eigenvectors of the affin-ity matrix. A simple way to learn a binary code is there-fore to simply threshold these eigenvectors, but we discusstwo problems with this simple approach: (1) out-of-sampleextension (i.e. calculating the code for a new datapoint)and (2) the curse of dimensionality whereby the numberof eigenvectors needed to achieve good performance growsexponentially with dimension. By making a separability as-sumption on the data distribution and taking the limit of theeigenvectors as the number of datapoints goes to infinity wecan avoid these two problems. In particular, we introducea “kernel trick” that allows us to compute the factorization

2

Page 3: Multidimensional Spectral Hashingfergus/drafts/mdsh_cvpr11.pdf · Multidimensional Spectral Hashing Anonymous CVPR submission Paper ID 1312 Abstract With the growing availability

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1312

CVPR#1312

CVPR 2012 Submission #1312. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

with an exponentially large number of eigenvectors, but us-ing only time and memory that grows linearly with dimen-sion. We compare our algorithm to the state-of-the-art onreal and synthetic datasets.

2. Learning Binary Codes using Matrix Fac-torization

Typically the cost function for learning binary codes isgiven in terms of the Hamming distance ‖yi − yj‖ betweenthe binary codes of datapoints i and j, where yi is a bi-nary vector of length k. But since the elements of yi areconstrained to be plus or minus 1, the Hamming distanceis a simple function of the dot product, or Hamming affin-ity yT

i yj , since: ‖yi − yj‖2 = 2k − 2yTi yj . Thus instead

of requiring that Hamming distances match some targetvalue, we can equivalently require that the Hamming affin-ity match some target value. A natural choice for this tar-get value is the affinity W (i, j) = exp(−‖xi − xj‖2/2σ2),where as noted above, the scale parameter σ is an explicitinput set by the user. Another small modification we maketo the standard formulation of learning binary codes, is thatwe allow a weight λ, one for each bit. The code for eachdatapoint is still binary valued (requiring very little memoryand computation for each codeword), but the weights can bereal-valued. Using these weights, the Hamming affinity isa linear combination of the agreement between the codes:yT

i Λyj =∑

k λkyi(k)yj(k), where Λ is a diagonal matrixwhose diagonal values give the weights. We seek codessuch that this weighted Hamming affinity is equal to theoriginal affinity yT

i Λyj ≈ W (i, j):

(Y ∗,Λ∗) = arg minyi∈{−1,1}k,Λ

∑i,j

(yTi Λyj −W (i, j))2 (1)

Let Y be a n × k matrix, where the ith row gives the bi-nary code of the ith datapoint, then the cost function canbe rewritten as minimizing the Frobenius norm between theaffinity matrix W and the rank k matrix Y ΛY T :

(Y ∗,Λ∗) = arg minyi∈{−1,1}k,Λ

‖W − Y ΛY T ‖2F (2)

Under this formalization, the best binary code for a givendataset is equivalent to performing binary matrix factor-ization of the affinity matrix. Unfortunately, as in manyconstrained matrix factorization problems[15], the discreteconstraint that the elements of the matrix are in {1,−1}make this problem computationally intractable.

We notice that removing the binary constraint yields astandard matrix factorization problem for which the opti-mal solution is easily obtained by taking the top k eigen-vectors of W . Denote by U a n× k matrix whose columnsare the top k eigenvectors and by Λ a diagonal matrix with

the k eigenvalues of W then UΛUT is the best rank k ap-proximation of W . If we are lucky, and the eigenvectorsare already binary valued, then we have solved the binarymatrix factorization problem. As we now discuss, even ifthe eigenvectors are not binary-valued they (1) give an up-per bound on the performance of any code and (2) can bethresholded to give an approximate solution for binary ma-trix factorization.

The preceding discussion suggests an extremely simplemethod for learning binary codes that focus on a specificdistance given by σ — use σ to define an affinity matrixand threshold the eigenvectors of affinity matrix. But thisapproach will not give a practical code. In the case of thestate-of-the-art algorithms (SH, LSH, ITQ) there is a simplefunction that allows us to calculate the code of a new data-point x (in LSH and ITQ we simply perform k dot productsof x with known vectors and threshold the result). But tocalculate the binary codes using eigenvectors of the affinitymatrix, we need to compute the affinity between x and allother datapoints, form the (n + 1) × (n + 1) affinity ma-trix and calculate its eigenvectors. Alternatively, we can usemethods such as the Nystrom approximation [8] to approx-imate the eigenvectors of the affinity matrix from a smallermatrix, but this approach still requires prohibitive computa-tion for large datasets. We now show how to calculate thelimit of the eigenvectors as the number of points approachesinfinity and how this leads to a practical, yet inefficient al-gorithm.

2.1. The limit of matrix factorization

To analyze the limit of the eigenvectors, we need to takea slightly different viewpoint (cf. [3, 18, 4, 2, 9]): rather thanthinking of the n dimensional space defined by the n points,we consider the d dimensional space Rd in which thesepoints lie. We define an operator W(s, t) = exp(−‖s−t‖2

2σ2 )which simply measures the affinity between any two pointss and t in Rd (unlike W (i, j) which only measures affinitybetween points in the dataset). It is easy to see that whenthe points in the dataset are IID samples from a distribu-tion Pr(x), the matrix factorization problem approaches adifferent problem, which we will call the operator factor-ization problem.

Claim 1: Let xi be IID samples from Pr(x), and W (i, j)the affinity between xi and xj . For any function y(x) , de-fine yi = y(xi) as the code corresponding to xi, then asthe number of samples goes to infinity 1

N2 ‖W − Y T ΛY ‖F

approaches:

J(y, Λ) =∫

s,t

(W(s, t)− yT (s)Λy(s)

)2Pr(s) Pr(t)dsdt

(3)The proof follows directly from the law of large numbers.

In the operator factorization problem, we seek a codey(s) so that for any two points s and t the affinity be-

3

Page 4: Multidimensional Spectral Hashingfergus/drafts/mdsh_cvpr11.pdf · Multidimensional Spectral Hashing Anonymous CVPR submission Paper ID 1312 Abstract With the growing availability

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#1312

CVPR#1312

CVPR 2012 Submission #1312. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

tween the codes y(t)T Λy(s) is equal to the original affin-ity W (s, t). The probability Pr(x) defines a weighting onthe points, giving more weight to pairs of points (s, t) withhigher density.

Just as the solution to the matrix factorization problemis given by the k eigenvectors of W , the solution to the op-erator factorization can be shown to be given by k eigen-functions of the operator WP. We define the operator(Wf)(s) =

∫tW (s, t)f(t)dt and the operator (Pf)(s) =

Pr(s)f(s). A function f is an eigenfunction of WP ifWPf = λf .

Claim 2: The solution to the operator factorization prob-lem, (y∗,Λ∗) = arg minJ(y, Λ), with J as in equation 3and y(s) constrained to be a k dimensional vector, is givenby the k eigenfunctions of WP with maximal eigenvalue.Proof: For the case that Pr(x) is uniform on a region in Rd,the result follows from the standard matrix factorization re-sult. When Pr(x) is non-uniform, a change of variablesz = P 1/2y converts the problem into one with uniformPr(x) and the operator replaced by

√PW

√P. This means

that the optimal z is given by the top k eigenfunctions of√PW

√P or alternatively WPy = Λy.

Taken together, claims 1-2 suggest a solution to the bi-nary matrix factorization problem: we simply solve forthe top k eigenfunctions of the operator WP, i.e. wesolve for both the eigenfunction Φi(s), and their eigenval-ues λi. Given these solutions, we can compute the codefor each point x simply by thresholding at zero y(x) ={sign(Φi(s))}k

i=1. Note that this gives an analytical for-mula for the code of a new datapoint, so this algorithm ispractical. Also, when the eigenfunctions happen to be bi-nary valued, this code is optimal in the limit of many dat-apoints — no binary code can do a better job of approxi-mating the affinity. However, this code may still be veryinefficient due to the curse of dimensionality.

2.2. Curse of Dimensionality

How many bits will we need in order to faithfully ap-proximate the original affinity? We can answer this questionby analyzing the eigenspectrum of (WP). If there are onlya small number of eigenvalues that dominate the spectrum,we can expect a good approximation with a small code. Butif the number of large eigenvalues is very large, we cannotexpect any small code to give a good approximation.

To illustrate this, consider the multidimensional uniformdistribution that was analyzed in [18]. For this case theeigenfunctions of WP are simply the eigenfunctions of W(since P is uniform) and correspond to the fundamentalmodes of vibration of a metallic plate.2 The eigenfunctions

2In [18] these eigenfunctions arose from relaxing the balanced-cutproblem, while here the same eigenfunctions arise from binary matrix fac-torization. For a uniform distribution, both problems give the same eigen-functions with the eigenvalue λ replaced by 1− λ.

are all sinusoidal functions and can be divided into “single-dimension” eigenfunctions and “outer-product” eigenfunc-tions. All single-dimension eigenfunctions are of the form:

Φj(x(i)) = sin(

π

2+

bi − aix(i)

)(4)

λj = e−σ2

2 | jπbi−ai

|2 (5)

where x(i) refers to the ith coordinate of x and x(i) is uni-formly distributed in [ai, bi].

The “outer-product” eigenfunctions are simply prod-ucts of eigenfunctions along different dimensions andtheir eigenvalue is simply the product of the eigenvaluesalong each dimension. Figure 2 shows the eigenfunc-tions for a uniform rectangular distribution in R2 (replottedfrom [18]), with the “outer-product” eigenfunctions shownwith a red frame.

In the ordinary spectral hashing (SH) algorithm, these“outer-product” eigenfunctions were discarded since thegoal of SH is to search for codes where the bits are inde-pendent, and the outer-product eigenfunctions lead to bitsthat are deterministic functions of the other bits. But if ourgoal is to approximate the affinity matrix, then discardingthese eigenfunctions makes no sense — in fact, the optimalfactorization will often contain a very large number of theseouter-product eigenfunctions.

In particular, when the uniform distribution is not overtwo dimensions, as in figure 2 but rather over d dimensions,then the number of outer-product eigenfunctions with largeeigenvalue will grow exponentially with the dimension d.For example, if σ << |bi−ai| for all i then the each dimen-sion will have at least two eigenfunctions with the maximaleigenvalue λ ≈ 1. In this case, the number of bits needed tofaithfully approximate the affinity grows exponentially withdimension. Since the eigenfunctions give the optimal low-rank factorization of the affinity, this means that any code(binary or continuous) that does a good job of approximat-ing the affinity should have a length that grows exponen-tially with the dimension.

Although we have focused above on the uniform distri-bution, it can be shown that these “outer-product” eigen-functions, will exist whenever the distribution Pr(x) is aproduct of distributions along single dimensions. The badnews is that this means we will require an exponentiallylarge number of bits to faithfully approximate the affinity.The good news, however, is that we do not need to store thebits corresponding to outer-product eigenfunctions, sincewe can calculate their values directly from the bits of thesingle-product eigenfunctions. This insight leads to thefollowing algorithm, which we call inefficient multidimen-sional spectral hashing:

1. calculate the single-dimension eigenfunctions. We de-

4

Page 5: Multidimensional Spectral Hashingfergus/drafts/mdsh_cvpr11.pdf · Multidimensional Spectral Hashing Anonymous CVPR submission Paper ID 1312 Abstract With the growing availability

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#1312

CVPR#1312

CVPR 2012 Submission #1312. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 2. Left: Eigenfunctions for a uniform rectangular distribution in 2D. Right: Binary codes obtained by thresholding these eigen-functions. These eigenfunctions can be divided into single-dimension eigenfunctions and “outer-product” eigenfunctions that are indicatedby a red-frame. In the standard SH algorithm, bits from the outer-product eigenfunctions are ignored while here we show how to efficientlycompute Hamming distances with an exponential number of such bits using a “kernel trick”.

note by Φij(x(i)) the jth eigenfunction of the ith co-ordinate and λij the corresponding eigenvalue.

2. Sort λij and find a set of k indices (I, J) so that λij ismaximal.

3. encode each datapoint x with y(x) = sign(φij(x)) forall (i, j) ∈ (I, J).

4. Expand the code of x to include all outer-producteigenfunctions. For any subset (Il, Jl) ⊂ (I, J)such that no index i appears in Il more thanonce, add an additional bit given by yl(x) =∏

i,j∈(Il,Jl)sign(Φi,j(x(i))). The weight of this bit

is given by λl =∏

i,j∈(Il,Jl)λi,j .

5. The Hamming affinity between xi and xj is given by:

H(i, j) =∑

l

yl(xi)λlyl(xj) (6)

where the index l goes over the single-dimension bitsas well as the outer-product bits.

We call this algorithm “multidimensional spectral hash-ing” (MDSH) because the first three steps (calculating thesingle-dimension eigenfunctions, sorting the eigenvaluesand encoding using the sign of the eigenfunctions) are ex-actly the same as the original SH algorithm. But the dif-ference is that whereas the original SH used only single-dimension eigenfunctions during retrieval, here we expandthe code to include the outer-product eigenfunctions aswell. From the standpoint of approximating the affinity, thismakes a huge difference. Suppose instead of thresholdingthe eigenfunctions (in the third step of the algorithm) wewould encode each point with the value of the k top single-dimension eigenfunctions. It is easy to show that in thiscase the approximate affinity calculated by the algorithmH(i, j), will be better than any other factorization that usescontinuous codes of length k2, where k2 is the number ofeigenfunctions (single dimension and outer-product) witheigenvalues greater than λk. In general k2 will be muchlarger than k (it will grow exponentially in dimension) butthe MDSH algorithm guarantees that we will obtain a fac-torization that is optimal with rank k2 even though we onlyencode each point with a code of length k.

As an example, consider a uniform distribution in [0, 1]10

and assume σ = 0.1. Along each dimension, the eigenval-ues are the same, and the top 4 eigenvalues are equal to(0.95, 0.82, 0.64, 0.45) (equation 5). If we choose 4 eigen-functions per dimension, i.e. k = 40, then the precedingargument shows we will have a better factorization than anymethod that uses k2 eigenvectors, where k2 is the numberof eigenvectors whose eigenvalue is greater than 0.45. Butsince 0.9410 > 0.45 there are at least 210 = 1024 sucheigenvectors so that k2 > 1024. Thus even though we onlystore 40 number per datapoint, we are guaranteed to be bet-ter than all factorizations of rank 1024.

As its name suggests, the inefficient MDSH algorithmrequires computation that scales exponentially with dimen-sion (equation 6). However, we now show how to computethe same Hamming distance with computation that is linearin dimension.

Observation: (Kernel Trick) Let H(i, j) be theweighted Hamming distance between two points, us-ing all the bits in the inefficient MDSH algorithm(equation 6). Let Hd(i, j) be the weighted Ham-ming distance between these two points using only pureeigenfunctions along dimension d, i.e. Hd(i, j) =∑

(d,l)∈(I,J) λdlsign(φdl(xi(d)))sign(φdl(xj(d))). Then:

H(i, j) = −1 +∏d

(1 + Hd(i, j)) (7)

Proof: This follows directly by expanding the product inequation 7.

Note that this algorithm will give exactly the same Ham-ming similarity as the inefficient one but both storage andcomputation grow linearly with dimension. When the inputdistribution Pr(x) is separable and the eigenfunctions hap-pen to be binary valued, this algorithm will give the optimalrank k2 binary factorization of the affinity matrix, wherek22 >> k is the number of eigenfunctions with eigenvaluegreater than λk.

For real problems, of course, the input distribution is notguaranteed to be separable nor are the eigenfunctions guar-anteed to be binary valued. To deal with the separabilityassumption, we first apply PCA to the data so that differ-ent dimensions are at least uncorrelated (for Gaussian dis-tributions this would also make the distribution separable).

5

Page 6: Multidimensional Spectral Hashingfergus/drafts/mdsh_cvpr11.pdf · Multidimensional Spectral Hashing Anonymous CVPR submission Paper ID 1312 Abstract With the growing availability

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#1312

CVPR#1312

CVPR 2012 Submission #1312. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

MDSHLSHSHITQBREFleet

32 bits, evaluation using threshold T= δ/4

0 0.2 0.4 0.6 0.8 10.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Recall

Pre

cisi

on

MDSHLSHSHITQBREFleet

32 bits, evaluation using threshold T= δ

1/8 1/6 1/4 1/2 1 7/50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance Threshold (Fraction of δ)

Mea

n P

reci

sion

Evaluation using 32 bits

MDSH

SH

LSH

ITQ

BRE

8 16 24 32 64 1280

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of bits

Mea

n P

reci

sion

Evaluation using distance threshold δ/4

MDSH

SH

LSH

ITQ

BRE

Figure 3. Comparison between different algorithms on toy data (32D Gaussian from figure 1). We compare MDSH, SH [18], LSH [1],Gong and Lazebnik’s ITQ [6], Kulis and Darrell’s BRE [7] and MLH [10]. Left: Performance as a function of distance threshold. Rankingof existing algorithm depends on distance, but MDSH consistently works best. Middle: Precision-recall curves for two different thresholds.Right: Performance as a function of number of bits. SH performance can decrease with more bits, but MDSH continues to improve.

0 0.2 0.4 0.6 0.8 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

evaluation using neighborhood=0.47

MDSH

LSH

SH

ITQ

Figure 4. A case where the data density is extremely nonsepara-ble by construction. MDSH still outperforms the other methodssince it is approximating the correct operator but with an incorrectweighting on the error.

Note however, that even if the original distribution is non-separable, the eigenfunctions that assume separability arestill the optimal solution to an operator factorization prob-lem (eq. 3) with the correct operator but the wrong density.As long as the separable density gives nonzero weight topoints that had nonzero weight in the original density, theeigenfunction expansion will converge to the correct affin-ity as k increases. In other words, if the eigenfunctions hap-pen to be binary, the MDSH algorithm will converge to the

correct affinity even if the distribution is nonseparable.The binary assumption is more problematic, as can be

expected from the fact that binary matrix factorization is in-tractable. We follow the classic approach of spectral meth-ods (e.g. [14]) and threshold the nonbinary eigenfunctionsto obtain an approximate solution to the binary problem.The only exception is when the number of eigenfunctionswith significant eigenvalue is less than the number of de-sired bits. In such cases (which we determine by count-ing the number of eigenvalues greater than a threshold of0.1 times the maximal eigenvalue) , MDSH cannot createenough bits with significant weight and we concatenate tothe current code a second code that uses the exact sameeigenfunctions but instead of thresholding at zero, thresh-olds at another value. We repeat this concatenation until thenumber of bits is equal to the bit budget.

We use the code from [4] to numerically calculate theeigenfunctions for a single dimension, where we use thefact that generalized eigenfunctions of the graph Laplacian(which are what [4] solves) correspond to eigenfunctions ofa reweighted affinity operator. We will release Matlab codeof the full algorithm upon publication of the paper.

3. Results

We first revisit the toy data shown in the introduction.This data is sampled from a 32D Gaussian with the standarddeviation along the ith dimension equal to 1/i2. We com-pare MDSH to five existing approaches [18, 1, 6, 7, 10].

6

Page 7: Multidimensional Spectral Hashingfergus/drafts/mdsh_cvpr11.pdf · Multidimensional Spectral Hashing Anonymous CVPR submission Paper ID 1312 Abstract With the growing availability

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#1312

CVPR#1312

CVPR 2012 Submission #1312. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Query

Real NN

MDSH

LSH

SH

ITQ

Nearest Neighbors1 2 3 4 5 6 7 8 9 10

Query

Real NN

MDSH

LSH

SH

ITQ

(a)

(b)

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

MDSHLSHSHITQ

(c)

0 0.2 0.4 0.6 0.8 10.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

RecallP

reci

sion

MDSHLSHSH

(d)

Evaluation using threshold T= δ/4

Evaluation using threshold T= δ

8 16 24 32 64 1280

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of bits

Mea

n P

reci

sion

Evaluation using distance threshold δ/4

MDSHSHLSHITQ

(e)

Figure 5. Results on CIFAR dataset, 64K images. Left: 2 randomly selected test images and the retrieved neighbors. Middle: Precision-recall curves for different threshold. Right: Mean precision as a function of number of bits.

Query

Real NN

MDSH

LSH

SH

ITQ

Nearest Neighbors

1 2 3 4 5 6 7 8 9 10

Query

Real NN

MDSH

LSH

SH

ITQ

0 100 200 300 400 5000

2

4

6

8

10

12

14

16x 10

4

Rank

Expe

cted

Ran

k

MDSHLSHSHITQ

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

MDSHLSHSHITQ

0 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

MDSHLSHSHITQ

evaluation using threshold δ/4 evaluation using threshold δ/20

Using 32 bits

(a)

(b)

(c)

16 32 64 128 2560

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10 5

Number of bits

Expe

cted

rank

@ 1

00

MDSHSHLSHITQ

(d)

Figure 6. Results on 1 million Tiny images with 32 bits. Top and Middle left: precision recall curve with T = δ/4, T = δ/20. Bottomleft: The average rank in terms of Euclidean similarity for the kth neighbor retrieved by each code. Low curves are better. Bottom middle:Expected rank at 100 as a function of number of bits. Low curves are better. Right: 2 randomly selected test images and the retrievedneighbors.

In all comparisons, we use the code made available bythe authors of the paper. Figure 3 shows that MDSH,which explicitly tries to model the affinity with a givenσ outperforms the other algorithms for different regimes.Note also that the performance of SH, which ignores the“outer-product” eigenfunctions, may decrease with morebits (cf. [12]) but MDSH which does not ignore these eigen-functions continues to improve.

To check the robustness to the separability assumption,we take the 32D synthetic data and set x(i) = x2(i− 1) forall even dimensions. This creates strong, nonlinear depen-dencies in the data which PCA cannot undo, thus the separa-

bility assumption does not hold. Nevertheless, we find thatMDSH still outperforms the other methods (Figure 4). Asdiscussed in section 2.2, even with the separability assump-tion fails, MDSH is still approximating the correct operator,but with an incorrect weight function on the error.

Our next experiment uses the exact same CIFAR datasetused in [6]. It consist of 64,800 images from 10 ground-truth classes. We use 32 PCA coefficients of the GIST de-scriptors and show results when the “distance threshold”,i.e. the definition of ground-truth neighbors is set to δ, theaverage distance between any two images, and when thedistance threshold is set to δ/4. As can be seen, for large

7

Page 8: Multidimensional Spectral Hashingfergus/drafts/mdsh_cvpr11.pdf · Multidimensional Spectral Hashing Anonymous CVPR submission Paper ID 1312 Abstract With the growing availability

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#1312

CVPR#1312

CVPR 2012 Submission #1312. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

distance thresholds, all algorithms perform well but the ITQalgorithm works best. In contrast, in the more challengingregime of small distance thresholds, the MDSH algorithmoutperforms all other algorithms. BRE and MLH were veryslow in these runs so were not compared. Note that ITQcannot generate more bits than dimensions in the input (32dims).

Finally, we show results on a database of 1 Million TinyImages (Figure 6). Unlike the CIFAR dataset which is struc-tured into 10 groups by construction, in this unstructureddataset we find that the setting of a reasonable distancethreshold is very tricky— since almost all pairs of points areat distance δ. When we set the threshold to be δ/20, MDSHperforms perfectly and retrieves all similar pairs with per-fect precision. For threshold δ/4, it still has high precisionbut returns only a small fraction of the retrieved pairs. Incontrast LSH and ITQ simply retrieve a huge number ofpairs even for the smallest possible threshold and thereforehave very low precision in both cases.

To avoid the dependence on setting the distance thresh-old, we present in the bottom left figure a different mea-sure of performance. For each algorithm and number k wefind the kth nearest neighbor defined by the code and findthe ranking of that image in the original gist distance. Thelower the curve the better. MDSH is the best using thismeasure (which does not require a distance threshold). Themiddle bottom figure shows that MDSH continues to out-perform the other methods using this measure for differentnumber of bits.

4. Discussion

What makes a good similarity-preserving binary code?One option is to define a code as good if Hamming dis-tance matches the original Euclidean distance. In this pa-per, we have shown that this definition may be problem-atic, since in applications where we care about retrieving“similar” neighbors, matching all distances exactly may bea waste of bits. Instead we have suggested an alternativecriterion, where the goal is to have Hamming affinity matchthe desired affinity matrix. We have shown that this leads toan intractable, binary matrix factorization problem but thatthe spectral relaxation can be used to build excellent codeswith very simple learning algorithms.

References

[1] A. Andoni and P. Indyk. Near-optimal hashing algo-rithms for approximate nearest neighbor in high di-mensions. In FOCS, pages 459–468, 2006. 1, 2, 6

[2] M. Belkin and P. Niyogi. Towards a theoretical foun-dation for laplacian-based manifold methods. J. Com-put. Syst. Sci., 74(8):1289–1308, 2008. 3

[3] R. R. Coifman, S. Lafon, A. Lee, M. Maggioni,B. Nadler, F. Warner, and S. Zucker. Geometric dif-fusion as a tool for harmonic analysis and structuredefinition of data, part i: Diffusion maps. PNAS,21(102):7426–7431, 2005. 3

[4] R. Fergus, Y. Weiss, and A. Torralba. Semi-supervisedlearning in gigantic image collections. In NIPS, 2009.3, 6

[5] A. Gionis, P. Indyk, and R. Motwani. SimilaritySearch in High Dimensions via Hashing. In Proc. IntlConf. on Very Large Data Bases, 1999. 1

[6] Y. Gong and S. Lazebnik. Iterative quantization: Aprocrustean approach to learning binary codes. InCVPR, 2011. 1, 2, 6, 7

[7] B. Kulis and T. Darrell. Learning to Hash with BinaryReconstructive Embeddings. In NIPS, 2009. 1, 2, 6

[8] S. Kumar, M. Mohri, and A. Talwalkar. Samplingtechniques for the Nystrom method. In AISTATS,2009. 3

[9] B. Nadler, N. Srebro, and X. Zhou. Statistical analysisof semi-supervised learning: The limit of infinite unla-belled data. In Y. Bengio, D. Schuurmans, J. Lafferty,C. K. I. Williams, and A. Culotta, editors, Advancesin Neural Information Processing Systems 22, pages1330–1338. 2009. 3

[10] M. Norouzi and D. Fleet. Minimal loss hashing forcompact binary codes. In ICML, 2011. 1, 6

[11] D. R. R. Lin and J. Yagnik. Spec hashing: Similar-ity preserving algorithm for entropy-based coding. InCVPR, 2010. 1

[12] M. Raginksi and S. Lazebnik. Locality-sensitive bi-nary codes from shift-invariant kernels. In NIPS, 2009.1, 7

[13] R. R. Salakhutdinov and G. E. Hinton. Semantic hash-ing. In SIGIR workshop on Information Retrieval andapplications of Graphical Models, 2007. 1

[14] J. Shi and J. Malik. Normalized cuts and image seg-mentation. IEEE PAMI, 22:888–905, 2000. 6

[15] N. Srebro and T. Jaakkola. Weighted low-rank approx-imations. In ICML, pages 720–727, 2003. 3

[16] A. Torralba, R. Fergus, and Y. Weiss. Small Codes andLarge Image Databases for Recognition. In CVPR,2008. 1

[17] J. Wang, S. Kumar, and S.-F. Chang. Sequential pro-jection learning for hashing with compact codes. InICML, pages 1127–1134, 2010. 1

[18] Y. Weiss, A. Torralba, and R. Fergus. Spectral hash-ing. In NIPS, 2008. 1, 2, 3, 4, 6

8