efï¬cient online locality sensitive hashing via reservoir counting

6
Efficient Online Locality Sensitive Hashing via Reservoir Counting Benjamin Van Durme HLTCOE Johns Hopkins University Ashwin Lall Mathematics and Computer Science Denison University Abstract We describe a novel mechanism called Reser- voir Counting for application in online Local- ity Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approxima- tion accuracy at a similar memory footprint. 1 Introduction Feature vectors based on lexical co-occurrence are often of a high dimension, d. This leads to O(d) op- erations to calculate cosine similarity, a fundamental tool in distributional semantics. This is improved in practice through the use of data structures that ex- ploit feature sparsity, leading to an expected O(f ) operations, where f is the number of unique features we expect to have non-zero entries in a given vector. Ravichandran et al. (2005) showed that the Lo- cality Sensitive Hash (LSH) procedure of Charikar (2002), following from Indyk and Motwani (1998) and Goemans and Williamson (1995), could be suc- cessfully used to compress textually derived fea- ture vectors in order to achieve speed efficiencies in large-scale noun clustering. Such LSH bit signa- tures are constructed using the following hash func- tion, where ~v R d is a vector in the original feature space, and ~ r is randomly drawn from N (0, 1) d : h( ~v)= 1 if ~v · ~ r 0, 0 otherwise. If h b ( ~v) is the b-bit signature resulting from b such hash functions, then the cosine similarity between vectors ~u and ~v is approximated by: cos(~u,~v)= ~u·~v |~u||~v| cos( D(h b (~u),h b (~v)) b * π), where D(·, ·) is Hamming distance, the number of bits that disagree. This technique is used when b d, which leads to faster pair-wise comparisons between vectors, and a lower memory footprint. Van Durme and Lall (2010) observed 1 that if the feature values are additive over a dataset (e.g., when collecting word co-occurrence frequencies), then these signatures may be constructed online by unrolling the dot-product into a series of local oper- ations: ~v · ~ r i t ~v t · ~ r i , where ~v t represents features observed locally at time t in a data-stream. Since updates may be done locally, feature vec- tors do not need to be stored explicitly. This di- rectly leads to significant space savings, as only one counter is needed for each of the b running sums. In this work we focus on the following observa- tion: the counters used to store the running sums may themselves be an inefficient use of space, in that they may be amenable to compression through approximation. 2 Since the accuracy of this LSH rou- tine is a function of b, then if we were able to reduce the online requirements of each counter, we might afford a larger number of projections. Even if a chance of approximation error were introduced for each hash function, this may be justified in greater overall fidelity from the resultant increase in b. 1 A related point was made by Li et al. (2008) when dis- cussing stable random projections. 2 A b bit signature requires the online storage of b * 32 bits of memory when assuming a 32-bit floating point representation per counter, but since here the only thing one cares about these sums are their sign (positive or negative) then an approximation to the true sum may be sufficient.

Upload: others

Post on 03-Feb-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efï¬cient Online Locality Sensitive Hashing via Reservoir Counting

Efficient Online Locality Sensitive Hashing via Reservoir Counting

Benjamin Van DurmeHLTCOE

Johns Hopkins University

Ashwin LallMathematics and Computer Science

Denison University

Abstract

We describe a novel mechanism called Reser-voir Counting for application in online Local-ity Sensitive Hashing. This technique allowsfor significant savings in the streaming setting,allowing for maintaining a larger number ofsignatures, or an increased level of approxima-tion accuracy at a similar memory footprint.

1 Introduction

Feature vectors based on lexical co-occurrence areoften of a high dimension, d. This leads to O(d) op-erations to calculate cosine similarity, a fundamentaltool in distributional semantics. This is improved inpractice through the use of data structures that ex-ploit feature sparsity, leading to an expected O(f)operations, where f is the number of unique featureswe expect to have non-zero entries in a given vector.

Ravichandran et al. (2005) showed that the Lo-cality Sensitive Hash (LSH) procedure of Charikar(2002), following from Indyk and Motwani (1998)and Goemans and Williamson (1995), could be suc-cessfully used to compress textually derived fea-ture vectors in order to achieve speed efficienciesin large-scale noun clustering. Such LSH bit signa-tures are constructed using the following hash func-tion, where ~v ∈ Rd is a vector in the original featurespace, and ~r is randomly drawn from N(0, 1)d:

h(~v) ={

1 if ~v · ~r ≥ 0,0 otherwise.

If hb(~v) is the b-bit signature resulting from b suchhash functions, then the cosine similarity betweenvectors ~u and ~v is approximated by:

cos(~u,~v) = ~u·~v|~u||~v| ≈ cos(D(hb(~u),hb(~v))

b ∗ π),

where D(·, ·) is Hamming distance, the number ofbits that disagree. This technique is used whenb � d, which leads to faster pair-wise comparisonsbetween vectors, and a lower memory footprint.

Van Durme and Lall (2010) observed1 that ifthe feature values are additive over a dataset (e.g.,when collecting word co-occurrence frequencies),then these signatures may be constructed online byunrolling the dot-product into a series of local oper-ations: ~v ·~ri = Σt~vt ·~ri, where ~vt represents featuresobserved locally at time t in a data-stream.

Since updates may be done locally, feature vec-tors do not need to be stored explicitly. This di-rectly leads to significant space savings, as only onecounter is needed for each of the b running sums.

In this work we focus on the following observa-tion: the counters used to store the running sumsmay themselves be an inefficient use of space, inthat they may be amenable to compression throughapproximation.2 Since the accuracy of this LSH rou-tine is a function of b, then if we were able to reducethe online requirements of each counter, we mightafford a larger number of projections. Even if achance of approximation error were introduced foreach hash function, this may be justified in greateroverall fidelity from the resultant increase in b.

1A related point was made by Li et al. (2008) when dis-cussing stable random projections.

2A b bit signature requires the online storage of b∗32 bits ofmemory when assuming a 32-bit floating point representationper counter, but since here the only thing one cares about thesesums are their sign (positive or negative) then an approximationto the true sum may be sufficient.

Page 2: Efï¬cient Online Locality Sensitive Hashing via Reservoir Counting

Thus, we propose to approximate the online hashfunction, using a novel technique we call ReservoirCounting, in order to create a space trade-off be-tween the number of projections and the amount ofmemory each projection requires. We show experi-mentally that this leads to greater accuracy approx-imations at the same memory cost, or similar accu-racy approximations at a significantly reduced cost.This result is relevant to work in large-scale distribu-tional semantics (Bhagat and Ravichandran, 2008;Van Durme and Lall, 2009; Pantel et al., 2009; Linet al., 2010; Goyal et al., 2010; Bergsma and VanDurme, 2011), as well as large-scale processing ofsocial media (Petrovic et al., 2010).

2 Approach

While not strictly required, we assume here to bedealing exclusively with integer-valued features. Wethen employ an integer-valued projection matrix inorder to work with an integer-valued stream of on-line updates, which is reduced (implicitly) to astream of positive and negative unit updates. Thesign of the sum of these updates is approximatedthrough a novel twist on Reservoir Sampling. Whencomputed explicitly this leads to an impracticalmechanism linear in each feature value update. Toensure our counter can (approximately) add and sub-tract in constant time, we then derive expressions forthe expected value of each step of the update. Thefull algorithms are provided at the close.

Unit Projection Rather than construct a projec-tion matrix from N(0, 1), a matrix randomly pop-ulated with entries from the set {−1, 0, 1} will suf-fice, with quality dependent on the relative propor-tion of these elements. If we let p be the percentprobability mass allocated to zeros, then we createa discrete projection matrix by sampling from themultinomial: (1−p

2 : −1, p : 0, 1−p2 : +1). An

experiment displaying the resultant quality is dis-played in Fig. 1, for varied p. Henceforth we assumethis discrete projection matrix, with p = 0.5.3 Theuse of such sparse projections was first proposed byAchlioptas (2003), then extended by Li et al. (2006).

3Note that if using the pooling trick of Van Durme and Lall(2010), this equates to a pool of the form: (-1,0,0,1).

Percent.Zeros

Mea

n.A

bsol

ute.

Err

or

0.1

0.2

0.3

0.4

0.5

0.2 0.4 0.6 0.8 1.0

Method

DiscreteNormal

Figure 1: With b = 256, mean absolute error in cosineapproximation when using a projection based onN(0, 1),compared to {−1, 0, 1}.

Unit Stream Based on a unit projection, we canview an online counter as summing over a streamdrawn from {−1, 1}: each projected feature valueunrolled into its (positive or negative) unary repre-sentation. For example, the stream: (3,-2,1), can beviewed as the updates: (1,1,1,-1,-1,1).

Reservoir Sampling We can maintain a uniformsample of size k over a stream of unknown lengthas follows. Accept the first k elements into an reser-voir (array) of size k. Each following element at po-sition n is accepted with probability k

n , whereuponan element currently in the reservoir is evicted, andreplaced with the just accepted item. This schemeis guaranteed to provide a uniform sample, whereearly items are more likely to be accepted, but also atgreater risk of eviction. Reservoir sampling is a folk-lore algorithm that was extended by Vitter (1985) toallow for multiple updates.

Reservoir Counting If we are sampling over astream drawn from just two values, we can implic-itly represent the reservoir by counting only the fre-quency of one or the other elements.4 We can there-fore sample the proportion of positive and negativeunit values by tracking the current position in thestream, n, and keeping a log2(k + 1)-bit integer

4For example, if we have a reservoir of size 5, containingthree values of −1, and two values of 1, then the exchangeabil-ity of the elements means the reservoir is fully characterized byknowing k, and that there are two 1’s.

Page 3: Efï¬cient Online Locality Sensitive Hashing via Reservoir Counting

counter, s, for tracking the number of 1 values cur-rently in the reservoir.5 When a negative value isaccepted, we decrement the counter with probabilitysk . When a positive update is accepted, we incrementthe counter with probability (1− s

k ). This reflects anupdate evicting either an element of the same sign,which has no effect on the makeup of the reservoir,or decreasing/increasing the number of 1’s currentlysampled. An approximate sum of all values seen upto position n is then simply: n(2s

k − 1). While thisvalue is potentially interesting in future applications,here we are only concerned with its sign.

Parallel Reservoir Counting On its own thiscounting mechanism hardly appears useful: as it isdependent on knowing n, then we might just as wellsum the elements of the stream directly, counting inwhatever space we would otherwise use in maintain-ing the value of n. However, if we have a set of tiedstreams that we process in parallel,6 then we onlyneed to track n once, across b different streams, eachwith their own reservoir.

When dealing with parallel streams resulting fromdifferent random projections of the same vector, wecannot assume these will be strictly tied. Some pro-jections will cancel out heavier elements than oth-ers, leading to update streams of different lengthsonce elements are unrolled into their (positive ornegative) unary representation. In practice we havefound that tracking the mean value of n across bstreams is sufficient. When using a p = 0.5 zeroedmatrix, we can update n by one half the magnitudeof each observed value, as on average half the pro-jections will cancel out any given element. This stepcan be found in Algorithm 2, lines 8 and 9.

Example To make concrete what we have cov-ered to this point, consider a given feature vec-tor of dimensionality d = 3, say: [3, 2, 1]. Thismight be projected into b = 4, vectors: [3, 0, 0],[0, -2, 1], [0, 0, 1], and [-3, 2, 0]. When viewed aspositive/negative, loosely-tied unit streams, they re-spectively have length n: 3, 3, 1, and 5, with meanlength 3. The goal of reservoir counting is to effi-ciently keep track of an approximation of their sums(here: 3, -1, 1, and -1), while the underlying feature

5E.g., a reservoir of size k = 255 requires an 8-bit integer.6Tied in the sense that each stream is of the same length,

e.g., (-1,1,1) is the same length as (1,-1,-1).

k n m mean(A) mean(A′)10 20 10 3.80 4.0210 20 1000 37.96 39.3150 150 1000 101.30 101.83

100 1100 100 8.88 8.72100 10100 10 0.13 0.10

Table 1: Average over repeated calls to A and A′.

vector is being updated online. A k = 3 reservoirused for the last projected vector, [-3, 2, 0], mightreasonably contain two values of -1, and one valueof 1.7 Represented explicitly as a vector, the reser-voir would thus be in the arrangement:

[1, -1, -1], [-1, 1, -1], or [-1, -1, 1].These are functionally equivalent: we only need toknow that one of the k = 3 elements is positive.

Expected Number of Samples Traversingm con-secutive values of either 1 or −1 in the unit streamshould be thought of as seeing positive or negativem as a feature update. For a reservoir of size k, letA(m,n, k) be the number of samples accepted whentraversing the stream from position n+ 1 to n+m.A is non-deterministic: it represents the results offlipping m consecutive coins, where each coin is in-creasingly biased towards rejection.

Rather than computing A explicitly, which is lin-ear inm, we will instead use the expected number ofupdates, A′(m,n, k) = E[A(m,n, k)], which canbe computed in constant time. Where H(x) is theharmonic number of x:8

A′(m,n, k) =n+m∑

i=n+1

k

i

= k(H(n+m)−H(n))

≈ k loge(n+m

n).

For example, consider m = 30, encountered atposition n = 100, with a reservoir of k = 10. Wewill then accept 10 loge(

130100) ≈ 3.79 samples of 1.

As the reservoir is a discrete set of bins, fractionalportions of a sample are resolved by a coin flip: ifa = k loge(

n+mn ), then accept u = dae samples

with probability (a − bac), and u = bac samples7Other options are: three -1’s, or one -1 and two 1’s.8With x a positive integer,H(x) =

Pxi=1 1/x ≈ loge(x)+

γ, where γ is Euler’s constant.

Page 4: Efï¬cient Online Locality Sensitive Hashing via Reservoir Counting

otherwise. These steps are found in lines 3 and 4of Algorithm 1. See Table 1 for simulation resultsusing a variety of parameters.

Expected Reservoir Change We now discusshow to simulate many independent updates of thesame type to the reservoir counter, e.g.: five updatesof 1, or three updates of -1, using a single estimate.Consider a situation in which we have a reservoir ofsize k with some current value of s, 0 ≤ s ≤ k, andwe wish to perform u independent updates. We de-note by U ′k(s, u) the expected value of the reservoirafter these u updates have taken place. Since a sin-gle update leads to no change with probability s

k , wecan write the following recurrence for U ′k:

U ′k(s, u) =s

kU ′k(s, u−1)+

k − sk

U ′k(s+1, u−1),

with the boundary condition: for all s, U ′k(s, 0) = s.Solving the above recurrence, we get that the ex-

pected value of the reservoir after these updates is:

U ′k(s, u) = k + (s− k)(

1− 1k

)u

,

which can be mechanically checked via induction.The case for negative updates follows similarly (seelines 7 and 8 of Algorithm 1).

Hence, instead of simulating u independent up-dates of the same type to the reservoir, we simplyupdate it to this expected value, where fractional up-dates are handled similarly as when estimating thenumber of accepts. These steps are found in lines 5through 9 of Algorithm 1, and as seen in Fig. 2, thiscan give a tight estimate.

Comparison Simulation results over Zipfian dis-tributed data can be seen in Fig. 3, which shows theuse of reservoir counting in Online Locality Sensi-tive Hashing (as made explicit in Algorithm 2), ascompared to the method described by Van Durmeand Lall (2010).

The total amount of space required when usingthis counting scheme is b log2(k + 1) + 32: b reser-voirs, and a 32 bit integer to track n. This is com-pared to b 32 bit floating point values, as is standard.Note that our scheme comes away with similar lev-els of accuracy, often at half the memory cost, whilerequiring larger b to account for the chance of ap-proximation errors in individual reservoir counters.

Expected

True

50

100

150

200

250

50 100 150 200 250

Figure 2: Results of simulating many iterations of U ′,for k = 255, and various values of s and u.

Algorithm 1 RESERVOIRUPDATE(n, k,m, σ, s)Parameters:n : size of stream so fark : size of reservoir, also maximum value of sm : magnitude of updateσ : sign of updates : current value of reservoir

1: if m = 0 or σ = 0 then2: Return without doing anything3: a := A′(m,n, k) = k loge

(n+m

n

)4: u := dae with probability a− bac, bac otherwise5: if σ = 1 then6: s′ := U ′(s, a) = k + (s− k) (1− 1/k)u

7: else8: s′ := U ′(s, a) = s (1− 1/k)u

9: Return ds′e with probability s′−bs′c, bs′c otherwise

Bits.Required

Mea

n.A

bsol

ute.

Err

or

0.06

0.07

0.08

0.09

0.10

0.11

0.12

1000 2000 3000 4000 5000 6000 7000 8000

log2.k

● 8

● 32

b● 64

128192256512

Figure 3: Online LSH using reservoir counting (red) vs.standard counting mechanisms (blue), as measured by theamount of total memory required to the resultant error.

Page 5: Efï¬cient Online Locality Sensitive Hashing via Reservoir Counting

Algorithm 2 COMPUTESIGNATURE(S ,k,b,p)Parameters:S : bit array of size bk : size of each reservoirb : number of projectionsp : percentage of zeros in projection, p ∈ [0, 1]

1: Initialize b reservoirs R[1, . . . , b], each representedby a log2(k + 1)-bit unsigned integer

2: Initialize b hash functions hi(w) that map features wto elements in a vector made up of −1 and 1 eachwith proportion 1−p

2 , and 0 at proportion p.3: n := 04: {Processing the stream}5: for each feature value pair (w,m) in stream do6: for i := 1 to b do7: R[i] := ReservoirUpdate(n, k,m, hi(w), R[i])8: n := n+ bm(1− p)c9: n := n+1 with probabilitym(1−p)−bm(1−p)c

10: {Post-processing to compute signature}11: for i := 1 . . . b do12: if R[i] > k

2 then13: S[i] := 114: else15: S[i] := 0

3 Discussion

Time and Space While we have provided a con-stant time, approximate update mechanism, the con-stants involved will practically remain larger thanthe cost of performing single hardware additionor subtraction operations on a traditional 32-bitcounter. This leads to a tradeoff in space vs. time,where a high-throughput streaming application thatis not concerned with online memory requirementswill not have reason to consider the developments inthis article. The approach given here is motivatedby cases where data is not flooding in at breakneckspeed, and resource considerations are dominated bya large number of unique elements for which weare maintaining signatures. Empirically investigat-ing this tradeoff is a matter of future work.

Random Walks As we here only care for the signof the online sum, rather than an approximation ofits actual value, then it is reasonable to consider in-stead modeling the problem directly as a randomwalk on a linear Markov chain, with unit updatesdirectly corresponding to forward or backward state

-4 -3 -2 -1 0 1 2 3

Figure 4: A simple 8-state Markov chain, requiringlg(8) = 3 bits. Dark or light states correspond to aprediction of a running sum being positive or negative.States are numerically labeled to reflect the similarity toa small bit integer data type, one that never overflows.

transitions. Assuming a fixed probability of a posi-tive versus negative update, then in expectation thestate of the chain should correspond to the sign.However if we are concerned with the global statis-tic, as we are here, then the assumption of a fixedprobability update precludes the analysis of stream-ing sources that contain local irregularities.9

In distributional semantics, consider a featurestream formed by sequentially reading the n-gramresource of Brants and Franz (2006). The pair: (thedog : 3,502,485), can be viewed as a feature valuepair: (leftWord=’the’ : 3,502,485), with respect toonline signature generation for the word dog. Ratherthan viewing this feature repeatedly, spread over alarge corpus, the update happens just once, withlarge magnitude. A simple chain such as seen inFig. 4 will be “pushed” completely to the right orthe left, based on the polarity of the projection, irre-spective of previously observed updates. ReservoirCounting, representing an online uniform sample, isagnostic to the ordering of elements in the stream.

4 Conclusion

We have presented a novel approximation schemewe call Reservoir Counting, motivated here by a de-sire for greater space efficiency in Online LocalitySensitive Hashing. Going beyond our results pro-vided for synthetic data, future work will explore ap-plications of this technique, such as in experimentswith streaming social media like Twitter.

AcknowledgmentsThis work benefited from conversations with DanielStefonkovic and Damianos Karakos.

9For instance: (1,1,...,1,1,-1,-1,-1), is overall positive, butlocally negative at the end.

Page 6: Efï¬cient Online Locality Sensitive Hashing via Reservoir Counting

ReferencesDimitris Achlioptas. 2003. Database-friendly random

projections: Johnson-lindenstrauss with binary coins.J. Comput. Syst. Sci., 66:671–687, June.

Shane Bergsma and Benjamin Van Durme. 2011. Learn-ing Bilingual Lexicons using the Visual Similarity ofLabeled Web Images. In Proc. of the InternationalJoint Conference on Artificial Intelligence (IJCAI).

Rahul Bhagat and Deepak Ravichandran. 2008. LargeScale Acquisition of Paraphrases for Learning SurfacePatterns. In Proc. of the Annual Meeting of the Asso-ciation for Computational Linguistics (ACL).

Thorsten Brants and Alex Franz. 2006. Web 1T 5-gramversion 1.

Moses Charikar. 2002. Similarity estimation techniquesfrom rounding algorithms. In Proceedings of STOC.

Michel X. Goemans and David P. Williamson. 1995.Improved approximation algorithms for maximum cutand satisfiability problems using semidefinite pro-gramming. JACM, 42:1115–1145.

Amit Goyal, Jagadeesh Jagarlamudi, Hal Daume III, andSuresh Venkatasubramanian. 2010. Sketch Tech-niques for Scaling Distributional Similarity to theWeb. In Proceedings of the ACL Workshop on GEo-metrical Models of Natural Language Semantics.

Piotr Indyk and Rajeev Motwani. 1998. Approximatenearest neighbors: towards removing the curse of di-mensionality. In Proceedings of STOC.

Ping Li, Trevor J. Hastie, and Kenneth W. Church. 2006.Very sparse random projections. In Proceedings ofthe 12th ACM SIGKDD international conference onKnowledge discovery and data mining, KDD ’06,pages 287–296, New York, NY, USA. ACM.

Ping Li, Kenneth W. Church, and Trevor J. Hastie. 2008.One Sketch For All: Theory and Application of Con-ditional Random Sampling. In Proc. of the Confer-ence on Advances in Neural Information ProcessingSystems (NIPS).

Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine,David Yarowsky, Shane Bergsma, Kailash Patil, EmilyPitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani,and Sushant Narsale. 2010. New Tools for Web-ScaleN-grams. In Proceedings of LREC.

Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, and Vishnu Vyas. 2009. Web-ScaleDistributional Similarity and Entity Set Expansion. InProc. of the Conference on Empirical Methods in Nat-ural Language Processing (EMNLP).

Sasa Petrovic, Miles Osborne, and Victor Lavrenko.2010. Streaming First Story Detection with applica-tion to Twitter. In Proceedings of the Annual Meetingof the North American Association of ComputationalLinguistics (NAACL).

Deepak Ravichandran, Patrick Pantel, and Eduard Hovy.2005. Randomized Algorithms and NLP: Using Lo-cality Sensitive Hash Functions for High Speed NounClustering. In Proc. of the Annual Meeting of the As-sociation for Computational Linguistics (ACL).

Benjamin Van Durme and Ashwin Lall. 2009. StreamingPointwise Mutual Information. In Proc. of the Confer-ence on Advances in Neural Information ProcessingSystems (NIPS).

Benjamin Van Durme and Ashwin Lall. 2010. OnlineGeneration of Locality Sensitive Hash Signatures. InProc. of the Annual Meeting of the Association forComputational Linguistics (ACL).

Jeffrey S. Vitter. 1985. Random sampling with a reser-voir. ACM Trans. Math. Softw., 11:37–57, March.