mapreduce based personalized locality sensitive hashing ......keywords: locality sensitive hashing,...

MapReduce Based Personalized LocalitySensitive Hashing for Similarity Joins on Large

Scale Data

Jingjing Wang, Chen Lin

February 24, 2015

Abstract

Locality Sensitive Hashing (LSH) has been proposed as an efficient techniquefor similarity joins for high dimensional data. The efficiency and approximationrate of LSH depend on the number of generated false positive instances and falsenegative instances. In many domains, reducing the number offalse positives iscrucial. Furthermore, in some application scenarios, balancing false positives andfalse negatives is favored. To address these problems, in this paper we proposePersonalized Locality Sensitive Hashing (PLSH), where a new banding scheme isembedded to tailor the number of false positives, false negatives and the sum ofboth. PLSH is implemented in parallel using MapReduce framework to deal withsimilarity joins on large scale data. Experimental studieson real and simulated dataverify the efficiency and effectiveness of our proposed PLSHtechnique, comparedwith state-of-the-art methods.

Keywords: Locality Sensitive Hashing, MapReduce, Similarity Joins

1 Introduction

A fundamental problem in data mining is to detect similar items. Finding similar pairsof instances is an essential component in mining numerous types of data, includingdocument clustering [1, 2], plagiarism detection [3], image search [4], recommendersystem [5] and so on.

Identifying pairs of similar instances is also called similarity joins [6]. Given a setof data instances, a similarity thresholdJ , and a join attributea, the goal of similarityjoins is to find all pairs of instances〈A,B〉 where their similarity on the join attributeis larger than the thresholdJ (i.e. sim(A.a,B.a) ≥ J). There are various simi-larity measurements, including cosine similarity [7], edit distance [6, 8, 9], hammingdistance [7, 10], dimension root similarity [2], EDU-basedsimilarity for elementarydiscourse units [11], and so on. In this work we focus on Jaccard similarity, which isproven to be successful for high dimensional, sparse feature sets [6]. Jaccard similarity

1

littlep

Highlight

littlep

Highlight

littlep

Highlight

for two feature vectorsS1 andS2 is defined as:sim(S1, S2) =|S1∩S2||S1∪S2|

. As an exam-ple, we illustrate naive computation for similarity joins based on Jaccard similarity inTable 1. Suppose there are5 instances, namelyA,B,C,D andE, the join attributeconsists of featuresa, b, c, d, e, f , the Jaccard similarity of〈A,B〉 is 1

4 , 〈A,C〉 is 23 ,

〈B,C〉 is 15 , 〈B,E〉, 〈C,D〉 and〈C,E〉 is 1

4 , 〈D,E〉 is 13 , the similarities for remain-

ing pairs〈A,D〉 , 〈A,E〉 , 〈B,D〉 are all0. Given the similarity thresholdJ = 0.5, itis directly concluded that instancesA andC are similar.

Table 1: An illustrative example of similarity joins based on Jaccard similarity. 0/1indicates absence/presence of features in each instance.

XXXXXXXXXX

InstanceFeature

a b c d e f

A 0 1 0 0 1 0B 1 0 0 0 1 1C 0 1 0 1 1 0D 0 0 1 1 0 0E 0 0 0 1 0 1

A naive algorithm, which finds similar pairs by computing similarities for all in-stance pairs, is clearly impracticable on a large collection of instances with high di-mensional features. To improve efficiency and scalability of similarity joins, previousresearch efforts generally fall into two categories. On onehand, parallel algorithmsare adopted on clusters of machines. Most of them are implemented using MapReduceframework, including a 3-stage MapReduce approach for end-to-end set-similarity joinalgorithm [12], fast computation of inner products for large scale news articles [13], anew ant colony optimization algorithm parallelized using MapReduce [14] to selectfeatures in a high dimension space, and so on. Others exploitthe parallelism and highdata throughput of GPU, i.e. the LSS algorithm [15]. On the other hand, algorith-mic design can be improved to reduce time and storage cost of similarity computationfor high dimensional feature space. One type of such approaches use dimension re-duction technologies, including Principle Components Analysis, neural networks [16].Another type is to hash and filter, so that high dimensional feature space can be re-placed by smaller representative signatures. Most popularhashing methods includeminhashing [17], minwise hashing [18] and Locality Sensitive Hashing (LSH) [10].The core idea of hashing is to map similar pairs to similar signatures with several hun-dred dimensions, each element of which is the result of hashing, hence sheds insightsto the solution of high dimensionality. Hashing can also be ameans for data clusteringbecause it enables similar features with vast dimensions tobe hashed into the samebuckets, thus partitions features into groups [19]. Filtering methods, including lengthfilter [7], prefix filter [20] and suffix filter [21], are frequently utilized consequentlyto eliminate dissimilar pairs while remaining possible similar pairs. As a result, fewersimilarity computations are needed. In particular, banding technique [22], a specified

2

littlep

Highlight

littlep

Highlight

littlep

Highlight

littlep

Highlight

littlep

Highlight

form of locality sensitive hashing, which maps every band ofsignatures to an array ofbuckets so the probability of collision is much higher for instances close to each other,is the most efficient filtering method.

Although previous works have demonstrated the importance and feasibility of hash-ing and filtering approaches, one critical issue remains underestimated. Hashing andfiltering approaches produce approximate results. The similarities of selected pairs arenot guaranteed to be larger than the predefined threshold. Inthe meanwhile, obsoletedpairs are not indeed dissimilar, with similarities less than the predefined threshold. Theformer case is called false positive, while the latter one iscalled false negative. Anappropriate number of false positives and false negatives are acceptable in many appli-cations. However, the tolerance to false positive and falsenegative may differ. In mostapplication scenarios such as clustering and information retrieval, a small amount offalse positives is emphasized to increase efficiency and precision. In applications suchas recommendation and bio-informatics systems [23, 24, 25], a small number of falsenegatives is more important.

In this paper, we address the problem of tailoring the numberof false positivesand false negatives for different applications. To the bestof our knowledge, this isthe first time in literature to present such detailed analysis. False positives and falsenegatives are caused by the scheme of pruning candidate pairs whose signatures mapinto disjoint bucket arrays. Intuitively, similar signatures are likely to have highlyanalogous bands. And analogous bands will be mapped into identical bucket arrays.Inspired by this intuition, we propose the new banding technique called PersonalizedLocality Sensitive Hashing (PLSH), in which bands of signatures mapped to at leastkidentical buckets are selected as candidates. We also explore the probability guaranteeof the new banding techniques provided to three cases: namely false negatives, falsepositives and the sum of both. According to these probabilities, we propose the upperbounds and lower bounds of false positives and false negatives, and accordingly presentto personalize the parameters involved in banding and hashing algorithms to fulfilldifferent application demands.

The contributions of this paper are three-folds:

• We improve the traditional banding technique by a new banding technique withflexible threshold to reduce the number of false positives and improve efficiency.

• We derive the number and lower/upper bound of false negatives, false positivesand balancing between them for our new banding technique.

• We implement the new banding technique using parallel framework MapReduce.

The rest of the paper is structured as follows. In section 2, the background of min-hashing and banding technique are presented. In Section 3, we introduce PersonalizedLocality Sensitive Hashing (PLSH). The implementation of PLSH using MapReduceis shown in Section 4. In Section 5, we present and analyze theexperimental results.We survey the related works in Section 6. Finally, the conclusion is given in Section 7.

3

littlep

Highlight

2 Background

In this section, we briefly introduce the minhashing algorithm and the consequent band-ing algorithm, which are the fundamental blocks of LocalitySensitive Hashing (LSH).The intuition of minhashing is to generate low dimensional signatures to represent highdimensional features. The intuition of banding is to filter candidates which are not like-ly to be similar pairs.

2.1 MinHashing

For large scale data sets, feature space is usually high dimensional and very sparse,i.e. only a tiny portion of features appear in a single instance. In order to reduce thememory used to store sparse vector, we use a signature, an integer vector consisting ofup to several hundred elements to represent an instance. To generate a signature, wefirst randomly change the order of features. In other word, the permutation defines ahash functionhi that maps the features to features with a different order. Each elementof signature is a minhash value [17], which is the position ofthe first non-zero featurein the permuted feature vector. For example, the original feature vector in Table 1 isabcdef , suppose the permuted feature vector isbacdfe, then feature vectors forA, B,C, D, E become(100001), (010011), (100101), (001100) and(000110) illustrated inTable 2. Thus the minhash value forA, B, C, D, E is 1, 2, 1, 3, 4, respectfully.

We can choosen independent permutationsh1, h2, ..., hn. Suppose the minhashvalue of an instanceSi for a certain permutationhj is denoted byminhj(Si), then thesignature denoted asSig(Si) is:

Sig(Si) = (min h1(Si),minh2(Si), ...,min hn(Si))

The approximate similarity between two instances based on their signatures is de-fined as the percentage of identical values at the same position in the correspond-ing signatures. For example, givenn = 6, Sig(S1) = (2, 1, 5, 0, 3, 2), Sig(S2) =(2, 1, 3, 2, 8, 0), the approximate Jaccard similarity is:sim(S1, S2) ≈

26 = 0.33.

Table 2: An illustrative example of permutation of feature vectors. 0/1 indicates ab-sence/presence of features in each instance.

````````````

Instance

Featureb a c d f e

A 1 0 0 0 0 1B 0 1 0 0 1 1C 1 0 0 1 0 1D 0 0 1 1 0 0E 0 0 0 1 1 0

4

littlep

Highlight

littlep

Highlight

littlep

Highlight

2.2 Banding

Given a large set of signatures generated in Section 2.1, it is still too costly to com-pare similarities for all signature pairs. Therefore, a banding technique is presentedconsequently to filter dissimilar pairs.

The banding technique divide each signature intob bands, where each band consistsof r elements. For each band of every signature, the banding technique map the vectorof r elements to a bucket array.

Figure 1: An illustrative example of banding technique

As shown in Figure 1, theith band of each signature maps to bucket arrayi. In-tuitively, if for a pair of signatures, the corresponding bucket arrays have at least onebucket array in common, then the pair is likely to be similar.e.g. signature 1 and sig-nature 2, signature 2 and signaturem in Figure 1 are similar. Such a pair with commonbucket array is considered to be a candidate pair, and needs to be verified in the bandingtechnique.

3 Personalized LSH

3.1 New Banding Technique

The candidates generated by LSH are not guaranteed to be similar pairs. Chances arethat a pair of signatures are projected to identical bucket arrays even if the Jaccardsimilarity between the pair of instances is not larger than the given threshold. In themeantime, a pair of instances can be filtered out from candidates since their correspond-ing signatures are projected into disjoint bucket arrays even if the Jaccard similarity issmaller than the given threshold. The former case is called false positive, while thelatter one is called false negative. Massive false positives will lead to inaccurate result-s, while a large amount of false negatives will deteriorate computational efficiency ofLSH. To enhance the algorithm precision and efficiency, we present here a new bandingscheme to filter more dissimilar instance pairs. Intuitively, if two instances are highlyalike, its possible that many bands of the two correspondingsignatures are mapped toidentical buckets. For example, in Figure 1, there are at least 3 bands (i.e. the1st, the

5

littlep

Highlight

littlep

Highlight

littlep

Highlight

littlep

Highlight

5th, thebth bands) of signature 1 and signature 2 which map to the same buckets (i.e.in the corresponding bucket array 1, 5,b).

Therefore we change the banding scheme as follows. For any pair of instances,if the two corresponding signatures do not map into at leastk(k ∈ [1, b]) identicalbuckets, it will be filtered out. Otherwise, it is consideredto be a candidate pair and theexact Jaccard similarity is computed and verified. For the signatures shown in Figure 1,givenk = 3, signature1 and signaturem, signature 2 and signaturem are filtered.

3.2 Number of False Positives

A candidate pair〈S1, S2〉 is false positive, ifsim(〈S1, S2〉) < J , andS1, S2 share atleastk common bucket arrays. Since the efficiency of LSH is mainly dependent on thenumber of false positives, and most real applications demand a high precision, we firstderive the possible number of false positives generated by the new banding technique.

Lemma 1. The upper bound of false positives generated by the new banding techniqueis equal to the original LSH and the lower bound is approximate to 0.

Proof. According to the law of large numbers, the probability that the minhash valuesof two feature vectors(e.g.S1, S2) are equal under any random permutationh, is veryclose to the frequency percentage of observing identical value in the same position attwo long signatures of the corresponding feature vectors. That is

P (minh(S1) = min h(S2)) = limn→+∞

|Sig(S1)r = Sig(S2)r|

|Sig(S1)|(1)

wheren is the length of signaturesSig(S1) andSig(S2), r is the position in signatures,r ∈ [1, n].

Also, the probability that a random permutation of two feature vectors produces thesame minhash value, equals to the Jaccard similarity of those instances [17]. That is:

P (min h(S1) = minh(S2)) =|S1 ∩ S2|

|S1 ∪ S2|= sim(S1, S2). (2)

Based on the above two equations, the probability of two instances with Jaccardsimilarity s is considered to be a candidate pair by the new banding technique denotedbyPnew is

Pnew(s) = 1− Σk−1i=0

(

i

b

)

(sr)i(1 − sr)b−i (3)

wheres is the Jaccard similarity of the two instances,r is the length of each band,b isthe number of bands. We can prove the derivative ofPnew(s) is greater than 0, whichrepresentsPnew(s) is a monotonically increasing function ofs.

The number of false positive, denoted byFPmax(k), is:

FPmax(k) =

∫ J

0

NsPnew(s)ds (4)

6

littlep

Highlight

littlep

Highlight

littlep

Highlight

whereNs denotes the total number of similar pairs whose Jaccard similarity is s inthe instances set. Given an instance set,Ns is a constant.J is the given similaritythreshold.

The value ofFPmax(k) depends on the similarity distribution of a given instanceset. The upper bound ofFPmax(k) equals to the original LSHFPmax(1). Withoutthe knowledge of the similarity distribution of the data set, the lower bound of falsepositives cannot be directly derived. Hence, we introduce athresholdǫ to ensure

FPmax(k)

FPmax(1)≤ ǫ (5)

whereǫ is close to zero with increasingk. If k is⌊

Jnr

⌋

, the lower bound of false posi-tives approximates to 0, which indicates that the candidates generated by the proposednew banding technique are almost all truly similar pairs.

To understand the zero lower bound withk =⌊

Jnr

⌋

, suppose there are two sig-natures withn elements each,

⌊

Jnr

⌋

bands of which are mapped to the same bucket.At leastr

⌊

Jnr

⌋

≈ Jn elements in the two signatures are identical because a band in-cludesr elements. According to Equation 1 and Equation 2, the approximate similaritybetween the two corresponding instances is then greater than Jn

n= J . Hence, simi-

larity for each pair of signatures is greater than the threshold J and no false positivesexist.

The introduction ofǫ also enables us to personalize the number of false positives,i.e. vary the range ofk for different ǫ. The range ofk for a desiredǫ is a functionof J, b, r that can be numerically solved. For example, given a group ofspecifiedarguments areJ = 0.7, b = 20, r = 5, Figure 2 shows the trend ofFPmax(k)

FPmax(1)for k.

The minimum ofFPmax(k)FPmax(1)

is achieved whenk = b. If the desiredǫ = 0.4, we can find

a satisfying range ofk ∈ [3, 20] sinceFPmax(2)FPmax(1)

≥ ǫ andFPmax(3)FPmax(1)

≤ ǫ.

Figure 2: An illustrative example of number of false positives for variousk

7

littlep

Highlight

3.3 Number of False Negatives

False negatives are truly similar pairs mapped to disjoint bucket arrays. We also derivethe upper and lower bound of false negatives generated by theproposed new bandingtechnique.

Lemma 2. The upper bound of false negatives generated by the new banding technique

isΣb−1i=0

(

i

b

)

(sr)i(1− sr)b−i)Ns≥J . The lower bound is close to the original LSH.

Proof. Similar to Section 3.2, the number of false negatives, denoted byFNmax(k),is

FNmax(k) =

∫ 1

J

Ns(1 − Pnew(s))ds (6)

FNmax(k) is a monotonic increasing function ofk. The lower bound of it is achievedwhenk = 1. The upper bound ofFNmax(k) is obtained whenk is the total number ofbands. Hence, the upper bound ofFNmax(k) is proportional to the number of similarinstancesNs≥J .

limk→b

FNmax(k) = (Σb−1i=0

(

i

b

)

(sr)i(1 − sr)b−i)Ns≥J . (7)

For a desired the number of false negatives, we do a division betweenFNmax(k)andFNmax(1) in terms of

FNmax(k)

FNmax(1)≤ ǫ (8)

whereǫ is a threshold which is always greater than 1. By deriving thenumerical so-

lution for∫ 1

JNs(1 − (Σb−1

i=0

(

i

b

)

(sr)i(1 − sr)b−i))ds, the range ofk for a desired

ǫ is obtained. For example, given the argumentsJ = 0.7, b = 20, r = 5, Figure 3shows us the trend ofFNmax(k)

FNmax(1). If the desiredǫ = 100, from Figure 3, we can find

that FNmax(5)FNmax(1)

≈ 70, FNmax(6)FNmax(1)

≈ 100, so the satisfying range isk ∈ [1, 5].

3.4 Balance False Positives and False Negatives

In some application scenarios, we want to have a balance between false positives andfalse negatives. Here we analyse a special case where we wanta desired aggregatednumber of false positives and false negatives. We useFNPmax to denote the sum offalse positives and false negatives, which is defined as follows:

FNPmax(k) = FPmax(k) + FNmax(k) (9)

The lower bound ofFNPmax(k) is dependent on the similarity distribution ofthe given data set. However, since in most casesNs<J ≫ Ns≥J , thusFNPmax(k) =

8

Figure 3: An illustrative example of number of false negatives for variousk

Ns<JPnew(k)+Ns≫J (1−Pnew(k)) is less thanFNPmax(1) whenk is appropriatelychosen.

Inspired by Section 3.2 and Section 3.3, we can also use a thresholdǫ to obtainthe desired degree of precision. As shown in Figure 4, the ratio of FNPmax(k)

FNPmax(1)for

J = 0.7, b = 20, r = 5 on a uniformly distributed data set first decreases as the valueof k increases. The minima isǫ = 0.2674 whenk = 4. Then the ratio increases ask becomes larger. If we are required to have a higher precisionof the new bandingtechnique, compared with traditional banding technique, in term of aggregated numberof false negatives and false positives (i.e. smallFNPmax(k)

FNPmax(1)≤ 1), thenk ∈ [1, 12] is

acceptable.

Figure 4: An illustrative example of balancing false positives and false negatives

9

littlep

Highlight

littlep

Highlight

4 Mapreduce Implementation of PLSH

In this section, we first introduce the MapReduce framework.Then we present thedetails of implementing Personalized LSH with MapReduce, including minhashing,banding and verification.

4.1 MapReduce

MapReduce [26] is a framework for processing paralleled algorithms on large scaledata sets using a cluster of computers. MapReduce allows fordistributed processing ofdata, which is partitioned and stored in a distributed file system (HDFS). Data is storedin the form of〈key, value〉 pairs to facilitate computation.

As illustrated in Figure 5, the MapReduce data flow consists of two key phases: themap phase and the reduce phase. In the map phase, each computing node works on thelocal input data, and process the input〈key, value〉 pairs to a list of intermediate pairs〈key, value〉 in a different domain. The〈key, value〉 pairs generated in map phase arehash-partitioned and sorted by the key; and then they are sent across the computingcluster in a shuffle phase. In the reduce phase, pairs with thesame key are passed tothe same reduce task. User-provided functions are processed in the reduce task on eachkey to produce the desired output.

Figure 5: An illustrative example of MapReduce data flow, with 2 reduce tasks.

In similarity joins, to generate〈key, value〉 pairs, we first segment the join at-tributes in each instance to tokens. Each token is denoted bya unique integer id. In thefollowing steps, token id is used to represent each feature.

4.2 MinHashing

We use one map reduce job to implement minhashing. Before themap function iscalled,〈token, id〉 pairs are loaded. In the map task, each instance is represented by aset of tokens{id} present in this instance.

In the reduce task, for each instance, the reducer produces asignature with lengthn. As described in Section 2, the minhashing function requires random permutations offeatures. But it’s not feasible to permute massive featuresexplicitly. Instead, a random

10

littlep

Highlight

littlep

Highlight

littlep

Highlight

littlep

Highlight

hash function is adopted to simulate this task. Suppose the total number of features istc, integer setX = [0, 1, ..., tc− 1], we choose the hash function as

h(i) = (a ∗ i+ b) mod tc (10)

wherea, b, i ∈ [0, tc − 1] anda, tc must be relatively prime,a mod b is a functionthat obtains the remainder ofa divided byb. It maps a numberi ∈ [0, tc−1] to anothernumberh(i) ∈ [0, tc−1]with none collision. Hence the result listh(0), h(1), ..., h(tc−1) is a permutation of the original features.

Furthermore, since it requiresn independent permutations to produce a signaturefor each instance, we prove that there are more thann different permutations.

Lemma 3. Giventc features, the desired signature lengthn, the hash functionh(i) =(a ∗ i + b) mod tc, wherea, b and i ∈ [0, tc − 1], produces more thann differentpermutations.

Proof. Assume a permutationx0, x1, ..., xtc−1 is generated by hash functionh withparametersa andb, thenb = x0, x1 = (a + b) mod tc, a = (x1 − b + tc) mod tc

andxk+1 = (xk + a) mod tc, k ∈ [0, tc − 2]. Hence, for a specifieda, differentintegersb ∈ [0, tc−1] produce different permutations. Euler’s totient functionφ(n′) isan arithmetic function that counts the number of totatives of integern′, which indicatesthe number of desireda is φ(tc). Therefore, there areφ(tc)tc pairs of< a, b > whichproduceφ(tc)tc different permutations. Sinceφ(tc)tc ≥ tc ≥ n, we prove that hashfunctionh produces more thann different permutations.

4.3 Banding

Banding technique filters dissimilar pairs. As shown in Figure 6, we implement thebanding procedure in two MapReduce phases.

In the first phase, the signatures are input to the map function. The map function di-vides each signature intob bands, each band consists ofr elements, and then each bandis mapped to a bucket. The outputs are in form of〈[bandId, bucketId], instanceId〉.In another word,bandId andbucketId are combined as a key, andinstanceId is as-signed to the corresponding value. For example, as shown in Figure 6, the signature ofinstance1 is (1 2 11 3 4 23 ...), and (2 2 13 3 4 23 ...) for instance 2. Suppose r = 3,then instance 1 is divided into at least 2 bands (1 2 11) and (3 423). The two bandsare mapped to bucket11 in bucket array1 and bucket12 in bucket array2. So the out-puts of map function for instance 1 include〈[1, 11], 1〉 and〈[2, 12], 1〉. Analogously,〈[1, 5], 2〉 and〈[2, 12], 2〉 are a part of map outputs for instance 2.

In reduce task, all instances with the samebandId andbucketId are assigned to thesame reduce task. An output in the form of〈[instanceId1, instanceId2], 1〉 is pro-duced for every pair of instances, where the fixed value 1 represents the occurrence fre-quency for pair[instanceId1, instanceId2]. For instance,〈[1, 5], 2〉 and〈[1, 5], 23〉are the aforementioned map outputs for instance 1 and 2, since they have the samebandId 1 andbucketId 5, the reduce task produces a pair〈[2, 23], 1〉. That is to say,instance 2 and instance 23 are likely to be a candidate pair because their first bands areboth mapped to the5th bucket.

11

littlep

Highlight

littlep

Highlight

littlep

Highlight

littlep

Highlight

In the second phase, the map task outputs what is produced in the first phrase. Tominimize the network traffic between the map and reduce functions, we use a com-bine function to aggregate the outputs generated by the map function into partial localcounts in the computing node. Subsequently, the reduce function computes the to-tal counts for each instance pair. Outputs of the reduce function are in the form of〈[instanceId1, instanceId2], count〉 pairs, wherecount is the global frequency forthe instance pair. Personalized LSH eliminates those pairsof instances whosecount isless than the given thresholdk. As shown in Figure 6, supposek = 12, the count ofinstance pair< [2, 23], 15 > is greater thank, so[2, 23] is a candidate pair.

Figure 6: An illustrative example of data flow in banding stage

4.4 Verification

Candidate pairs generated in Section 4.3 need to be checked in the verification stage.For each candidate instance pairs[instanceId1, instanceId2], signature similarity iscomputed. Because of the massive number of instances, it is not a trivial task.

In order to reduce the storage cost for each reduce task, the set of signatures forall instances is partitioned into small files according to instance ids. In this way, eachreduce task holds only two different small partitions, where the first partition is for thefirst instance in the pair[instanceId1, instanceId2], and the second partition for thesecond instance. For example, for each reduce input pair[instanceId1, instanceId2],partition1 contains the signature forinstanceId1 while the other partition containsthe signature forinstanceId2. All pairs of instances ids contained in the same par-titions have to be assigned to the same reduce task. Hence, map task calculates thereduce task id for each pair according to its instances ids and produces an output〈reduceTaskId, [instanceId1, instanceId2]〉. In reduce task, signatures similari-ty for each pair of instances is computed. Finally, reduce task outputs pairs whosesimilarities are greater than the given thresholdJ . The outputs are in the form of〈[instanceId1, instanceId2], sim〉, wheresim is the Jaccard similarity forinstanceId1andinstanceId2. As shown in Figure 7, suppose the given thresholdJ = 0.7, simi-larity between signature 2 and signature 5 is 0.73, which is greater thanJ , the outputis 〈[2, 5], 0.73〉.

12

littlep

Highlight

littlep

Highlight

littlep

Highlight

Figure 7: An illustrative example of the data flow in verification stage

5 Experiment Evaluation

In this section, we design a series of experiments to analyzethe performance of theproposed PLSH algorithms. We want to study the following research questions.

• The efficiency of PLSH. For example, is the PLSH fast, compared with othersimilarity join algorithms? Can it scale to large scale datasets? Will the differentvalues of parameters affect the efficiency of PLSH?

• The effectiveness of PLSH. For example, is the PLSH accurate? Can it generateless false positives and false negatives?

• The personalization of PLSH. For example, how should we set the parameters ofPLSH to generate the desired number of false positives and false negatives? Isthe tailored PLSH more appropriate for different applications?

The experiments are conducted on a 6-node cluster. Each nodehas one processori7-3820 3.6GHz with four cores, 32GB of RAM, and 100G hard disks. On each node,we install the Ubuntu 12.04, 64-bit, server edition operating system, Java 1.6 with a 64-bit server JVM, and Hadoop 1.0. Apache Hadoop is an open source implementation ofMapReduce. We run 2 reduce tasks in parallel on each node.

We use DBLP and CITESEERX dataset and increase the number of instances whenneeded. The original DBLP dataset has approximately 1.2M instances while the origi-nal CITESEERX has about 1.3M instances. As shown in Figure 8,when increasing thenumber of instances of these two data sets, CITESEERX occupies larger storage thanDBLP.

In our experiments, we tokenize the data by word. The concatenation of the papertitle and the list of authors are the join attributes(i.e. paper title and the list of authors aretwo attributes in each instance.). The default threshold ofJaccard similarityJ = 0.7.The default length of hash signaturen is 100, and each band hasr = 5 elements.

5.1 Efficiency of PLSH

The ground truth is the truly similar pairs generated by fuzzyjoin [12]. A 3-stageMapReduce approach is implemented for end-to-end set-similarity join algorithm, and

13

littlep

Highlight

Figure 8: Dataset sizes of DBLP and CITESEERX

selects instance pairs with Jaccard similarities greater than the given threshold.We first compare the efficiency of our method PLSH with fuzzyjoin on the DBLP

and CITESEERX data sets. The CPU times of different algorithms are shown in Fig-ure 9. We can conclude from Figure 9 that, (1)generally, PLSHis faster than fuzzyjoin.When the number of instances is 7.5M in DBLP and CITESEERX, the time cost offuzzyjoin is nearly two times of that of PLSH. (2) When the data size increases, theefficiency improvement is more significant. This suggests that PLSH is more scalableto large scale data sets. (3)Fuzzyjoin takes roughly equivalent CPU time on DBLPand CITESEERX with similar size, while PLSH works faster on DBLP than on CITE-SEERX. This suggests that PLSH is more affected by the similarity distribution in adata set.

Figure 9: CPU time for various data sizes

Next we analyze the effect of the number of reduce tasks for algorithm efficiency.Because the reduce task number of verification step is different from the previous steps,we record the CPU time in the previous three stages (minhashing, banding1, banding2,note that banding procedure is completed in two MapReduce phases) and in verificationstep separately. In the three steps, we vary the reduce tasksnumber from 2 to 12 witha step size of 2. The data sets are DBLP datasets containing 0.6M, 1.2M, 2.5M, 5M,7.5M instances respectfully.

From Figure 10 we have the following observations. (1) In general, the time cost

14

littlep

Highlight

Figure 10: CPU time of stages previous to verification for different data sizes withvarious reduce task number

of PLSH reduces with more reduce tasks. This verifies our assumption that the paral-lel mechanism improves the algorithm efficiency. (2) When there are less than 2.5Minstances, CPU time decreases slowly with the increasing reduce tasks number. Thisis because the start time of MapReduce can’t be shortened even though tasks numberis increasing. When the CPU time is more than 2000 seconds, the percentage of starttime is just a small part of the total time. Therefore, when the number of instances isrelatively large, the improvement is more significant. Thissuggests that the parallelmechanism of PLSH is more suitable for large scale data sets.

We further study the speedup of reduce task numbers. We set the time cost with2 reduce tasks as the baseline, and plot the ratio between therunning time v.s. thebaseline for various reduce task numbers. From Figure 11, weobserve that when thesize of dataset increases, the speedup is more significant. But none of the curve is astraight line, which suggest that the speed up is not linear with respect to the numberof reduce tasks. The limited speedup is due to two main reasons. (1) There are 5 jobsto find the similar instances. The start time of each job will not be shortened. (2) Asthe number of dataset increases, more data is sent through the network, more data ismerged and reduced, thus the traffic cost cannot be reduced.

The number of reduce tasks is fixed in the verification stage. However, in thisstage, parameterk in the proposed PLSH has a significant effect on the running time.We analyze the performance of PLSH with various values ofk. From Figure 12, wehave the following observations. (1) In general, as the value of k increases, the timecost in verification stage decreases. This suggests that with a largerk, the proposedPLSH can better prune dissimilar pairs, and enhance efficiency. (2) When the data setis small, i.e. the number of instances is smaller than 2.5M, the enhancement is notobvious. The underlying reason is that there are fewer candidates in a small data set.(3) Speed up is significant in large-scale data set (with morethan 5M instances), whichsuggests the scalability of PLSH.

5.2 Effectiveness Evaluation

In this subsection, we study the effectiveness of PLSH, and how the length of signature,the defined threshold and the parameterk affect effectiveness.

15

Figure 11: Relative CPU time for different data sizes with various reduce task number

Figure 12: Verification time with various k

We first define four types of instances. False positives (FP) are dissimilar pairs(similarity lower than the threshold) that are selected as candidates by PLSH. Truepositives (TP) are similar pairs (similarity higher than the threshold) that are select-ed as candidates by PLSH. False negatives (FN) are similar pairs (similarity higherthan the threshold) that are pruned by PLSH. True Negatives (TN) are dissimilar pairs(similarity lower than the threshold) that are pruned by PLSH.

The numbers of FP with various signature length andk, with similarity thresholdJ = 0.7 are shown in Figure 13. We have the following observations. (1) For a fixedsignature length, the number of false positives significantly decreases with largerk. (2)For largerk (e.g.k = 5), prolonging the signature results in fewer false positives. (3)For smallk and shorter signatures (< 400 elements), the number of false positives isuncertain. The above three observations indicate that the proposed PLSH can achievea high precision with largerk, longer signatures.

The numbers of FP with various similarity threshold andk, with signature lengthn = 500 are shown in Figure 14. We have the following observations. (1) For a fixedsimilarity threshold, the number of false positives significantly decreases with largerk. (2) For a fixedk, the number of false positives significantly decreases withlargersimilarity threshold. This is because with larger similarity threshold, there are fewerqualified candidates.

The numbers of FN with various signature length andk, with similarity threshold

16

Figure 13: The number of false positives with various signature length andk

J = 0.7 are shown in Figure 15. We have the following observations. (1) For a fixedsignature length, the number of false negatives generated by original LSH (k = 1) isfewer than PLSH. (2) For original LSH, shorter signatures generates more FN. (3) Interm of FN, PLSH is not very sensitive to the length of signature.

Figure 14: Number of false positives with various similarity threshold andk

The numbers of FN with various similarity threshold andk, with signature lengthn = 500 are shown in Figure 16. Although the number of false negatives for PLSH islarger than the original PLSH, we observe that for large similarity thresholdJ = 0.9the difference is minute. In most applications, we want to search for very similar pairs,thus, the PLSH will still perform good in most scenarios.

Since the false negatives generated by PLSH are in general more than LSH, wefurther use precision and specificity to evaluate the performance of PLSH. Precision isdefined asPR = TP

(TP+FP ) and specificity is defined as TNFP+TN

. In this subsection,we want to analyze the effect ofk, n, J in terms of precision and specificity. We varyk from 1 to 5 with a step size of 1,n from 200 to 500 with a step size 100,J from 0.7to 0.9 with a step size 0.1.

From Figures 17 to 20, we have the following observations. (1) The PLSH methodperforms better than LSH in terms of specificity and precision, which demonstrates thepotential of PLSH in many data mining and bioinformatics applications. (2) Generally,longer signatures outperform shorter signatures.

17

Figure 15: The number of false negatives with various signature length andk

Figure 16: The number of false negatives with various similarity threshold andk

5.3 Personalized LSH

As proved in Section 3, an important characteristic of our proposed PLSH is that itis capable of tailoring the number of false positives and false negatives for differentapplications. In this subsection, we present a pilot study on the tailoring capabilityof PLSH. We first numerically derive the appropriatek for different degree of desiredprecision. As shown in Table 3, the required precision is measured in terms of theratio of false positives v.s. conventional LSHFPmax(k)

FPmax(1), the ratio of false negatives v.s.

conventional LSHFNmax(k)FNmax(1)

, and the ratio of total errors (the sum of false positives

and false negatives) v.s. the conventional LSHFNPmax(k)FNPmax(1)

. For example, we can see

that if we want to generate less than half of false negatives in LSH, FNmax(k)FNmax(1)

≤ 0.5,we should setk = 3.

We then use the different settings ofk to generate collaborator recommendationson DBLP and CITESEERX data sets. We keep the authors who have published morethan 25 papers. We use a modified collaborative filtering [25]to generate recommenda-tions. For each paperpa published by authora, PLSH is utilized to find a set of similarpublications{r} with similarity sim(r, pa) ≥ J , J = 0.7. We then gather the set ofauthors{a(r)} who write the similar publications. In recommendation systems, thesimilar authors are treated as nearest neighbors. Each nearest neighbor is assigned a s-

18

littlep

Highlight

Figure 17: Specificity with variousn andk

Figure 18: Specificity with variousJ andk

core, which is the accumulated similarities between the publication of nearest neighborand the publication of the authora. For each author, the top 5 nearest neighbors withlargest accumulated similarities are returned as recommended collaborators. We man-ually annotate the results. The evaluation metric is the precision at top 5 results (P@5).The average P@5 results are shown in Table 3. We have the following conclusions. (1)When the number of false positives is reduced, we can generate more precise nearestneighbours, thus the recommendation performance is boosted. (2) When the number offalse positives is too small, then due to the sparsity problem, the collaborative filteringframework is not able to generate enough nearest neighbors,thus the recommendationperformance is deteriorated. (3) A recommendation system achieves best performancewith a fine tuned parameterk, which optimize the trade-off between false positives andfalse negatives.

6 Related Work

There are fruitful research works on the similarity joins problem. Commonly adoptedsimilarity measurements include cosine similarity [7], edit distance [8], Jaccard simi-larity [12], hamming distance [10] and synonym based similarity [27] and so on.

The methodologies used in similarity joins can be categorized into two classes. The

19

littlep

Highlight

littlep

Highlight

Figure 19: Precision with variousn andk

Figure 20: Precision with variousJ andk

first class adopt dimension reduction technologies [16], sothat the similarity computa-tion is conducted in a lower dimensional space. The second class utilizes filtering ap-proaches to avoid unnecessary similarity computation, including prefix filter [7], lengthfilter [7], prefix filter [20] and suffix filter [21], positionalfilter [8]. To take advantageof the two categorise, a series of hashing methods ( such as minhashing [17], minwisehashing [18] and Locality Sensitive Hashing (LSH) [10, 19] and so on) combine dimen-sion reduction and filtering approaches by first using a signature scheme [28] to projectthe original feature vectors to low dimensional signatures, and then filter unqualifiedsignatures.

Recently, parallel algorithms are adopted on clusters of machines to improve effi-

Table 3: Recommendation performance with differentk.k

FPmax(k)FPmax(1)

FNmax(k)FNmax(1)

FNPmax(k)FNPmax(1)

P@5DBLP P@5CITESEERX

1 1 7 1 0.71 0.822 0.9 20 0.9 0.83 0.903 0.5 42 0.5 0.91 0.994 0.2 71 0.3 0.99 0.935 0.1 103 0.4 0.97 0.84

20

littlep

Highlight

littlep

Highlight

ciency of similarity joins for large scale and high dimensional data sets. We have seenan emerging trend of research efforts in this line, including a 3-stage MapReduce ap-proach for end-to-end set-similarity join algorithm [12, 13]. fast computation of innerproducts for large scale news articles [13], a new ant colonyoptimization algorithmparallelized using MapReduce [14] to select features in a high dimension space, theLSS algorithm exploiting the high data throughout of GPU [15], etc.

In the era of big data, approximate computation is also a possible solution, if theapproximate result with much less computation cost is pretty close to the accuratesolution. Approximate computation can be implemented not only through hardwaremodules [29][30] but also through approximate algorithms [31, 32]. Unlike exact sim-ilarity joins [25], approximate similarity joins and top-ksimilarity joins [34] are betteradapted to different application domains [33].

7 Conclusion

The problem of similarity joins based on Jaccard similarityis studied. We propose anew banding technique which reduces the number of false positives, false negatives andthe sum of them. From our experiments, it shows that our new method is both efficientand effective.

References

[1] Claudio Carpineto, Stanislaw Osinski, Giovanni Romano, and Dawid Weiss. Asurvey of web clustering engines.ACM Computing Surveys (CSUR), 41(3):17,2009.

[2] Rıdvan Saracoglu, Kemal Tutuncu, and Novruz Allahverdi. A fuzzy clusteringapproach for finding similar documents using a novel similarity measure.ExpertSystems with Applications, 33(3):600–605, 2007.

[3] Timothy C Hoad and Justin Zobel. Methods for identifyingversioned and pla-giarized documents.Journal of the American society for information science andtechnology, 54(3):203–215, 2003.

[4] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing for scal-able image search. InComputer Vision, 2009 IEEE 12th International Conferenceon, pages 2130–2137. IEEE, 2009.

[5] Lei Li, Li Zheng, Fan Yang, and Tao Li. Modeling and broadening temporal userinterest in personalized news recommendation.Expert Systems with Applications,41(7):3168–3177, 2014.

[6] Jiannan Wang, Jianhua Feng, and Guoliang Li. Trie-join:Efficient trie-basedstring similarity joins with edit-distance constraints.Proceedings of the VLDBEndowment, 3(1-2):1219–1230, 2010.

21

littlep

Highlight

[7] Roberto J Bayardo, Yiming Ma, and Ramakrishnan Srikant.Scaling up all pairssimilarity search. InProceedings of the 16th international conference on WorldWide Web, pages 131–140. ACM, 2007.

[8] Chuan Xiao, Wei Wang, and Xuemin Lin. Ed-join: an efficient algorithm for sim-ilarity joins with edit distance constraints.Proceedings of the VLDB Endowment,1(1):933–944, 2008.

[9] Guoliang Li, Dong Deng, Jiannan Wang, and Jianhua Feng. Pass-join: Apartition-based method for similarity joins.Proceedings of the VLDB Endow-ment, 5(3):253–264, 2011.

[10] Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in highdimensions via hashing. InVLDB, volume 99, pages 518–529, 1999.

[11] Ngo Xuan Bach, Nguyen Le Minh, and Akira Shimazu. Exploiting discourse in-formation to identify paraphrases.Expert Systems with Applications, 41(6):2832–2841, 2014.

[12] Rares Vernica, Michael J Carey, and Chen Li. Efficient parallel set-similarityjoins using mapreduce. InProceedings of the 2010 ACM SIGMOD InternationalConference on Management of data, pages 495–506. ACM, 2010.

[13] Tamer Elsayed, Jimmy Lin, and Douglas W Oard. Pairwise document similarityin large collections with mapreduce. InProceedings of the 46th Annual Meetingof the Association for Computational Linguistics on Human Language Technolo-gies: Short Papers, pages 265–268. Association for Computational Linguistics,2008.

[14] M Janaki Meena, KR Chandran, A Karthik, and A Vijay Samuel. An enhancedaco algorithm to select features for text categorization and its parallelization.Ex-pert Systems with Applications, 39(5):5861–5871, 2012.

[15] Michael D Lieberman, Jagan Sankaranarayanan, and Hanan Samet. A fast simi-larity join algorithm using graphics processing units. InData Engineering, 2008.ICDE 2008. IEEE 24th International Conference on, pages 1111–1120. IEEE,2008.

[16] Mevlut Ture, Imran Kurt, and Zekeriya Akturk. Comparison of dimension reduc-tion methods using patient satisfaction data.Expert Systems with Applications,32(2):422–426, 2007.

[17] Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher.Min-wise independent permutations.Journal of Computer and System Sciences,60(3):630–659, 2000.

[18] Ping Li and Christian Konig. b-bit minwise hashing. InProceedings of the 19thinternational conference on World wide web, pages 671–680. ACM, 2010.

22

[19] Chyouhwa Chen, Shi-Jinn Horng, and Chin-Pin Huang. Locality sensitive hash-ing for sampling-based algorithms in association rule mining. Expert Systemswith Applications, 38(10):12388–12397, 2011.

[20] Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A primitive opera-tor for similarity joins in data cleaning. InData Engineering, 2006. ICDE’06.Proceedings of the 22nd International Conference on, pages 5–5. IEEE, 2006.

[21] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. Efficientsimilarity joins for near-duplicate detection.ACM Transactions on Database Sys-tems (TODS), 36(3):15, 2011.

[22] Anand Rajaraman and Jeffrey David Ullman.Mining of massive datasets. Cam-bridge University Press, 2011.

[23] Joel Pinho Lucas, Saddys Segrera, and Marıa N Moreno. Making use of associa-tive classifiers in order to alleviate typical drawbacks in recommender systems.Expert Systems with Applications, 39(1):1273–1283, 2012.

[24] Joel P Lucas, Nuno Luz, Marıa N Moreno, Ricardo Anacleto, Ana Almei-da Figueiredo, and Constantino Martins. A hybrid recommendation approachfor a tourism system.Expert Systems with Applications, 40(9):3532–3550, 2013.

[25] Jesus Bobadilla, Fernando Ortega, Antonio Hernando,and Abraham Gutierrez.Recommender systems survey.Knowledge-Based Systems, 46:109–132, 2013.

[26] Tom White.Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012.

[27] Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik. Transformation-basedframework for record matching. InData Engineering, 2008. ICDE 2008. IEEE24th International Conference on, pages 40–49. IEEE, 2008.

[28] Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on Verylarge data bases, pages 918–929. VLDB Endowment, 2006.

[29] W.-T.J. Chan, A.B. Kahng, Seokhyeong Kang, R. Kumar, and J. Sartori. Sta-tistical analysis and modeling for error composition in approximate computationcircuits. InComputer Design (ICCD), 2013 IEEE 31st International Conferenceon, pages 47–53, Oct 2013.

[30] Jiawei Huang, John Lach, and Gabriel Robins. A methodology for energy-qualitytradeoff using imprecise hardware. InProceedings of the 49th Annual DesignAutomation Conference, DAC ’12, pages 504–509, New York, NY, USA, 2012.ACM.

[31] Michael W. Mahoney. Approximate computation and implicit regularization forvery large-scale data analysis. InProceedings of the 31st Symposium on Prin-ciples of Database Systems, PODS ’12, pages 143–154, New York, NY, USA,2012. ACM.

23

[32] Janko Strassburg and Vassil Alexandrov. On scalability behaviour of monte carlosparse approximate inverse for matrix computations. InProceedings of the Work-shop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA’13, pages 6:1–6:8, New York, NY, USA, 2013. ACM.

[33] Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, and Dong Xin. Anefficient filter for approximate membership checking. InProceedings of the 2008ACM SIGMOD international conference on Management of data, pages 805–818.ACM, 2008.

[34] Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. Top-k set similarityjoins. InData Engineering, 2009. ICDE’09. IEEE 25th International Conferenceon, pages 916–927. IEEE, 2009.

24

mapreduce based personalized locality sensitive hashing ......keywords: locality sensitive hashing,...

Documents