local sensitive hashing & minhash on facebook friend

30
Local Sensitive Hashing & Minhash on Facebook friend links data & friends recommendation Chengeng Ma Stony Brook University 2016/03/05

Upload: chen-geng-ma

Post on 15-Feb-2017

212 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Local sensitive hashing & minhash on facebook friend

Local Sensitive Hashing & Minhash on Facebook friend

links data & friends recommendation

Chengeng Ma

Stony Brook University

2016/03/05

Page 2: Local sensitive hashing & minhash on facebook friend

1. What is Local Sensitive Hash & Minhash?

β€’ If you are familiar with LSH and Minhash, please directly go to page 12, because the following pages are just fundamental knowledge about this topic, which you can find more details in the book, Mining of Massive Dataset, written by Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.

Page 3: Local sensitive hashing & minhash on facebook friend

What is LSH & Minhash about?

β€’ Local Sensitive Hash (LSH) & Minhash are two profoundly important methods in Big Data for finding similar items.

β€’ In Amazon, if you can find two similar persons, you can recommend to one person the items the other has purchased.

β€’ For Google, Baidu, …, users always hope the search engine can find pictures similar to the one they have uploaded.

Page 4: Local sensitive hashing & minhash on facebook friend

Calculating similarity between each pair is a lot of computation (Why LSH?)β€’ If you have 106 items within your

data, you will need almost 0.5 Γ— 1012 times computation to know the similarities between each pair.

β€’ You will need to parallel a lot of tasks to deal with this huge computation amount.

β€’ You can do this with the help of Hadoop, but you can do better with the help of LSH & Minhash.

β€’ The LSH can hash one item to a bucket based on the feature list that item has.

β€’ If two items are quite similar with each other in their feature lists, then they will have a large probability to be hashed into the same bucket.

β€’ You can amplify this effect to different extent by setting parameters.

β€’ Finally, you only need to compute similarities for the pairs formed by the items within the same bucket.

Page 5: Local sensitive hashing & minhash on facebook friend

How Minhashcomes in?

β€’ The LSH needs you keep the feature list of each item in the format like a matrix (the sequence is important).

β€’ If the size of universal set is fixed or small, e.g., the fingerprint array, then LSH alone can work well.

1st column represents the items person S1 has purchased.

1st row represents who has purchased item a.

Page 6: Local sensitive hashing & minhash on facebook friend

How Minhash comes in?

β€’ Jaccard Similarity = 2/7

β€’ However, if the universal set is large or not size-fixed, e.g., items purchased by each account, friend list on social network, …

β€’ Then formatting the dataset into matrix is not efficient, since the dataset is usually very sparse.

β€’ Then Minhash works, if the similarities between two feature lists is calculated as Jaccardsimilarities.

Page 7: Local sensitive hashing & minhash on facebook friend

What’s Minhash value?

β€’ Permute the original matrix by row.

β€’ For each column (set), the 1st non-empty element’s row index is the minhashvalue of that column.

Original matrix

Permute to a different order: b, e, a, d, c.

H(S1)=a, H(S2)=c, H(S3)=b, H(S4)=a.

Page 8: Local sensitive hashing & minhash on facebook friend

Minhash’s property (similarity preserved):

β€’ 3 kinds of rows between set π‘†π‘Ž & 𝑆𝑏:

(x): both sets have 1;

(y): one has 1, the other has 0;

(z): both sets have 0.

π½π‘Žπ‘ =|𝑋|

𝑋 + |π‘Œ|

Pr β„Ž π‘†π‘Ž = β„Ž 𝑆𝑏 =|𝑋|

𝑋 + |π‘Œ|

β€’ If you do 100 times different minhash, you reduce one dimension of the matrix from unknown large to 100.

β€’ The probability that two sets share the same minhash value equals the Jaccard similarity between them.

Pr β„Ž π‘†π‘Ž = β„Ž 𝑆𝑏 = π½π‘Žπ‘

Page 9: Local sensitive hashing & minhash on facebook friend

Permutations can be simulated by hash functions

β€’ For j th column in original matrix, find all the non-empty elements, try to input their indexes into the i th hash function, the minimum output is the element SIG(i, j).

β€’ Hash function: π‘Ž βˆ— π‘₯ + 𝑏 % 𝑁 :

β€’ where N is a prime, equal to or slightly larger than the size of universal set (# of rows of original matrix),

β€’ a & b must be integers within [1, N-1].

β€’ The result signature matrix, where row index is for hash functions, column index is for sets.

For example, we use 2 hash functions to simulate 2 permutations: (x+1)%5 and (3x+1)%5, where x is row index

SIG

Page 10: Local sensitive hashing & minhash on facebook friend

Now you have signature matrix, you use it instead of original matrix to do LSH.β€’ Divide the signature matrix into b

bands, each of which has r rows.

β€’ For each band range, build an empty hashtable, hash each column (portion within the band range) into a bucket, so that only identical bands are hashed into the same bucket.

β€’ Columns within the same bucket are considered candidates that you should form pairs and calculate similarities.

β€’ Take the union of different band ranges and filter out the false positives.

Page 11: Local sensitive hashing & minhash on facebook friend

JaccardSimilarity

Probability of becoming a candidate

Why LSH works? --- the amplification effects

Page 12: Local sensitive hashing & minhash on facebook friend

2. Details of my class project: datasetβ€’ User-to-user links from Facebook

New Orleans networks data.

β€’ The data is created by BimalViswanath et al. and used for their paper On the Evolution of User Interaction in Facebook.

β€’ It can be download in http://socialnetworks.mpi-sws.org/data/facebook-links.txt.gz

β€’ It has 63,731 persons and 1,545,686 links, 10.4 MB in size.

β€’ The data is not large, but as a training, I will use Hadoop during this project.

Page 13: Local sensitive hashing & minhash on facebook friend

My class project plan:

β€’ Firstly find similar persons based on users’ friend lists, where LSH and Minhash will be implemented in Hadoop.

β€’ The similar persons are called β€œclose friend” in this project.

β€’ Then recommend to you the persons who are friends of your close friend but not yet of you.

β€’ It generally sounds like Collaborative Filtering.

β€’ Two persons who have similar friend list are considered β€œclose friends”, since they must have some relationship in the real world, e.g., schoolmates, workmates, teammate, …

β€’ If you’re good friend of someone, you may like to know more of his/her friend.

β€’ We do not set too high threshold for similarity, since finding a duplicate of you is not interesting.

Page 14: Local sensitive hashing & minhash on facebook friend

Why not just use common friend counts?

β€’ The classical way is based on number of common friends.

β€’ However, there are some persons who have a lot of common friends with you, but has nothing to do with you, e.g., celebrities, politics, salesmen who want to sell their stuff through social network, or even swindlers…

β€’ People use social network to find friends that can physically reach them, but not for persons too far away from them.

β€’ Most of my friends may like a pop singer and become friends of him. Based on common friends, system will recommend that pop singer to me.

β€’ But the pop singer can never remember me, since he has millions of friends on site.

Page 15: Local sensitive hashing & minhash on facebook friend

Prepare work:

β€’ 1. Make data into the format like below, where jrepresents the j thperson, Pj is a list of friends of person j.

β€’ 2. In this study, 63731 is both the number of sets to compare and the size of universal element set. Because both the key j and the elements within set Pjare user id.

1: P12: P2… …n: Pn

β€’ 3. 63731 is not a prime (101*631). Only prime number can simulate true permutations. We use 63737 instead, equivalent to adding 6 persons that has no friends online.

β€’ 4. Hash function for Minhash:

N=63737L; hashNum=100;

private long fhash ( int i, long x ) {

return 13 + π‘₯ βˆ’ 1 βˆ— (π‘βˆ—π‘–

3βˆ—β„Žπ‘Žπ‘ β„Žπ‘π‘’π‘š+ 1) %𝑁 ;

}

1 ≀ π‘₯ ≀ 𝑁, 0 ≀ 𝑖 ≀ β„Žπ‘Žπ‘ β„Žπ‘π‘’π‘šβˆ’ 1

Page 16: Local sensitive hashing & minhash on facebook friend

Pseudocode of Minhash (Map job only)

β€’ Mapper input: (c, Pc), where Pc represents a list [ j1, j2, …, js ];

β€’ Build a new array s[hashNum] (hashNum=100 here), initialized as infinity everywhere.

β€’ For i th hash function, each element jj in Pc is an opportunity to get lower hash value, finally the minimum hash value from all jj is the minhash in SIG[i,c].

β€’ Output c as key, the content of array s as value.

input (c, Pc), where Pc= [ j1, j2, …, js ]

long[] s = new long[hashNum];

for 0 ≀ 𝑖𝑖 ≀ β„Žπ‘Žπ‘ β„Žπ‘π‘’π‘š βˆ’ 1:s[ii] = infinity;

end

for jj in [ j1, j2, …, js ]:for 0 ≀ 𝑖𝑖 ≀ β„Žπ‘Žπ‘ β„Žπ‘π‘’π‘š βˆ’ 1:

s[ii] = min (s[ii], fhash(ii, jj));end

End

Output (c, array s);

Page 17: Local sensitive hashing & minhash on facebook friend

Pseudocode for LSH:β€’ Mapper input (j, 𝑆𝑗), where 𝑆𝑗 is

the j th column of signature matrix.

β€’ Split array 𝑆𝑗 into B bands as, 𝑆𝑗1, 𝑆𝑗2, …, 𝑆𝑗𝐡.

β€’ For b th band, get its hash value stored in myHash.

β€’ Output the tuple (b, myHash) as key, j as value.

for 1 ≀ 𝑏 ≀ 𝐡:myHash = getHashValue(𝑆𝑗𝑏)Output { ( b, myHash ), j }

end

β€’ Reducer input:

{ (b, aHashValue), [𝑗1, 𝑗2, …, 𝑗𝑝] }

Now, form pairs between 𝑗1, …,𝑗𝑝and output as candidate pairs.

for 1 ≀ π‘₯ ≀ 𝑝 βˆ’ 1:

for π‘₯ + 1 ≀ 𝑦 ≀ 𝑝:

output (𝑗π‘₯ , 𝑗𝑦)

end

end

One more program is needed to remove duplicates.

Hadoop’s sorting procedure helps us gathering all the items that both has the same hash value and comes from the same band range.

Page 18: Local sensitive hashing & minhash on facebook friend

Hash function for LSH:

β€’ The LSH needs to hash a band portion of vector into a value.

β€’ It hopes only identical vectors can be hashed into the same bucket.

β€’ An easy way is to directly use its string expression, since Hadoop also uses Text to transport data.

β€’ For example, hash the below portion into string:

β€’ In this way, only exactly same vector portion can comes into the same bucket.

β€œ21,14,36,55”

Hash to string

Page 19: Local sensitive hashing & minhash on facebook friend

Parameters set:

β€’ We do not want to set threshold of similarity too high, since finding a duplicate of you on web is not interesting.

β€’ So we set the threshold of similarity near 0.1.

β€’ We set B=50 and hashNum=100, so that each band in LSH has R=2 rows.

𝑃 π‘Ÿπ‘’π‘π‘œπ‘šπ‘šπ‘’π‘›π‘‘ π‘₯) = 1 βˆ’ (1 βˆ’ π‘₯𝑅)𝐡

β€’ S curve grows quickly:β€’ X=0.1, P=0.39

β€’ X=0.15, P=0.68

β€’ X=0.2, P=0.87

B=50, R=2

Page 20: Local sensitive hashing & minhash on facebook friend

Result Test:

β€’ 𝑃 π‘Ÿπ‘’π‘π‘œπ‘šπ‘šπ‘’π‘›π‘‘ π‘₯) =𝑃(π‘Ÿπ‘’π‘π‘œπ‘šπ‘šπ‘’π‘›π‘‘ & π‘₯≀𝑠<π‘₯+𝑑π‘₯)

𝑃(π‘₯≀𝑠<π‘₯+𝑑π‘₯)

β€’ 𝑃 π‘Ÿπ‘’π‘π‘œπ‘šπ‘šπ‘’π‘›π‘‘ π‘₯) = 1 βˆ’ (1βˆ’ π‘₯𝑅)𝐡

β€’ The Hadoop output can be analyzed to get 𝑃(π‘Ÿπ‘’π‘π‘œπ‘šπ‘šπ‘’π‘›π‘‘ & π‘₯ ≀ 𝑠 < π‘₯ + 𝑑π‘₯).

β€’ For 𝑃(π‘₯ ≀ 𝑠 < π‘₯ + 𝑑π‘₯), another Hadoop program is written to really calculate the similarities within all possible pairs (which takes N(N-1)/2 times computation), since the dataset is not too large.

β€’ But only the similarities equal to or larger than 0.1 is stored in output file, because it takes several Terabytes to store all the similarities.

Page 21: Local sensitive hashing & minhash on facebook friend

𝑃 π‘Ÿπ‘’π‘π‘œπ‘šπ‘šπ‘’π‘›π‘‘ π‘₯) derived from real dataset (blue) and the theoretical curve (red)

Page 22: Local sensitive hashing & minhash on facebook friend

Histogram of LSH recommended pairs and all existing pairs (but cut at 0.1) within the data

Page 23: Local sensitive hashing & minhash on facebook friend

Statistics

β€’ The LSH & Minhashrecommends 1,065,318 pairs.

β€’ There are 660,334 existing pairs that really have s larger than 0.1.

β€’ Intersection of them have 429,176 pairs, which contains 65% of similar pairs (s>0.1).

β€’ But the computation is hundreds of times faster than before.

β€’ 1 βˆ’ (1 βˆ’ π‘₯𝑅)𝐡) =𝑃(π‘Ÿπ‘’π‘π‘œπ‘šπ‘šπ‘’π‘›π‘‘ & π‘₯≀𝑠<π‘₯+𝑑π‘₯)

𝑃(π‘₯≀𝑠<π‘₯+𝑑π‘₯)

β€’ We define a reference value π‘₯𝑅𝑒𝑓 :

β€’ 1 βˆ’ (1 βˆ’ π‘₯𝑅𝑒𝑓𝑅)𝐡)=

0.1

1𝑃 π‘Ÿπ‘’π‘π‘œπ‘šπ‘šπ‘’π‘›π‘‘ & π‘₯ ≀ 𝑠 < π‘₯ + 𝑑π‘₯ 𝑑π‘₯

0.1

1𝑃(π‘₯ ≀ 𝑠 < π‘₯ + 𝑑π‘₯) 𝑑π‘₯

= 429176/660334 = 0.649938

Take in parameters B=50, R=2

π‘₯𝑅𝑒𝑓 = 0.1441 , which is slightly above 0.1

Page 24: Local sensitive hashing & minhash on facebook friend

How to calculate 𝑃(π‘₯ ≀ 𝑠 < π‘₯ + 𝑑π‘₯)?(Similarity Joins Problem)

β€’ To get the exact P.D.F., you need to really calculate the similarities for all N(N-1)/2 pairs.

β€’ Using Hadoop can parallel and speed up. But don’t use too high replicate rate.

β€’ How about the right hand side method?

Mapper input: (i, Pi)

for 1 ≀ j ≀N:if i < j: output {(i, j),Pi}

else if i > j: output {(j, i), Pi}

end

Reducer input: { (i, j), [Pi, Pj] }

Output { (i, j), Sij }

β€’ This method takes replicate rate as N and will definitely fail. The correct way is to split persons into G groups.

Page 25: Local sensitive hashing & minhash on facebook friend

The correct way to get similarities between all pairs by Hadoop.

β€’ Mapper input: (i, Pi)

β€’ Determine its group number as u=i%G, where G is the number of groups you split, it is also the replicate rate.

β€’ For 0 ≀ v ≀G-1:β€’ If u < v: Output { (u, v), (i, Pi) }β€’ Else if u>v: Output { (v, u), (i, Pi) }

β€’ end

β€’ Reducer input:

β€’ { (u, v), [βˆ€ 𝑖, 𝑃𝑖 ∈ πΊπ‘Ÿπ‘œπ‘’π‘ 𝑒, βˆ€ 𝑗, 𝑃𝑗 ∈ πΊπ‘Ÿπ‘œπ‘’π‘ 𝑣] }

β€’ Create two empty list uList & vList, separately to gather all 𝑖, 𝑃𝑖 that belongs to group u

and v.For 0 ≀ 𝛼 ≀ 𝑠𝑖𝑧𝑒 𝑒𝐿𝑖𝑠𝑑 βˆ’ 1:

Get i and Pi from 𝑒𝐿𝑖𝑠𝑑[𝛼]For 0 ≀ 𝛽 ≀ 𝑠𝑖𝑧𝑒 𝑣𝐿𝑖𝑠𝑑 βˆ’ 1:

Get j and Pj from 𝑣𝐿𝑖𝑠𝑑[𝛽]If i<j: output { (i, j), Sij }Else if i>j: output { (j, i), Sij }

v

Continued in the next page

Page 26: Local sensitive hashing & minhash on facebook friend

Still within the reducer:

β€’ The above only consider pairs whose element comes from different groups.

β€’ Now we consider elements within the same group.

β€’ We manage to avoid calculate the same pairs multiple times by setting if conditions.

If v==u+1:

For 0 ≀ 𝛼 ≀ 𝑠𝑖𝑧𝑒 𝑒𝐿𝑖𝑠𝑑 βˆ’ 2:

Get i and Pi from 𝑒𝐿𝑖𝑠𝑑[𝛼]

For 𝛼 + 1 ≀ 𝛽 ≀ 𝑠𝑖𝑧𝑒 𝑒𝐿𝑖𝑠𝑑 βˆ’ 1:

Get j and Pj from 𝑒𝐿𝑖𝑠𝑑[𝛽]

If i<j: output { (i, j), Sij }

Else if i>j: output { (j, i), Sij }

If u==0 & v==G-1:

For 0 ≀ 𝛼 ≀ 𝑠𝑖𝑧𝑒 𝑣𝐿𝑖𝑠𝑑 βˆ’ 2:

Get i and Pi from 𝑣𝐿𝑖𝑠𝑑[𝛼]

For 𝛼 + 1 ≀ 𝛽 ≀ 𝑠𝑖𝑧𝑒 𝑣𝐿𝑖𝑠𝑑 βˆ’ 1:

Get j and Pj from 𝑣𝐿𝑖𝑠𝑑[𝛽]

If i<j: output { (i, j), Sij }

Else if i>j: output { (j, i), Sij }

Page 27: Local sensitive hashing & minhash on facebook friend

Post processing work:

β€’ 1. Filter out the false positives by calculate similarities for those candidate pairs. Then we will have the similar persons (β€œclose friend”) for a lot of users.

β€’ General ides is using 2 MR jobs:β€’ 1st MR job use i as key and change

(i, j) to (i, j, Pi); β€’ 2nd MR job change (i, j, Pi) to (i, j,

Pi, Pj), so you can get similarity Sij.

β€’ 2. Recommendation: for each user, take the union of his/her close friends’ friend list and filter out the members he/she already knows.

β€’ General idea is:β€’ When you have a similar person

list like {a, [b1, b2, … ,bs]}, then you transfer it to {a, [Pb1, Pb2, …, Pbs]}, where Pbi is the friend list of person bi.

β€’ Then take the union of Pbi and finally minus Pa.

Page 28: Local sensitive hashing & minhash on facebook friend

Filter Out False Positives (2 MR jobs)β€’ 1st Mapper (multiple inputs):

Recommendation data:

(i, j) { i, (j, β€œR”) } if i<j

{ j, (i, β€œR”) } if i>j

Friend List data:

(i, Pi) {i, (Pi, β€œF”) }

β€’ 1st Reducer: input:

{ i, [ (j, β€œR”) βˆ€j candidate paired with i & j>i; (Pi, β€œF”) ] }

For each j from input:Output {j, (i, Pi, ”temp”)}

β€’ 2nd Mapper (multiple inputs):

Temporary data: Pass

Friend List data:

(j, Pj) {j, (Pj, β€œF”) }

β€’ 2nd Reducer: input:

{ j, [ (i, Pi, β€œtemp”) βˆ€π‘– associated with j; (Pj, β€œF”) ] }

For each i:Sij=similarity(Pi, Pj)If Sij>=0.1: output {(i, j), Sij}

Page 29: Local sensitive hashing & minhash on facebook friend

Recommendation (3 MR jobs):

β€’ 1st Mapper (multiple inputs):β€’ Similar persons list data:

{ a, [b1,b2,…,bs]} {bi, (a, β€œS”)} for all iβ€’ Friend List data:

(bi, Pbi) {bi, (Pbi, β€œF”) }

β€’ 1st Reducer: input:{bi, [ (a, ”S”) βˆ€a similar to bi; (Pbi, ”F”) ]}For each a from input:

Output {a, Pbi}

β€’ 2nd Mapper: Passβ€’ 2nd Reducer:

input: { a, [Pb1, Pb2, …, Pbs] }U = Pb1 βˆͺ Pb2 βˆͺ … βˆͺ PbsOutput {a, U}

β€’ 3rd Mapper (multiple inputs):(i, Ui) {i, (Ui, β€œu”)}(i, Pi) {i, (Pi, β€œF”)]

β€’ 3rd Reducer: input:{i, [(Ui, β€œu”), (Pi, β€œF”)] }Output {i, Ui-Pi}

Page 30: Local sensitive hashing & minhash on facebook friend

Reference:

β€’ 1. Mining of Massive Dataset, Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.

β€’ 2. Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P. Gummadi. 2009. On the evolution of user interaction in Facebook. In Proceedings of the 2nd ACM workshop on Online social networks (WOSN '09). ACM, New York, NY, USA, 37-42. DOI=http://dx.doi.org/10.1145/1592665.1592675