Download - Mining of massive datasets
Stream Processing
Stream Processing
Have I already processed this?
How many distinct queries were made?
How many hits did I get?
Stream Processing – Bloom Filters
Guaranteed detection of negatives.
Possible false positive.
Stream Processing – Bloom Filters
Have a collection of hash functions (h1, h2, h3…).
For an input, run the hash functions. Map to bit array.
If all bits are lit in working store, might have been processed (possibility of
false positives).
If any of the lit bits in hashed array are not lit in working store, need to
process this. (Guaranteed…no false negatives).
Stream Processing – Bloom Filters
1 0 0 1 1 0 1 1 1 0
0 0 1 0 0 1 1 0 0 0
0 0 1 1 1 0 0 0 0 1
0 1 0 0 0 1 0 0 0 1
1 0 0 1 1 0 0 0 0 0
Input 1: “Foo” hashes to:
1 0 1 1 1 0 0 0 0 0
Input 2: “Bar” hashes to:
Stream Processing – Bloom Filters
Not just for streams (everything is a stream, right?)
Cassandra uses bloom filters to detect if some data is in a low level storage
file.
Map Reduce
A little smarts goes a l-o-o-o-n-g way.
Map Reduce – Multiway Joins
R join S join T
size(R) = r, size(S) = s, size(T) = t
Probability of match for R and S = p
Probability of match for S and T = p
Which do we join first?
Map Reduce – Multiway Joins
R (A, B) join S(B, C) join T(C, D)
size(R) = r, size(S) = s, size(T) = t
Probability of match for R and S = p
Probability of match for S and T = p
Communication cost:
* If we join R and S first: O(r + s + t + pst)
* If we join S and T first: O(r + s + t + prs)
Map Reduce – Multiway Joins
Can we do better?
Map Reduce – Multiway Joins
Hash B to b buckets, c to C buckets.
bc = k
Cost ~ r + 2s + t + 2 * sqrt(krt)
Usually, can neglect r + t compared to the k term. So,
2s + 2*sqrt(krt)
[Single MR job]
Map Reduce – Multiway Joins
Hash B to b buckets, c to C buckets.
bc = k
Cost ~ r + 2s + t + 2 * sqrt(krt)
Usually, can neglect r + t compared to the k term. So,
2s + 2*sqrt(krt)
[Single MR job]
vs (r + s + t + prs)
[Two MR jobs]
Map Reduce – Multiway Joins
So…is this always better?
Map Reduce – Complexity
Replication Rate (r):
Number of outputs by all Map tasks / number of inputs
Reducer Size (q):
Max number of items per key at reducers
p = number of inputs
For nxn:
qr >= 2n^2
r >= p / q
Map Reduce – Matrix Multiplication
Approach 1
Matrix M, N
M(i, j), N(j, k)
Map1: Map matrices to (j, (M, i, mij)), (j, (N, k, njk))
Reduce1: for each key, output ((i, k), mij*njk)
Map2: Identity
Reduce2: For each key, (i, k) get the sum of values.
Map Reduce – Matrix Multiplication
Approach 2
One step:
Map:
For M, produce ((i, k), (M, j, mij)) for k = 1…Ncolumns_in_N
For M, produce ((i, k), (N, j, njk)) for k = 1…Nrows_in_M
Reduce:
For each key (i, k), multiple values, and sum.
Map Reduce – Matrix Multiplication
Approach 3
Two steps again.
Map Reduce – Matrix Multiplication
One pass:
(4n^4) / q
Two pass:
(4n^3) / sqrt(q)
Similarity - Shingling
“abcdef” -> [“abc”, “bcd”, “cde”…]
Jaccard similarity - > N(intersection) / N(union)
Similarity - Shingling
“abcdef” -> [“abc”, “bcd”, “cde”…]
Jaccard similarity - > N(intersection) / N(union)
Problem?
Size
Similarity - Minhashing
Similarity - Minhashing
h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
Similarity - Minhashing
h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
Problem?
Similarity – Minhash Signatures
Similarity – Minhash Signatures
Problem? Still can’t find pairs with greatest similarity efficiently
Similarity – LSH for Minhash Signatures
Clustering – Hierarchical
Clustering – K Means
1. Pick k points (centroids)
2. Assign points to clusters
3. Shift centroids to “centre”.
4. Repeat
Clustering – K Means
Clustering – FBR • 3 sets – Discard, Compressed and Retained
• First two have summaries. N, sum per dimension, sum of squares per dimension
• High dimensional Euclidian space
Mahalanobis Distance
Clustering – CURE
Clustering – CURE • Sample. Run clustering on sample.
• Pick “representatives” from each sample.
• Move representatives about 20% or so to the centre.
• Merge of close.
Dimentionality Reduction
Dimentionality Reduction
Dimentionality Reduction - SVD
Dimentionality Reduction - SVD
Dimensionality Reduction - CUR
SVD results in U and V being dense, even when M is sparse.
O(n^3)
Dimensionality Reduction - CUR
Choose r.
Choose r rows and r columns of M.
Intersection is W.
Run SVD on W (much smaller than M). W = XΣY’
Compute Σ+, the Moore-Penrose pseudoinverse of Σ.
Then, U = Y * (Σ+)^2 * X’
Dimensionality Reduction – CUR
Choosing Rows and Columns
Random, but with bias for importance.
(Frobenius Norm)^2
Probability of picking a row or column:
Sum of squares for row or column / Sum of squares of all elements
Dimensionality Reduction – CUR
Choosing Rows and Columns
Same row / column may get picked (selection with replacement).
Reduces rank.
Dimensionality Reduction – CUR
Choosing Rows and Columns
Same row / column may get picked (selection with replacement).
Reduces rank.
Can be combined: multiply vector by sqrt(k) if it appears k times.
Dimensionality Reduction – CUR
Choosing Rows and Columns
Same row / column may get picked (selection with replacement).
Reduces rank.
Can be combined: multiply vector by sqrt(k) if it appears k times.
Compute pseudo-inverse as before, but transpose the result.
Dimensionality Reduction – CUR
Choosing Rows and Columns
Same row / column may get picked (selection with replacement).
Reduces rank.
Can be combined: multiply vector by sqrt(k) if it appears k times.
Compute pseudo-inverse as before, but transpose the result.
Thanks
Mining of Massive Datasets
Leskovec, Rajaraman, Ullman
Coursera / Stanford Course
Book: http://www.mmds.org/ [free]