![Page 1: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/1.jpg)
Large-scale Data Mining:
MapReduce and BeyondPart 2: Algorithms
Spiros Papadimitriou, IBM Research
Jimeng Sun, IBM Research
Rong Yan, Facebook
![Page 2: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/2.jpg)
Part 2:Mining using MapReduce
Mining algorithms using MapReduce
Information retrieval
Graph algorithms: PageRank
Clustering: Canopy clustering, KMeans
Classification: kNN, Naïve Bayes
MapReduce Mining Summary
2
![Page 3: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/3.jpg)
MapReduce Interface and Data Flow
Map: (K1, V1) list(K2, V2)
Combine: (K2, list(V2)) list(K2, V2)
Partition: (K2, V2) reducer_id
Reduce: (K2, list(V2)) list(K3, V3)
3
ReduceMap Combine
Map Combine
Map Combine
Host 1
Host 2
Host 3
Reduce Host 4Map Combine
partition
id, doc list(w, id) list(unique_w, id)
w1, list(unique_id)w1, list(id)
w2, list(id)w2, list(unique_id)
![Page 4: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/4.jpg)
4
Information retrieval using
MapReduce
![Page 5: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/5.jpg)
IR: Distributed Grep
Find the doc_id and line# of a matching pattern
Map: (id, doc) list(id, line#)
Reduce: None
5
21
54
6
Map1
Map2
Map3
3
Docs Output
<1, 123>
<3, 717>, <5, 1231>
<6, 1012>
Grep “data mining”
![Page 6: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/6.jpg)
IR: URL Access Frequency
Map: (null, log) (URL, 1)
Reduce: (URL, 1) (URL, total_count)
6
Map1
Map2
Map3
LogsMap output
<u1,1>
<u2,1>
<u3,1>
<u3,1>
Reduce
<u1,2>
<u2,1>
<u3,2>
Reduce output
Also described in Part 1
<u1,1>
![Page 7: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/7.jpg)
IR: Reverse Web-Link Graph
Map: (null, page) (target, source)
Reduce: (target, source) (target, list(source))
7
Map1
Map2
Map3
PagesMap output
<t1,s2>
<t2,s3>
<t2,s5>
<t3,s5>
Reduce
<t1,[s2]>
<t2,[s3,s5]>
<t3,[s5]>
Reduce output
It is the same as matrix transpose
![Page 8: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/8.jpg)
IR: Inverted Index
Map: (id, doc) list(word, id)
Reduce: (word, list(id)) (word, list(id))
8
Map1
Map2
Map3
DocMap output
<w1,1>
<w2,2>
<w3,3>
<w1,5>
Reduce<w1,[1,5]>
<w2,[2]>
<w3,[3]>
Reduce output
![Page 9: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/9.jpg)
9
Graph mining using MapReduce
![Page 10: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/10.jpg)
PageRank
PageRank vector q is defined as
where
A is the source-by-destination
adjacency matrix,
e is all one vector.
N is the number of nodes
c is the weight between 0 and 1 (eg.0.85)
PageRank indicates the importance of a page.
Algorithm: Iterative powering for finding the first
eigen-vector
10
q = cATq+ 1¡cNe
21
3 4
A =
0BB@
0 1 1 1
0 0 1 1
0 0 0 1
0 0 1 0
1CCA
Browsing Teleporting
![Page 11: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/11.jpg)
MapReduce: PageRank
11
PageRank Map()
Input: key = page x,
value = (PageRank qx, links[y1…ym])
Output: key = page x, value = partialx
1. Emit(x, 0) //guarantee all pages will be emitted
2. For each outgoing link yi:
• Emit(yi, qx/m)
PageRank Reduce()
Input: key = page x, value = the list of [partialx]
Output: key = page x, value = PageRank qx
1. qx = 0
2. For each partial value d in the list:
• qx += d
3. qx = cqx+ (1-c)/N
4. Emit(x, qx)
21
3 4
Map: distribute PageRank qi
q1 q2 q3 q4
Reduce: update new PageRank
q2 q3 q4q1
Check out Kang et al ICDM’09
![Page 12: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/12.jpg)
12
Clustering using MapReduce
![Page 13: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/13.jpg)
Canopy: single-pass clustering
Canopy creation
Construct overlapping clusters – canopies
Make sure no two canopies with too much overlaps
Key: no canopy centers are too close to each other
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00 13
C1
C2
C3
C4
overlapping clusters too much overlap two thresholds
T1>T2
T1
T2
![Page 14: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/14.jpg)
Canopy creation
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
if dist(p,c)< T1: c.add(p)
if dist(p,c)< T2: strongBound = true
– If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
14
![Page 15: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/15.jpg)
Canopy creation
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
15
C1
Strongly marked
points
C2
T1 T2
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
if dist(p,c)< T1: c.add(p)
if dist(p,c)< T2: strongBound = true
– If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
![Page 16: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/16.jpg)
Canopy creation
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
Canopy center
Other points
in the cluster
Strongly marked
points
16
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
if dist(p,c)< T1: c.add(p)
if dist(p,c)< T2: strongBound = true
– If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
![Page 17: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/17.jpg)
Canopy creation
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
17
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
if dist(p,c)< T1: c.add(p)
if dist(p,c)< T2: strongBound = true
– If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
![Page 18: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/18.jpg)
Canopy creation
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
18
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
if dist(p,c)< T1: c.add(p)
if dist(p,c)< T2: strongBound = true
– If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
![Page 19: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/19.jpg)
MapReduce - Canopy Map()
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
Canopy creation Map()
Input: A set of points P, threshold T1, T2
Output: key = null; value = a list of local canopies
(total, count)
For each p in P:
For each canopy c:
• if dist(p,c)< T1 then c.total+=p, c.count++;
• if dist(p,c)< T2 then strongBound = true
• If not strongBound then create canopy at p
Close()
For each canopy c:
• Emit(null, (total, count))
Map1
Map2
19
![Page 20: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/20.jpg)
MapReduce - Canopy Reduce()
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
Map1 results
Map2 results
20
For simplicity we assume only one reducer
Reduce()
Input: key = null; input values (total, count)
Output: key = null; value = cluster centroids
For each intermediate values
p = total/count
For each canopy c:
• if dist(p,c)< T1 then c.total+=p,
c.count++;
• if dist(p,c)< T2 then strongBound = true
If not strongBound then create canopy at p
Close()
For each canopy c: emit(null, c.total/c.count)
![Page 21: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/21.jpg)
MapReduce - Canopy Reduce()
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
Reducer results
21
Reduce()
Input: key = null; input values (total, count)
Output: key = null; value = cluster centroids
For each intermediate values
p = total/count
For each canopy c:
• if dist(p,c)< T1 then c.total+=p,
c.count++;
• if dist(p,c)< T2 then strongBound = true
If not strongBound then create canopy at p
Close()
For each canopy c: emit(null, c.total/c.count)Remark: This assumes only one reducer
![Page 22: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/22.jpg)
Clustering Assignment
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
22
Clustering assignment
For each point p
Assign p to the closest
canopy center
![Page 23: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/23.jpg)
MapReduce: Cluster Assignment
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
Canopy center
23
Cluster assignment Map()
Input: Point p; cluster centriods
Output:
Output key = cluster id
Output value =point id
currentDist = inf
For each cluster centroids c
If dist(p,c)<currentDist
then bestCluster=c, currentDist=dist(p,c);
Emit(bestCluster, p)
Results can be directly written back to HDFS without a reducer. Or an
identity reducer can be applied to have output sorted on cluster id.
![Page 24: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/24.jpg)
KMeans: Multi-pass clustering
24
Kmeans ()
While not converge:
AssignCluster()
UpdateCentroids()
AssignCluster():
• For each point p
Assign p the closest c
Traditional AssignCluster()
UpdateCentroids ():
For each cluster
Update cluster center
UpdateCentroid()
![Page 25: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/25.jpg)
MapReduce – KMeans
25
KmeansIter()
Map(p) // Assign Cluster
For c in clusters:
If dist(p,c)<minDist,
then minC=c, minDist = dist(p,c)
Emit(minC.id, (p, 1))
Reduce() //Update Centroids
For all values (p, c) :
total += p; count += c;
Emit(key, (total, count))Map1
Map2
Initial centroids
![Page 26: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/26.jpg)
MapReduce – KMeans
26
3
24
1
Map1
Map2
Initial centroids
Reduce1
Reduce2
Map: assign each p
to closest centroids
Reduce: update each
centroid with its new
location (total, count)
KmeansIter()
Map(p) // Assign Cluster
For c in clusters:
If dist(p,c)<minDist,
then minC=c, minDist = dist(p,c)
Emit(minC.id, (p, 1))
Reduce() //Update Centroids
For all values (p, c) :
total += p; count += c;
Emit(key, (total, count))
![Page 27: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/27.jpg)
27
Classification using MapReduce
![Page 28: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/28.jpg)
MapReduce kNN
1
1 011
1
1
00
01
0
1
1
0
0
0
Map()
Input:
– All points
– query point p
Output: k nearest neighbors (local)
Emit the k closest points to p
28
Map1
Map2
Query point
Reduce()
Input:
– Key: null; values: local neighbors
– query point p
Output: k nearest neighbors (global)
Emit the k closest points to p among all
local neighbors
Reduce
K=3
![Page 29: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/29.jpg)
T1w
T2w
Naïve Bayes
30
Formulation:
Parameter estimation
Class prior:
Conditional probability:
d1
d2
d3
d4
d5
d6
d7
Term vector
c1:N1=3
c2:N2=4
P (cjd) / P (c)Qw2d P (wjc)
P(c) = Nc
N where Nc is #docs in c, N is #docs
P (wjc) = TcwPw0 Tcw0
Tcw is # of occurrences of w in class c
Goals:
1. total number of docs N
2. number of docs in c:Nc
3. word count histogram in c:Tcw
4. total word count in c:Tcw’
c is a class label, d is a doc, w is a word
![Page 30: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/30.jpg)
MapReduce: Naïve Bayes
31
ClassPrior()
Map(doc):
Emit(class_id, (doc_id, doc_length)
Combine()/Reduce()
Nc = 0; sTcw = 0
For each doc_id:
Nc++; sTcw +=doc_length
Emit(c, Nc)
Naïve Bayes can be implemented
using MapReduce jobs of histogram
computation
ConditionalProbability()
Map(doc):
For each word w in doc:
Emit(pair(c, w) , 1)
Combine()/Reduce()
Tcw = 0
For each value v: Tcw += v
Emit(pair(c, w) , Tcw)
Goals:
1. total number of docs N
2. number of docs in c:Nc
3. word count histogram in c:Tcw
4. total word count in c:Tcw’
![Page 31: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/31.jpg)
32
MapReduce Mining Summary
![Page 32: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/32.jpg)
Taxonomy of MapReduce algorithms
One Iteration Multiple Iterations Not good for
MapReduce
Clustering Canopy KMeans
Classification Naïve Bayes, kNN Gaussian Mixture SVM
Graphs PageRank
Information
Retrieval
Inverted Index Topic modeling
(PLSI, LDA)
33
One-iteration algorithms are perfect fits
Multiple-iteration algorithms are OK fits
but small shared info have to be synchronized across iterations (typically through
filesytem)
Some algorithms are not good for MapReduce framework
Those algorithms typically require large shared info with a lot of synchronization.
Traditional parallel framework like MPI is better suited for those.
![Page 33: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/33.jpg)
MapReduce for machine learning algorithms
The key is to convert into summation form
(Statistical Query model [Kearns’94])
y= f(x) where f(x) corresponds to map(),
corresponds to reduce().
Naïve Bayes
MR job: P(c)
MR job: P(w|c)
Kmeans
MR job: split data into subgroups and compute partial
sum in Map(), then sum up in Reduce()
34 Map-Reduce for Machine Learning on Multicore [NIPS’06]
![Page 34: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/34.jpg)
Machine learning algorithms using MapReduce
Linear Regression:
MR job1:
MR job2:
Finally, solve
Locally Weighted Linear Regression:
MR job1:
MR job2:
35
A =P
iwixixiT
b =P
iwixiy(i)
Map-Reduce for Machine Learning on Multicore [NIPS’06]
y = (XTX)¡1XTy
where X 2Rm£n and y 2RmA =XTX =
Pi xixi
T
b =XTy =P
i xiy(i)
Ay = b
![Page 35: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/35.jpg)
Machine learning algorithms using MapReduce
(cont.)
Logistic Regression
Neural Networks
PCA
ICA
EM for Gaussian Mixture Model
36 Map-Reduce for Machine Learning on Multicore [NIPS’06]
![Page 36: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/36.jpg)
37
MapReduce Mining Resources
![Page 37: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/37.jpg)
Mahout: Hadoop data mining library
Mahout: http://lucene.apache.org/mahout/
scalable data mining libraries: mostly implemented in Hadoop
Data structure for vectors and matrices
Vectors
Dense vectors as double[]
Sparse vectors as HashMap<Integer, Double>
Operations: assign, cardinality, copy, divide, dot, get,
haveSharedCells, like, minus, normalize, plus, set, size, times,
toArray, viewPart, zSum and cross
Matrices
Dense matrix as a double[][]
SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the
rows or columns of the matrix in a SparseVector
SparseMatrix as a HashMap<Integer, Vector>
Operations: assign, assignColumn, assignRow, cardinality, copy,
divide, get, haveSharedCells, like, minus, plus, set, size, times,
transpose, toArray, viewPart and zSum38
![Page 38: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/38.jpg)
MapReduce Mining Papers
[Chu et al. NIPS’06] Map-Reduce for Machine Learning on Multicore
General framework under MapReduce
[Papadimitriou et al ICDM’08] DisCo: Distributed Co-clustering with Map-
Reduce
Co-clustering
[Kang et al. ICDM’09] PEGASUS: A Peta-Scale Graph Mining System
- Implementation and Observations
Graph algorithms
[Das et al. WWW’07] Google news personalization: scalable online
collaborative filtering
PLSI EM
[Grossman+Gu KDD’08] Data Mining Using High Performance Data
Clouds: Experimental Studies Using Sector and Sphere. KDD 2008
Alternative to Hadoop which supports wide-area data collection and
distribution.
39
![Page 39: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/39.jpg)
Summary: algorithms
Best for MapReduce:
Single pass, keys are uniformly distributed.
OK for MapReduce:
Multiple pass, intermediate states are small
Bad for MapReduce
Key distribution is skewed
Fine-grained synchronization is required.
40
![Page 40: Large-scale Data Mining: Hadoop and Beyongcs.kangwon.ac.kr/.../2011_1/grad_mining/slides/07-2.pdf · 2016-06-02 · Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms](https://reader030.vdocuments.net/reader030/viewer/2022040308/5f0329a37e708231d407d97c/html5/thumbnails/40.jpg)
Large-scale Data Mining:
MapReduce and BeyondPart 2: Algorithms
Spiros Papadimitriou, IBM Research
Jimeng Sun, IBM Research
Rong Yan, Facebook