overlapping clusters for distributed computation
DESCRIPTION
My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdfAnd the codes http://www.cs.purdue.edu/homes/dgleich/codes/overlapping/TRANSCRIPT
Overlapping Clusters for Distributed Computation DAVID F. GLEICH "
PURDUE UNIVERSITY COMPUTER SCIENCE "
DEPARTMENT
WSDM2012 David Gleich · Purdue
REID ANDERSEN "MICROSOFT CORP.
VAHAB MIRROKNI "GOOGLE RESEARCH, NYC
1
Problem Find a good way to distribute a big graph
for solving things like linear systems and simulating random walks
Contributions Theoretical demonstration that overlap helps Proof of concept procedure to find overlapping partitions to reduce communication (~20%)
All code available http://www.cs.purdue.edu/~dgleich/codes/
overlapping
WSDM2012 David Gleich · Purdue 2
The problem WHAT OUR NETWORKS LOOK LIKE
WHAT OUR OTHER NETWORKS LOOK LIKE
WSDM2012 David Gleich · Purdue 3
The problem COMBINING NETWORKS AND GRAPHS IS A MESS
WSDM2012 David Gleich · Purdue 4
WSDM2012 David Gleich · Purdue
“Good” data distributions are a fundamental problem in distributed computation. !How to divide the communication graph!Balance work Balance communication Balance data Balance programming
complexity too
5
Current solutions Work Comm. Data Programming
Disjoint vertex partitions Excellent Okay to
Good Excellent “Think like a vertex”
2d or Edge Partitions Excellent Excellent Good “Impossible”
Where we fit!
Overlapping partitions Okay Good to
Excellent “Let’s see” “Think like a cached vertex”
WSDM2012 David Gleich · Purdue 6
Goals
WSDM2012 David Gleich · Purdue
Find a set of "overlapping clusters "where random walks stay in a cluster for a long time solving diffusion-like problems requires little communication (think PageRank, Katz, hitting times, semi-supervised learning)
7
Related work Domain decomposition, Schwarz methods
How to solve a linear system with overlap. Szyld et al.
Communication avoiding algorithms k-step matrix-vector products (Demmel et al.) and "growing overlap around partitions (Fritzsche, Frommer, Szyld)
Overlapping communities and link partitioning algorithms for social network analysis
Link communities (Ahn et al.); surveys by Fortunato and Satu
P2P based PageRank algorithms Parreira, Castillo, Donato et al.
WSDM2012 David Gleich · Purdue 8
Overlapping clusters Each vertex
in at least one cluster has one home cluster
WSDM2012 David Gleich · Purdue
Formally,
an overlapping cover is
(C, ⌧ )
C = { , , }= set of clusters
⌧ : V 7! C = map to homes
⌧ is a partition!
9
Random walks in overlapping clusters
Each vertex in at least one cluster has one home cluster
Random walks
go to the home cluster after leaving
WSDM2012 David Gleich · Purdue
red cluster "sends the walk to gray cluster
red cluster "keeps the walk
10
An evaluation metric"Swapping probability
Is a good overlapping cover? Does a random walk swap clusters often?
probability that a walk changes clusters on each step computable expression in the paper
WSDM2012 David Gleich · Purdue
red cluster "sends the walk to gray cluster
red cluster "keeps the walk
(C, ⌧ )
⇢1 =
11
Overlapping clusters Each vertex
is in at least one cluster has one home cluster
Vol(C) = sum of degrees of vertices in cluster C
MaxVol = " upper bound on Vol(C) TotalVol(C) = " sum of Vol(C) for all clusters VolRatio = TotalVol(C) / Vol(G)" how much extra data!
WSDM2012 David Gleich · Purdue
C
C
12
Swapping probability & partitioning
Much like a classical graph partitioning metric
WSDM2012 David Gleich · Purdue
P is a partition
⇢1(P) =
1
Vol(G)
X
P2P
Cut(P)
No overlap in this figure !
13
Overlapping clusters vs. Partitioning in theory
Take a cycle graph M groups of ℓ𝓁 vertices MaxVol = 2ℓ𝓁
WSDM2012 David Gleich · Purdue
(Optimal!)
for partitioning
⇢1 =
1
`
for overlapping
⇢1 =
4
⌦(`2
)
14
Heuristics for finding good "overlapping clusters
Our multi-stage heuristic!1. Find a large set of good clusters
Use personalized PageRank clusters
2. Find “well contained” nodes (cores) Compute expected “leavetime”
3. Cover the graph with core vertices Approximately solve a min set-cover problem
4. Combine clusters up to MaxVol The swapping probability is sub-modular
WSDM2012 David Gleich · Purdue
N P-hard for optimal solution L
15
Heuristics for finding good "overlapping clusters
Our multi-stage heuristic!1. Find a large set of good clusters
Use personalized PageRank clusters, or metis
2. Find “well contained” nodes (cores) Compute expected “leave time”
3. Cover the graph with core vertices Approximately solve a min set-cover problem
4. Combine clusters up to MaxVol The swapping probability is sub-modular
WSDM2012 David Gleich · Purdue
N P-hard for optimal solution L
Each cluster takes “< MaxVol” work
Takes O(Vol) work per cluster
Fast enough
Fast enough
16
WSDM2012 David Gleich · Purdue
Demo!
17
Solving "linear "systems
WSDM2012 David Gleich · Purdue
Like PageRank, Katz, and semi-supervised learning 18
All nodes solve locally using "the coordinate descent method.
WSDM2012 David Gleich · Purdue 19
All nodes solve locally using "the coordinate descent method.
A core vertex for the gray cluster.
WSDM2012 David Gleich · Purdue 20
All nodes solve locally using "the coordinate descent method.
Red sends residuals to white. White send residuals to red.
WSDM2012 David Gleich · Purdue 21
White then uses the coordinate descent method to adjust its solution. Will cause communication to red/blue.
WSDM2012 David Gleich · Purdue 22
WSDM2012 David Gleich · Purdue
That algorithm is called "restricted additive Schwarz.
PageRank Katz scores semi-supervised learning any spd or M-matrix "
linear system
We look at PageRank!
23
1 1.1 1.2 1.3 1.4 1.5 1.6 1.70
0.5
1
1.5
2
Volume Ratio
Rel
ativ
e W
ork
Metis Partitioner
Swapping Probability (usroads)PageRank Communication (usroads)Swapping Probability (web−Google)PageRank Communication (web−Google)
How much more of the graph we need to store.
It works!
Partitioning baseline
WSDM2012 David Gleich · Purdue 24
Relat
ive c
omm
unica
tion
WSDM2012 David Gleich · Purdue
the number of foreign residual elements communicated dur-ing the course of the linear system solve.
PageRank as a linear system is (I ! !P T )x = e, whereP is the random walk normalization of an adjacency matrixand ! is the link-following probability in PageRank. This“A” is an M -matrix and thus the above procedure will con-verge – although PageRank admits a much simpler conver-gence analysis, which we omit due to space. For PageRank,we solve each local system using the PageRank push algo-rithm proposed by McSherry [29], using a queue as describedby Andersen et al. [3]. This algorithm updates the foreignresiduals along with the local residuals. At each iteration,it “pushes” the residual rank to the boundary of the cluster,updating the solution xC within the current cluster. Aftercompleting all of these push operations, it communicates theresiduals to the core vertices as above.
There is no closed form solution for the PageRank vec-tor of an undirected graph, although it is usually nearby thestandard normalized degree stationary distribution. Nonethe-less, using this approach, it is actually possible to solve aPageRank system with zero communication. This occurswhen the core vertices are su!ciently far from the bound-ary, so that the residual rank arriving at the boundary isnegligible. In a real implementation, there would be a fewsmall communication steps in aggregating the solution vec-tor x and checking the residual r; however, we do not modelthose communication steps in our PageRank communicationmetric. This result cannot occur in a partitioning becausethere no boundary around the cluster.
To summarize, when solving a problem with overlappingclusters, solve locally within each cluster and communicatethe residual on the boundary to the core vertices identifiedby the map " .
7. EXPERIMENTAL RESULTSAt this point, we provide experimental evidence for the
ability of our heuristic technique to (i) reduce the swappingprobability (Table 2) and (ii) reduce the communication ina distributed PageRank solve (Table 3). Before getting tothat demonstration, we first discuss the datasets we utilizefor our experiments.
7.1 Data SetsThe data we use to evaluate our techniques comes in
two classes: geometric networks and information networks.(We include social networks within information networks.)Our technique is extremely e"ective at the geometric net-works, whereas it is less e"ective at information networks.Thus, we focus more on the latter. All of our experimen-tal data comes from the following public sources: the Uni-versity of Florida Sparse Matrix collection [12], the Stan-ford SNAP collection [26], the Library of Congress subjectheadings (lcsh graph) [39], and the National Highway Plan-ning Network (usroads graph – http://www.fhwa.dot.gov/planning/nhpn/). The annulus graph is a random triangu-lation of points in an large annulus. Please see Table 1 forinformation about the size of the networks. We removedthe direction of any edges and only consider the largest con-nected component of the network.
7.2 Cluster performanceThe first stage of our algorithm is to generate a set of
clusters with small conductance using a local personalized
Table 1: All graphs are undirected and connected.Edges are counted twice and some graphs have self-loops. The first group are geometric networks andthe second are information networks.
Graph |V | |E| maxdeg |E|/|V |
onera 85567 419201 5 4.9usroads 126146 323900 7 2.6annulus 500000 2999258 19 6.0
email-Enron 33696 361622 1383 10.7soc-Slashdot 77360 1015667 2540 13.1
dico 111982 2750576 68191 24.6lcsh 144791 394186 1025 2.7
web-Google 855802 8582704 6332 10.0as-skitter 1694616 22188418 35455 13.1cit-Patents 3764117 33023481 793 8.8
100
105
0
0.2
0.4
0.6
0.8
1
Cluster Vertices
Co
nd
uct
an
ce
100
102
104
0
0.2
0.4
0.6
0.8
1
Cluster Vertices
Conduct
ance
100
105
0
0.2
0.4
0.6
0.8
1
Cluster Vertices
Co
nd
uct
an
ce
Figure 2: Cluster conductance for three di!er-ent regimes on the graph lcsh. From left: small,medium, and big. The red dots are the results withthe final set of combined clusters.
PageRank algorithm [3]. This algorithm has well knownproperties and has been used in other experimental probes ofgraph clusters [27]. This algorithm has a few parameters wecan control. Although our final goal may be to have clustersof the network with a MaxVol of 10% of the total graphvolume, we often found it e"ective to find smaller clustersand combine these together. We investigate three regimes.The first, called small, seeks clusters up to MaxVol of 10000with a small average volume. The second, called med, seeksclusters up to MaxVol of 10000 with a large average volume.The third, called big, seeks clusters up to MaxVol of 10% ofthe total graph volume with a average volume comparableto the medium set. The conductance of the sets generatedby these algorithms is plotted in Figure 2 for the graph lcsh.
7.3 Combine performanceWe now show that we can e"ectively combine clusters. See
Figure 3 for relationships between the input to the clustercombination process and the output in three measures. Thisexperiment summarizes the result of all the cluster combina-tion experiments we performed on the information networksduring the sweep experiment discussed in the next section.The first measure is the number of clusters. In this measure,the algorithm e"ectively combines up to 1,000,000 initialclusters down to a few hundred or thousand. The secondmeasure is the volume ratio of the input clusters to the vol-ume ratio of the output clusters. This ratio always decreases,though not always by a large margin. The final measure isthe average conductance of the input clusters to the aver-age conductance of the output clusters. Here, we find theaverage conductance can increase, although it most oftendecreases.
Graph Vertices Edges MaxDeg Density
25
WSDM2012 David Gleich · Purdue
OVERLAPPING CLUSTERS FOR DISTRIBUTED COMPUTATIONREIDANDERSEN · MICROSOFT
DAVID F.GLEICH · SANDIA
VAHAB S.MIRROKNI · GOOGLE
1. THE IDEAScalable, distributed algorithms must address communicationproblems. We investigate overlapping clusters, or vertex parti-tions that intersect, for graph computations. This setup storesmore of the graph than required but then affords the ease of im-plementation of vertex partitioned algorithms. Our hope is thatthis technique allows us to reduce communication in a computa-tion on a distributed graph.
2. RELATED WORKThe motivation above draws on recent work in communication avoiding algorithms.Mohiyuddin et al. (SC09) design a matrix-powers kernel that gives rise to an over-lapping partition. Fritzsche et al. (CSC2009) develop an overlapping clustering fora Schwarz method. Both techniques extend an initial partitioning with overlap. Ourprocedure generates overlap directly. Indeed, Schwarz methods are commonly usedto capitalize on overlap. Elsewhere, overlapping communities (Ahn et al, Nature2009; Mishra et al. WAW2007) are now a popular model of structure in social net-works. These have long been studied in statistics (Cole and Wishart, CompJ 1970).
3. PROBLEM SETUPLet Vol(C) = sum of degrees for � 2 C; C�t(C) = total edges between C and the restof the graph. Note that Vol(C) is a proxy for the adjacency data size of vertices in C.
Given a graph G, an overlapping clustering (C,�) is a set of clusters C and a map-ping from each vertex to a home cluster �. The total number of edges in a cluster(Vol(C)) is constrained by M�xVol. In a random walk on an overlapping clustering,the walk moves from cluster to cluster. On leaving a cluster, it goes to the homecluster of the new vertex: e.g. In the illustrations here, the color in-dicates the home vertices for a cluster. (See the example above too.) A transitionbetween clusters is a swap, and requires a communication if the underlying graphis distributed. We thus wish to minimize swaps in a random walk. Let �T(�) = theexpected fraction of steps that swap in a T-step walk starting from �. We study:�� = limT!�
1n
P� �T(�), the fraction of steps with swaps for a long walk. For a cycle
graph, we can prove that overlap reduces the communication.
THEOREM Consider a large cycle Cn of n = M� nodes for a large number M >0, and let the maximum volume of a cluster M�xVol be �. Let P be the optimal parti-tioning of G to non-overlapping clusters of size at most M�xVol and ��� be the swap-ping probability of P. There exists an overlapping cover with Tot�lVol of 2Vol(G)whose swapping probability �0� is less than ���/�(M�xVol).
(Proof Sketch) The overlapping clustering that achieves this bound is:
Cycle wrap Cycle wrap
1 �H H H H
The cycle graph
Each cluster has � vertices, and the home vertices are the “middle” ones, as in thefour vertices labeled H above for the blue cluster. The best �� for a partitioning is2� because �� = 1
Vol(G)P
C2P C�t(C) for a partitioning. A random walk travels O(pt)
distance in t steps. The edge of an overlapping cluster is always in the center ofanother cluster, and so it will take �2/4 steps to exit after a swap, yielding �� = 4
�2.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a whollyowned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Adminis-tration under contract DE-AC04-94AL85000.
4. HEURISTICS FOR OVERLAPPING CLUSTERSOptimizing �� with a M�xVol constraint is NP-hard by a re-laxation from minimum bisection. To produce clusters with asmall �� we use a multi-stage heuristic:
1. Identify candidate clusters. Use a PageRank clusteringheuristic or METIS to find small conductance clusters up tosize M�xVol.
2. Compute well-contained sets. For each vertex, computethe time for a random walk to leave a cluster starting thereand use this to pick home vertices.
3. Cover with cluster cores. Approximately solve a set-coverproblem to pick a subset of clusters.
4. Combine clusters. Finally, we combine any small clustersuntil the final size of each is about M�xVol.
THE OUTPUTA PageRank cluster(color is containment, white=best)
... and another
A combined cluster(red = home vertices)
... and another
Overall best containment(white=best)
Vertex overlap(gray=1, black=2, red=2)
5. DATAWe empirically study this idea on 10 public graphs.
Graph |V| |E| m�xdeg |E|/ |V|onera 85567 419201 5 4.9
usroads 126146 323900 7 2.6annulus 500000 2999258 19 6.0
email-Enron 33696 361622 1383 10.7soc-Slashdot 77360 1015667 2540 13.1
dico 111982 2750576 68191 24.6lcsh 144791 394186 1025 2.7
web-Google 855802 8582704 6332 10.0as-skitter 1694616 22188418 35455 13.1
cit-Patents 3764117 33023481 793 8.8
6. RESULTSWe present two types of results: (i) an estimated swappingprobability ��; and (ii) the communication volume of a par-allel PageRank solution (link-following � = 0.85) using an ad-ditive Schwarz method. The volume ratio is the amount ofextra storage for the overlap (2 means we store the graphtwice). Below, as the ratio increases, the swapping probabil-ity and PageRank communication volume decreases.
Two graphs, in detail All graphs, multiple experiments.
1 1.1 1.2 1.3 1.4 1.5 1.6 1.70
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Volume Ratio
Rel
ativ
e W
ork
Metis Partitioner
Swapping Probability (usroads)PageRank Communication (usroads)Swapping Probability (web−Google)PageRank Communication (web−Google)
0 5 10 15 20 250
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Volume Ratio
Rel
ativ
e Sw
appi
ng P
roba
bilit
y
Metis Partitioner
The communication ratio of our best result for the PageRankcommunication volume compared to METIS or GRACLUS showsthat the method works for 6 of them (perf. ratio < 1). The 0communication result is not a bug.
Graph Comm. ofPartition
Comm. ofOverlap
Perf. Ratio Vol. Ratio
onera 18654 48 0.003 2.82usroads 3256 0 0.000 1.49annulus 12074 2 0.000 0.01email-Enron 194536* 235316 1.210 1.7soc-Slashdot 875435* 1.3⇥ 106 1.480 1.78dico 1.5 ⇥ 106* 2.0⇥ 106 1.320 1.53lcsh 73000* 48777 0.668 2.17web-Google 201159* 167609 0.833 1.57as-skitter 2.4 ⇥ 106 3.9⇥ 106 1.645 1.93cit-Patents 8.7 ⇥ 106 7.3⇥ 106 0.845 1.34
Finally, we evaluate our heuristic.
At left, the cluster combine procedure reduces 106 clusters toaround 102. Middle, combining clusters can decrease the volumeratio from 10 to around 1. At right, the mean conductance tends todecrease after combining clusters.
100 105100
105
Input Clusters
Out
put C
lust
ers
100 101100
101
Input Volume Ratio
Out
put V
olum
e Ra
tio
0 0.5 10
0.2
0.4
0.6
0.8
1
Input Mean Conductance
Out
put M
ean
Cond
ucta
nce
Please email David Gleich ([email protected]) for questions Our code is available online: Google “overlapping gleich” to find it. CSC2011 - 17 May 2011
* means Graculus gave a better
partition than Metis
26
All code available http://www.cs.purdue.edu/~dgleich/codes/
overlapping
WSDM2012 David Gleich · Purdue
Summary !Overlap helps reduce communication in a distributed process!!Proof of concept procedure to find overlapping partitions to reduce communication
Future work Truly distributed implementation and evaluation Can we exploit data redundancy to solve problems on large graphs faster?
27
src -> dst src -> dst src -> dst
src -> dst src -> dst src -> dst
Copy 1 Copy 2