non-exhaustive, overlapping k-means
TRANSCRIPT
Non-exhaustive, overlapping K-means
clustering
David F. Gleich!Purdue University!
Real-world graph and point data have overlapping clusters.
Other uses for PageRankWhat else people use PageRank to do
GeneRank
10 20 30 40 50 60 70
NM_003748NM_003862Contig32125_RCU82987AB037863NM_020974Contig55377_RCNM_003882NM_000849Contig48328_RCContig46223_RCNM_006117NM_003239NM_018401AF257175AF201951NM_001282Contig63102_RCNM_000286Contig34634_RCNM_000320AB033007AL355708NM_000017NM_006763AF148505Contig57595NM_001280AJ224741U45975Contig49670_RCContig753_RCContig25055_RCContig53646_RCContig42421_RCContig51749_RCAL137514NM_004911NM_000224NM_013262Contig41887_RCNM_004163AB020689NM_015416Contig43747_RCNM_012429AB033043AL133619NM_016569NM_004480NM_004798Contig37063_RCNM_000507AB037745Contig50802_RCNM_001007Contig53742_RCNM_018104Contig51963Contig53268_RCNM_012261NM_020244Contig55813_RCContig27312_RCContig44064_RCNM_002570NM_002900AL050090NM_015417Contig47405_RCNM_016337Contig55829_RCContig37598Contig45347_RCNM_020675NM_003234AL080110AL137295Contig17359_RCNM_013296NM_019013AF052159Contig55313_RCNM_002358NM_004358Contig50106_RCNM_005342NM_014754U58033Contig64688NM_001827Contig3902_RCContig41413_RCNM_015434NM_014078NM_018120NM_001124L27560Contig45816_RCAL050021NM_006115NM_001333NM_005496Contig51519_RCContig1778_RCNM_014363NM_001905NM_018454NM_002811NM_004603AB032973NM_006096D25328Contig46802_RCX94232NM_018004Contig8581_RCContig55188_RCContig50410Contig53226_RCNM_012214NM_006201NM_006372Contig13480_RCAL137502Contig40128_RCNM_003676NM_013437Contig2504_RCAL133603NM_012177R70506_RCNM_003662NM_018136NM_000158NM_018410Contig21812_RCNM_004052Contig4595Contig60864_RCNM_003878U96131NM_005563NM_018455Contig44799_RCNM_003258NM_004456NM_003158NM_014750Contig25343_RCNM_005196Contig57864_RCNM_014109NM_002808Contig58368_RCContig46653_RCNM_004504M21551NM_014875NM_001168NM_003376NM_018098AF161553NM_020166NM_017779NM_018265AF155117NM_004701NM_006281Contig44289_RCNM_004336Contig33814_RCNM_003600NM_006265NM_000291NM_000096NM_001673NM_001216NM_014968NM_018354NM_007036NM_004702Contig2399_RCNM_001809Contig20217_RCNM_003981NM_007203NM_006681AF055033NM_014889NM_020386NM_000599Contig56457_RCNM_005915Contig24252_RCContig55725_RCNM_002916NM_014321NM_006931AL080079Contig51464_RCNM_000788NM_016448X05610NM_014791Contig40831_RCAK000745NM_015984NM_016577Contig32185_RCAF052162AF073519NM_003607NM_006101NM_003875Contig25991Contig35251_RCNM_004994NM_000436NM_002073NM_002019NM_000127NM_020188AL137718Contig28552_RCContig38288_RCAA555029_RCNM_016359Contig46218_RCContig63649_RCAL080059
Use (�� �GD�1)x =w tofind “nearby” importantgenes.
ProteinRank
IsoRank
Clustering(graph partitioning)
Sports ranking
Teaching
Morrison et al. GeneRank, 2005.Gleich (Stanford) PageRank intro Ph.D. Defense 9 / 41
Social networks have overlapping clusters because of social circles
Genes have overlapping clusters due to their role in multiple functions
SILO Seminar David Gleich · Purdue
Overlapping research projects are what got me here too!
PhD Thesis on Google’s PageRank
MSR Intern and
Overlapping Clusters for Distributed
Computation
Accelerated NCP plots and locally
minimal communities
Neighborhood inflated seed expansion for overlapping communities
Non-exhaustive overlapping "K-means
SILO Seminar David Gleich · Purdue
1. NISE Clustering - Whang, Gleich, Dhillon, CIKM 2013 2. NEO-K-means - Whang, Gleich, Dhillon, SDM 2015 3. NEO-K-means SDP "
Hou, Whang, Gleich, Dhillon, KDD 2015 4. Multiplier Methods for Overlapping K-Means"
Hou, Whang, Gleich, Dhillon, Submitted
SILO Seminar David Gleich · Purdue
Proposed Algorithm
Seed Set ExpansionCarefully select seedsGreedily expand communities around the seed sets
The algorithmFiltering PhaseSeeding PhaseSeed Set Expansion PhasePropagation Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44)
Overlapping communities via seed set expansion works nicely.
Proposed Algorithm
Seed Set ExpansionCarefully select seedsGreedily expand communities around the seed sets
The algorithmFiltering PhaseSeeding PhaseSeed Set Expansion PhasePropagation Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44)
Table 4: Returned number of clusters and graph coverage of each algorithm
Graph random egonet graclus ctr. spread hubs demon bigclam
HepPh coverage (%) 97.1 72.1 100 100 88.8 62.1no. of clusters 97 241 109 100 5,138 100
AstroPh coverage (%) 97.6 71.1 100 100 94.2 62.3no. of clusters 192 282 256 212 8,282 200
CondMat coverage (%) 92.4 99.5 100 100 91.2 79.5no. of clusters 199 687 257 202 10,547 200
DBLP coverage (%) 99.9 86.3 100 100 84.9 94.6no. of clusters 21,272 8,643 18,477 26,503 174,627 25000
Amazon coverage (%) 99.9 100 100 100 79.2 99.2no. of clusters 21,553 14,919 20,036 27,763 105,828 25,000
Flickr coverage (%) 76.0 54.0 100 93.6 - 52.1no. of clusters 14,638 24,150 16,347 15,349 - 15,000
LiveJournal coverage (%) 88.9 66.7 99.8 99.8 - 43.9no. of clusters 14,850 34,389 16,271 15,058 - 15,000
Myspace coverage (%) 91.4 69.1 100 99.9 - -no. of clusters 14,909 67,126 16,366 15,324 - -
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandomdemonbigclam
Student Version of MATLAB
(a) AstroPh
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandomdemonbigclam
Student Version of MATLAB
(b) HepPh
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandomdemonbigclam
Student Version of MATLAB
(c) CondMat
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandombigclam
Student Version of MATLAB
(d) Flickr
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandombigclam
Student Version of MATLAB
(e) LiveJournal
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandom
Student Version of MATLAB
(f) Myspace
Figure 2: Conductance vs. graph coverage – lower curve indicates better communities. Overall, “gracluscenters” outperforms other seeding strategies, including the state-of-the-art methods Demon and Bigclam.
We can cover 95% of network with communities of cond. ~0.15.
Flickr social network 2M vertices, 22M edges
cond(S)
= cut(S)/“size”(S)
SILO Seminar David Gleich · Purdue
We wanted a more principled approach to achieve these results.
SILO Seminar David Gleich · Purdue
The state of the art for clustering
SILO Seminar David Gleich · Purdue
K-Means
Problem 1 Problem 2 Problem 3 Problem 4
😀 😊 😟 😢 K-Means
The state of the art for clustering
SILO Seminar David Gleich · Purdue
K-Means
Problem 1 Problem 2 Problem 3 Problem 4
😀 😊 K-Means NEO-K-Means NEO K-Means
😊 😊
m1
m2
|| xi – m1 ||
|| xi – m2 ||
K-means as optimization.
SILO Seminar David Gleich · Purdue
minimize
Pij Uijkxi � mjk2
subject to U is an assignment to clusters
mj =
1Pi Ui j
Uijxi
minimize
Pij Uijkxi � mjk2
subject to U is an multi-assignment to clusters
mj =
1Pi Ui j
Uijxi
Input Points x
1
, ... , xn
Find an assignment
matrix U that gives
cluster assignments
to minimize
x1x2x3x4
U =
2
664
1 01 00 10 1
3
775
c1 c2
K-means objective!
K-means’ objective with overlap?!
Overlap is not a natural addition to optimization based clustering.
SILO Seminar David Gleich · Purdue
The NEO-K-means objective balances overlap and outliers.
SILO Seminar David Gleich · Purdue
minimize
Pij Uijkxi � mjk2
subject to Uij is binary
trace(UT U) = (1 + ↵)n (↵n overlap)
e
TInd[Ue] � (1 � �)n (up to �n outliers)
mj =
1Pi Ui j
Uijxi
· If ↵,� = 0, then we get back to K-means.
· Automatically choose ↵,� based on K-means.
😊 1. Make (1 + ↵)n total assignments.
2. Allow up to �n outliers.
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(a) Ground-truth clusters
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(b) First extension of k-means
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(c) NEO-K-Means
Figure 1: (a) Two ground-truth clusters are generated (n=1,000, ↵=0.1, �=0.005). Green points indicate overlap between
the clusters, and black points indicate outliers. See Section 5 for details. (b) Our first extension of k-means objective
function defined in (2.2) makes too many outlier assignments and fails to recover the ground-truth. (c) The NEO-K-Means
objective defined in (2.3) adds an explicit term for non-exhaustiveness that enables it to correctly detect the outliers, and
find natural overlapping clustering structure which is very similar to the ground-truth clusters (↵ and � are automatically
estimated by the heuristics discussed in Section 2.5).
vector having all the elements equal to one. Then, thevector U1 denotes the number of clusters to which eachdata point belongs. Thus, (U1)i = 0 means that xi
does not belong to any cluster. Now, by adding a non-exhaustiveness constraint to (2.2), we define our NEO-K-Means objective function as follows:(2.3)
minU
kX
j=1
nX
i=1
uijkxi �mjk2, where mj =
Pni=1 uijxiPni=1 uij
s.t. trace(UTU) = (1 + ↵)n,
Pni=1 {(U1)i = 0} �n.
We allow at most �n data points to be not assignedto any cluster, i.e., at most �n data points can beconsidered as outliers. We require 0 �n and notethat �n ⌧ n to cause most data points to be assignedto clusters. Specifically, by the definition of “outliers”,�n should be a very small number compared to n. Theparameters ↵ and � o↵er an intuitive way to capture thedegree of overlap and non-exhaustiveness; by “turningthe knob” on these parameters, the user can explore thelandscape of overlapping, non-exhaustive clusterings. If↵=0 and �=0, the NEO-K-Means objective function isequivalent to the standard k-means objective presentedin (2.1). To see this, note that setting the parameter�=0 requires every data point to belong to at leastone cluster, while setting ↵=0 makes n assignments.Putting these together, the resulting clustering will bedisjoint and exhaustive. Note that by setting ↵=0,objective (2.2) does not have this property.
To see if the objective function (2.3) yields a reason-able clustering, we test it on the same dataset we usedin the previous subsection. Figure 1 (c) shows the result(↵ and � are automatically estimated by the heuristicsdiscussed in Section 2.5). We see that NEO-K-Meanscorrectly finds all the outliers, and produces very similaroverlapping structure to the ground-truth clusters.
2.4 The NEO-K-Means Algorithm. We now pro-pose a simple iterative algorithm which monotonicallydecreases the NEO-K-Means objective until it convergesto a local minimum. Having the hard constraints in(2.3), we will make n+↵n assignments such that at most�n data points can have no membership in any clus-ter. Note that the second constraint can be interpretedas follows: among n data points, at least n � �n datapoints should have membership to some cluster. Whenour algorithm makes assignments of points to clusters, ituses two phases to satisfy these two constraints. Thus,each cluster Cj decomposes into two sets Cj and Cj thatrecord the assignments made in each phase.
Algorithm 1 describes the NEO-K-Means algo-rithm. We first initialize cluster centroids. Any ini-tialization strategies that are used in k-means may alsobe applied to our algorithm. Given cluster centroids, wecompute all the distances [dij ]n⇥k between every datapoint and clusters, and for every data point, recordits closest cluster and that distance. Then, the datapoints are sorted in an ascending order by the dis-tance to its closest cluster. To ensure at least n � �n
data points are assigned to some cluster (i.e., to satisfythe second constraint), we assign the first n � �n datapoints to their closest clusters. Let Cj denote the assign-ments made by this step. Thus,
Pkj=1 |Cj | = n � �n.
Then, we make �n + ↵n more assignments by tak-ing �n + ↵n minimum distances among [dij ]n⇥k such
that xi /2 Cj . Let Cj denote the assignments made
by this step. Thus,Pk
j=1 |Cj | = �n + ↵n. Finally,Pk
j=1(|Cj | + |Cj |) = n + ↵n. Once all the assignmentsare made, we update cluster centroids by recomputingthe mean of each cluster. We repeat this procedure un-til the change in objective function is su�ciently smallor the maximum number of iterations is reached. Notethat the algorithm does not forcibly choose �n points as
Lloyd’s algorithm for NEO-K-means is just a wee-bit more complex.
SILO Seminar David Gleich · Purdue
Until done
1. Update centroids.
2. Assign (1 � �)n nodes to closest centroid
3. Make (↵ + �)n assignments based on minimizing distance.
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(a) Ground-truth clusters
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(b) First extension of k-means
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(c) NEO-K-Means
Figure 1: (a) Two ground-truth clusters are generated (n=1,000, ↵=0.1, �=0.005). Green points indicate overlap between
the clusters, and black points indicate outliers. See Section 5 for details. (b) Our first extension of k-means objective
function defined in (2.2) makes too many outlier assignments and fails to recover the ground-truth. (c) The NEO-K-Means
objective defined in (2.3) adds an explicit term for non-exhaustiveness that enables it to correctly detect the outliers, and
find natural overlapping clustering structure which is very similar to the ground-truth clusters (↵ and � are automatically
estimated by the heuristics discussed in Section 2.5).
vector having all the elements equal to one. Then, thevector U1 denotes the number of clusters to which eachdata point belongs. Thus, (U1)i = 0 means that xi
does not belong to any cluster. Now, by adding a non-exhaustiveness constraint to (2.2), we define our NEO-K-Means objective function as follows:(2.3)
minU
kX
j=1
nX
i=1
uijkxi �mjk2, where mj =
Pni=1 uijxiPni=1 uij
s.t. trace(UTU) = (1 + ↵)n,
Pni=1 {(U1)i = 0} �n.
We allow at most �n data points to be not assignedto any cluster, i.e., at most �n data points can beconsidered as outliers. We require 0 �n and notethat �n ⌧ n to cause most data points to be assignedto clusters. Specifically, by the definition of “outliers”,�n should be a very small number compared to n. Theparameters ↵ and � o↵er an intuitive way to capture thedegree of overlap and non-exhaustiveness; by “turningthe knob” on these parameters, the user can explore thelandscape of overlapping, non-exhaustive clusterings. If↵=0 and �=0, the NEO-K-Means objective function isequivalent to the standard k-means objective presentedin (2.1). To see this, note that setting the parameter�=0 requires every data point to belong to at leastone cluster, while setting ↵=0 makes n assignments.Putting these together, the resulting clustering will bedisjoint and exhaustive. Note that by setting ↵=0,objective (2.2) does not have this property.
To see if the objective function (2.3) yields a reason-able clustering, we test it on the same dataset we usedin the previous subsection. Figure 1 (c) shows the result(↵ and � are automatically estimated by the heuristicsdiscussed in Section 2.5). We see that NEO-K-Meanscorrectly finds all the outliers, and produces very similaroverlapping structure to the ground-truth clusters.
2.4 The NEO-K-Means Algorithm. We now pro-pose a simple iterative algorithm which monotonicallydecreases the NEO-K-Means objective until it convergesto a local minimum. Having the hard constraints in(2.3), we will make n+↵n assignments such that at most�n data points can have no membership in any clus-ter. Note that the second constraint can be interpretedas follows: among n data points, at least n � �n datapoints should have membership to some cluster. Whenour algorithm makes assignments of points to clusters, ituses two phases to satisfy these two constraints. Thus,each cluster Cj decomposes into two sets Cj and Cj thatrecord the assignments made in each phase.
Algorithm 1 describes the NEO-K-Means algo-rithm. We first initialize cluster centroids. Any ini-tialization strategies that are used in k-means may alsobe applied to our algorithm. Given cluster centroids, wecompute all the distances [dij ]n⇥k between every datapoint and clusters, and for every data point, recordits closest cluster and that distance. Then, the datapoints are sorted in an ascending order by the dis-tance to its closest cluster. To ensure at least n � �n
data points are assigned to some cluster (i.e., to satisfythe second constraint), we assign the first n � �n datapoints to their closest clusters. Let Cj denote the assign-ments made by this step. Thus,
Pkj=1 |Cj | = n � �n.
Then, we make �n + ↵n more assignments by tak-ing �n + ↵n minimum distances among [dij ]n⇥k such
that xi /2 Cj . Let Cj denote the assignments made
by this step. Thus,Pk
j=1 |Cj | = �n + ↵n. Finally,Pk
j=1(|Cj | + |Cj |) = n + ↵n. Once all the assignmentsare made, we update cluster centroids by recomputingthe mean of each cluster. We repeat this procedure un-til the change in objective function is su�ciently smallor the maximum number of iterations is reached. Notethat the algorithm does not forcibly choose �n points as
This algorithm correctly assigns our
example case and even determines
overlap and outlier parameters!
THEOREM Lloyds algorithm decrease the objective monotonically.
The non-exhaustiveness is necessary for assignments.
SILO Seminar David Gleich · Purdue
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(a) Ground-truth clusters
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(b) First extension of k-means
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(c) NEO-K-Means
Figure 1: (a) Two ground-truth clusters are generated (n=1,000, ↵=0.1, �=0.005). Green points indicate overlap between
the clusters, and black points indicate outliers. See Section 5 for details. (b) Our first extension of k-means objective
function defined in (2.2) makes too many outlier assignments and fails to recover the ground-truth. (c) The NEO-K-Means
objective defined in (2.3) adds an explicit term for non-exhaustiveness that enables it to correctly detect the outliers, and
find natural overlapping clustering structure which is very similar to the ground-truth clusters (↵ and � are automatically
estimated by the heuristics discussed in Section 2.5).
vector having all the elements equal to one. Then, thevector U1 denotes the number of clusters to which eachdata point belongs. Thus, (U1)i = 0 means that xi
does not belong to any cluster. Now, by adding a non-exhaustiveness constraint to (2.2), we define our NEO-K-Means objective function as follows:(2.3)
minU
kX
j=1
nX
i=1
uijkxi �mjk2, where mj =
Pni=1 uijxiPni=1 uij
s.t. trace(UTU) = (1 + ↵)n,
Pni=1 {(U1)i = 0} �n.
We allow at most �n data points to be not assignedto any cluster, i.e., at most �n data points can beconsidered as outliers. We require 0 �n and notethat �n ⌧ n to cause most data points to be assignedto clusters. Specifically, by the definition of “outliers”,�n should be a very small number compared to n. Theparameters ↵ and � o↵er an intuitive way to capture thedegree of overlap and non-exhaustiveness; by “turningthe knob” on these parameters, the user can explore thelandscape of overlapping, non-exhaustive clusterings. If↵=0 and �=0, the NEO-K-Means objective function isequivalent to the standard k-means objective presentedin (2.1). To see this, note that setting the parameter�=0 requires every data point to belong to at leastone cluster, while setting ↵=0 makes n assignments.Putting these together, the resulting clustering will bedisjoint and exhaustive. Note that by setting ↵=0,objective (2.2) does not have this property.
To see if the objective function (2.3) yields a reason-able clustering, we test it on the same dataset we usedin the previous subsection. Figure 1 (c) shows the result(↵ and � are automatically estimated by the heuristicsdiscussed in Section 2.5). We see that NEO-K-Meanscorrectly finds all the outliers, and produces very similaroverlapping structure to the ground-truth clusters.
2.4 The NEO-K-Means Algorithm. We now pro-pose a simple iterative algorithm which monotonicallydecreases the NEO-K-Means objective until it convergesto a local minimum. Having the hard constraints in(2.3), we will make n+↵n assignments such that at most�n data points can have no membership in any clus-ter. Note that the second constraint can be interpretedas follows: among n data points, at least n � �n datapoints should have membership to some cluster. Whenour algorithm makes assignments of points to clusters, ituses two phases to satisfy these two constraints. Thus,each cluster Cj decomposes into two sets Cj and Cj thatrecord the assignments made in each phase.
Algorithm 1 describes the NEO-K-Means algo-rithm. We first initialize cluster centroids. Any ini-tialization strategies that are used in k-means may alsobe applied to our algorithm. Given cluster centroids, wecompute all the distances [dij ]n⇥k between every datapoint and clusters, and for every data point, recordits closest cluster and that distance. Then, the datapoints are sorted in an ascending order by the dis-tance to its closest cluster. To ensure at least n � �n
data points are assigned to some cluster (i.e., to satisfythe second constraint), we assign the first n � �n datapoints to their closest clusters. Let Cj denote the assign-ments made by this step. Thus,
Pkj=1 |Cj | = n � �n.
Then, we make �n + ↵n more assignments by tak-ing �n + ↵n minimum distances among [dij ]n⇥k such
that xi /2 Cj . Let Cj denote the assignments made
by this step. Thus,Pk
j=1 |Cj | = �n + ↵n. Finally,Pk
j=1(|Cj | + |Cj |) = n + ↵n. Once all the assignmentsare made, we update cluster centroids by recomputingthe mean of each cluster. We repeat this procedure un-til the change in objective function is su�ciently smallor the maximum number of iterations is reached. Notethat the algorithm does not forcibly choose �n points as
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(a) Ground-truth clusters
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(b) First extension of k-means
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(c) NEO-K-Means
Figure 1: (a) Two ground-truth clusters are generated (n=1,000, ↵=0.1, �=0.005). Green points indicate overlap between
the clusters, and black points indicate outliers. See Section 5 for details. (b) Our first extension of k-means objective
function defined in (2.2) makes too many outlier assignments and fails to recover the ground-truth. (c) The NEO-K-Means
objective defined in (2.3) adds an explicit term for non-exhaustiveness that enables it to correctly detect the outliers, and
find natural overlapping clustering structure which is very similar to the ground-truth clusters (↵ and � are automatically
estimated by the heuristics discussed in Section 2.5).
vector having all the elements equal to one. Then, thevector U1 denotes the number of clusters to which eachdata point belongs. Thus, (U1)i = 0 means that xi
does not belong to any cluster. Now, by adding a non-exhaustiveness constraint to (2.2), we define our NEO-K-Means objective function as follows:(2.3)
minU
kX
j=1
nX
i=1
uijkxi �mjk2, where mj =
Pni=1 uijxiPni=1 uij
s.t. trace(UTU) = (1 + ↵)n,
Pni=1 {(U1)i = 0} �n.
We allow at most �n data points to be not assignedto any cluster, i.e., at most �n data points can beconsidered as outliers. We require 0 �n and notethat �n ⌧ n to cause most data points to be assignedto clusters. Specifically, by the definition of “outliers”,�n should be a very small number compared to n. Theparameters ↵ and � o↵er an intuitive way to capture thedegree of overlap and non-exhaustiveness; by “turningthe knob” on these parameters, the user can explore thelandscape of overlapping, non-exhaustive clusterings. If↵=0 and �=0, the NEO-K-Means objective function isequivalent to the standard k-means objective presentedin (2.1). To see this, note that setting the parameter�=0 requires every data point to belong to at leastone cluster, while setting ↵=0 makes n assignments.Putting these together, the resulting clustering will bedisjoint and exhaustive. Note that by setting ↵=0,objective (2.2) does not have this property.
To see if the objective function (2.3) yields a reason-able clustering, we test it on the same dataset we usedin the previous subsection. Figure 1 (c) shows the result(↵ and � are automatically estimated by the heuristicsdiscussed in Section 2.5). We see that NEO-K-Meanscorrectly finds all the outliers, and produces very similaroverlapping structure to the ground-truth clusters.
2.4 The NEO-K-Means Algorithm. We now pro-pose a simple iterative algorithm which monotonicallydecreases the NEO-K-Means objective until it convergesto a local minimum. Having the hard constraints in(2.3), we will make n+↵n assignments such that at most�n data points can have no membership in any clus-ter. Note that the second constraint can be interpretedas follows: among n data points, at least n � �n datapoints should have membership to some cluster. Whenour algorithm makes assignments of points to clusters, ituses two phases to satisfy these two constraints. Thus,each cluster Cj decomposes into two sets Cj and Cj thatrecord the assignments made in each phase.
Algorithm 1 describes the NEO-K-Means algo-rithm. We first initialize cluster centroids. Any ini-tialization strategies that are used in k-means may alsobe applied to our algorithm. Given cluster centroids, wecompute all the distances [dij ]n⇥k between every datapoint and clusters, and for every data point, recordits closest cluster and that distance. Then, the datapoints are sorted in an ascending order by the dis-tance to its closest cluster. To ensure at least n � �n
data points are assigned to some cluster (i.e., to satisfythe second constraint), we assign the first n � �n datapoints to their closest clusters. Let Cj denote the assign-ments made by this step. Thus,
Pkj=1 |Cj | = n � �n.
Then, we make �n + ↵n more assignments by tak-ing �n + ↵n minimum distances among [dij ]n⇥k such
that xi /2 Cj . Let Cj denote the assignments made
by this step. Thus,Pk
j=1 |Cj | = �n + ↵n. Finally,Pk
j=1(|Cj | + |Cj |) = n + ↵n. Once all the assignmentsare made, we update cluster centroids by recomputingthe mean of each cluster. We repeat this procedure un-til the change in objective function is su�ciently smallor the maximum number of iterations is reached. Notethat the algorithm does not forcibly choose �n points as
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(a) Ground-truth clusters
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(b) First extension of k-means
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
Cluster 1Cluster 2Cluster 1 & 2Not assigned
(c) NEO-K-Means
Figure 1: (a) Two ground-truth clusters are generated (n=1,000, ↵=0.1, �=0.005). Green points indicate overlap between
the clusters, and black points indicate outliers. See Section 5 for details. (b) Our first extension of k-means objective
function defined in (2.2) makes too many outlier assignments and fails to recover the ground-truth. (c) The NEO-K-Means
objective defined in (2.3) adds an explicit term for non-exhaustiveness that enables it to correctly detect the outliers, and
find natural overlapping clustering structure which is very similar to the ground-truth clusters (↵ and � are automatically
estimated by the heuristics discussed in Section 2.5).
vector having all the elements equal to one. Then, thevector U1 denotes the number of clusters to which eachdata point belongs. Thus, (U1)i = 0 means that xi
does not belong to any cluster. Now, by adding a non-exhaustiveness constraint to (2.2), we define our NEO-K-Means objective function as follows:(2.3)
minU
kX
j=1
nX
i=1
uijkxi �mjk2, where mj =
Pni=1 uijxiPni=1 uij
s.t. trace(UTU) = (1 + ↵)n,
Pni=1 {(U1)i = 0} �n.
We allow at most �n data points to be not assignedto any cluster, i.e., at most �n data points can beconsidered as outliers. We require 0 �n and notethat �n ⌧ n to cause most data points to be assignedto clusters. Specifically, by the definition of “outliers”,�n should be a very small number compared to n. Theparameters ↵ and � o↵er an intuitive way to capture thedegree of overlap and non-exhaustiveness; by “turningthe knob” on these parameters, the user can explore thelandscape of overlapping, non-exhaustive clusterings. If↵=0 and �=0, the NEO-K-Means objective function isequivalent to the standard k-means objective presentedin (2.1). To see this, note that setting the parameter�=0 requires every data point to belong to at leastone cluster, while setting ↵=0 makes n assignments.Putting these together, the resulting clustering will bedisjoint and exhaustive. Note that by setting ↵=0,objective (2.2) does not have this property.
To see if the objective function (2.3) yields a reason-able clustering, we test it on the same dataset we usedin the previous subsection. Figure 1 (c) shows the result(↵ and � are automatically estimated by the heuristicsdiscussed in Section 2.5). We see that NEO-K-Meanscorrectly finds all the outliers, and produces very similaroverlapping structure to the ground-truth clusters.
2.4 The NEO-K-Means Algorithm. We now pro-pose a simple iterative algorithm which monotonicallydecreases the NEO-K-Means objective until it convergesto a local minimum. Having the hard constraints in(2.3), we will make n+↵n assignments such that at most�n data points can have no membership in any clus-ter. Note that the second constraint can be interpretedas follows: among n data points, at least n � �n datapoints should have membership to some cluster. Whenour algorithm makes assignments of points to clusters, ituses two phases to satisfy these two constraints. Thus,each cluster Cj decomposes into two sets Cj and Cj thatrecord the assignments made in each phase.
Algorithm 1 describes the NEO-K-Means algo-rithm. We first initialize cluster centroids. Any ini-tialization strategies that are used in k-means may alsobe applied to our algorithm. Given cluster centroids, wecompute all the distances [dij ]n⇥k between every datapoint and clusters, and for every data point, recordits closest cluster and that distance. Then, the datapoints are sorted in an ascending order by the dis-tance to its closest cluster. To ensure at least n � �n
data points are assigned to some cluster (i.e., to satisfythe second constraint), we assign the first n � �n datapoints to their closest clusters. Let Cj denote the assign-ments made by this step. Thus,
Pkj=1 |Cj | = n � �n.
Then, we make �n + ↵n more assignments by tak-ing �n + ↵n minimum distances among [dij ]n⇥k such
that xi /2 Cj . Let Cj denote the assignments made
by this step. Thus,Pk
j=1 |Cj | = �n + ↵n. Finally,Pk
j=1(|Cj | + |Cj |) = n + ↵n. Once all the assignmentsare made, we update cluster centroids by recomputingthe mean of each cluster. We repeat this procedure un-til the change in objective function is su�ciently smallor the maximum number of iterations is reached. Notethat the algorithm does not forcibly choose �n points as
Output without assignment constraint. (beta = 1)
NEO-K-means output (correct)
The Weighted, Kernel "NEO-K-Means objective. • Introduce weights for each data point. • Introduce feature maps for each data point too.
SILO Seminar David Gleich · Purdue
minimize
Pij Uijwik�(xi ) � mjk2
subject to Uij is binary
trace(UT U) = (1 + ↵)n (↵n overlap)
e
TInd[Ue] � (1 � �)n (up to �n outliers)
mj =
1Pi Uij wi
wiUijxi
X
ij
Uijwik�(xi ) � mjk2 =X
ij
UijwiKii �
ujWKWuj
u
Tj Wuj
!
Theorem If K = D�1 + D�1AD�1, then the NEO-K-Means objectiveis equivalent to overlapping conductance.
NOTE
This means that NEO-K-Means was the principled objective we were after!
SILO Seminar David Gleich · Purdue
Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community
�(S) =
cut(S)
min
�vol(S), vol(
¯S)
�(edges leaving the set)
(total edges in the set)
David Gleich · Purdue
cut(S) = 7
vol(S) = 33
vol(
¯S) = 11
�(S) = 7/11
SILO Seminar
Our theorem means that NEO-K-Means can optimize the sum-conductance obj.
SILO Seminar David Gleich · Purdue
�(S) cut(S)
vol(S)
+
cut(
¯S)
vol(
¯S)
X
S2C
cut(S)
vol(S)
=
X
S2C
�(S) if vol(S) vol(
¯S)
Conductance Normalized cut bi-partition
NEO-K-Means"objective
When we use this method to partition the Karate club network, we get reasonable solutions.
• Inspired by Dhillon et al.’s work on Graclus
• We have a multilevel method to optimize the graph case.
We get state of the art clustering perf. on vector and graph datasets.
SILO Seminar David Gleich · Purdue
F1 scores on vector datasets from the Mulan repository. moc fuzzy esp isp okm rokm NEO
synth1 0.833 0.959 0.977 0.985 0.989 0.969 0.996synth2 0.836 0.957 0.952 0.973 0.967 0.975 0.996synth3 0.547 0.919 0.968 0.952 0.970 0.928 0.996yeast - 0.308 0.289 0.203 0.311 0.203 0.366music 0.534 0.533 0.527 0.508 0.527 0.454 0.550scene 0.467 0.431 0.572 0.586 0.571 0.593 0.626
n dim.
¯|C| outliers k
synth1 5,000 2 2,750 0 2
synth2 1,000 2 550 5 2
synth3 6,000 2 3,600 6 2
yeast 2,417 103 731.5 0 14
music 593 72 184.7 0 6
scene 2,407 294 430.8 0 6
The Mulan testset has a number of appropriate datasets
NEO-K-Means with Lloyds is fast and usually accurate but inconsistent.
SILO Seminar David Gleich · Purdue
−6 −4 −2 0 2 4 6−2
0
2
4
6
8
10
Cluster 1Cluster 2Cluster 1 & 2Cluster 3Not assigned
−6 −4 −2 0 2 4 6−2
0
2
4
6
8
10
Cluster 1Cluster 2Cluster 1 & 2Cluster 3Not assigned
−6 −4 −2 0 2 4 6−2
0
2
4
6
8
10
Cluster 1Cluster 2Cluster 1 & 2Cluster 3Not assigned
A more complicated overlapping test case
The output from NEO-K-Means with Lloyd’s method
Can we get a more robust method? Yes!
SILO Seminar David Gleich · Purdue
Towards better optimization of the objective
1. An SDP relaxation of the objective. 2. A practical low-rank SDP heuristic. 3. Faster optimization methods for the heuristic.
SILO Seminar David Gleich · Purdue
From assignments to co-occurrence matrices
SILO Seminar David Gleich · Purdue
There are three key variables in our formulation
1. The co-occurrence matrix
Z =
X
j
WujuTj W/uT
j Wuj
2. The overlap vector f
3. The assignment indicator g
U =
2
664
1 01 10 10 0
3
775
f =
2
664
1210
3
775 g =
2
664
1110
3
775
We can convert our objective into a trace minimization problem.
SILO Seminar David Gleich · Purdue
Kij = �(xi )T�(xj )di = wiKii
X
ij
Uijwik�(xi ) � mjk2 =X
ij
UijwiKii �
ujWKWuj
u
Tj Wuj
!
=X
ij
UijwiKii �X
j
ujWKWuj
u
Tj Wuj
= f
Td � trace(KZ )
Z = normalized co-occurrence
f = overlap count
g = assignment indicator
The objective function
There is an SDP-like framework to solve NEO-K-means.
SILO Seminar David Gleich · Purdue
maximize
Z ,f,gtrace(KZ ) � fT d
subject to trace(W�1Z ) = k , (a)
Zij � 0, (b)
Z ⌫ 0, Z = Z T(c)
Ze = W f, (d)
eT f = (1 + ↵)n, (e)
eT g � (1 � �)n, (f )f � g, (g)
rank(Z ) = k , (h)
f 2 Zn�0
, g 2 {0, 1}n. (i)
Z must come from
an assignment matrix
Overlap and assignmentconstraints
Combinatorial constraints
There is an SDP-relaxation to approximate NEO-K-means.
SILO Seminar David Gleich · Purdue
Z must come from
an assignment matrix
Overlap and assignmentconstraints
maximize
Z ,f,gtrace(KZ ) � fT d
subject to trace(W�1Z ) = k , (a)
Zij � 0, (b)
Z ⌫ 0, Z = Z T(c)
Ze = W f, (d)
eT f = (1 + ↵)n, (e)
eT g � (1 � �)n, (f )f � g, (g)
0 g 1
Relaxed constraints
This SDP can easily solve simple problems.
SILO Seminar David Gleich · Purdue
NEO-K-Means SDP
Solution Z from CVX is even rank 2!
But SDP methods have a number of issues for large-scale problems.
1. The number of variables is quadratic in the number of data points
2. The best solvers can only solve problems with a few hundred or thousand points.
So like many before us (e.g. Burer & Monteiro, Kulis Surendran, and Platt 2007, and more)
we optimize a low-rank factorization of the solution
SILO Seminar David Gleich · Purdue
Using the NEO-K-Means Low-Rank SDP, we can find assignments directly.
SILO Seminar David Gleich · Purdue
NEO-K-Means Low-rank SDP
Y Y T
kZ � Y Y Tk = 2.3 ⇥ 10�4
maximize
Y ,f,g,s,rtrace(Y T KY ) � fT d
subject to k = trace(Y T W�1Y )
0 = Y Y T e � W f0 = eT f � (1 + ↵)n0 = f � g � s0 = eT g � (1 � �)n � rYij � 0, s � 0, r � 0
0 f ke, 0 g 1
The Low-Rank NEO-K-Means SDP
We lose convexity but gain practicality. We introduce slacks at this point.
SILO Seminar David Gleich · Purdue
icky non-convex term
simple bound constraints
We use an augmented Lagrangian method to optimize this problem
SILO Seminar David Gleich · Purdue
[28] J. Peng and Y. Wei. Approximating k-means-typeclustering via semidefinite programming. SIAMJournal on Optimization, 18(1):186–205, 2007.
[29] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. P.Vlahavas. Multi-label classification of music intoemotions. In International Conference on MusicInformation Retrieval, pages 325–330, 2008.
[30] J. J. Whang, I. S. Dhillon, and D. F. Gleich.Non-exhaustive, overlapping k-means. In Proceedingsof the SIAM International Conference on DataMining, pages 936–944, 2015.
[31] J. J. Whang, D. Gleich, and I. S. Dhillon. Overlappingcommunity detection using seed set expansion. InACM International Conference on Information andKnowledge Management, pages 2099–2108, 2013.
[32] L. F. Wu, T. R. Hughes, A. P. Davierwala, M. D.Robinson, R. Stoughton, and S. J. Altschuler.Large-scale prediction of saccharomyces cerevisiaegene function using overlapping transcriptionalclusters. Nature Genetics, 31(3):255–265, June 2002.
[33] E. P. Xing and M. I. Jordan. On semidefiniterelaxations for normalized k-cut and connections tospectral clustering. Technical ReportUCB/USD-3-1265, University of California, Berkeley,2003.
[34] J. Yang and J. Leskovec. Overlapping communitydetection at scale: a nonnegative matrix factorizationapproach. In ACM International Conference on WebSearch and Data Mining, pages 587–596, 2013.
[35] S. X. Yu and J. Shi. Multiclass spectral clustering. InIEEE International Conference on Computer Vision -Volume 2, 2003.
APPENDIXA. AUGMENTED LAGRANGIANSThe augmented Lagrangian framework is a general strat-
egy to solve nonlinear optimization problems with equalityconstraints. We briefly review a standard textbook deriva-tion for completeness [26]. Consider a general problem:
minimizex
f(x)
subject to ci(x) = 0, i = 1, . . . ,m
l x u.
(7)
The augmented Lagrangian for this problem involves a setof Lagrange multipliers �i to estimate the influence of eachconstraint on the objective as well as a quadratic penalty tosatisfy the nonlinear constraints. It is defined as
LA(x;�,�) = f(x)�mX
i=1
�ici(x) +�
2
mX
i=1
c
2
i (x).
An augmented Lagrangian algorithm iteratively proceedsfrom an arbitrary starting point to a local solution of (7).At each step, a bound-constrained solver minimizes LA overx subject to l x u. Based on an approximate solu-tion, it adjusts the Lagrange multipliers � and may updatethe penalty parameter �. See Algorithm 17.4 in Nocedal andWright [26] for a standard strategy to adjust the multipliers,penalty, and tolerances for each subproblem.
We use the L-BFGS-B procedure [9] to solve the subprob-lem. This requires both a subroutine to evaluate the func-tion and the gradient vector.
B. GRADIENTS FOR NEO-LRWe now describe the analytic form of the gradients for the
augmented Lagrangian of the NEO-LR objective and a briefvalidation that these are correct. Consider the augmentedLagrangian (5). The gradient has five components for thefive sets of variables: Y , f , g, s and r:
rY LA(Y , f ,g, s, r;�, µ, �,�) =
� 2KY � eµTY � µeTY
� 2(�1
� �(tr(Y TW�1Y )� k))W�1Y
+ �(Y Y T eeTY + eeTY Y TY )� �(W feTY + efTWY )
r
f
LA(Y , f ,g, s, r;�, µ, �,�) =
d+Wµ� �(WY Y T e�W 2f)� �2
e+ �(eT f � (1 + ↵)n)e
� � + �(f � g � s)
r
g
LA(Y , f ,g, s, r;�, µ, �,�) =
� � �(f � g � s)� �3
e+ �(eT g � (1� �)n� r)e
r
s
LA(Y , f ,g, s, r;�, µ, �,�) = � � �(f � g � s)
rrLA(Y , f ,g, s, r;�, µ, �,�) = �3
� �(eT g � (1� �)n� r)
Using analytic gradients in a black-box solver such as L-BFGS-B is problematic if the gradients are even slightly in-correctly computed. To guarantee the analytic gradients wederive are correct, we use forward finite di↵erence methodto get numerical approximation of the gradients based onthe objective function. We compare these with our analyticgradient and expect to see small relative di↵erences on theorder of 10�5 or 10�6. This is exactly what Figure 4 shows.
10−4 10−2 100 102 104 10610−12
10−11
10−10
10−9
10−8
10−7
10−6
rel.
diff
abs(g)
Figure 4: Finite di↵erence comparison of gradientswhere ✏ = 10-6. This figure shows that the relativedi↵erence between the analytical gradient and thegradient computed via finite di↵erences is small, in-dicating the gradient is correctly computed.
Finally, we denote by e the vector of all 1s.The following program is equivalent to the NEO-K-Means
objective with a discrete assignment matrix:
maximizeZ,f ,g
trace(KZ)� fTd
subject to trace(W�1Z) = k, (a)Zij � 0, (b)Z ⌫ 0,Z = ZT (c)Ze = W f , (d)eT f = (1 + ↵)n, (e)eTg � (1� �)n, (f)f � g, (g)rank(Z) = k, (h)f 2 Zn
�0
,g 2 {0, 1}n. (i)
(2)
We omit the verification that this is actually equivalent tothe NEO-K-Means objective (1) as it is not informative forour discussion. Constraints (a), (b), (c), and (h) encode thefact that Z must arise from an assignment matrix. Con-straints (d), (e), (f), (g), and (i) are new to our NEO-K-Means formulation that express the amount of overlap andnonexhaustiveness in the solution. This is a mixed-integer,rank constrained sdp. As such, it is combinatorially hard tooptimize just like the original NEO-K-Means objective.
The constraints that make this a combinatorial problemare (h) and (i). If we relax these constraints:
maximizeZ,f ,g
trace(KZ)� fTd
subject to (a), (b), (c), (d), (e), (f), (g)0 g 1
(3)
then we arrive at a convex problem. Thus, any local optimalsolution of (3) must be a global solution.
Solving (3) requires a black-box sdp solver such as cvx.As it converts this problem into a standard form for suchproblems, the number of variables becomes O(n2) and theresulting complexity is worse than O(n3) in most cases, andcan be as bad as O(n6). These solvers are further limited bythe delicate numerical precision issues that arise as they ap-proach a solution. The combination of these features meansthat o↵-the-shelf procedures struggle to solve problems withmore than 100 data points. We now describe a means toenable us to solve larger problems.
4. A LOW-RANK SDP FOR NEO-K-MEANSIn the sdp formulation of the NEO-K-Means objective (3),
the matrix Z should only be rank k. By applying the low-rank factorization idea, Z becomes Y Y T where Y is n⇥ k
and non-negative. Thus, the following optimization programis a low-rank sdp for (3) (we have chosen to write it inthe standard form of a minimization problem with explicitslack variables s, r to convert the inequality constraints intoequality and bound constraints).
minimizeY ,f ,g,s,r
fTd� trace(Y TKY )
subject to k = trace(Y TW�1Y ) (s)0 = Y Y T e�W f (t)0 = eT f � (1 + ↵)n (u)0 = f � g � s (v)0 = eTg � (1� �)n� r (w)Yij � 0, s � 0, r � 00 f ke, 0 g 1
(4)
Here we also replaced the constraint Y Y T � 0 with thestronger constraint Y � 0. This problem is a quadraticprogramming problem with quadratic constraints, and wewill discuss how to solve it in the next subsection. We callthe problem NEO-LR and the solution procedure LRSDP.Even though now we lose convexity by formulating the lowrank sdp, this nonlinear programming problem only requiresO(nk) memory and existing nonlinear programming tech-niques allow us to scale to large problems.
After we get a solution, the solution Y can be regardedas the normalized assignment matrix
Y = WU
where U = [u1
, u2
, . . . , uk], and uc = uc/psc for any c =
1, . . . , k.
4.1 Solving the NEO-K-Means low-rank SDPTo solve the NEO-LR problem (4), we use an augmented
Lagrangian framework. This is an iterative strategy whereeach step consists of minimizing an augmented Lagrangianof the problem that includes a current estimate of the La-grange multipliers for the constraints as well as a penaltyterm that drives the solution towards the feasible set. Aug-mented Lagrangian techniques have been successful in pre-vious studies of low-rank sdp approximations [6].Let � = [�
1
;�2
;�3
] be the Lagrange multipliers associatedwith the three scalar constraints (s), (u), (w), and µ and� be the Lagrange multipliers associated with the vectorconstraints (t) and (v), respectively. Let � � 0 be a penaltyparameter. The augmented Lagrangian for (4) is:
LA(Y, f ,g, s, r;�,µ,�,�) =
fTd� trace(Y TKY )| {z }
the objective
� �
1
(trace(Y TW�1Y )� k)
+�
2(trace(Y TW�1Y )� k)2
� µT (Y Y T e�W f)
+�
2(Y Y T e�W f)T (Y Y T e�W f)
� �
2
(eT f � (1 + ↵)n) +�
2(eT f � (1 + ↵)n)2
� �T (f � g � s) +�
2(f � g � s)T (f � g � s)
� �
3
(eTg � (1� �)n� r)
+�
2(eTg � (1� �)n� r)2
(5)
At each step in the augmented Lagrangian solution frame-work, we solve the following subproblem:
minimize LA(Y , f ,g, s, r;�,µ,�,�)
subject to Yij � 0, s � 0, r � 0,
0 f ke, 0 g 1.
(6)
We use a limited-memory BFGS with bound constraints al-gorithm [9] to minimize the subproblem with respect to thevariables Y , f , g, s and r. This requires computation ofthe gradient of LA with respect to the variables. We de-termine and validate an analytic form for the gradient inAppendix B. In Section 6.1, we provide evidence that ouroptimization procedure is correctly implemented. Those ex-periments also show that we achieve the same objective func-
We use an augmented Lagrangian method to optimize this problem • Use L-BFGS-B to optimize each step. • Update the multiplier estimates in the standard way. • Pick parameters in a modestly standard way. • Some variability between problems to show best results, only
a little variation in time/performance. • Faster than the NEOS solvers
SILO Seminar David Gleich · Purdue
Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP
Comparison with Solvers on NEOS Server
NEOS Server 1: State-of-the-Art Solvers for Numerical Optimization
Our solver with ALM approach is much faster than theirs (e.g.,SNOPT which is suitable for large nonlinearly constrained problemswith a modest number of degrees of freedom).
Our Solver ALM (obj/time) SNOPT solver (obj/time)MUSIC 79514.130/92s 79515.156/306sSCENE 18534.030/3798s 18534.021/8910sYEAST 8902.253/4331s Not solved
1http://www.neos-server.org/neos/
Yangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 25 / 61
We win with our LRSDP solver vs. "the CVX default solver
• Dolphins (n=62) and Les Mis (n=77) are graph probs • LRSDP is much faster and just as accurate.
SILO Seminar David Gleich · Purdue
Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP
Algorithmic Validation
Comparison of SDP and LRSDPLRSDP is roughly an order of magnitude faster than cvx.LRSDP generates solutions as good as the global optimal from cvx.The objective value are di↵erent in light of the solution tolerances.dolphins 1: 62 nodes, 159 edges, les miserables 2: 77 nodes, 254 edges
Objective value Run timeSDP LRSDP SDP LRSDP
dolphins
k=2, ↵=0.2, �=0 -1.968893 -1.968329 107.03 secs 2.55 secsk=2, ↵=0.2, �=0.05 -1.969080 -1.968128 56.99 secs 2.96 secsk=3, ↵=0.3, �=0 -2.913601 -2.915384 160.57 secs 5.39 secsk=3, ↵=0.3, �=0.05 -2.921634 -2.922252 71.83 secs 8.39 secs
les miserables
k=2, ↵=0.2, �=0 -1.937268 -1.935365 453.96 secs 7.10 secsk=2, ↵=0.3, �=0 -1.949212 -1.945632 447.20 secs 10.24 secsk=3, ↵=0.2, �=0.05 -2.845720 -2.845070 261.64 secs 13.53 secsk=3, ↵=0.3, �=0.05 -2.859959 -2.859565 267.07 secs 19.31 secs
1D. Lusseau et al., Behavioral Ecology and Sociobiology, 2003.
2D. E. Knuth. The Stanford GraphBase: A Platform for Combinatorial Computing. Addison-Wesley, 1993.
Yangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 26 / 61
Dolphins from Lusseau et al. 2003; Les Mis from Knuth GraphBase
Rounding and Improvement are both important.
SILO Seminar David Gleich · Purdue
Input ! Relaxed solution ! Rounded solution ! Improved solution
Rounding
f gives the number of clustersg gives the set of assignmentsOption 1
Use g and f to determinethe number of assignments and go greedy.
Option 2Just greedily assign based on W�1Y .
Improvement
Run NEO-K-Means on the output.
Initialization
Run NEO-K-Means on the intput.
The new method is more robust, even in simple tests. Consider clustering a cycle graph
SILO Seminar David Gleich · Purdue
We use disconnected nodes to measure the cluster quality.
SILO Seminar David Gleich · Purdue
disconnected nodes
0 0.5 1 1.5 2 2.5 3 3.5 40
10
20
30
40
50
60
70
80
90
100
Noise
No. of dis
connect
ed n
odes
random+onelevel neomultilevel neolrsdp As we increase the noise,
only the LRSDP method can reliably find the true clustering.
We get improved vector and graph clustering results too.
SILO Seminar David Gleich · Purdue
Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP
Experimental Results on Data Clustering
Comparison of NEO-K-Means objective function valuesReal-world datasets from Mulan1
By using the LRSDP solution as the initialization of the iterativealgorithm, we can achieve better (smaller) objective function values.
worst best avg.
yeastkmeans+neo 9611 9495 9549lrsdp+neo 9440 9280 9364slrsdp+neo 9471 9231 9367
musickmeans+neo 87779 70158 77015lrsdp+neo 82323 70157 75923slrsdp+neo 82336 70159 75926
scenekmeans+neo 18905 18745 18806lrsdp+neo 18904 18759 18811slrsdp+neo 18895 18760 18810
1http://mulan.sourceforge.net/datasets.htmlYangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 31 / 61
Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP
Experimental Results on Data Clustering
F1 scores on real-world vector datasets (the larger, the better)NEO-K-Means-based methods outperform other methods.Low-rank SDP method improves the clustering results.
moc esp isp okm kmeans+neo lrsdp+neo slrsdp+neo
yeastworst - 0.274 0.232 0.311 0.356 0.390 0.369best - 0.289 0.256 0.323 0.366 0.391 0.391avg. - 0.284 0.248 0.317 0.360 0.391 0.382
musicworst 0.530 0.514 0.506 0.524 0.526 0.537 0.541best 0.544 0.539 0.539 0.531 0.551 0.552 0.552avg. 0.538 0.526 0.517 0.527 0.543 0.545 0.547
sceneworst 0.466 0.569 0.586 0.571 0.597 0.610 0.605best 0.470 0.582 0.609 0.576 0.627 0.614 0.625avg. 0.467 0.575 0.598 0.573 0.610 0.613 0.613
Yangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 32 / 61
We have improved results – impressively so on the yeast dataset – and only slightly worse on the scene data.
We get improved vector and graph clustering results too.
SILO Seminar David Gleich · Purdue
Facebook1 Facebook2 HepPh AstroPh
bigclam 0.830 0.640 0.625 0.645demon 0.495 0.318 0.503 0.570oslom 0.319 0.445 0.465 0.580nise 0.297 0.293 0.102 0.153m-neo 0.285 0.269 0.206 0.190LRSDP 0.222 0.148 0.091 0.137
No. of vertices No. of edges
Facebook1 348 2,866Facebook2 756 30,780HepPh 11,204 117,619AstroPh 17,903 196,972
For these graphs, we dramatically improve the conductance-vs-coverage plots.
Lloyd’s iterative method takes O(1 second) LRSDP method takes O(1 hour) Now we want to improve the LRSDP time.
SILO Seminar David Gleich · Purdue
We can improve the optimization beyond ALM. 1. Proximal augmented Lagrangian (PALM)"
Add a regularization term to the augmented Lagrangian""""
Solve with L-BFGS-B
2. ADMM method (5 blocks)
SILO Seminar David Gleich · Purdue
x
(k+1) = argmin LA(x(k );�(k )...) +12⌧
kx � x
(k )k
Y k+1 = argminY
LA(Y , fk , gk , sk , r k ;�k ,µk ,�k ,�)
fk+1 = argminf
LA(Y k+1, f, gk , sk , r k�k ,µk ,�k ,�)
gk+1 = argming
LA(Y k+1, fk+1, g, sk , r k�k ,µk ,�k ,�)
sk+1 = argmins
LA(Y k+1, fk+1, gk+1, s, r k�k ,µk ,�k ,�)
r k+1 = argminr
LA(Y k+1, fk+1, gk+1, sk+1, r�k ,µk ,�k ,�)
Convex J Non-convex L
We had to get a new convergence result for the proximal method
Results for bound-constrained sub-problems? Ours is a a small adaptation of a general result due to Pennanen (2002).
SILO Seminar David Gleich · Purdue
Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP
Convergence analysis of PALM 1
Theorem 1
Let (x, �) be a KKT pair satisfying the strongly second order su�cient condition and
assume the gradients rc(x) are linearly independent. If the {�k} are large enough with
�k ! � 1 and if k(x0,�0)� (x, �)k is small enough, then there exists a sequence
{(xk ,�k)} conforming to Algorithm 1 along with open neighborhoods Ck such that for
each k, xk+1 is the unique solution in Ck to (Pk). Then also, the sequence {(xk ,�k)}converges linearly and Fejer monotonically to x, � with rate r(�) < 1 that is decreasing
in � and r(�) ! 0 as � ! 1.
1We specialize a general convergence result due to Pennanen (Local convergence of the proximal point algorithm and
multiplier methods without monotonicity. Math. Oper. Res., 27(1):170-191, 2002.) to our algorithm. The sketch of theproof can be found in the Appendix section.
Yangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 37 / 61
On the yeast dataset, we see no difference in objective, but faster solves
SILO Seminar David Gleich · Purdue
0
500
1000
1500
2000
2500
3000
3500
4000
4500
iterative ALM PALM ADMM
Runtimes on YEAST
8700
8800
8900
9000
9100
9200
ALM PALM ADMM
f(x) values on YEAST
On yeast, we see much better discrete objectives and F1 scores.
SILO Seminar David Gleich · Purdue
9000
9100
9200
9300
9400
9500
9600
9700
iterative ALM PALM ADMM
NEO−K−Means objectives on YEAST
0.34
0.345
0.35
0.355
0.36
0.365
0.37
0.375
0.38
0.385
0.39
iterative ALM PALM ADMM
F1 Scores on YEAST
Recap For overlapping clustering of data and overlapping community detection of graphs, we have a new objective • Fast Lloyd-like iterative algorithm • SDP relaxation • Low-rank SDP relaxation • Proximal and ADMM acceleration techniques
SILO Seminar David Gleich · Purdue
1. NEO-K-means - Whang, Gleich, Dhillon, SDM 2015 2. NEO-K-means SDP + Aug. Lagrangian"
Hou, Whang, Gleich, Dhillon, KDD 2015 3. Multiplier Methods for Overlapping K-Means"
Hou, Whang, Gleich, Dhillon, Submitted
SILO Seminar David Gleich · Purdue
plot(x)
0 2 4 6 8 10x 105
0
0.02
0.04
0.06
0.08
0.1
100 102 104 10610−15
10−10
10−5
100
100 102 104 10610−15
10−10
10−5
100
nonzeros Crawl of flickr from 2006 ~800k nodes, 6M edges, beta=1/2
(I � �P)x = (1 � �)s
nnz(x) ⇡ 800k
kD�
1 (x�
x
⇤ )k 1
"
Localized solutions of diffusion equations in large graphs. Joint with Kyle Kloster. WAW2013, KDD2014, WAW2015; J. Internet Math.
34
feature
X R D S • S P R I N G 2 0 1 3 • V O L . 1 9 • N O . 3
damentally limited); the second type models a diffusion of a virtual good (think of sending links or a virus spread-ing on a network), which can be copied infinitely often. Google’s celebrated PageRank model uses a conservative diffusion to determine importance of pages on the Web [4], whereas those studying when a virus will continue to propagate found the eigenvalues of the non-conservative diffusion determine the answer [5]. Thus, just as in scientific computing, marrying the method to the model is key for the best scientific computing on social networks.
Ultimately, none of these steps dif-fer from the practice of physical sci-entific computing. The challenges in creating models, devising algorithms, validating results, and comparing models just take on different chal-lenges when the problems come from social data instead of physical mod-els. Thus, let us return to our starting question: What does the matrix have to do with the social network? Just as in scientific computing, many inter-esting problems, models, and meth-ods for social networks boil down to matrix computations. Yet, as in the expander example above, the types of matrix questions change dramatical-ly in order to fit social network mod-els. Let’s see what’s been done that’s enticingly and refreshingly different from the types of matrix computa-tions encountered in physical scien-tific computing.
EXPANDER GRAPHS AND PARALLEL COMPUTING Recently, a coalition of folks from aca-demia, national labs, and industry set out to tackle the problems in parallel computing and expander graphs. They established the Graph 500 benchmark (http://www.graph500.org) to measure the performance of a parallel com-puter on a standard graph computa-tion with an expander graph. Over the past three years, they’ve seen perfor-mance grow by more than 1,000-times through a combination of novel soft-ware algorithms and higher perfor-mance parallel computers. But, there is still work left in adapting the soft-ware innovations for parallel comput-ing back to matrix computations for social networks.
Figure 1. In a standard scientific computing problem, we find the steady state heat distribution of a plate with a heat-source in the middle. This scientific problem is solved via a linear system. In a social diffusion problem, we are trying to find people who like the movie (labeled in dark orange) instead of people who don’t like the movie (labeled in dark purple). By solving a different linear system, we can determine who is likely to enjoy the movie (light orange).
Problem
Diffusionin a plate
Movie interest indiffusion
Equations
�
�
�
�
Figure 2. The network, or mesh, from a typical problem in scientific computing resides in a low dimensional space—think of two or three dimensions. These physical spaces put limits on the size of the boundary or “surface area” of the space given its volume. No such limits exist in social networks and these two sets are usually about the same size. A network with this property is called an expander network.
Size of set ≈ Size of boundary
Size of set » Size of boundary
“Networks” from PDEs are usuallyphysical
Social networksare expanders
SILO Seminar David Gleich · Purdue
Higher order organization of complex networks Joint with Austin Benson and Jure Leskovec
910
8
72
0
4
3
11
6
5
1
CEPDR
CEPVR
IL2R
OLLRRIALRIAR
RIVLRIVR
RMDDR
RMDLRMDR
RMDVL
RMFLSMDDL
SMDDR
SMDVR
URBR
By using a new generalization of spectral clustering methods, we are able to find completely novel and relevant structures in complex systems such as the connectome and transport networks.
SILO Seminar David Gleich · Purdue
SIAM Annual Meeting !(AN16)! July 11-15, 2016 The Westin Waterfront"Boston, Massachusetts David Gleich, Purdue Mary Silber, Northwestern
Big Data, Data Science, and Privacy Education, Communication, and Policy Reproducibility and Ethics Efficiency and Optimization Integrating Models and Data (incl. "
computational social science, PDEs) Dynamic Networks (learning, evolution, "
adaptation, and cooperation) Applied Math, Statistics, and "
Machine Learning Earth systems; environmental/ecological
applications Epidemiology
Future work
Even faster solvers
Understand why the solution seems to be rank-2. Better init for Lloyds.
SILO Seminar David Gleich · Purdue
Solution Z from CVX is even rank 2! 1. NEO-K-means - Whang, Gleich, Dhillon, SDM 2015 2. NEO-K-means SDP + Aug. Lagrangian"
Hou, Whang, Gleich, Dhillon, KDD 2015 3. Multiplier Methods for Overlapping K-Means"
Hou, Whang, Gleich, Dhillon, Submitted