non-exhaustive, overlapping k-means

Non-exhaustive, overlapping K-means

clustering

David F. Gleich!Purdue University!

Real-world graph and point data have overlapping clusters.

Other uses for PageRankWhat else people use PageRank to do

GeneRank

10 20 30 40 50 60 70

NM_003748NM_003862Contig32125_RCU82987AB037863NM_020974Contig55377_RCNM_003882NM_000849Contig48328_RCContig46223_RCNM_006117NM_003239NM_018401AF257175AF201951NM_001282Contig63102_RCNM_000286Contig34634_RCNM_000320AB033007AL355708NM_000017NM_006763AF148505Contig57595NM_001280AJ224741U45975Contig49670_RCContig753_RCContig25055_RCContig53646_RCContig42421_RCContig51749_RCAL137514NM_004911NM_000224NM_013262Contig41887_RCNM_004163AB020689NM_015416Contig43747_RCNM_012429AB033043AL133619NM_016569NM_004480NM_004798Contig37063_RCNM_000507AB037745Contig50802_RCNM_001007Contig53742_RCNM_018104Contig51963Contig53268_RCNM_012261NM_020244Contig55813_RCContig27312_RCContig44064_RCNM_002570NM_002900AL050090NM_015417Contig47405_RCNM_016337Contig55829_RCContig37598Contig45347_RCNM_020675NM_003234AL080110AL137295Contig17359_RCNM_013296NM_019013AF052159Contig55313_RCNM_002358NM_004358Contig50106_RCNM_005342NM_014754U58033Contig64688NM_001827Contig3902_RCContig41413_RCNM_015434NM_014078NM_018120NM_001124L27560Contig45816_RCAL050021NM_006115NM_001333NM_005496Contig51519_RCContig1778_RCNM_014363NM_001905NM_018454NM_002811NM_004603AB032973NM_006096D25328Contig46802_RCX94232NM_018004Contig8581_RCContig55188_RCContig50410Contig53226_RCNM_012214NM_006201NM_006372Contig13480_RCAL137502Contig40128_RCNM_003676NM_013437Contig2504_RCAL133603NM_012177R70506_RCNM_003662NM_018136NM_000158NM_018410Contig21812_RCNM_004052Contig4595Contig60864_RCNM_003878U96131NM_005563NM_018455Contig44799_RCNM_003258NM_004456NM_003158NM_014750Contig25343_RCNM_005196Contig57864_RCNM_014109NM_002808Contig58368_RCContig46653_RCNM_004504M21551NM_014875NM_001168NM_003376NM_018098AF161553NM_020166NM_017779NM_018265AF155117NM_004701NM_006281Contig44289_RCNM_004336Contig33814_RCNM_003600NM_006265NM_000291NM_000096NM_001673NM_001216NM_014968NM_018354NM_007036NM_004702Contig2399_RCNM_001809Contig20217_RCNM_003981NM_007203NM_006681AF055033NM_014889NM_020386NM_000599Contig56457_RCNM_005915Contig24252_RCContig55725_RCNM_002916NM_014321NM_006931AL080079Contig51464_RCNM_000788NM_016448X05610NM_014791Contig40831_RCAK000745NM_015984NM_016577Contig32185_RCAF052162AF073519NM_003607NM_006101NM_003875Contig25991Contig35251_RCNM_004994NM_000436NM_002073NM_002019NM_000127NM_020188AL137718Contig28552_RCContig38288_RCAA555029_RCNM_016359Contig46218_RCContig63649_RCAL080059

Use (�� GD�1)x =w tofind “nearby” importantgenes.

ProteinRank

IsoRank

Clustering(graph partitioning)

Sports ranking

Teaching

Morrison et al. GeneRank, 2005.Gleich (Stanford) PageRank intro Ph.D. Defense 9 / 41

Social networks have overlapping clusters because of social circles

Genes have overlapping clusters due to their role in multiple functions

SILO Seminar David Gleich · Purdue

Overlapping research projects are what got me here too!

PhD Thesis on Google’s PageRank

MSR Intern and

Overlapping Clusters for Distributed

Computation

Accelerated NCP plots and locally

minimal communities

Neighborhood inflated seed expansion for overlapping communities

Non-exhaustive overlapping "K-means


1.  NISE Clustering - Whang, Gleich, Dhillon, CIKM 2013 2.  NEO-K-means - Whang, Gleich, Dhillon, SDM 2015 3.  NEO-K-means SDP "

Hou, Whang, Gleich, Dhillon, KDD 2015 4.  Multiplier Methods for Overlapping K-Means"

Hou, Whang, Gleich, Dhillon, Submitted

Proposed Algorithm

Seed Set ExpansionCarefully select seedsGreedily expand communities around the seed sets

The algorithmFiltering PhaseSeeding PhaseSeed Set Expansion PhasePropagation Phase

Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44)

Overlapping communities via seed set expansion works nicely.

Proposed Algorithm

Seed Set ExpansionCarefully select seedsGreedily expand communities around the seed sets

The algorithmFiltering PhaseSeeding PhaseSeed Set Expansion PhasePropagation Phase

Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44)

Table 4: Returned number of clusters and graph coverage of each algorithm

Graph random egonet graclus ctr. spread hubs demon bigclam

HepPh coverage (%) 97.1 72.1 100 100 88.8 62.1no. of clusters 97 241 109 100 5,138 100

AstroPh coverage (%) 97.6 71.1 100 100 94.2 62.3no. of clusters 192 282 256 212 8,282 200

CondMat coverage (%) 92.4 99.5 100 100 91.2 79.5no. of clusters 199 687 257 202 10,547 200

DBLP coverage (%) 99.9 86.3 100 100 84.9 94.6no. of clusters 21,272 8,643 18,477 26,503 174,627 25000

Amazon coverage (%) 99.9 100 100 100 79.2 99.2no. of clusters 21,553 14,919 20,036 27,763 105,828 25,000

Flickr coverage (%) 76.0 54.0 100 93.6 - 52.1no. of clusters 14,638 24,150 16,347 15,349 - 15,000

LiveJournal coverage (%) 88.9 66.7 99.8 99.8 - 43.9no. of clusters 14,850 34,389 16,271 15,058 - 15,000

Myspace coverage (%) 91.4 69.1 100 99.9 - -no. of clusters 14,909 67,126 16,366 15,324 - -

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Coverage (percentage)

Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandomdemonbigclam

Student Version of MATLAB

(a) AstroPh

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Max

imum

Con

duct

ance



(b) HepPh

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Max

imum

Con

duct

ance



(c) CondMat

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandombigclam


(d) Flickr

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandombigclam


(e) LiveJournal

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandom


(f) Myspace

Figure 2: Conductance vs. graph coverage – lower curve indicates better communities. Overall, “gracluscenters” outperforms other seeding strategies, including the state-of-the-art methods Demon and Bigclam.

We can cover 95% of network with communities of cond. ~0.15.

Flickr social network 2M vertices, 22M edges

cond(S)

= cut(S)/“size”(S)


We wanted a more principled approach to achieve these results.


The state of the art for clustering


K-Means

Problem 1 Problem 2 Problem 3 Problem 4

😀 😊 😟 😢 K-Means

The state of the art for clustering


K-Means

Problem 1 Problem 2 Problem 3 Problem 4

😀 😊 K-Means NEO-K-Means NEO K-Means

😊 😊

m1

m2

|| xi – m1 ||

|| xi – m2 ||

K-means as optimization.


minimize

Pij Uijkxi � mjk2

subject to U is an assignment to clusters

mj =

1Pi Ui j

Uijxi

minimize

Pij Uijkxi � mjk2

subject to U is an multi-assignment to clusters

mj =

1Pi Ui j

Uijxi

Input Points x

1

, ... , xn

Find an assignment

matrix U that gives

cluster assignments

to minimize

x1x2x3x4

U =

2

664

1 01 00 10 1

3

775

c1 c2

K-means objective!

K-means’ objective with overlap?!

Overlap is not a natural addition to optimization based clustering.


The NEO-K-means objective balances overlap and outliers.


minimize

Pij Uijkxi � mjk2

subject to Uij is binary

trace(UT U) = (1 + ↵)n (↵n overlap)

e

TInd[Ue] � (1 � �)n (up to �n outliers)

mj =

1Pi Ui j

Uijxi

· If ↵,� = 0, then we get back to K-means.

· Automatically choose ↵,� based on K-means.

😊 1. Make (1 + ↵)n total assignments.

2. Allow up to �n outliers.

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8

Cluster 1Cluster 2Cluster 1 & 2Not assigned

(a) Ground-truth clusters

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8


(b) First extension of k-means

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8


(c) NEO-K-Means

Figure 1: (a) Two ground-truth clusters are generated (n=1,000, ↵=0.1, �=0.005). Green points indicate overlap between

the clusters, and black points indicate outliers. See Section 5 for details. (b) Our first extension of k-means objective

function defined in (2.2) makes too many outlier assignments and fails to recover the ground-truth. (c) The NEO-K-Means

objective defined in (2.3) adds an explicit term for non-exhaustiveness that enables it to correctly detect the outliers, and

find natural overlapping clustering structure which is very similar to the ground-truth clusters (↵ and � are automatically

estimated by the heuristics discussed in Section 2.5).

vector having all the elements equal to one. Then, thevector U1 denotes the number of clusters to which eachdata point belongs. Thus, (U1)i = 0 means that xi

does not belong to any cluster. Now, by adding a non-exhaustiveness constraint to (2.2), we define our NEO-K-Means objective function as follows:(2.3)

minU

kX

j=1

nX

i=1

uijkxi �mjk2, where mj =

Pni=1 uijxiPni=1 uij

s.t. trace(UTU) = (1 + ↵)n,

Pni=1 {(U1)i = 0} �n.

We allow at most �n data points to be not assignedto any cluster, i.e., at most �n data points can beconsidered as outliers. We require 0 �n and notethat �n ⌧ n to cause most data points to be assignedto clusters. Specifically, by the definition of “outliers”,�n should be a very small number compared to n. Theparameters ↵ and � o↵er an intuitive way to capture thedegree of overlap and non-exhaustiveness; by “turningthe knob” on these parameters, the user can explore thelandscape of overlapping, non-exhaustive clusterings. If↵=0 and �=0, the NEO-K-Means objective function isequivalent to the standard k-means objective presentedin (2.1). To see this, note that setting the parameter�=0 requires every data point to belong to at leastone cluster, while setting ↵=0 makes n assignments.Putting these together, the resulting clustering will bedisjoint and exhaustive. Note that by setting ↵=0,objective (2.2) does not have this property.

To see if the objective function (2.3) yields a reason-able clustering, we test it on the same dataset we usedin the previous subsection. Figure 1 (c) shows the result(↵ and � are automatically estimated by the heuristicsdiscussed in Section 2.5). We see that NEO-K-Meanscorrectly finds all the outliers, and produces very similaroverlapping structure to the ground-truth clusters.

2.4 The NEO-K-Means Algorithm. We now pro-pose a simple iterative algorithm which monotonicallydecreases the NEO-K-Means objective until it convergesto a local minimum. Having the hard constraints in(2.3), we will make n+↵n assignments such that at most�n data points can have no membership in any clus-ter. Note that the second constraint can be interpretedas follows: among n data points, at least n � �n datapoints should have membership to some cluster. Whenour algorithm makes assignments of points to clusters, ituses two phases to satisfy these two constraints. Thus,each cluster Cj decomposes into two sets Cj and Cj thatrecord the assignments made in each phase.

Algorithm 1 describes the NEO-K-Means algo-rithm. We first initialize cluster centroids. Any ini-tialization strategies that are used in k-means may alsobe applied to our algorithm. Given cluster centroids, wecompute all the distances [dij ]n⇥k between every datapoint and clusters, and for every data point, recordits closest cluster and that distance. Then, the datapoints are sorted in an ascending order by the dis-tance to its closest cluster. To ensure at least n � �n

data points are assigned to some cluster (i.e., to satisfythe second constraint), we assign the first n � �n datapoints to their closest clusters. Let Cj denote the assign-ments made by this step. Thus,

Pkj=1 |Cj | = n � �n.

Then, we make �n + ↵n more assignments by tak-ing �n + ↵n minimum distances among [dij ]n⇥k such

that xi /2 Cj . Let Cj denote the assignments made

by this step. Thus,Pk

j=1 |Cj | = �n + ↵n. Finally,Pk

j=1(|Cj | + |Cj |) = n + ↵n. Once all the assignmentsare made, we update cluster centroids by recomputingthe mean of each cluster. We repeat this procedure un-til the change in objective function is su�ciently smallor the maximum number of iterations is reached. Notethat the algorithm does not forcibly choose �n points as

Lloyd’s algorithm for NEO-K-means is just a wee-bit more complex.


Until done

1. Update centroids.

2. Assign (1 � �)n nodes to closest centroid

3. Make (↵ + �)n assignments based on minimizing distance.

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8



−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8



−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8


(c) NEO-K-Means









minU

kX

j=1

nX

i=1




Pni=1 {(U1)i = 0} �n.






Pkj=1 |Cj | = n � �n.






This algorithm correctly assigns our

example case and even determines

overlap and outlier parameters!

THEOREM Lloyds algorithm decrease the objective monotonically.

The non-exhaustiveness is necessary for assignments.


−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8



−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8



−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8


(c) NEO-K-Means









minU

kX

j=1

nX

i=1




Pni=1 {(U1)i = 0} �n.






Pkj=1 |Cj | = n � �n.






−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8



−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8



−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8


(c) NEO-K-Means









minU

kX

j=1

nX

i=1




Pni=1 {(U1)i = 0} �n.






Pkj=1 |Cj | = n � �n.






−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8



−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8



−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8


(c) NEO-K-Means









minU

kX

j=1

nX

i=1




Pni=1 {(U1)i = 0} �n.






Pkj=1 |Cj | = n � �n.






Output without assignment constraint. (beta = 1)

NEO-K-means output (correct)

The Weighted, Kernel "NEO-K-Means objective. •  Introduce weights for each data point. •  Introduce feature maps for each data point too.


minimize

Pij Uijwik�(xi ) � mjk2

subject to Uij is binary

trace(UT U) = (1 + ↵)n (↵n overlap)

e

TInd[Ue] � (1 � �)n (up to �n outliers)

mj =

1Pi Uij wi

wiUijxi

X

ij

Uijwik�(xi ) � mjk2 =X

ij

UijwiKii �

ujWKWuj

u

Tj Wuj

!

Theorem If K = D�1 + D�1AD�1, then the NEO-K-Means objectiveis equivalent to overlapping conductance.

NOTE

This means that NEO-K-Means was the principled objective we were after!


Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community

�(S) =

cut(S)

min

�vol(S), vol(

¯S)

�(edges leaving the set)

(total edges in the set)

David Gleich · Purdue

cut(S) = 7

vol(S) = 33

vol(

¯S) = 11

�(S) = 7/11

SILO Seminar

Our theorem means that NEO-K-Means can optimize the sum-conductance obj.


�(S) cut(S)

vol(S)

+

cut(

¯S)

vol(

¯S)

X

S2C

cut(S)

vol(S)

=

X

S2C

�(S) if vol(S) vol(

¯S)

Conductance Normalized cut bi-partition

NEO-K-Means"objective

When we use this method to partition the Karate club network, we get reasonable solutions.

•  Inspired by Dhillon et al.’s work on Graclus

•  We have a multilevel method to optimize the graph case.

We get state of the art clustering perf. on vector and graph datasets.


F1 scores on vector datasets from the Mulan repository. moc fuzzy esp isp okm rokm NEO

synth1 0.833 0.959 0.977 0.985 0.989 0.969 0.996synth2 0.836 0.957 0.952 0.973 0.967 0.975 0.996synth3 0.547 0.919 0.968 0.952 0.970 0.928 0.996yeast - 0.308 0.289 0.203 0.311 0.203 0.366music 0.534 0.533 0.527 0.508 0.527 0.454 0.550scene 0.467 0.431 0.572 0.586 0.571 0.593 0.626

n dim.

¯|C| outliers k

synth1 5,000 2 2,750 0 2

synth2 1,000 2 550 5 2

synth3 6,000 2 3,600 6 2

yeast 2,417 103 731.5 0 14

music 593 72 184.7 0 6

scene 2,407 294 430.8 0 6

The Mulan testset has a number of appropriate datasets

NEO-K-Means with Lloyds is fast and usually accurate but inconsistent.


−6 −4 −2 0 2 4 6−2

0

2

4

6

8

10

Cluster 1Cluster 2Cluster 1 & 2Cluster 3Not assigned

−6 −4 −2 0 2 4 6−2

0

2

4

6

8

10


−6 −4 −2 0 2 4 6−2

0

2

4

6

8

10


A more complicated overlapping test case

The output from NEO-K-Means with Lloyd’s method

Can we get a more robust method? Yes!


Towards better optimization of the objective

1.  An SDP relaxation of the objective. 2.  A practical low-rank SDP heuristic. 3.  Faster optimization methods for the heuristic.


From assignments to co-occurrence matrices


There are three key variables in our formulation

1. The co-occurrence matrix

Z =

X

j

WujuTj W/uT

j Wuj

2. The overlap vector f

3. The assignment indicator g

U =

2

664

1 01 10 10 0

3

775

f =

2

664

1210

3

775 g =

2

664

1110

3

775

We can convert our objective into a trace minimization problem.


Kij = �(xi )T�(xj )di = wiKii

X

ij

Uijwik�(xi ) � mjk2 =X

ij

UijwiKii �

ujWKWuj

u

Tj Wuj

!

=X

ij

UijwiKii �X

j

ujWKWuj

u

Tj Wuj

= f

Td � trace(KZ )

Z = normalized co-occurrence

f = overlap count

g = assignment indicator

The objective function

There is an SDP-like framework to solve NEO-K-means.


maximize

Z ,f,gtrace(KZ ) � fT d

subject to trace(W�1Z ) = k , (a)

Zij � 0, (b)

Z ⌫ 0, Z = Z T(c)

Ze = W f, (d)

eT f = (1 + ↵)n, (e)

eT g � (1 � �)n, (f )f � g, (g)

rank(Z ) = k , (h)

f 2 Zn�0

, g 2 {0, 1}n. (i)

Z must come from

an assignment matrix

Overlap and assignmentconstraints

Combinatorial constraints

There is an SDP-relaxation to approximate NEO-K-means.


Z must come from

an assignment matrix

Overlap and assignmentconstraints

maximize

Z ,f,gtrace(KZ ) � fT d

subject to trace(W�1Z ) = k , (a)

Zij � 0, (b)

Z ⌫ 0, Z = Z T(c)

Ze = W f, (d)

eT f = (1 + ↵)n, (e)

eT g � (1 � �)n, (f )f � g, (g)

0 g 1

Relaxed constraints

This SDP can easily solve simple problems.


NEO-K-Means SDP

Solution Z from CVX is even rank 2!

But SDP methods have a number of issues for large-scale problems.

1.  The number of variables is quadratic in the number of data points

2.  The best solvers can only solve problems with a few hundred or thousand points.

So like many before us (e.g. Burer & Monteiro, Kulis Surendran, and Platt 2007, and more)

we optimize a low-rank factorization of the solution


Using the NEO-K-Means Low-Rank SDP, we can find assignments directly.


NEO-K-Means Low-rank SDP

Y Y T

kZ � Y Y Tk = 2.3 ⇥ 10�4

maximize

Y ,f,g,s,rtrace(Y T KY ) � fT d

subject to k = trace(Y T W�1Y )

0 = Y Y T e � W f0 = eT f � (1 + ↵)n0 = f � g � s0 = eT g � (1 � �)n � rYij � 0, s � 0, r � 0

0 f ke, 0 g 1

The Low-Rank NEO-K-Means SDP

We lose convexity but gain practicality. We introduce slacks at this point.


icky non-convex term

simple bound constraints

We use an augmented Lagrangian method to optimize this problem


[28] J. Peng and Y. Wei. Approximating k-means-typeclustering via semidefinite programming. SIAMJournal on Optimization, 18(1):186–205, 2007.

[29] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. P.Vlahavas. Multi-label classification of music intoemotions. In International Conference on MusicInformation Retrieval, pages 325–330, 2008.

[30] J. J. Whang, I. S. Dhillon, and D. F. Gleich.Non-exhaustive, overlapping k-means. In Proceedingsof the SIAM International Conference on DataMining, pages 936–944, 2015.

[31] J. J. Whang, D. Gleich, and I. S. Dhillon. Overlappingcommunity detection using seed set expansion. InACM International Conference on Information andKnowledge Management, pages 2099–2108, 2013.

[32] L. F. Wu, T. R. Hughes, A. P. Davierwala, M. D.Robinson, R. Stoughton, and S. J. Altschuler.Large-scale prediction of saccharomyces cerevisiaegene function using overlapping transcriptionalclusters. Nature Genetics, 31(3):255–265, June 2002.

[33] E. P. Xing and M. I. Jordan. On semidefiniterelaxations for normalized k-cut and connections tospectral clustering. Technical ReportUCB/USD-3-1265, University of California, Berkeley,2003.

[34] J. Yang and J. Leskovec. Overlapping communitydetection at scale: a nonnegative matrix factorizationapproach. In ACM International Conference on WebSearch and Data Mining, pages 587–596, 2013.

[35] S. X. Yu and J. Shi. Multiclass spectral clustering. InIEEE International Conference on Computer Vision -Volume 2, 2003.

APPENDIXA. AUGMENTED LAGRANGIANSThe augmented Lagrangian framework is a general strat-

egy to solve nonlinear optimization problems with equalityconstraints. We briefly review a standard textbook deriva-tion for completeness [26]. Consider a general problem:

minimizex

f(x)

subject to ci(x) = 0, i = 1, . . . ,m

l x u.

(7)

The augmented Lagrangian for this problem involves a setof Lagrange multipliers �i to estimate the influence of eachconstraint on the objective as well as a quadratic penalty tosatisfy the nonlinear constraints. It is defined as

LA(x;�,�) = f(x)�mX

i=1

�ici(x) +�

2

mX

i=1

c

2

i (x).

An augmented Lagrangian algorithm iteratively proceedsfrom an arbitrary starting point to a local solution of (7).At each step, a bound-constrained solver minimizes LA overx subject to l x u. Based on an approximate solu-tion, it adjusts the Lagrange multipliers � and may updatethe penalty parameter �. See Algorithm 17.4 in Nocedal andWright [26] for a standard strategy to adjust the multipliers,penalty, and tolerances for each subproblem.

We use the L-BFGS-B procedure [9] to solve the subprob-lem. This requires both a subroutine to evaluate the func-tion and the gradient vector.

B. GRADIENTS FOR NEO-LRWe now describe the analytic form of the gradients for the

augmented Lagrangian of the NEO-LR objective and a briefvalidation that these are correct. Consider the augmentedLagrangian (5). The gradient has five components for thefive sets of variables: Y , f , g, s and r:

rY LA(Y , f ,g, s, r;�, µ, �,�) =

� 2KY � eµTY � µeTY

� 2(�1

� �(tr(Y TW�1Y )� k))W�1Y

+ �(Y Y T eeTY + eeTY Y TY )� �(W feTY + efTWY )

r

f

LA(Y , f ,g, s, r;�, µ, �,�) =

d+Wµ� �(WY Y T e�W 2f)� �2

e+ �(eT f � (1 + ↵)n)e

� � + �(f � g � s)

r

g

LA(Y , f ,g, s, r;�, µ, �,�) =

� � �(f � g � s)� �3

e+ �(eT g � (1� �)n� r)e

r

s

LA(Y , f ,g, s, r;�, µ, �,�) = � � �(f � g � s)

rrLA(Y , f ,g, s, r;�, µ, �,�) = �3

� �(eT g � (1� �)n� r)

Using analytic gradients in a black-box solver such as L-BFGS-B is problematic if the gradients are even slightly in-correctly computed. To guarantee the analytic gradients wederive are correct, we use forward finite di↵erence methodto get numerical approximation of the gradients based onthe objective function. We compare these with our analyticgradient and expect to see small relative di↵erences on theorder of 10�5 or 10�6. This is exactly what Figure 4 shows.

10−4 10−2 100 102 104 10610−12

10−11

10−10

10−9

10−8

10−7

10−6

rel.

diff

abs(g)

Figure 4: Finite di↵erence comparison of gradientswhere ✏ = 10-6. This figure shows that the relativedi↵erence between the analytical gradient and thegradient computed via finite di↵erences is small, in-dicating the gradient is correctly computed.

Finally, we denote by e the vector of all 1s.The following program is equivalent to the NEO-K-Means

objective with a discrete assignment matrix:

maximizeZ,f ,g

trace(KZ)� fTd

subject to trace(W�1Z) = k, (a)Zij � 0, (b)Z ⌫ 0,Z = ZT (c)Ze = W f , (d)eT f = (1 + ↵)n, (e)eTg � (1� �)n, (f)f � g, (g)rank(Z) = k, (h)f 2 Zn

�0

,g 2 {0, 1}n. (i)

(2)

We omit the verification that this is actually equivalent tothe NEO-K-Means objective (1) as it is not informative forour discussion. Constraints (a), (b), (c), and (h) encode thefact that Z must arise from an assignment matrix. Con-straints (d), (e), (f), (g), and (i) are new to our NEO-K-Means formulation that express the amount of overlap andnonexhaustiveness in the solution. This is a mixed-integer,rank constrained sdp. As such, it is combinatorially hard tooptimize just like the original NEO-K-Means objective.

The constraints that make this a combinatorial problemare (h) and (i). If we relax these constraints:

maximizeZ,f ,g

trace(KZ)� fTd

subject to (a), (b), (c), (d), (e), (f), (g)0 g 1

(3)

then we arrive at a convex problem. Thus, any local optimalsolution of (3) must be a global solution.

Solving (3) requires a black-box sdp solver such as cvx.As it converts this problem into a standard form for suchproblems, the number of variables becomes O(n2) and theresulting complexity is worse than O(n3) in most cases, andcan be as bad as O(n6). These solvers are further limited bythe delicate numerical precision issues that arise as they ap-proach a solution. The combination of these features meansthat o↵-the-shelf procedures struggle to solve problems withmore than 100 data points. We now describe a means toenable us to solve larger problems.

4. A LOW-RANK SDP FOR NEO-K-MEANSIn the sdp formulation of the NEO-K-Means objective (3),

the matrix Z should only be rank k. By applying the low-rank factorization idea, Z becomes Y Y T where Y is n⇥ k

and non-negative. Thus, the following optimization programis a low-rank sdp for (3) (we have chosen to write it inthe standard form of a minimization problem with explicitslack variables s, r to convert the inequality constraints intoequality and bound constraints).

minimizeY ,f ,g,s,r

fTd� trace(Y TKY )

subject to k = trace(Y TW�1Y ) (s)0 = Y Y T e�W f (t)0 = eT f � (1 + ↵)n (u)0 = f � g � s (v)0 = eTg � (1� �)n� r (w)Yij � 0, s � 0, r � 00 f ke, 0 g 1

(4)

Here we also replaced the constraint Y Y T � 0 with thestronger constraint Y � 0. This problem is a quadraticprogramming problem with quadratic constraints, and wewill discuss how to solve it in the next subsection. We callthe problem NEO-LR and the solution procedure LRSDP.Even though now we lose convexity by formulating the lowrank sdp, this nonlinear programming problem only requiresO(nk) memory and existing nonlinear programming tech-niques allow us to scale to large problems.

After we get a solution, the solution Y can be regardedas the normalized assignment matrix

Y = WU

where U = [u1

, u2

, . . . , uk], and uc = uc/psc for any c =

1, . . . , k.

4.1 Solving the NEO-K-Means low-rank SDPTo solve the NEO-LR problem (4), we use an augmented

Lagrangian framework. This is an iterative strategy whereeach step consists of minimizing an augmented Lagrangianof the problem that includes a current estimate of the La-grange multipliers for the constraints as well as a penaltyterm that drives the solution towards the feasible set. Aug-mented Lagrangian techniques have been successful in pre-vious studies of low-rank sdp approximations [6].Let � = [�

1

;�2

;�3

] be the Lagrange multipliers associatedwith the three scalar constraints (s), (u), (w), and µ and� be the Lagrange multipliers associated with the vectorconstraints (t) and (v), respectively. Let � � 0 be a penaltyparameter. The augmented Lagrangian for (4) is:

LA(Y, f ,g, s, r;�,µ,�,�) =

fTd� trace(Y TKY )| {z }

the objective

� �

1

(trace(Y TW�1Y )� k)

+�

2(trace(Y TW�1Y )� k)2

� µT (Y Y T e�W f)

+�

2(Y Y T e�W f)T (Y Y T e�W f)

� �

2

(eT f � (1 + ↵)n) +�

2(eT f � (1 + ↵)n)2

� �T (f � g � s) +�

2(f � g � s)T (f � g � s)

� �

3

(eTg � (1� �)n� r)

+�

2(eTg � (1� �)n� r)2

(5)

At each step in the augmented Lagrangian solution frame-work, we solve the following subproblem:

minimize LA(Y , f ,g, s, r;�,µ,�,�)

subject to Yij � 0, s � 0, r � 0,

0 f ke, 0 g 1.

(6)

We use a limited-memory BFGS with bound constraints al-gorithm [9] to minimize the subproblem with respect to thevariables Y , f , g, s and r. This requires computation ofthe gradient of LA with respect to the variables. We de-termine and validate an analytic form for the gradient inAppendix B. In Section 6.1, we provide evidence that ouroptimization procedure is correctly implemented. Those ex-periments also show that we achieve the same objective func-

We use an augmented Lagrangian method to optimize this problem •  Use L-BFGS-B to optimize each step. •  Update the multiplier estimates in the standard way. •  Pick parameters in a modestly standard way. •  Some variability between problems to show best results, only

a little variation in time/performance. •  Faster than the NEOS solvers


Low rank structure in NEO-K-Means solution Explore low rank structure in NEO-K-Means SDP

Comparison with Solvers on NEOS Server

NEOS Server 1: State-of-the-Art Solvers for Numerical Optimization

Our solver with ALM approach is much faster than theirs (e.g.,SNOPT which is suitable for large nonlinearly constrained problemswith a modest number of degrees of freedom).

Our Solver ALM (obj/time) SNOPT solver (obj/time)MUSIC 79514.130/92s 79515.156/306sSCENE 18534.030/3798s 18534.021/8910sYEAST 8902.253/4331s Not solved

1http://www.neos-server.org/neos/

Yangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 25 / 61

We win with our LRSDP solver vs. "the CVX default solver

•  Dolphins (n=62) and Les Mis (n=77) are graph probs •  LRSDP is much faster and just as accurate.



Algorithmic Validation

Comparison of SDP and LRSDPLRSDP is roughly an order of magnitude faster than cvx.LRSDP generates solutions as good as the global optimal from cvx.The objective value are di↵erent in light of the solution tolerances.dolphins 1: 62 nodes, 159 edges, les miserables 2: 77 nodes, 254 edges

Objective value Run timeSDP LRSDP SDP LRSDP

dolphins

k=2, ↵=0.2, �=0 -1.968893 -1.968329 107.03 secs 2.55 secsk=2, ↵=0.2, �=0.05 -1.969080 -1.968128 56.99 secs 2.96 secsk=3, ↵=0.3, �=0 -2.913601 -2.915384 160.57 secs 5.39 secsk=3, ↵=0.3, �=0.05 -2.921634 -2.922252 71.83 secs 8.39 secs

les miserables

k=2, ↵=0.2, �=0 -1.937268 -1.935365 453.96 secs 7.10 secsk=2, ↵=0.3, �=0 -1.949212 -1.945632 447.20 secs 10.24 secsk=3, ↵=0.2, �=0.05 -2.845720 -2.845070 261.64 secs 13.53 secsk=3, ↵=0.3, �=0.05 -2.859959 -2.859565 267.07 secs 19.31 secs

1D. Lusseau et al., Behavioral Ecology and Sociobiology, 2003.

2D. E. Knuth. The Stanford GraphBase: A Platform for Combinatorial Computing. Addison-Wesley, 1993.


Dolphins from Lusseau et al. 2003; Les Mis from Knuth GraphBase

Rounding and Improvement are both important.


Input ! Relaxed solution ! Rounded solution ! Improved solution

Rounding

f gives the number of clustersg gives the set of assignmentsOption 1

Use g and f to determinethe number of assignments and go greedy.

Option 2Just greedily assign based on W�1Y .

Improvement

Run NEO-K-Means on the output.

Initialization

Run NEO-K-Means on the intput.

The new method is more robust, even in simple tests. Consider clustering a cycle graph


We use disconnected nodes to measure the cluster quality.


disconnected nodes

0 0.5 1 1.5 2 2.5 3 3.5 40

10

20

30

40

50

60

70

80

90

100

Noise

No. of dis

connect

ed n

odes

random+onelevel neomultilevel neolrsdp As we increase the noise,

only the LRSDP method can reliably find the true clustering.

We get improved vector and graph clustering results too.



Experimental Results on Data Clustering

Comparison of NEO-K-Means objective function valuesReal-world datasets from Mulan1

By using the LRSDP solution as the initialization of the iterativealgorithm, we can achieve better (smaller) objective function values.

worst best avg.

yeastkmeans+neo 9611 9495 9549lrsdp+neo 9440 9280 9364slrsdp+neo 9471 9231 9367

musickmeans+neo 87779 70158 77015lrsdp+neo 82323 70157 75923slrsdp+neo 82336 70159 75926

scenekmeans+neo 18905 18745 18806lrsdp+neo 18904 18759 18811slrsdp+neo 18895 18760 18810

1http://mulan.sourceforge.net/datasets.htmlYangyang Hou (Purdue CS) Low Rank Methods for Optimizing Clustering Nov 2, 2015 31 / 61


Experimental Results on Data Clustering

F1 scores on real-world vector datasets (the larger, the better)NEO-K-Means-based methods outperform other methods.Low-rank SDP method improves the clustering results.

moc esp isp okm kmeans+neo lrsdp+neo slrsdp+neo

yeastworst - 0.274 0.232 0.311 0.356 0.390 0.369best - 0.289 0.256 0.323 0.366 0.391 0.391avg. - 0.284 0.248 0.317 0.360 0.391 0.382

musicworst 0.530 0.514 0.506 0.524 0.526 0.537 0.541best 0.544 0.539 0.539 0.531 0.551 0.552 0.552avg. 0.538 0.526 0.517 0.527 0.543 0.545 0.547

sceneworst 0.466 0.569 0.586 0.571 0.597 0.610 0.605best 0.470 0.582 0.609 0.576 0.627 0.614 0.625avg. 0.467 0.575 0.598 0.573 0.610 0.613 0.613


We have improved results – impressively so on the yeast dataset – and only slightly worse on the scene data.

We get improved vector and graph clustering results too.


Facebook1 Facebook2 HepPh AstroPh

bigclam 0.830 0.640 0.625 0.645demon 0.495 0.318 0.503 0.570oslom 0.319 0.445 0.465 0.580nise 0.297 0.293 0.102 0.153m-neo 0.285 0.269 0.206 0.190LRSDP 0.222 0.148 0.091 0.137

No. of vertices No. of edges

Facebook1 348 2,866Facebook2 756 30,780HepPh 11,204 117,619AstroPh 17,903 196,972

For these graphs, we dramatically improve the conductance-vs-coverage plots.

Lloyd’s iterative method takes O(1 second) LRSDP method takes O(1 hour) Now we want to improve the LRSDP time.


We can improve the optimization beyond ALM. 1.  Proximal augmented Lagrangian (PALM)"

Add a regularization term to the augmented Lagrangian""""

Solve with L-BFGS-B

2.  ADMM method (5 blocks)


x

(k+1) = argmin LA(x(k );�(k )...) +12⌧

kx � x

(k )k

Y k+1 = argminY

LA(Y , fk , gk , sk , r k ;�k ,µk ,�k ,�)

fk+1 = argminf

LA(Y k+1, f, gk , sk , r k�k ,µk ,�k ,�)

gk+1 = argming

LA(Y k+1, fk+1, g, sk , r k�k ,µk ,�k ,�)

sk+1 = argmins

LA(Y k+1, fk+1, gk+1, s, r k�k ,µk ,�k ,�)

r k+1 = argminr

LA(Y k+1, fk+1, gk+1, sk+1, r�k ,µk ,�k ,�)

Convex J Non-convex L

We had to get a new convergence result for the proximal method

Results for bound-constrained sub-problems? Ours is a a small adaptation of a general result due to Pennanen (2002).



Convergence analysis of PALM 1

Theorem 1

Let (x, �) be a KKT pair satisfying the strongly second order su�cient condition and

assume the gradients rc(x) are linearly independent. If the {�k} are large enough with

�k ! � 1 and if k(x0,�0)� (x, �)k is small enough, then there exists a sequence

{(xk ,�k)} conforming to Algorithm 1 along with open neighborhoods Ck such that for

each k, xk+1 is the unique solution in Ck to (Pk). Then also, the sequence {(xk ,�k)}converges linearly and Fejer monotonically to x, � with rate r(�) < 1 that is decreasing

in � and r(�) ! 0 as � ! 1.

1We specialize a general convergence result due to Pennanen (Local convergence of the proximal point algorithm and

multiplier methods without monotonicity. Math. Oper. Res., 27(1):170-191, 2002.) to our algorithm. The sketch of theproof can be found in the Appendix section.


On the yeast dataset, we see no difference in objective, but faster solves


0

500

1000

1500

2000

2500

3000

3500

4000

4500

iterative ALM PALM ADMM

Runtimes on YEAST

8700

8800

8900

9000

9100

9200

ALM PALM ADMM

f(x) values on YEAST

On yeast, we see much better discrete objectives and F1 scores.


9000

9100

9200

9300

9400

9500

9600

9700


NEO−K−Means objectives on YEAST

0.34

0.345

0.35

0.355

0.36

0.365

0.37

0.375

0.38

0.385

0.39


F1 Scores on YEAST

Recap For overlapping clustering of data and overlapping community detection of graphs, we have a new objective •  Fast Lloyd-like iterative algorithm •  SDP relaxation •  Low-rank SDP relaxation •  Proximal and ADMM acceleration techniques


1.  NEO-K-means - Whang, Gleich, Dhillon, SDM 2015 2.  NEO-K-means SDP + Aug. Lagrangian"




plot(x)

0 2 4 6 8 10x 105

0

0.02

0.04

0.06

0.08

0.1

100 102 104 10610−15

10−10

10−5

100

100 102 104 10610−15

10−10

10−5

100

nonzeros Crawl of flickr from 2006 ~800k nodes, 6M edges, beta=1/2

(I � �P)x = (1 � �)s

nnz(x) ⇡ 800k

kD�

1 (x�

x

⇤ )k 1

"

Localized solutions of diffusion equations in large graphs. Joint with Kyle Kloster. WAW2013, KDD2014, WAW2015; J. Internet Math.

34

feature

X R D S • S P R I N G 2 0 1 3 • V O L . 1 9 • N O . 3

damentally limited); the second type models a diffusion of a virtual good (think of sending links or a virus spread-ing on a network), which can be copied infinitely often. Google’s celebrated PageRank model uses a conservative diffusion to determine importance of pages on the Web [4], whereas those studying when a virus will continue to propagate found the eigenvalues of the non-conservative diffusion determine the answer [5]. Thus, just as in scientific computing, marrying the method to the model is key for the best scientific computing on social networks.

Ultimately, none of these steps dif-fer from the practice of physical sci-entific computing. The challenges in creating models, devising algorithms, validating results, and comparing models just take on different chal-lenges when the problems come from social data instead of physical mod-els. Thus, let us return to our starting question: What does the matrix have to do with the social network? Just as in scientific computing, many inter-esting problems, models, and meth-ods for social networks boil down to matrix computations. Yet, as in the expander example above, the types of matrix questions change dramatical-ly in order to fit social network mod-els. Let’s see what’s been done that’s enticingly and refreshingly different from the types of matrix computa-tions encountered in physical scien-tific computing.

EXPANDER GRAPHS AND PARALLEL COMPUTING Recently, a coalition of folks from aca-demia, national labs, and industry set out to tackle the problems in parallel computing and expander graphs. They established the Graph 500 benchmark (http://www.graph500.org) to measure the performance of a parallel com-puter on a standard graph computa-tion with an expander graph. Over the past three years, they’ve seen perfor-mance grow by more than 1,000-times through a combination of novel soft-ware algorithms and higher perfor-mance parallel computers. But, there is still work left in adapting the soft-ware innovations for parallel comput-ing back to matrix computations for social networks.

Figure 1. In a standard scientific computing problem, we find the steady state heat distribution of a plate with a heat-source in the middle. This scientific problem is solved via a linear system. In a social diffusion problem, we are trying to find people who like the movie (labeled in dark orange) instead of people who don’t like the movie (labeled in dark purple). By solving a different linear system, we can determine who is likely to enjoy the movie (light orange).

Problem

Diffusionin a plate

Movie interest indiffusion

Equations

�

�

�

�

Figure 2. The network, or mesh, from a typical problem in scientific computing resides in a low dimensional space—think of two or three dimensions. These physical spaces put limits on the size of the boundary or “surface area” of the space given its volume. No such limits exist in social networks and these two sets are usually about the same size. A network with this property is called an expander network.

Size of set ≈ Size of boundary

Size of set » Size of boundary

“Networks” from PDEs are usuallyphysical

Social networksare expanders


Higher order organization of complex networks Joint with Austin Benson and Jure Leskovec

910

8

72

0

4

3

11

6

5

1

CEPDR

CEPVR

IL2R

OLLRRIALRIAR

RIVLRIVR

RMDDR

RMDLRMDR

RMDVL

RMFLSMDDL

SMDDR

SMDVR

URBR

By using a new generalization of spectral clustering methods, we are able to find completely novel and relevant structures in complex systems such as the connectome and transport networks.


SIAM Annual Meeting !(AN16)! July 11-15, 2016 The Westin Waterfront"Boston, Massachusetts David Gleich, Purdue Mary Silber, Northwestern

Big Data, Data Science, and Privacy Education, Communication, and Policy Reproducibility and Ethics Efficiency and Optimization Integrating Models and Data (incl. "

computational social science, PDEs) Dynamic Networks (learning, evolution, "

adaptation, and cooperation) Applied Math, Statistics, and "

Machine Learning Earth systems; environmental/ecological

applications Epidemiology

Future work

Even faster solvers

Understand why the solution seems to be rank-2. Better init for Lloyds.


Solution Z from CVX is even rank 2! 1.  NEO-K-means - Whang, Gleich, Dhillon, SDM 2015 2.  NEO-K-means SDP + Aug. Lagrangian"



non-exhaustive, overlapping k-means

Technology