clusterpath: an algorithm for clustering using convex fusion...

Clusterpath: an Algorithm for Clusteringusing Convex Fusion Penalties

Toby Dylan Hocking Armand Joulin Francis BachJean-Philippe Vert

INRIA – Sierra team, Laboratoire d’Informatique de l’ÉNSMines ParisTech – CBIO, INSERM U900, Institut Curie

Paris, France

The clustering problem: different appraoches

Clustering: assignlabels to n pointsin p dimensionsX ∈ Rn×p.

Methods:K-meansHierarchicalMixture modelsSpectral (Ng etal. 2001)

Issues:HierarchyConvexityGreedinessStabilityInterpretability

Clusterpath: relaxing a hard fusion penalty

Hard-thresholding of differences is a combinatorial problem:minα∈Rn×p ||α− X ||2F subject to

∑i<j 1αi 6=αj 6 t

Relaxation:∑

i<j ||αi − αj ||qwij 6 tThe Lagrange form is useful for optimization algorithms:

α∗(X , λ, q, w) = argminα∈Rn×p12||α− X ||2F + λ

∑i<j

||αi − αj ||qwij

The clusterpath of X is the path of optimal α∗ obtained byvarying λ, for fixed weights wij ∈ R+ and norm q ∈ 1, 2, ∞.Related work: “fused lasso” Tibshirani and Saunders (2005),“grouping pursuit” Shen and Huang (2010), “sum of norms”Lindsten et al. (2011).

Norm and weights control the clusterpath

norm = 1

X

X

norm = 2

X

X

norm = ∞

X

X

γ=

0γ

=1

Geometric interpretation: constrain area between pointsIdentity weights, t = Ω(X )

`2`2

`2

`1

`1 `1

`1

`1

`1

`∞

`∞

`∞

X1

X2

X3

Decreasing weights after join, t < Ω(X )

w12w13

X1

X2

X3

α1

αC = α2 = α3

Decreasing weights, t = Ω(X )

w12w13

w23

X1

X2

X3

We propose dedicated algorithms for each norm

Norm Properties Algorithm Complexity Problem sizes1 piecewise linear, separable path O(pn log n) large ≈ 105

2 rotation invariant active-set O(n2p) medium ≈ 103∞ piecewise linear Frank-Wolfe unknown* medium ≈ 103

*each iteration of complexity O(n2p).

Outline of the `1 path algorithm

Condition sufficient for optimality:

0 = αi − Xi + λ∑

j 6=iαi 6=αj

wij sign(αi − αj) + λ∑

j 6=iαi=αj

wijβij,

with |βij | 6 1 and βij = −βji (Hoefling 2009).1 For λ = 0 the solution α = X is optimal. We initialize the

clusters Ci = i and coefficients αi = Xi for all i .2 As λ increases, the solutions will follow straight lines.3 Taking the derivative of the optimality condition with respect toλ and summing over all points in a cluster C leads to:

dαC

dλ= vC =

∑j 6∈C

wjC sign(αj − αC) =∑j 6∈C

∑i∈C

wij sign(αj − αC)

4 When 2 clusters C1 and C2 fuse, they form a new clusterC = C1 ∪ C2 with vC = (|C1|v1 + |C2|v2)/(|C1| + |C2|).

5 Stop when all the points merge at the mean X .6 Combine dimensions using λ values.

`1 clusterpath of 10 points in 2d

α2

α1

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

Joins the left cluster on α1 before joining right cluster.

Solution at λ = 0.18 yields 2 clusters.

-0.5 0.0 0.5 1.0 1.5

Location in the regularization path λ

Opt

imal

valu

eof` 1

clus

terp

ath

-0.8-0.6-0.4-0.20.00.20.40.6

-0.5

0.0

0.5

1.0

1.5

0.00 0.05 0.10 0.15 0.20

α1

α2

Free software! http://clusterpath.r-forge.r-project.org/

Dedicated C++ optimization algorithms with R interface.Calculates the exact `1 clusterpath for identity weights.Active-set algorithm for the `1 and `2 clusterpath with general weights.

R interface to Python cvxmod clusterpath solver.Clusterpath visualizations in 2d, 3d, and animations.Coming soon: picking the number of clusters automatically!

Future workNecessary and sufficient conditions for cluster splitting?Automatically learning weights and number of clusters?Applications to solving proximal problems.

Clustering performance and timings

Cluster using the prior knowledge that there are 2 clusters.Quantify partition correspondence using the Normalized RandIndex (Hubert and Arabie, 1985): 1 for perfect correspondence,0 for completely random assignment.Results for 2 non-convex interlocking half-moons in 2d:

Clustering method Rand SD Seconds SDeexp spectral clusterpath 0.99 0.00 8.49 2.64eexp spectral kmeans 0.99 0.00 3.10 0.08`2 clusterpath 0.95 0.12 29.47 2.31e01 Ng et al. kmeans 0.95 0.19 7.37 0.42e01 spectral kmeans 0.91 0.19 3.26 0.21Gaussian mixture 0.42 0.13 0.07 0.00average linkage 0.40 0.13 0.05 0.00kmeans 0.26 0.04 0.01 0.00

Similar performance to spectral clustering, and learns a tree:

The weighted `2 clusterpath applied to the iris data:

Scatter Plot Matrix

Sepal.Length0

1

2 0 1 2

−2

−1

0

−2 −1 0

Sepal.Width

1

2

3 1 2 3

−2

−1

0

−2 −1 0

Petal.Length0.0

0.5

1.0

1.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.0

−1.5 −0.50.0

Petal.Width0.0

0.5

1.0

1.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.0

−1.5 −0.50.0

setosa versicolor virginica

Performance for several model sizes

Number of clusters

Nor

mal

ized

Ran

din

dex

(big

germ

eans

bette

ragr

eem

entw

ithkn

own

clus

ters

)

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

2 3 5 7 9 11

data:iris

data:m

oons

method

γ = 0.5

γ = 2

γ = 10

GMM

HC

kmeans

http://clusterpath.r-forge.r-project.org/

clusterpath: an algorithm for clustering using convex fusion...

Documents