clusterpath: an algorithm for clustering using convex fusion...

1
Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties Toby Dylan Hocking Armand Joulin Francis Bach Jean-Philippe Vert INRIA – Sierra team, Laboratoire d’Informatique de l’ÉNS Mines ParisTech – CBIO, INSERM U900, Institut Curie Paris, France The clustering problem: different appraoches Clustering: assign labels to n points in p dimensions X R n×p . Methods: K-means Hierarchical Mixture models Spectral (Ng et al. 2001) Issues: Hierarchy Convexity Greediness Stability Interpretability Clusterpath: relaxing a hard fusion penalty Hard-thresholding of differences is a combinatorial problem: min αR n×p ||α - X || 2 F subject to i <j 1 α i 6=α j 6 t Relaxation: i <j ||α i - α j || q w ij 6 t The Lagrange form is useful for optimization algorithms: α * (X , λ, q , w )= argmin αR n×p 1 2 ||α - X || 2 F + λ X i <j ||α i - α j || q w ij The clusterpath of X is the path of optimal α * obtained by varying λ, for fixed weights w ij R + and norm q {1, 2, }. Related work: “fused lasso” Tibshirani and Saunders (2005), “grouping pursuit” Shen and Huang (2010), “sum of norms” Lindsten et al. (2011). Norm and weights control the clusterpath norm = 1 X X norm = 2 X X norm = X X γ = 0 γ = 1 Geometric interpretation: constrain area between points Identity weights, t = Ω(X ) 2 2 2 1 1 1 1 1 1 X 1 X 2 X 3 Decreasing weights after join, t (X ) w 12 w 13 X 1 X 2 X 3 α 1 α C = α 2 = α 3 Decreasing weights, t = Ω(X ) w 12 w 13 w 23 X 1 X 2 X 3 We propose dedicated algorithms for each norm Norm Properties Algorithm Complexity Problem sizes 1 piecewise linear, separable path O (pn log n) large 10 5 2 rotation invariant active-set O (n 2 p ) medium 10 3 piecewise linear Frank-Wolfe unknown* medium 10 3 *each iteration of complexity O (n 2 p ). Outline of the 1 path algorithm Condition sufficient for optimality: 0 = α i - X i + λ X j 6=i α i 6=α j w ij sign(α i - α j )+ λ X j 6=i α i =α j w ij β ij , with |β ij | 6 1 and β ij =-β ji (Hoefling 2009). 1 For λ = 0 the solution α = X is optimal. We initialize the clusters C i = {i } and coefficients α i = X i for all i . 2 As λ increases, the solutions will follow straight lines. 3 Taking the derivative of the optimality condition with respect to λ and summing over all points in a cluster C leads to: d α C d λ = v C = X j 6C w jC sign(α j - α C )= X j 6C X i C w ij sign(α j - α C ) 4 When 2 clusters C 1 and C 2 fuse, they form a new cluster C = C 1 C 2 with v C =(|C 1 |v 1 + |C 2 |v 2 )/(|C 1 | + |C 2 |). 5 Stop when all the points merge at the mean X . 6 Combine dimensions using λ values. 1 clusterpath of 10 points in 2d α 2 α 1 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 Joins the left cluster on α 1 before joining right cluster. Solution at λ = 0.18 yields 2 clusters. -0.5 0.0 0.5 1.0 1.5 Location in the regularization path λ Optimal value of 1 clusterpath -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 -0.5 0.0 0.5 1.0 1.5 0.00 0.05 0.10 0.15 0.20 α 1 α 2 Free software! http://clusterpath.r-forge.r-project.org/ Dedicated C++ optimization algorithms with R interface. Calculates the exact 1 clusterpath for identity weights. Active-set algorithm for the 1 and 2 clusterpath with general weights. R interface to Python cvxmod clusterpath solver. Clusterpath visualizations in 2d, 3d, and animations. Coming soon: picking the number of clusters automatically! Future work Necessary and sufficient conditions for cluster splitting? Automatically learning weights and number of clusters? Applications to solving proximal problems. Clustering performance and timings Cluster using the prior knowledge that there are 2 clusters. Quantify partition correspondence using the Normalized Rand Index (Hubert and Arabie, 1985): 1 for perfect correspondence, 0 for completely random assignment. Results for 2 non-convex interlocking half-moons in 2d: Clustering method Rand SD Seconds SD e exp spectral clusterpath 0.99 0.00 8.49 2.64 e exp spectral kmeans 0.99 0.00 3.10 0.08 2 clusterpath 0.95 0.12 29.47 2.31 e 01 Ng et al. kmeans 0.95 0.19 7.37 0.42 e 01 spectral kmeans 0.91 0.19 3.26 0.21 Gaussian mixture 0.42 0.13 0.07 0.00 average linkage 0.40 0.13 0.05 0.00 kmeans 0.26 0.04 0.01 0.00 Similar performance to spectral clustering, and learns a tree: The weighted 2 clusterpath applied to the iris data: Scatter Plot Matrix Sepal.Length 0 1 2 0 1 2 -2 -1 0 -2 -1 0 ●● Sepal.Width 1 2 3 1 2 3 -2 -1 0 -2 -1 0 ●● ●● Petal.Length 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.0 -1.5 -0.50.0 ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● Petal.Width 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.0 -1.5 -0.5 0.0 setosa versicolor virginica Performance for several model sizes Number of clusters Normalized Rand index (bigger means better agreement with known clusters) 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 2 3 5 7 9 11 data: iris data: moons method γ = 0.5 γ = 2 γ = 10 GMM HC kmeans

Upload: others

Post on 23-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clusterpath: an Algorithm for Clustering using Convex Fusion ...members.cbio.mines-paristech.fr/~thocking/papers/2011-05...2011/05/01  · Results for 2 non-convex interlocking half-moons

Clusterpath: an Algorithm for Clusteringusing Convex Fusion Penalties

Toby Dylan Hocking Armand Joulin Francis BachJean-Philippe Vert

INRIA – Sierra team, Laboratoire d’Informatique de l’ÉNSMines ParisTech – CBIO, INSERM U900, Institut Curie

Paris, France

The clustering problem: different appraoches

Clustering: assignlabels to n pointsin p dimensionsX ∈ Rn×p.

Methods:K-meansHierarchicalMixture modelsSpectral (Ng etal. 2001)

Issues:HierarchyConvexityGreedinessStabilityInterpretability

Clusterpath: relaxing a hard fusion penalty

Hard-thresholding of differences is a combinatorial problem:minα∈Rn×p ||α− X ||2F subject to

∑i<j 1αi 6=αj 6 t

Relaxation:∑

i<j ||αi − αj ||qwij 6 tThe Lagrange form is useful for optimization algorithms:

α∗(X , λ, q, w) = argminα∈Rn×p12||α− X ||2F + λ

∑i<j

||αi − αj ||qwij

The clusterpath of X is the path of optimal α∗ obtained byvarying λ, for fixed weights wij ∈ R+ and norm q ∈ 1, 2, ∞.Related work: “fused lasso” Tibshirani and Saunders (2005),“grouping pursuit” Shen and Huang (2010), “sum of norms”Lindsten et al. (2011).

Norm and weights control the clusterpath

norm = 1

X

X

norm = 2

X

X

norm = ∞

X

X

γ=

=1

Geometric interpretation: constrain area between pointsIdentity weights, t = Ω(X )

`2`2

`2

`1

`1 `1

`1

`1

`1

`∞

`∞

`∞

X1

X2

X3

Decreasing weights after join, t < Ω(X )

w12w13

X1

X2

X3

α1

αC = α2 = α3

Decreasing weights, t = Ω(X )

w12w13

w23

X1

X2

X3

We propose dedicated algorithms for each norm

Norm Properties Algorithm Complexity Problem sizes1 piecewise linear, separable path O(pn log n) large ≈ 105

2 rotation invariant active-set O(n2p) medium ≈ 103∞ piecewise linear Frank-Wolfe unknown* medium ≈ 103

*each iteration of complexity O(n2p).

Outline of the `1 path algorithm

Condition sufficient for optimality:

0 = αi − Xi + λ∑

j 6=iαi 6=αj

wij sign(αi − αj) + λ∑

j 6=iαi=αj

wijβij,

with |βij | 6 1 and βij = −βji (Hoefling 2009).1 For λ = 0 the solution α = X is optimal. We initialize the

clusters Ci = i and coefficients αi = Xi for all i .2 As λ increases, the solutions will follow straight lines.3 Taking the derivative of the optimality condition with respect toλ and summing over all points in a cluster C leads to:

dαC

dλ= vC =

∑j 6∈C

wjC sign(αj − αC) =∑j 6∈C

∑i∈C

wij sign(αj − αC)

4 When 2 clusters C1 and C2 fuse, they form a new clusterC = C1 ∪ C2 with vC = (|C1|v1 + |C2|v2)/(|C1| + |C2|).

5 Stop when all the points merge at the mean X .6 Combine dimensions using λ values.

`1 clusterpath of 10 points in 2d

α2

α1

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

Joins the left cluster on α1 before joining right cluster.

Solution at λ = 0.18 yields 2 clusters.

-0.5 0.0 0.5 1.0 1.5

Location in the regularization path λ

Opt

imal

valu

eof` 1

clus

terp

ath

-0.8-0.6-0.4-0.20.00.20.40.6

-0.5

0.0

0.5

1.0

1.5

0.00 0.05 0.10 0.15 0.20

α1

α2

Free software! http://clusterpath.r-forge.r-project.org/

Dedicated C++ optimization algorithms with R interface.Calculates the exact `1 clusterpath for identity weights.Active-set algorithm for the `1 and `2 clusterpath with general weights.

R interface to Python cvxmod clusterpath solver.Clusterpath visualizations in 2d, 3d, and animations.Coming soon: picking the number of clusters automatically!

Future workNecessary and sufficient conditions for cluster splitting?Automatically learning weights and number of clusters?Applications to solving proximal problems.

Clustering performance and timings

Cluster using the prior knowledge that there are 2 clusters.Quantify partition correspondence using the Normalized RandIndex (Hubert and Arabie, 1985): 1 for perfect correspondence,0 for completely random assignment.Results for 2 non-convex interlocking half-moons in 2d:

Clustering method Rand SD Seconds SDeexp spectral clusterpath 0.99 0.00 8.49 2.64eexp spectral kmeans 0.99 0.00 3.10 0.08`2 clusterpath 0.95 0.12 29.47 2.31e01 Ng et al. kmeans 0.95 0.19 7.37 0.42e01 spectral kmeans 0.91 0.19 3.26 0.21Gaussian mixture 0.42 0.13 0.07 0.00average linkage 0.40 0.13 0.05 0.00kmeans 0.26 0.04 0.01 0.00

Similar performance to spectral clustering, and learns a tree:

The weighted `2 clusterpath applied to the iris data:

Scatter Plot Matrix

Sepal.Length0

1

2 0 1 2

−2

−1

0

−2 −1 0

Sepal.Width

1

2

3 1 2 3

−2

−1

0

−2 −1 0

Petal.Length0.0

0.5

1.0

1.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.0

−1.5 −0.50.0

Petal.Width0.0

0.5

1.0

1.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.0

−1.5 −0.50.0

setosa versicolor virginica

Performance for several model sizes

Number of clusters

Nor

mal

ized

Ran

din

dex

(big

germ

eans

bette

ragr

eem

entw

ithkn

own

clus

ters

)

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

2 3 5 7 9 11

data:iris

data:m

oons

method

γ = 0.5

γ = 2

γ = 10

GMM

HC

kmeans