clustering on the simplex

Informatics and Mathematical Modelling / Intelligent Signal Processing

1EMMDS 2009 July 3rd, 2009

Clustering on the Simplex

Morten Mørup DTU Informatics

Intelligent Signal ProcessingTechnical University of Denmark



Joint work with

Lars Kai HansenDTU Informatics


Christian WalderDTU Informatics




Clustering

Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. (Wikipedia)



Clustering approaches K-means iterative refinement algorithm

(Lloyd, 1982; Hartigan, 1979)

Problem NP-complete (Megiddo and Supowit, 1984)

Relaxations of the hard assigment problem: Annealing approaches based

on temperature parameter(T0 the original clustering problem is recovered)(see for instance Hofmann and Buhmann, 1997)

Fuzzy clustering (Hathaway and Bezdek, 1988)

Expectation Maximization (Mixture of Gaussians)

Spectral Clustering

Previously relaxations are either not exact or dependent on some problem specific annealing parameter in order to recover the original binary combinatorial assignments.

Assignmnt Step (S): Assign each data point to the cluster with closest mean value Update Step (C): Calculate the new mean value for each cluster

No single change in assignment better than current assignment

(1-spin stability).

Guarantee of optimality:

Drawbacks:



From the K-means objective to Pairwise ClusteringK-mean objective

Pairwise Clustering (Buhmann and Hofmann, 1994)

K similarity matrix, K=XTX equivalent tothe k-means objective



Although Clustering is hard there is room to be simple(x) minded!

Binary Combinatorial (BC) Simplicial Relaxation (SR)



The simplicial relaxation (SR) admits standard continuous optimization to solve for the pairwise clustering problems.

For instance by normalization invariant projected gradient ascent:



Brown and grey clusters each contain 1000 data-points in R2

Whereas the remaining clusters each have 250 data-points.

Synthetic data exampleK-means SR-clustering



SR-clustering algorithm driven by high density regions



SR-clustering (init=1) SR-clustering (init=0.01) Lloyd’s K-means

Thus, solutions in general substantially better than Lloyd’s algorithm having the same computational complexity



10 components 50 components 100 components

K-means

SR-clustering(init=1)

SR-clustering(init=0.01)



SR-clustering for Kernel based semi-supervised learning

(Basu et al, 2004, Kulis et al. 2005, Kulis et al, 2009)

Kernel based semi-supervised learning based on pairwise clustering



Simplicial relaxation admit solving the problem as a (non-convex) continous optimization problem



Class labels can be handled explicitly fixing Must and cannot links can be absorbed into the Kernel

Hence the problem reduces more or less to standard SR-clustering problem for the estimation of S



Thus, Lagrange multipliers give a measure of conflict between the data and the supervision

At stationarity we have that the gradients of elements in each column of S that are 1 are larger than elements that are 0. Thus, evaluating the impact of the supervision can be done estimating the minimal lagrange multipliers that guarantee stationarity of the solution obtained by the SR-clustering algorithm. This is a convex optimization problem



Digit classification with one miss-labeled data observation from each class.



Community Detection in Complex NetworksCommunities/modules: a natural divisions of network nodes into denselyconnected subgroups (Newman & Girvan 2003)

G(V,E)

Adjacency MatrixA

Community detection algorithm

Permuted adjacency matrixPAPT

Permutation P of graph from clustering assignment S



Common Community detection objectives

Hamiltonian (Fu & Anderson, 1986, Reichardt & Bornholdt, 2004)

Modularity (Newman & Girvan, 2004)

Generic problems of the form



Again we can make an exact relaxation to the simplex!



SR-clustering of complex networks

Quality of solutions comparable to results obtained by extensive Gibbs sampling



So far we have demonstrated how binary combinatorial constraints are recovered at stationarity when relaxing the problems to the simplex.

However, simplex constraints also holds promising data mining properties of their own!



Def: The convex hull/convex envelope of XRMN is the minimal convex set containing X. (Informally it can be described as a rubber band wrapped around the data points.)

Finding the convex hull is solvable in linear time, O(N) (McCallum and D. Avis, 1979)However, the size of the convex set grows exponentially with the dimensionality of the data, O(logM-1(N)) (Dwyer, 1988)

The Convex Hull

The Principal Convex Hull (PCH)Def: The best convex set of size K according to some measure of distortion D(·|·) (Mørup et al. 2009). (Informally it can be described as a less flexible rubber band that wraps most of the data points.)



C: Give the fraction in which observations in X are used to form each feature (distinct aspects/freaks). In general C will be very sparse!!S: Give the fraction each observation resembles each distinct aspects XC.

(note when K large enough such that the PCH recover the convex hull)

The mathematical formulation of the Principal Convex Hull (PCH) is given by two simplex constraints

”Principal” in terms of the Frobenius norm

X XC

S



Relation between the PCH model, low rank decomposition and clustering approaches

PCH naturally bridges clustering and low-rank approximations!



Two important properties of the PCH model

The PCH model is invariant to affine

transformation and scaling

The PCH model is unique up to permutation

of the components



A featu

re e

xtra

ctio

n exa

mple

More contrast in features than obtained by clustering approaches. As such, PCH aim for distict aspects/regions in data

The PCH model strives to attain Platonic ”Ideal Forms”



PCH model f

or PET d

ata

(Posit

ron E

miss

ion Tom

ography)

Data contain 3 components:High-Binding regionsLow-binding regionsNon-binding regionsEach voxel given concentrationfraction of these regions

XC

S



NMF spectroscopy of samples of mixtures of propanol butanol and pentanol.



Collaborative filtering example

Medium size and large size Movie lens data (www.grouplens.org)Medium size: 1,000,209 ratings of 3,952 movies by 6,040 users Large size: 10,000,054 ratings of 10,677 movies given by 71,567



Conclusion The simplex offers unique data mining properties Simplicial relaxations (SR) form exact

relaxation of common hard assignment clustering problems, i.e. K-means, Pairwise Clustering and Community detection in graphs.

SR Enable to solve binary combinatorial problems using standard solvers from continuous optimization.

The proposed SR-clustering algorithm outperforms traditional iterative refinement algorithms

No need for annealing parameter. hard assignments guaranteed atstationarity (Theorem 1 and 2)

Semi-Supervised learning can be posed as continuous optimization problem with associated lagrange multipliers giving an evaluation measure of each supervised constraint



The Principal Convex Hull (PCH) formed by two types of simplex constraints

Extract distinct aspects of the data Relevant for data mining in general

where low rank approximation and clustering approaches have been invoked.

Conclusion cont.



A reformulation of ”Lex Parsimoniae”

Simplicity is the ultimate sophistication.

Simplexity is the ultimate sophistication.

- Leonardo Da Vinci

The simplest explanation is usually the best.

The simplex explanation is usually the best. - William of Ockham

The presented work is described in:M. Mørup and L. K. Hansen ”An Exact Relaxation of Clustering”, Submitted JMLR 2009M. Mørup, C. Walder and L. K. Hansen ”Simplicial Semi-supervised Learning”, submittedM. Mørup and L. K. Hansen ” Platonic Forms Revisited”, submitted

clustering on the simplex

Documents