margareta ackerman joint work with shai ben-david measures of clustering quality: a working set of...

Margareta Ackerman Joint work with Shai Ben-David Measures of Clustering Quality: A Working Set of Axioms for Clustering

Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science.. All apply clustering to gain a first understanding of the structure of large data sets. The Theory-Practice Gap Yet, there is distressingly little theoretical understanding of clustering.

Can clustering be given a formal and general definition? What is a good clustering? Can we distinguish clusterable from structureless data? Questions that research of fundamentals of clustering should address

Clustering is not well defined. There is a wide variety of different clustering tasks, with different (often implicit) measures of quality. Inherent Obstacles In most practical clustering tasks there is no clear ground truth to evaluate your solution by. (in contrast with classification tasks, in which you can have a hold out labeled set to evaluate the classifier against). A clustering may have different value to different users. e.g. Cluster paintings by painter vs. topic

Objective utility functions Sum Of In-Cluster Distances, Average Distances to Center Points, Cut Weight, Spectral Clustering, etc. (Shmoys, Charikar, Meyerson, Luxburg,..) Analyze the computational complexity of discrete optimization problems. Consider a restricted set of distributions (generative models): Ex. Mixtures of Gaussians [Dasgupta 99], [Vempala, 03], [Kannan et al 04], [Achlitopas, McSherry 05]. Recover the parameters of the model generating the data. Many more Add structure:Relevant Information Ex. Information bottle-neck approach [Tishby, Pereira, Bialek 99] Factor out user-irrelevant information. Common Solutions

What can we say independently of any specific algorithm, specific objective function or specific generative data model ? Clustering Axioms Postulate axioms that, ideally, every clustering approach should satisfy. e.g. [Hartigan 1975], [Puzicha, Hofmann, Buhmann 00], [Kleinberg 02]. usually conclude with negative results. Quest for a General Theory

Sd For a finite domain set S, a distance function d is the distance defined between the domain points. A Clustering Function maps d S Input: a distance function d over Sto S Output: a partition (clustering) of S Our Formal Setup

Kleinberg proposes natural-looking Axioms that distinguish clustering functions from other functions that output domain partitions. Kleinbergs Work on Clustering Functions

Scale Invariance F(d)=F(d)d F(d)=F(d) for all d and all strictly positive . Consistency dd, F(d),F(d)=F(d). If d equals d, except for shrinking distances within clusters of F(d) or stretching between-cluster distances, then F(d)=F(d). Richness PS For any partition P of S, there exists a distance d S F(d)=P function d over S so that F(d)=P. Kleinbergs Axioms

Theorem [Kleinberg, 2002]: These axioms are inconsistent. Namely, no function can satisfy these three axioms. Theorem [Kleinberg, 2002]: These axioms are inconsistent. Namely, no function can satisfy these three axioms. How come axioms that seem to capture our intuition about clustering are inconsistent?? Our answer: The formalization of these axioms is stronger than the intuition they intend to capture. We express that same intuition in an alternative framework, and achieve consistency.

Clustering-quality measures quantify the quality of clusterings. How good is this clustering? Clustering-Quality Measures

A clustering-quality measure is a function m( dataset, clustering ) m( dataset, clustering ) R satisfying some properties that make this function a meaningful clustering-quality measure. What properties should it satisfy? Defining Clustering-Quality Measures

Scale Invariance m(C,d)=m(C, d)d Cd m(C,d)=m(C, d) for all d and all strictly positive , and C over d. Richness CS d S For any clustering C of S, there exists a distance function d over S so that C = argmax c m (C,d) C = argmax c m (C,d). Rephrasing Kleinbergs axioms as clustering-quality measures axioms

Consistency dd, C, m(C,d)m(C,d). If d equals d, except for shrinking distances within clusters of C or stretching between-cluster distances, then m(C,d)m(C,d). dd C C Rephrasing Kleinbergs axioms as clustering-quality measures axioms

C(X,d)C(X,d) f:XXxyC f(x)f(y)C Clusterings C over (X,d) and C over (X,d) are isomorphic, if there exists a distance-preserving automorphism f:X X, such that x,y share the same C- cluster iff f(x) and f(y) share the same C-cluster. Isomorphism Invariance: CCm(C,d) = m(C,d) If C and C are isomorphic, then m(C,d) = m(C,d). An Additional Axiom

Moreover, every reasonable CQM satisfies our axioms. We prove this result by demonstrating measures that satisfy these axioms. Theorem: Consistency, scale invariance, richness, and isomorphism invariance for clustering quality measures form a consistent set of requirements. Theorem: Consistency, scale invariance, richness, and isomorphism invariance for clustering quality measures form a consistent set of requirements. Major Gain Consistency of New Axioms

xC The Relative Margin of a point x in C is x x (dist. to closest center to x) / (dist. to 2 nd closest center to x) C The Relative Margin of C is the average relative margin over all non-center points (over all possible center settings). Relative Margin satisfies scale-invariance, consistency, richness, and isomorphism invariance. An example of a CQM for center-based clustering: Relative Margin

C-index (Dalrymple-Alford, 1970) Gamma (Baker & Hubert, 1975) Adjusted ratio of clustering (Roenker et al., 1971) D-index (Dalrymple-Alford, 1970) Modified ratio of repetition (Bower, Lesgold, and Tieman, 1969) Dunn's index (Dunn, 1973) Variations of Dunns index (Bezdek and Pal, 1998) Strict separation (based on Balacan, Blum, and Vempala, 2008) And many more... Additional CQMs Satisfying Our Axioms

In the setting of clustering functions, the consistency axiom requires that consistent changes to the underlying distance should not create any new contenders for the best-clustering of the data. dd CC C C A clustering function that satisfies Kleinbergs Consistency cannot output C. Why is the CQM formalism more faithful to intuition?

C In the setting of clustering-quality measures, the consistency axiom requires only that the quality of the clustering of a given clustering C does not get worse. dd CC C C C While the quality of C improves, a different clustering, C, can still have better quality. Why is the CQM formalism more faithful to intuition?

The intuition behind Kleinbergs axioms is consistent (in spite of his impossibility result). The Impossibility Result can be overcome by a change of formalism. We do this by focusing on clustering-quality measures. Every reasonable clustering-quality measure satisfies our axioms. Summary

How can the completeness of a set of axioms be argued? Are the axioms useful for gaining interesting new insights about clusterings? Can we find properties that distinguish different clustering paradigms? Future Work

Appendix: Another Clustering-Quality Measure: Gamma (Baker & Hubert, 1975) Gamma is the best performing measure in Milligans study of 30 internal criterions (Milligan, 1981). C Let d(+) denote the number of times that points which were clustered together in C had distance greater than two points which were not in the same cluster Let d(-) denote the opposite result Gamma satisfies scale-invariance, consistency, richness, and isomorphism invariance.

Variants of Quality Measures m Given a clustering-quality measure m, we can create new ones by applying it to a subset of the clusters. m min (C,d) = min s (m(S,d)) m min (C,d) = min s (m(S,d)), S C where S is a subset of a least 2 clusters in C. m max m average Similarly, we can define m max and m average. m m min m max m average. If m satisfies the axioms of clustering-quality measures, then so do m min, m max,and m average.

margareta ackerman joint work with shai ben-david measures of clustering quality: a working set of...

Documents

clustering slide

clustering functions

clustering approach

spectral clustering

good clustering

partition clustering

practical clustering

distance d s fd