margareta ackerman joint work with shai ben-david measures of clustering quality: a working set of...

Click here to load reader

Upload: dylan-west

Post on 17-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Margareta Ackerman Joint work with Shai Ben-David Measures of Clustering Quality: A Working Set of Axioms for Clustering
  • Slide 2
  • Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science.. All apply clustering to gain a first understanding of the structure of large data sets. The Theory-Practice Gap Yet, there is distressingly little theoretical understanding of clustering.
  • Slide 3
  • Can clustering be given a formal and general definition? What is a good clustering? Can we distinguish clusterable from structureless data? Questions that research of fundamentals of clustering should address
  • Slide 4
  • Clustering is not well defined. There is a wide variety of different clustering tasks, with different (often implicit) measures of quality. Inherent Obstacles In most practical clustering tasks there is no clear ground truth to evaluate your solution by. (in contrast with classification tasks, in which you can have a hold out labeled set to evaluate the classifier against). A clustering may have different value to different users. e.g. Cluster paintings by painter vs. topic
  • Slide 5
  • Objective utility functions Sum Of In-Cluster Distances, Average Distances to Center Points, Cut Weight, Spectral Clustering, etc. (Shmoys, Charikar, Meyerson, Luxburg,..) Analyze the computational complexity of discrete optimization problems. Consider a restricted set of distributions (generative models): Ex. Mixtures of Gaussians [Dasgupta 99], [Vempala, 03], [Kannan et al 04], [Achlitopas, McSherry 05]. Recover the parameters of the model generating the data. Many more Add structure:Relevant Information Ex. Information bottle-neck approach [Tishby, Pereira, Bialek 99] Factor out user-irrelevant information. Common Solutions
  • Slide 6
  • What can we say independently of any specific algorithm, specific objective function or specific generative data model ? Clustering Axioms Postulate axioms that, ideally, every clustering approach should satisfy. e.g. [Hartigan 1975], [Puzicha, Hofmann, Buhmann 00], [Kleinberg 02]. usually conclude with negative results. Quest for a General Theory
  • Slide 7
  • Sd For a finite domain set S, a distance function d is the distance defined between the domain points. A Clustering Function maps d S Input: a distance function d over Sto S Output: a partition (clustering) of S Our Formal Setup
  • Slide 8
  • Kleinberg proposes natural-looking Axioms that distinguish clustering functions from other functions that output domain partitions. Kleinbergs Work on Clustering Functions
  • Slide 9
  • Scale Invariance F(d)=F(d)d F(d)=F(d) for all d and all strictly positive . Consistency dd, F(d),F(d)=F(d). If d equals d, except for shrinking distances within clusters of F(d) or stretching between-cluster distances, then F(d)=F(d). Richness PS For any partition P of S, there exists a distance d S F(d)=P function d over S so that F(d)=P. Kleinbergs Axioms
  • Slide 10
  • Theorem [Kleinberg, 2002]: These axioms are inconsistent. Namely, no function can satisfy these three axioms. Theorem [Kleinberg, 2002]: These axioms are inconsistent. Namely, no function can satisfy these three axioms. How come axioms that seem to capture our intuition about clustering are inconsistent?? Our answer: The formalization of these axioms is stronger than the intuition they intend to capture. We express that same intuition in an alternative framework, and achieve consistency.
  • Slide 11
  • Clustering-quality measures quantify the quality of clusterings. How good is this clustering? Clustering-Quality Measures
  • Slide 12
  • A clustering-quality measure is a function m( dataset, clustering ) m( dataset, clustering ) R satisfying some properties that make this function a meaningful clustering-quality measure. What properties should it satisfy? Defining Clustering-Quality Measures
  • Slide 13
  • Scale Invariance m(C,d)=m(C, d)d Cd m(C,d)=m(C, d) for all d and all strictly positive , and C over d. Richness CS d S For any clustering C of S, there exists a distance function d over S so that C = argmax c m (C,d) C = argmax c m (C,d). Rephrasing Kleinbergs axioms as clustering-quality measures axioms
  • Slide 14
  • Consistency dd, C, m(C,d)m(C,d). If d equals d, except for shrinking distances within clusters of C or stretching between-cluster distances, then m(C,d)m(C,d). dd C C Rephrasing Kleinbergs axioms as clustering-quality measures axioms
  • Slide 15
  • C(X,d)C(X,d) f:XXxyC f(x)f(y)C Clusterings C over (X,d) and C over (X,d) are isomorphic, if there exists a distance-preserving automorphism f:X X, such that x,y share the same C- cluster iff f(x) and f(y) share the same C-cluster. Isomorphism Invariance: CCm(C,d) = m(C,d) If C and C are isomorphic, then m(C,d) = m(C,d). An Additional Axiom
  • Slide 16
  • Moreover, every reasonable CQM satisfies our axioms. We prove this result by demonstrating measures that satisfy these axioms. Theorem: Consistency, scale invariance, richness, and isomorphism invariance for clustering quality measures form a consistent set of requirements. Theorem: Consistency, scale invariance, richness, and isomorphism invariance for clustering quality measures form a consistent set of requirements. Major Gain Consistency of New Axioms
  • Slide 17
  • xC The Relative Margin of a point x in C is x x (dist. to closest center to x) / (dist. to 2 nd closest center to x) C The Relative Margin of C is the average relative margin over all non-center points (over all possible center settings). Relative Margin satisfies scale-invariance, consistency, richness, and isomorphism invariance. An example of a CQM for center-based clustering: Relative Margin
  • Slide 18
  • C-index (Dalrymple-Alford, 1970) Gamma (Baker & Hubert, 1975) Adjusted ratio of clustering (Roenker et al., 1971) D-index (Dalrymple-Alford, 1970) Modified ratio of repetition (Bower, Lesgold, and Tieman, 1969) Dunn's index (Dunn, 1973) Variations of Dunns index (Bezdek and Pal, 1998) Strict separation (based on Balacan, Blum, and Vempala, 2008) And many more... Additional CQMs Satisfying Our Axioms
  • Slide 19
  • In the setting of clustering functions, the consistency axiom requires that consistent changes to the underlying distance should not create any new contenders for the best-clustering of the data. dd CC C C A clustering function that satisfies Kleinbergs Consistency cannot output C. Why is the CQM formalism more faithful to intuition?
  • Slide 20
  • C In the setting of clustering-quality measures, the consistency axiom requires only that the quality of the clustering of a given clustering C does not get worse. dd CC C C C While the quality of C improves, a different clustering, C, can still have better quality. Why is the CQM formalism more faithful to intuition?
  • Slide 21
  • The intuition behind Kleinbergs axioms is consistent (in spite of his impossibility result). The Impossibility Result can be overcome by a change of formalism. We do this by focusing on clustering-quality measures. Every reasonable clustering-quality measure satisfies our axioms. Summary
  • Slide 22
  • How can the completeness of a set of axioms be argued? Are the axioms useful for gaining interesting new insights about clusterings? Can we find properties that distinguish different clustering paradigms? Future Work
  • Slide 23
  • Appendix: Another Clustering-Quality Measure: Gamma (Baker & Hubert, 1975) Gamma is the best performing measure in Milligans study of 30 internal criterions (Milligan, 1981). C Let d(+) denote the number of times that points which were clustered together in C had distance greater than two points which were not in the same cluster Let d(-) denote the opposite result Gamma satisfies scale-invariance, consistency, richness, and isomorphism invariance.
  • Slide 24
  • Variants of Quality Measures m Given a clustering-quality measure m, we can create new ones by applying it to a subset of the clusters. m min (C,d) = min s (m(S,d)) m min (C,d) = min s (m(S,d)), S C where S is a subset of a least 2 clusters in C. m max m average Similarly, we can define m max and m average. m m min m max m average. If m satisfies the axioms of clustering-quality measures, then so do m min, m max,and m average.