data clustering 50 years beyond k-meansdataclustering.cse.msu.edu/papers/ecmltalk.pdf · 2012. 11....
TRANSCRIPT
![Page 1: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/1.jpg)
�
Data Clustering50 Years Beyond K-means
Data Clustering50 Years Beyond K-means
Anil K. Jain
Department of Computer Science
Michigan State University
![Page 2: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/2.jpg)
Angkor WatAngkor Wat
Hindu temple built by a Khmer king ~1,150AD; Khmer kingdom declined in the 15th century; French explorers discovered the hidden ruins in late 1800’s
![Page 3: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/3.jpg)
Apsaras of Angkor WatApsaras of Angkor Wat
• Angkor Wat contains the most unique gallery of
~2,000 women depicted by detailed full body portraits
• What facial types are represented in these portraits?
Kent Davis, “Biometrics of the Godedess”, DatAsia, Aug 2008
S. Marchal, “Costumes et Parures Khmers: D’apres les devata D’Angkor-Vat”, 1927
![Page 4: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/4.jpg)
Clustering of Apsara FacesClustering of Apsara Faces
Shape alignment
An ethnologist needs to validate the groups
127 facial landmarks
127 landmarks
1 2 3 4 56 7 89 10
Single Link clusters
![Page 5: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/5.jpg)
Clustering of Apsara FacesClustering of Apsara Faces
1 2 3 4 56 7 89 10
68.
67.72.
86.75.60.
78.69.76.72.
71.62.75.74.72.
97.88.94.87.78.69.
83.71.77.73.53.78.87.
59.55.66.80.70.72.167.
61.61.78.89.73.70.96.72.51.
4 clusters with K-means in 3D feature space
Dissimilarity matrix
0
![Page 6: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/6.jpg)
�
Data ExplosionData Explosion
� The digital universe was ~281 exabytes (281 billion
gigabytes) in 2007
� By 2011, the digital universe will be 10 times the
size it was in 2006
� Images and video, captured by over one billion
devices (camera phones), are the major source
� To archive and effectively use this data, we need
tools for data visualization & categorization
http://eon.businesswire.com/releases/information/digital/prweb509640.htm
http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf
![Page 7: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/7.jpg)
Exploratory Data AnalysisExploratory Data Analysis
• A collection of techniques to gain insight into data,
uncover underlying structure, generate
hypotheses, detect anomalies, and identify
important measurements (Tukey, 1977)
• Does not require assumptions common in
confirmatory data analysis (hypothesis testing or
discriminant analysis)
• Graphical techniques, visualization, outlier
detection, multidimensional scaling, clustering
![Page 8: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/8.jpg)
ClusteringClustering
“A statistical classification technique for
discovering whether the individuals of a
population fall into different groups by
making quantitative comparisons of
multiple characteristics” - Webster’s
• Q-analysis, typology, grouping, clumping,
taxonomy, unsupervised learning
• Given a representation of n objects, find K
clusters based on the given measure of similarity
A.K. Jain and R. C. Dubes, algorithms for Clustering Data, Prentice Hall, 1988
http://www.cse.msu.edu/~jain/Clustering_Jain_Dubes.pdf
![Page 9: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/9.jpg)
Numerical TaxonomyNumerical Taxonomy
Sokal and Sneath, Principles of Numerical Taxonomy, 1963
Michener (1957) makes a distinction between hierarchies of categories for
• Convenience: as a method for organizing data
• Natural classification: based on phylogeneticrelationship or degree of similarity among forms
![Page 10: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/10.jpg)
��
Historical Developments Historical Developments
• Cluster analysis first appeared in the title of a 1954 article
analyzing anthropological data (JSTOR)
• Hierarchical Clustering: Sneath (1957), Sorensen (1957)
• K-Means: Steinhaus1 (1956), Lloyd2 (1957), Cox3 (1957),
Ball & Hall4 (1967), MacQueen5 (1967)
• Mixture models (Wolfe, 1970)
• Graph-theoretic methods (Zahn, 1971)
• K Nearest neighbors (Jarvis & Patrick, 1973)
• Fuzzy clustering (Bezdek, 1973)
• Self Organizing Map (Kohonen, 1982)
• Vector Quantization (Gersho and Gray, 1992)
1Acad. Polon. Sci., 2Bell Tel. Report, 3JASA, 4Behavioral Sci., 5Berkeley Symp. Math Stat & Prob.
![Page 11: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/11.jpg)
��
K-Means AlgorithmK-Means Algorithm
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bisecting K-means (Karypis et al.); X-means (Pelleg and Moore); K-
means with constraints (Davidson); scalable K-means (Bradley et al.)
• Initialization• Value of K• Distance metric
![Page 12: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/12.jpg)
Beyond K-MeansBeyond K-Means
• Density-based (Ether et al., 1996)
• Subspace (Agrawal et al., 1998)
• Spectral (Hagen & Kahng, 1991; Shi & Malik, 2000)
• Dirichlet Process (Ferguson, 1973; Rasmussen, 2000)
• Information bottleneck (Tishby et al., 1999)
• Non-negative matrix factorization (Lee & Seung, 1999)
• Ensemble (Strehl & Ghosh, 2002; Fred & Jain, 2002)
• Semi-supervised (Wagstaff et al., 2003; Basu et al., 2004)
• Overlapping (Segal et al., 2003; Banerjee et al., 2005)
• Maximum margin (Xu et al., 2005)
• Discriminative (Bach & Harchaoui, 2007; Ye et al., 2007)
• ~155 papers on clustering in ML conf. (2006-07); Google
Scholar: 1,560 papers with “data clustering” in 2007 alone!
• Methods differ on choice of objective function, generative
models and heuristics
![Page 13: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/13.jpg)
User’s Dilemma!User’s Dilemma!
• What features and normalization scheme to use?
• How to define pair-wise similarity?
• How many clusters?
• Which clustering method?
• How to choose algorithmic parameters?
• Does the data have any clustering tendency?
• Are the discovered clusters & partition valid?
• How to visualize, interpret & evaluate clusters?
Dubes and Jain, “Clustering Techniques: User’s Dilemma”, Pattern Recognition, 1976
![Page 14: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/14.jpg)
• A set of entities which are alike;
entities from different clusters
are not alike
What is a Cluster?What is a Cluster?
![Page 15: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/15.jpg)
• A set of entities which are alike;
entities from different clusters
are not alike
What is a Cluster?What is a Cluster?
• Compact clusters
– within-cluster distance < between-cluster distance
![Page 16: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/16.jpg)
• A set of entities which are alike;
entities from different clusters
are not alike
What is a Cluster?What is a Cluster?
• Compact clusters
– within-cluster distance < between-cluster distance
• Connected clusters
– within-cluster connectivity > between-cluster connectivity
• Ideal cluster: compact and isolated
![Page 17: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/17.jpg)
RepresentationRepresentation
Shamir et al. BMC Bioinformatics, 2005
nxd pattern matrix
nxn similarity matrix
Handwritten digitsImage retrieval
Gene Expressions
Objects: pixels, images, time series, documents
Representation: features, similarity
Sea-surface temperature time series
longitude-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
90
60
30
0
-30
-60
-90
Segmentation
![Page 18: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/18.jpg)
Good RepresentationGood Representation
Points in given 2D spaceRepresentation based on eigenvectors of
RBF kernel
A good representation leads to compact &
isolated clusters
![Page 19: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/19.jpg)
Purpose of GroupingPurpose of Grouping
Large weight onappearance features
Large weight onactivity features
http://www.ofai.at/~elias.pampalk/kdd03/animals/
Two different meaningful groupings of 16 animals
based on 13 Boolean features (appearance & activity)
Mammals
Vs.
Birds
Predators
Vs.
Non-Predators
![Page 20: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/20.jpg)
Number of ClustersNumber of Clusters
True labels, K = 6 Clustering with K = 2
Clustering with K = 5 Clustering with K = 6
Clustering is in the eyes of the beholder
Original data
![Page 21: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/21.jpg)
��
Cluster ValidityCluster Validity
• Clustering algorithms find clusters, even if there are
no natural clusters in the data!
• Cluster stability (Lange et. al, 2004)
K-Means with K=3100 2D uniform data points
![Page 22: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/22.jpg)
15 Data points MST FORGY ISODATA
WISH JPCLUSTER Complete Link
Comparing Clustering MethodsComparing Clustering Methods
Which clustering algorithm is the best?
Dubes and Jain, “Clustering Techniques: User’s Dilemma”, Pattern Recognition, 1976
![Page 23: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/23.jpg)
��
Grouping of Clustering AlgorithmsGrouping of Clustering Algorithms
Clustering method vs. clustering algorithm
A. K. Jain, A. Topchy, M. Law, J. Buhmann, "Landscape of Clustering Algorithms", ICPR, 2004
Chameleon variants
K-means, Spectral, GMM & Ward’s linkage
Hierarchical clustering of 35 different algorithms (evaluated on 12 datasets)
![Page 24: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/24.jpg)
Mathematical & Statistical LinksMathematical & Statistical Links
K-Means Spectral Clustering
Matrix Factorization
Eigen Analysis of data/similarity
matrix
Prob. Latent Semantic Indexing
Zha et al., 2001; Dhillon et al., 2004; Gaussier et al., 2005, Ding et al., 2006; Ding et al., 2008
![Page 25: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/25.jpg)
Admissibility Criteria Admissibility Criteria
• A technique is P-admissible if it satisfies a desirable
property P (Fisher & Van Ness, Biometrika, 1971)
• Properties that test sensitivity w.r.t. changes that do
not alter the essential structure of data: Point &
cluster proportion, cluster omission, monotone
• Impossibility theorem (Kleinberg, NIPS 2002); no
clustering function satisfies scale invariance, richness
and consistency properties
• Difficulty in unifying the informal concept of
clustering and inherent tradeoffs
![Page 26: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/26.jpg)
No Best Clustering algorithmNo Best Clustering algorithm
Each algorithm, implicitly or explicitly, imposes a structure
on the data; if the match is “good”, algorithm is successful
Mixture of 3 Gaussians Two “half rings”GMM with K = 3 GMM with K = 2Spectral with K = 3 Spectral with K = 2
![Page 27: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/27.jpg)
Data CompressionData Compression• Pixels with similar attributes and spatial location
are clustered to find segments (Leeser et al., ‘98)
• Each segment indexed to its mean attribute value
http://www.ece.neu.edu/groups/rpl/projects/kmeans/
Input image Segmentation Reconstruction
![Page 28: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/28.jpg)
Object RecognitionObject RecognitionLocal descriptors are hierarchically quantized in a
vocabulary tree (Nister et al., CVPR, 2006)
…
…
…
b1b2
b3
b4
b5
b6
b7
b8
b1 b2 b3 b4Hierarchical codebook using K-Means
![Page 29: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/29.jpg)
Connell and Jain, “Writer Adaptation for Online Handwriting Recognition”, IEEE PAMI, Mar 2002
Finding LexemesFinding Lexemes• Find subclasses in handwritten “online” characters (122,000 characters written by 100 writers)
• Performance improves by modeling subclasses
![Page 30: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/30.jpg)
Xu & Croft (ACM TOIS, 1998) used corpus analysis
based on word co-occurrence to refine the large
equivalence classes generated by a stemmer
��
Information RetrievalInformation Retrieval
Human race
Horse race
![Page 31: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/31.jpg)
Map of ScienceMap of Science
Clustering of network (relational) data
800,000 scientific papers clustered into 776 scientific paradigms based on how often the papers were cited together by authors of other papers
Nature (2006)
![Page 32: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/32.jpg)
��
Some TrendsSome Trends
• Large-scale data
� Clustering of 1.5B images into 50M clusters; 10 hours on
2000 CPUs (Liu et al., WACV 2007)
• Evidence Accumulation
�Multi-way clustering (documents/words/authors)
�Multi-modal data (clustering genes based on expression
levels and text literature, Yang et al., CSB 2007)
• Domain Knowledge
� How to acquire & incorporate domain knowledge? Pairwise
constraints, feature constraints (e.g., WordNet)
• Complex Data Types
� Dynamically evolving data (cluster maintenance)
� Networks/graphs (How to define kernel/similarity matrix?)
![Page 33: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/33.jpg)
Clustering EnsembleClustering Ensemble
• Combine many “weak” partitions of a data to generate a
better partition (Strehl & Ghosh, 2002; Fred & Jain, 2002)
• Pairwise co-occurrences from different K-Means partitions
-5 0 5
-6
-4
-2
0
2
4
6
![Page 34: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/34.jpg)
Multiobjective ClusteringMultiobjective Clustering
• Different clusters in data may have different shapes and densities; difficult for a single criterion
• Find stable clusters from different algorithms
• Four stable clusters identified in image segmentationdata using GMM, Single-link, K-means and spectral
Law, Topchy and Jain, CVPR 2004
-6-4-202
-20
-10
0-2
-1
0
1
2
3
![Page 35: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/35.jpg)
��
Semi-supervised ClusteringSemi-supervised Clustering
Clustering with side information: modify the objective
function of a given algorithm or design a new algorithm to
utilize paiwise constraints
I: initialization, C: constraints, D: distance learning
Basu et al., KDD’04
![Page 36: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/36.jpg)
��
BoostClusterBoostCluster
• Can we improve any generic clustering algorithm in
the presence of “constraints”?
• BoostCluster: an unsupervised boosting algorithm
to iteratively update the similarity matrix given the
constraints & clustering output
Liu, Jin & Jain, BoostCluster: Boosting Clustering by Pairwise Constraints, KDD, Aug 2007
Original Data Similarity Matrix New representation Similarity matrix
![Page 37: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/37.jpg)
Performance of BoostClusterPerformance of BoostCluster
Handwritten digit (UCI); 4,000 points in 256 dimensions; 10 clusters
![Page 38: Data Clustering 50 Years Beyond K-meansdataclustering.cse.msu.edu/papers/ECMLTalk.pdf · 2012. 11. 1. · 50 Years Beyond K-means Anil K. Jain Department of Computer Science Michigan](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd048650e2ed3d3135ab7e/html5/thumbnails/38.jpg)
��
• It is natural to seek clustering methods to group a
heterogeneous set of objects based on similarity
• Objective should not be to choose the best clustering
technique– it would be fruitless & contrary to the
exploratory nature of clustering
• Enough clustering algorithms known to uncover
specific data structures are available; representation
is critical
• Future research: rational basis for comparing
clustering methods, quick-look procedures for very
large databases, taking multiple looks at the same
data and incorporating domain knowledge
SummarySummary