Метод К-средних в кластер-анализе и его...
DESCRIPTION
Метод К-средних в кластер-анализе и его интеллектуализация. Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФ Professor Emeritus, School of Computer Science & Information Systems, Birkbeck College University of London , UK. Outline:. - PowerPoint PPT PresentationTRANSCRIPT
Метод К-средних в кластер-анализе и его интеллектуализацияБ.Г. МиркинПрофессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФProfessor Emeritus, School of Computer Science & Information Systems, Birkbeck College University of London, UK 1
Outline:Clustering as empirical classificationK-Means and its issues: (1) Determining K and initialization (2) Weighting variables
Addressing (1): Data recovery clustering and K-Means (Mirkin 1987,
1990) One-by-one clustering: Anomalous patterns and iK-Means Other approaches Computational experiment
Addressing (2): Three-stage K-Means Minkowski K-Means Computational experiment
Conclusion2
WHAT IS CLUSTERING; WHAT IS DATAK-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Mixed Data; Interpretation AidsWARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward ClusteringDATA RECOVERY MODELS: Statistics Modelling as Data Recovery;
Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One ClusteringDIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of ClustersGENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability
3
Referred recent work:B.G. Mirkin, Chiang M. (2010) Intelligent choice of the number of clusters in K-Means clustering: An experimental study with different cluster spreads, J. of Classification, 27, 1, 3-41 B.G. Mirkin, Choosing the number of clusters (2011), WIRE Data Mining and Knowledge Discovery, 1, 3, 252-60B.G. Mirkin, R.Amorim (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognition, 45, 1061-75 4
What is clustering?
Finding homogeneous fragments, mostly sets of entities, in datasets for further analysis
5
Example: W. Jevons (1857) planet clusters (updated by Mirkin 1996)
Pluto doesn’t fit in the two clusters of planets: originated another cluster (September 2006)
6
Example: A Few ClustersClustering interface to WEB search engines (Grouper):Query: Israel (after O. Zamir and O. Etzioni 2001)
Cluster # sites Interpretation1ViewRefine
24 Society, religion• Israel and Iudaism• Judaica collection
2ViewRefine
12 Middle East, War, History• The state of Israel• Arabs and Palestinians
3ViewRefine
31 Economy, Travel• Israel Hotel Association• Electronics in Israel 7
Clustering algorithms: Nearest neighbour Agglomerative clustering Divisive clustering Conceptual clustering K-means Kohonen SOM Spectral clustering ………………….
8
Batch K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
K= 3 hypothetical centroids (@)
* * * * * * * * * * @ @
@** * * *
9
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
* * * * * * * * * * @ @
@** * * *
10
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
* * * * * * * * * * @ @
@** * * *
11
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters
* * @ * * * @ * * * *
** * * *@
12
K-Means criterion: Summary distance to cluster centroids
Minimize
* * @ * * * @ * * * *
** * * *@
kk Si
i
K
k
M
vkviv
Si
K
k
ydcycSW )c,()(),( k11
2
113
Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-
line’
Shortcomings of K-Means - Initialisation: no advice on K or
initial centroids - No deep minima - No defence of irrelevant features 14
Initial Centroids: Correct
Two cluster case
15
Initial Centroids: Correct
Initial Final
16
Different Initial Centroids
17
Different Initial Centroids: Wrong
Initial Final
18
(1) To address:* Number of clusters
Issue: Criterion WK < WK-1
* Initial setting* Deeper minimum
The two are interrelated: a good initial setting leads to a deeper minimum
19
Number K: conventional approach Take a range RK of K, say K=3, 4, …, 15 For each KRK Run K-Means 100-200 times from randomly
chosen initial centroids and choose the best of them W(S,c)=WK.
Compare WK for all KRK in a special way and choose the best; such as Gap statistic (2001) Jump statistic (2003) Hartigan (1975): In the ascending order of K,
pick the first K at which HK = [ WK / WK+1 - 1 ]/(N-K-1) 10
20
(1) Addressing* Number of clusters* Initial setting
with a PCA-like method in the data recovery approach
21
Representing a partition
Cluster k:
Centroid
ckv (v - feature)
Binary 1/0 membership
zik (i - entity)
22
Basic equations (same as for PCA, but score vectors zk constrained to be binary)
y – data entry, z – 1/0 membership, not score
c - cluster centroid, N – cardinality
i - entity, v - feature /category, k - cluster
,1 ivikzkvc
K
kivy
23
Quadratic data scatter decomposition (Pythagorean)
K-means: Alternating LS minimisation y – data entry, z – 1/0 membership
c - cluster centroid, N – cardinality
i - entity, v - feature /category, k - cluster
K
k Si
V
vkviv
V
vk
K
kkv
N
i
V
viv
k
cyNcy1 1
2
1 1
2
1 1
2 )(
,1 ivikzkvc
K
kivy
24
Equivalent criteria (1)A. Bilinear residuals squared MIN
Minimizing difference between data andcluster structureB. Distance-to-Centre Squared MIN
Minimizing difference between data andcluster structure
N
i Vvive
1
2
K
kk
Si
icdWk1
2 ),(
25
Equivalent criteria (2)C. Within-group error squared MIN
Minimizing difference between data andcluster structureD. Within-group variance Squared MIN
Minimizing within-cluster variance
2
1
)( iv
N
ikv
Vv Si
yck
K
kkk SS
1
2 )(||
26
Equivalent criteria (3)E. Semi-averaged within distance squared MIN
Minimizing dissimilarities within clustersF. Semi-averaged within similarity squared
MAX
Maximizing similarities within clusters
||/),(1 ,
2k
K
k Sji
Sjidk
jijiawhereSjia k
K
k Sji k
,),(|,|/),(1 ,
27
Equivalent criteria (4)G. Distant Centroids MAX
Finding anomalous typesH. Consensus partition MAX
Maximizing correlation between sought partition and given variables
||1
2k
K
k Vvkv Sc
),(1
vSV
v
28
Equivalent criteria (5)I. Spectral Clusters MAX
Maximizing summary Raileigh quotient over binary vectors
K
kk
Tkk
TTk zzzYYz
1
/
29
PCA inspired Anomalous Pattern Clustering
yiv =cv zi + eiv,
where zi = 1 if iS, zi = 0 if iS
With Euclidean distance squared
Si
V
vSviv
V
vSSv
N
i
V
viv cyNcy
1
2
1
2
1 1
2 )(
Si
SSS
N
i
cidNcdid ),()0,()0,(1cS must be anomalous, that is,
interesting30
Initial setting with Anomalous Pattern Cluster
Tom Sawyer
31
Anomalous Pattern Clusters: Iterate
0Tom Sawyer
12
3
32
iK-Means:Anomalous clusters + K-meansAfter extracting 2 clusters (how one can know that 2 is right?)
Final
33
iK-Means:Defining K and Initial Setting with Iterative Anomalous Pattern Clustering
Find all Anomalous Pattern clusters Remove smaller (e.g., singleton) clusters Put the number of remaining clusters as K and initialise K-Means with their centres
34
Study of eight Number-of-clusters methods (joint work with Mark Chiang):
• Variance based:Hartigan (HK)
Calinski & Harabasz (CH) Jump Statistic (JS)• Structure based:
Silhouette Width (SW)• Consensus based:
Consensus Distribution area (CD)Consensus Distribution mean (DD)
• Sequential extraction of APs (iK-Means):Least Square (LS)Least Moduli (LM)
35
Experimental resultsat 9 Gaussian clusters (3 spread patterns), 1000 x 15 data size
Estimated number of clusters
Adjusted Rand Index
Large spread
Small spread
Large spread
Small spread
HKCHJS
SWCDDDLSLM
1-time winner 2-times winner 3-times winner
Two winners counted each time
36
37
(2) Address: Weighting features according to relevance
1 1 1
| | ( , )k
K M K
ik v iv kv i kk i I v k i S
s w y с d y с
w: feature weights=scale factors
3-step K-Means:-Given s, c, find w (weights)-Given w, c, find s (clusters)-Given s,w, find c (centroids)-till convergence
38
Minkowski’s centersMinimize over c
At >1, d(c) is convexGradient method
39
( ) | |k
ivi S
d с y с
Minkowski’s metric effects The more uniform distribution of the entities over a feature, the smaller its weight Uniform distribution w=0The best Minkowski power is data dependentThe best can be learnt from data in a semi-supervised manner (with clustering of all objects)Example: at Fisher’s Iris, iMWK-Means gives 5 errors only (a record) 40
Conclusion:Data recovery K-Means-wise model of
clustering is a tool that involves wealth of interesting criteria for mathematical investigation and application projects
Further work:Extending the approach to other data types
– text, sequence, image, web pageUpgrading K-Means to address the issue of
interpretation of the results
DecoderModelCoder
Data clustering Clusters Data recovery
41
HEFCE survey of students’ satisfaction
HEFCE method: ALL 93 of highest mark STRATA: 43 best, ranging 71.8 to 84.6
42