decisions that are needed when using cluster analysis, and
TRANSCRIPT
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Decisions that are needed when using clusteranalysis,
and research that helps with making them
Christian Hennig
Supported by EPSRC Grant EP/K033972/1
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
1. Introduction
How to do clustering?How to represent information?What’s the best method?
What do we want from clustering?Various things. . .that aren’t necessarily served by the same clustering.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
1. Introduction
How to do clustering?How to represent information?What’s the best method?
What do we want from clustering?Various things. . .that aren’t necessarily served by the same clustering.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Optimal representation vs. pattern recognition
3 3333
33
33
333
3333333 3 33
3
3333
3
3333 3
33
333
33
1
22
1 1
2
1
2
1
2
2
21
2
1
1
1
1
1
1
22
1
2
2
1
2
2
12
1
2
2 12
21
2
2
2
1
11
1
1
2
2
1
2 1
12
2
2 1
1
2
2
1
1
1
1
21
1
2
2
1
2
2
1
22
1
1
2
2 1
2
2
22
1
2
2
2
12
12
1
12
2
1
22
2
12
−10 0 10 20 30
010
2030
40
xy[,1]
xy[,2
]
X
XX
−10 0 10 20 30
01
02
03
04
0
xy[,1]
xy[,
2]
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Granularity?
●
●
● ●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●● ●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
−2 −1 0 1 2
−2
−1
01
waiting
dura
tion
−2 −1 0 1 2
−2
−1
01
mclust
waiting
du
ratio
n
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Clustering is applied in various fieldswith various aims that have different requirements
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
E.g., object recognition in imagesrequires separation on suitable features,
clustering for information reduction requiresgood representation by centroids,
clustering regions for mapping requiressimilar (and close?) regions within clusters,
noisy observations from heterogeneous sourcesrequire telling apart homogeneous probability models.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
E.g., object recognition in imagesrequires separation on suitable features,
clustering for information reduction requiresgood representation by centroids,
clustering regions for mapping requiressimilar (and close?) regions within clusters,
noisy observations from heterogeneous sourcesrequire telling apart homogeneous probability models.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
E.g., object recognition in imagesrequires separation on suitable features,
clustering for information reduction requiresgood representation by centroids,
clustering regions for mapping requiressimilar (and close?) regions within clusters,
noisy observations from heterogeneous sourcesrequire telling apart homogeneous probability models.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
E.g., object recognition in imagesrequires separation on suitable features,
clustering for information reduction requiresgood representation by centroids,
clustering regions for mapping requiressimilar (and close?) regions within clusters,
noisy observations from heterogeneous sourcesrequire telling apart homogeneous probability models.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
How to do clusteringdepends on decisions that we have to make.
Speaking of “the natural clusters” is misleading;it suggests that the data on their owncan know the best clustering,and decisions can/should be made automatically.
Not so. . .but how can cluster analysis research help deciding?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
How to do clusteringdepends on decisions that we have to make.
Speaking of “the natural clusters” is misleading;it suggests that the data on their owncan know the best clustering,and decisions can/should be made automatically.
Not so. . .but how can cluster analysis research help deciding?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
How to do clusteringdepends on decisions that we have to make.
Speaking of “the natural clusters” is misleading;it suggests that the data on their owncan know the best clustering,and decisions can/should be made automatically.
Not so. . .but how can cluster analysis research help deciding?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
2. The data used for clustering
Decision how to represent information in dataDefinition, choice and selection of variablesDissimilarity definitionTransformation and standardisationDealing with missing information etc.
These often make a huge difference.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
2.1 Standardisation
. . . is reweighting of variablesin order to give all variables same influence(depending on method).
If relevant information contentis proportional to variationit’s a reason not to standardise(and not to use anything scale invariant).
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
2.1 Standardisation
. . . is reweighting of variablesin order to give all variables same influence(depending on method).
If relevant information contentis proportional to variationit’s a reason not to standardise(and not to use anything scale invariant).
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Bermuda pollen
Time
Vo
lta
ge
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Bermuda spores
Time
Vo
lta
ge
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Johnson spores
Time
Vo
lta
ge
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Walnut pollen
Time
Vo
lta
ge
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
2.2 Variable selection (and dimension reduction)
There’s much literatureon automatic variable selection for clustering.
Principles:Models assuming some variables “noise”(eg, Law, Figueiredo & Jain 2004)
Select variables to find “most clustered clustering”(eg, Montanari & Lizzani 2001)Eliminate redundant (dependent) variables(eg, Fraiman, Justel & Svarc 2008)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
2.2 Variable selection (and dimension reduction)
There’s much literatureon automatic variable selection for clustering.
Principles:Models assuming some variables “noise”(eg, Law, Figueiredo & Jain 2004)Select variables to find “most clustered clustering”(eg, Montanari & Lizzani 2001)
Eliminate redundant (dependent) variables(eg, Fraiman, Justel & Svarc 2008)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
2.2 Variable selection (and dimension reduction)
There’s much literatureon automatic variable selection for clustering.
Principles:Models assuming some variables “noise”(eg, Law, Figueiredo & Jain 2004)Select variables to find “most clustered clustering”(eg, Montanari & Lizzani 2001)Eliminate redundant (dependent) variables(eg, Fraiman, Justel & Svarc 2008)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
Obviously this is helpful for exploratory DA. . .
. . . but the selected variables definethe meaning of the clustering!
Is this for the data to decide?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
Obviously this is helpful for exploratory DA. . .
. . . but the selected variables definethe meaning of the clustering!
Is this for the data to decide?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)
Statistically dependent variablesmay be different in meaning;dependence information is not redundant“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.
Could create meaningful summary variablesrather than reducing dimension automatically.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)Statistically dependent variablesmay be different in meaning;dependence information is not redundant
“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.
Could create meaningful summary variablesrather than reducing dimension automatically.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)Statistically dependent variablesmay be different in meaning;dependence information is not redundant“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.
Could create meaningful summary variablesrather than reducing dimension automatically.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
StandardisationVariable selection
Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)Statistically dependent variablesmay be different in meaning;dependence information is not redundant“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.
Could create meaningful summary variablesrather than reducing dimension automatically.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
3. Characterising clustering methods
For choosing a suitable clustering method, needunderstanding how what they do relates to clustering aim.
Every clustering method comes with an implicit“cluster concept” with certain characteristics.
Understanding by “direct interpretation”and theoretical investigation of characteristics.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
E.g., k -means:n∑
i=1
‖xi −mc(i)‖2 = min!
Formalises representation of objects by centroids
Distances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
E.g., k -means:n∑
i=1
‖xi −mc(i)‖2 = min!
Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)
No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
E.g., k -means:n∑
i=1
‖xi −mc(i)‖2 = min!
Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separation
Squared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
E.g., k -means:n∑
i=1
‖xi −mc(i)‖2 = min!
Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distances
Maximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
E.g., k -means:n∑
i=1
‖xi −mc(i)‖2 = min!
Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Representation not separation
3 3333
33
33
333
3333333 3 33
3
3333
3
3333 3
33
333
33
1
22
1 1
2
1
2
1
2
2
21
2
1
1
1
1
1
1
22
1
2
2
1
2
2
12
1
2
2 12
21
2
2
2
1
11
1
1
2
2
1
2 1
12
2
2 1
1
2
2
1
1
1
1
21
1
2
2
1
2
2
1
22
1
1
2
2 1
2
2
22
1
2
2
2
12
12
1
12
2
1
22
2
12
−10 0 10 20 30
010
2030
40
xy[,1]
xy[,2
]
X
XX
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Unforgiving against large within-cluster distances
2
222
22 22 2
2
2
22
22 22
2
2
2
2
2
22
2
11
11
1
1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
11
1
4
4
4 4
4
44
4
4 44
4
4
4
4
4
4
4
4
4 44
44
4
3
333
3
3
3
3
3
33
33
3
3
3 3
33
33
3
3
3
3
0 1 2 3 4
01
23
4
x1
x2
4444444444444444444444444
4444444444444444444444444
4444444444444444444444444
4444444444444444444
444444
3
1
2
−30 −20 −10 0 10 20 30
05
1015
2025
30
x1
x2
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Theory:Axiomatic characterisation of clustering methodsJardine and Sibson (1971), Fisher and van Ness (1971)
Impossibility Theorem (Kleinberg, 2002)There’s no clustering method that fulfillsscale invariance, richness, “consistency”
Consistency: if all within-cluster distancesare made smaller and all between-cluster distancesmade larger, we get the same clustering.
This is not usually desirable!
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Theory:Axiomatic characterisation of clustering methodsJardine and Sibson (1971), Fisher and van Ness (1971)
Impossibility Theorem (Kleinberg, 2002)There’s no clustering method that fulfillsscale invariance, richness, “consistency”
Consistency: if all within-cluster distancesare made smaller and all between-cluster distancesmade larger, we get the same clustering.
This is not usually desirable!
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Theory:Axiomatic characterisation of clustering methodsJardine and Sibson (1971), Fisher and van Ness (1971)
Impossibility Theorem (Kleinberg, 2002)There’s no clustering method that fulfillsscale invariance, richness, “consistency”
Consistency: if all within-cluster distancesare made smaller and all between-cluster distancesmade larger, we get the same clustering.
This is not usually desirable!
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Until that point: focus on supposedlygenerally desirable characteristics, “admissibility”.
More useful for helping with decisions:Characteristics that distinguishdifferent kinds of clustering methods, e.g.,
order invariance: order-preserving transformationof distances doesn’t change clustering(Ackerman et al. 2010; Fisher & van Ness 1971)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Until that point: focus on supposedlygenerally desirable characteristics, “admissibility”.
More useful for helping with decisions:Characteristics that distinguishdifferent kinds of clustering methods, e.g.,
order invariance: order-preserving transformationof distances doesn’t change clustering(Ackerman et al. 2010; Fisher & van Ness 1971)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Attempts to formalise robustnessagainst adding observations:Hennig (2008) cluster-wise “dissolution point”Ackerman et al. (2012) “δ-robustness”
How dissimilar can a clustering becomewhen adding g points?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
k -means’ fixed k dissolution point is minimum.
2
222
22 22 2
2
2
22
22 22
2
2
2
2
2
22
2
11
11
1
1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
11
1
4
4
4 4
4
44
4
4 44
4
4
4
4
4
4
4
4
4 44
44
4
3
333
3
3
3
3
3
33
33
3
3
3 3
33
33
3
3
3
3
0 1 2 3 4
01
23
4
x1
x2
4444444444444444444444444
4444444444444444444444444
4444444444444444444444444
4444444444444444444
444444
3
1
2
−30 −20 −10 0 10 20 30
05
1015
2025
30
x1
x2
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Theoretical characterisation of clustering methodsrelevant for practice byconnecting characteristics to clustering aimsis an underdeveloped,important field for further research!
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
4. Cluster Benchmarking - for help with choosing a method
CLUSTER BENCHMARK
DATA REPOSITORY
HTTP://IFCS.BOKU.AC.AT/REPOSITORY
/
(With Iven van Mechelen, Nema Dean, Fritz Leisch, Rainer Dangl, Anne-Laure Boulesteix, Isabelle Guyon, Doug
Steinley)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
4.1 Approaches to cluster benchmarkingComparison of clustering methods on
Real datasets with known classesSimulated datasets (from mixture distributions?)Real datasets without known classes
Datasets?Competitors?Quality measurement?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
4.1 Approaches to cluster benchmarkingComparison of clustering methods on
Real datasets with known classesSimulated datasets (from mixture distributions?)Real datasets without known classes
Datasets?Competitors?Quality measurement?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Why not just use data with known true classes?
There may be more than onelegitimate clustering in a dataset.Real datasets with known classeswill tell us nothing about generalisability.“True classes” may not be data analytic clusters.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Why not just use data with known true classes?There may be more than onelegitimate clustering in a dataset.Real datasets with known classeswill tell us nothing about generalisability.“True classes” may not be data analytic clusters.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
7 standard occupation classes such as “manual workers”,“managerials and professionals”, “not working”
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
4.2 Quality measurement with unknown truth(current research, Hennig 2017, arxiv)
General approach: Measure different aspectsof clustering by different statisticsto give a multivariate characterisationof cluster validity.
Optimal clustering could be found bycomputing weighted average,according to relative importance of aspectsin given application.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Typical clustering aimsBetween-cluster separationWithin-cluster homogeneity (low distances)Within-cluster homogeneous distributional shapeGood representation of data by centroidsGood representation of dissimilarity by clusteringClusters are regions of high density without within-clustergapsUniform cluster sizesClusters easily characterisable by few variablesClusters well related to external informationStability
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Measuring within-cluster homogeneityby average within-cluster dissimilarity.
Measuring representation of dissimilaritiesby “Pearson-Γ” (Hubert and Schultz 1976, Halkidi et al. 2016):Correlation between vector of dissimilaritiesand 0-1 vector for “in same/different cluster”.
Representation of objects by centroids- eg, k -means criterion.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Measuring within-cluster homogeneityby average within-cluster dissimilarity.
Measuring representation of dissimilaritiesby “Pearson-Γ” (Hubert and Schultz 1976, Halkidi et al. 2016):Correlation between vector of dissimilaritiesand 0-1 vector for “in same/different cluster”.
Representation of objects by centroids- eg, k -means criterion.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Measuring within-cluster homogeneityby average within-cluster dissimilarity.
Measuring representation of dissimilaritiesby “Pearson-Γ” (Hubert and Schultz 1976, Halkidi et al. 2016):Correlation between vector of dissimilaritiesand 0-1 vector for “in same/different cluster”.
Representation of objects by centroids- eg, k -means criterion.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Measuring between-cluster separation:p-separation index:Average distance to nearest point in different cluster forp = 10% “border” points in any cluster.
−2 −1 0 1 2
−2
−1
01
waiting
du
ratio
n
X
X
X
X
X
X
X
XX
XX X
X
X
X
X
X
X
X
XX
XX
X
X
X X
X
X
X
X
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Clusters corresponding to high density regions- tough to measure (work in progress);
Many open problems:theoretical characterisation of indexes,relations between indexes,non-distance based criteria,big data computation/approximation.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
4.3 Data with known clusters used in different ways
Benchmarking with data with known clustersis surely relevant;discovering hidden structureis a major aim of clustering,and can only be tested this way.
Unfortunately, benchmarking with known clustersis often used for 1-d quality rankingwith little attention to generalisation andwhat can be learnt for clustering in practice.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
4.3 Data with known clusters used in different ways
Benchmarking with data with known clustersis surely relevant;discovering hidden structureis a major aim of clustering,and can only be tested this way.
Unfortunately, benchmarking with known clustersis often used for 1-d quality rankingwith little attention to generalisation andwhat can be learnt for clustering in practice.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Data with known clusters used in different ways:On what characteristics of real clustering problemdoes a method’s performance depend?
On what features/characteristics of the datadoes a method’s performance depend?How are the different characteristics/criteriaand methods related on real data?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Data with known clusters used in different ways:On what characteristics of real clustering problemdoes a method’s performance depend?On what features/characteristics of the datadoes a method’s performance depend?
How are the different characteristics/criteriaand methods related on real data?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways
Data with known clusters used in different ways:On what characteristics of real clustering problemdoes a method’s performance depend?On what features/characteristics of the datadoes a method’s performance depend?How are the different characteristics/criteriaand methods related on real data?
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
5. Cluster validationAssessing the quality of a clustering on a dataset,incl. comparing clusterings, parameters(such as number of clusters)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
Elements of cluster validationInternal validation indexesStability assessmentUse of external informationTesting for clustering structureVisual explorationComparison of different clusterings on same data/sensitivity analysis
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
5.1 Visual exploration
Visualisation is central for cluster validation.I’d never trust and interpret a clusteringwithout visual evaluation.
(If at all possible, I wouldn’t leave number of clustersto automatic criterion, rather decide from graphsand background information.)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
5.1 Visual exploration
Visualisation is central for cluster validation.I’d never trust and interpret a clusteringwithout visual evaluation.
(If at all possible, I wouldn’t leave number of clustersto automatic criterion, rather decide from graphsand background information.)
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
Visualisation for cluster validation (Rao 1952, Hennig 2004)
1
2
1
1
1
1
11
1
3
4
4
5
3
33
3
3
3
333 1
1
18
55
7
11
55
4
4
44
7
4 5
4
55 65
5
5
45
4
564
54
55 5
55
5
8
555
5 565
59
4
55
4
7
4
5 55
7
7
55
75
5
5
5 5 5
8
7
4 6
44
41
3
7777
5
7775
5
8
6 7 7
8
51
7
15
65
67
74
8
775
7
77
74 5
55
4
4 5 54 4
55
54
45
4
4 4 554
55
77
8
444
95
5 7
4
−5 0 5
05
1015
EEE, all equal
PC1
PC
21
2
1
1
11
1
1
1
3
4
4
5
33 3
3
3
3
33 3
1 11
8
55
7
11
5 5
4
4
44
7 4
5
4
55
655
5
45 4
56
4
5
4
555
55
5
8
555
55655
94
55
47
4
55
5
775 57 5
5
5
55
5
8
7
4
6
4
44
1
3
77
7
7
57
7 75 5
8
67 7
8
51
7
15
6
56
7
7
4
8
775
777
7
4
555
4
455
445
5
54
4
5 444
5
5 4
55
7
7
8
44 4
95
57
4
−5 0 5
−15
−10
−5
05
dc 1
dc 2
1
2
1
1
1
11
1
1
3
44
5
3
3
3
3
3
3
3
3
3
111
8
5
5
7
11
5
5
4
4
4
4
7
4
5
4
55
6
5
5 5
4
5
4
56
4
5
4
5
5
5
5
5
5
8
5
5
55
5
6
5
5
94
55
4
7
4
55
5
7
7
5
5
7
5
5
5
555
8
7
4
64
4
4 1
3
7
7
77
5
7
7
7
5
5
867
7
8
5
17
15
6
56
7
7
4
8
7
7
5
7
7
77
4
5
5
5
4
4
5
5
4
4
5
5
54
4
5
4
4
4
5
5
4
55
7
7
8
4
4 49
5
5
7
4
−40 −20 0 20 40 60
−5
05
arc 1
arc
2
1
2
11
11
11
1
3
4
4
5
333
3
3
3
33
3
1
1
1
8
5
5
7
11
55
4
4
4
4
7
4
5
4
5
565
55
45
4
56
4
5
4
55
5
55
5
8
5
55
55
6
55
9
4
5
5
4
7
45
5
5
7
755
7
5 5
55
5
5
87
4
6
4
4
4
1
3
7
7
7
7
5 777
55
8
6
77
85
1
715
6
5
6
7
7
4
8
77
57
7
77
4
5 55
4
4 5
5
4
4
55
5
4
4
5
44 4
5
5
4
5
5
778
4
449
5
57
4
−5 0 5
−10
−5
05
anc 1
anc
2
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
5.2 Parametric bootstrap
Parametric bootstrap (Efron 1979):fitting a model to data, sampling from fitted model.
Uses in cluster validation:Compare real datawith homogeneity model for “no clustering” (Hennig & Lin 2015)capturing all non-clustering structure.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
Buja et al. (2009) - visual testing:
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
Comparing ASW with values under null model
2 4 6 8 10
0.0
00
.05
0.1
00
.15
0.2
00
.25
0.3
0
Number of clusters
Ave
rag
e s
ilho
ue
tte
wid
th
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
Comparing quality index with fitted models for all n.o.c.(work in progress)
010
2030
4050
60
Number of clusters
Den
sity
crit
erio
n
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●●●●●●●
●
●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●
●●●
●●●
●●●
●
●●●●●●●●● ●
●●●
●●
●●
●
●●●●
●
●●●●●● ●●●
●●
●●●●
●●●●●●●
●
●●●●●●●●●
●
●
●
●●
●●
●
●●● ●
●●
●
●●●●●●●●●●●●●●●● ●●●
●●●●●●●●●
●
●
●●●●●●●●●●
●
●●●●●●●●
●●●
●
●●●
●
●●●●●
●●
●
●
●
●
●●
●
●●●●●
1 2 3 4 5 6 7 8 9 10 11
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
Conclusion
Too much cluster analysis researchis about automatising decisions.
Too little cluster analysis researchacknowledges what researchers should decide,and tries to help them making the decisions.
There are endless opportunities forthis kind of research.
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
Some marketing:
Christian Hennig Decisions in cluster analysis
IntroductionThe data used for clustering
Characterising clustering methodsCluster benchmarking
Cluster validation
Visual explorationParametric bootstrap
. . . and some more:
C. Hennig (2004) Asymmetric linear dimension reduction for classification. Journal of Computational andGraphical Statistics 13, 930-945
C. Hennig (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysismethods. Journal of Multivariate Analysis 99, 1154-1176.
C. Hennig and C.-J. Lin (2015) Flexible parametric bootstrap for testing homogeneity against clustering andassessing the number of clusters. Statistics and Computing 25, 821-833Open access
C. Hennig (2015) Clustering strategy and method selection. In Hennig, C., M. Meila, F. Murtagh, and R.Rocci (Eds.). Handbook of Cluster Analysis. Chapman and Hall/CRCFree version on arxiv
C. Hennig (2017) Cluster validation by measurement of clustering characteristics relevant to the user.Submitted.Free version on arxiv
Other authors (selection):
M. Ackerman, S. Ben-David, and D. Loker (2010) Towards Property-Based Classification of ClusteringParadigms. NIPS 2010.
A. Buja, D. Cook, H. Hofmann, M. Lawrence, E.-K. Lee, D. Swayne, H. Wickham (2009) Statistical inferencefor exploratory data analysis and model diagnostics. Philosophical Transactions of The Royal Society, A.
L. Fisher and J. W. Van Ness (1971) Admissible Clustering Procedures. Biometrika 58, 91-104.
J. Kleinberg (2002) An Impossibility Theorem for Clustering. NIPS 2002.
C. R. Rao (1952) Advanced Statistical Methods in Biometric Research. Wiley, New York.
Christian Hennig Decisions in cluster analysis