decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Decisions that are needed when using clusteranalysis,

and research that helps with making them

Christian Hennig

Supported by EPSRC Grant EP/K033972/1

Christian Hennig Decisions in cluster analysis



Cluster validation

1. Introduction

How to do clustering?How to represent information?What’s the best method?

What do we want from clustering?Various things. . .that aren’t necessarily served by the same clustering.




Cluster validation

Optimal representation vs. pattern recognition

3 3333

33

33

333

3333333 3 33

3

3333

3

3333 3

33

333

33

1

22

1 1

2

1

2

1

2

2

21

2

1

1

1

1

1

1

22

1

2

2

1

2

2

12

1

2

2 12

21

2

2

2

1

11

1

1

2

2

1

2 1

12

2

2 1

1

2

2

1

1

1

1

21

1

2

2

1

2

2

1

22

1

1

2

2 1

2

2

22

1

2

2

2

12

12

1

12

2

1

22

2

12

−10 0 10 20 30

010

2030

40

xy[,1]

xy[,2

]

X

XX

−10 0 10 20 30

01

02

03

04

0

xy[,1]

xy[,

2]




Cluster validation

Granularity?

●

●

● ●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●●● ●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

−2 −1 0 1 2

−2

−1

01

waiting

dura

tion

−2 −1 0 1 2

−2

−1

01

mclust

waiting

du

ratio

n




Cluster validation

Clustering is applied in various fieldswith various aims that have different requirements




Cluster validation

E.g., object recognition in imagesrequires separation on suitable features,

clustering for information reduction requiresgood representation by centroids,

clustering regions for mapping requiressimilar (and close?) regions within clusters,

noisy observations from heterogeneous sourcesrequire telling apart homogeneous probability models.




Cluster validation

How to do clusteringdepends on decisions that we have to make.

Speaking of “the natural clusters” is misleading;it suggests that the data on their owncan know the best clustering,and decisions can/should be made automatically.

Not so. . .but how can cluster analysis research help deciding?




Cluster validation

StandardisationVariable selection

2. The data used for clustering

Decision how to represent information in dataDefinition, choice and selection of variablesDissimilarity definitionTransformation and standardisationDealing with missing information etc.

These often make a huge difference.




Cluster validation


2.1 Standardisation

. . . is reweighting of variablesin order to give all variables same influence(depending on method).

If relevant information contentis proportional to variationit’s a reason not to standardise(and not to use anything scale invariant).




Cluster validation


0.0 0.5 1.0 1.5 2.0 2.5 3.0

Bermuda pollen

Time

Vo

lta

ge

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Bermuda spores

Time

Vo

lta

ge

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Johnson spores

Time

Vo

lta

ge

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Walnut pollen

Time

Vo

lta

ge




Cluster validation


2.2 Variable selection (and dimension reduction)

There’s much literatureon automatic variable selection for clustering.

Principles:Models assuming some variables “noise”(eg, Law, Figueiredo & Jain 2004)

Select variables to find “most clustered clustering”(eg, Montanari & Lizzani 2001)Eliminate redundant (dependent) variables(eg, Fraiman, Justel & Svarc 2008)




Cluster validation




Principles:Models assuming some variables “noise”(eg, Law, Figueiredo & Jain 2004)Select variables to find “most clustered clustering”(eg, Montanari & Lizzani 2001)

Eliminate redundant (dependent) variables(eg, Fraiman, Justel & Svarc 2008)




Cluster validation




Principles:Models assuming some variables “noise”(eg, Law, Figueiredo & Jain 2004)Select variables to find “most clustered clustering”(eg, Montanari & Lizzani 2001)Eliminate redundant (dependent) variables(eg, Fraiman, Justel & Svarc 2008)




Cluster validation


Obviously this is helpful for exploratory DA. . .

. . . but the selected variables definethe meaning of the clustering!

Is this for the data to decide?




Cluster validation


Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)

Statistically dependent variablesmay be different in meaning;dependence information is not redundant“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.

Could create meaningful summary variablesrather than reducing dimension automatically.




Cluster validation


Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)Statistically dependent variablesmay be different in meaning;dependence information is not redundant

“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.





Cluster validation


Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)Statistically dependent variablesmay be different in meaning;dependence information is not redundant“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.





Cluster validation

3. Characterising clustering methods

For choosing a suitable clustering method, needunderstanding how what they do relates to clustering aim.

Every clustering method comes with an implicit“cluster concept” with certain characteristics.

Understanding by “direct interpretation”and theoretical investigation of characteristics.




Cluster validation

E.g., k -means:n∑

i=1

‖xi −mc(i)‖2 = min!

Formalises representation of objects by centroids

Distances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)




Cluster validation

E.g., k -means:n∑

i=1


Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)

No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)




Cluster validation

E.g., k -means:n∑

i=1


Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separation

Squared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)




Cluster validation

E.g., k -means:n∑

i=1


Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distances

Maximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)




Cluster validation

E.g., k -means:n∑

i=1


Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)




Cluster validation

Representation not separation

3 3333

33

33

333

3333333 3 33

3

3333

3

3333 3

33

333

33

1

22

1 1

2

1

2

1

2

2

21

2

1

1

1

1

1

1

22

1

2

2

1

2

2

12

1

2

2 12

21

2

2

2

1

11

1

1

2

2

1

2 1

12

2

2 1

1

2

2

1

1

1

1

21

1

2

2

1

2

2

1

22

1

1

2

2 1

2

2

22

1

2

2

2

12

12

1

12

2

1

22

2

12

−10 0 10 20 30

010

2030

40

xy[,1]

xy[,2

]

X

XX




Cluster validation

Unforgiving against large within-cluster distances

2

222

22 22 2

2

2

22

22 22

2

2

2

2

2

22

2

11

11

1

1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

4

4

4 4

4

44

4

4 44

4

4

4

4

4

4

4

4

4 44

44

4

3

333

3

3

3

3

3

33

33

3

3

3 3

33

33

3

3

3

3

0 1 2 3 4

01

23

4

x1

x2

4444444444444444444444444

4444444444444444444444444

4444444444444444444444444

4444444444444444444

444444

3

1

2

−30 −20 −10 0 10 20 30

05

1015

2025

30

x1

x2




Cluster validation

Theory:Axiomatic characterisation of clustering methodsJardine and Sibson (1971), Fisher and van Ness (1971)

Impossibility Theorem (Kleinberg, 2002)There’s no clustering method that fulfillsscale invariance, richness, “consistency”

Consistency: if all within-cluster distancesare made smaller and all between-cluster distancesmade larger, we get the same clustering.

This is not usually desirable!




Cluster validation

Until that point: focus on supposedlygenerally desirable characteristics, “admissibility”.

More useful for helping with decisions:Characteristics that distinguishdifferent kinds of clustering methods, e.g.,

order invariance: order-preserving transformationof distances doesn’t change clustering(Ackerman et al. 2010; Fisher & van Ness 1971)




Cluster validation

Attempts to formalise robustnessagainst adding observations:Hennig (2008) cluster-wise “dissolution point”Ackerman et al. (2012) “δ-robustness”

How dissimilar can a clustering becomewhen adding g points?




Cluster validation

k -means’ fixed k dissolution point is minimum.

2

222

22 22 2

2

2

22

22 22

2

2

2

2

2

22

2

11

11

1

1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

4

4

4 4

4

44

4

4 44

4

4

4

4

4

4

4

4

4 44

44

4

3

333

3

3

3

3

3

33

33

3

3

3 3

33

33

3

3

3

3

0 1 2 3 4

01

23

4

x1

x2

4444444444444444444444444

4444444444444444444444444

4444444444444444444444444

4444444444444444444

444444

3

1

2

−30 −20 −10 0 10 20 30

05

1015

2025

30

x1

x2




Cluster validation

Theoretical characterisation of clustering methodsrelevant for practice byconnecting characteristics to clustering aimsis an underdeveloped,important field for further research!




Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

4. Cluster Benchmarking - for help with choosing a method

CLUSTER BENCHMARK

DATA REPOSITORY

HTTP://IFCS.BOKU.AC.AT/REPOSITORY

/

(With Iven van Mechelen, Nema Dean, Fritz Leisch, Rainer Dangl, Anne-Laure Boulesteix, Isabelle Guyon, Doug

Steinley)




Cluster validation


4.1 Approaches to cluster benchmarkingComparison of clustering methods on

Real datasets with known classesSimulated datasets (from mixture distributions?)Real datasets without known classes

Datasets?Competitors?Quality measurement?




Cluster validation


Why not just use data with known true classes?

There may be more than onelegitimate clustering in a dataset.Real datasets with known classeswill tell us nothing about generalisability.“True classes” may not be data analytic clusters.




Cluster validation


Why not just use data with known true classes?There may be more than onelegitimate clustering in a dataset.Real datasets with known classeswill tell us nothing about generalisability.“True classes” may not be data analytic clusters.




Cluster validation


7 standard occupation classes such as “manual workers”,“managerials and professionals”, “not working”




Cluster validation


4.2 Quality measurement with unknown truth(current research, Hennig 2017, arxiv)

General approach: Measure different aspectsof clustering by different statisticsto give a multivariate characterisationof cluster validity.

Optimal clustering could be found bycomputing weighted average,according to relative importance of aspectsin given application.




Cluster validation


Typical clustering aimsBetween-cluster separationWithin-cluster homogeneity (low distances)Within-cluster homogeneous distributional shapeGood representation of data by centroidsGood representation of dissimilarity by clusteringClusters are regions of high density without within-clustergapsUniform cluster sizesClusters easily characterisable by few variablesClusters well related to external informationStability




Cluster validation


Measuring within-cluster homogeneityby average within-cluster dissimilarity.

Measuring representation of dissimilaritiesby “Pearson-Γ” (Hubert and Schultz 1976, Halkidi et al. 2016):Correlation between vector of dissimilaritiesand 0-1 vector for “in same/different cluster”.

Representation of objects by centroids- eg, k -means criterion.




Cluster validation


Measuring between-cluster separation:p-separation index:Average distance to nearest point in different cluster forp = 10% “border” points in any cluster.

−2 −1 0 1 2

−2

−1

01

waiting

du

ratio

n

X

X

X

X

X

X

X

XX

XX X

X

X

X

X

X

X

X

XX

XX

X

X

X X

X

X

X

X




Cluster validation


Clusters corresponding to high density regions- tough to measure (work in progress);

Many open problems:theoretical characterisation of indexes,relations between indexes,non-distance based criteria,big data computation/approximation.




Cluster validation


4.3 Data with known clusters used in different ways

Benchmarking with data with known clustersis surely relevant;discovering hidden structureis a major aim of clustering,and can only be tested this way.

Unfortunately, benchmarking with known clustersis often used for 1-d quality rankingwith little attention to generalisation andwhat can be learnt for clustering in practice.




Cluster validation


Data with known clusters used in different ways:On what characteristics of real clustering problemdoes a method’s performance depend?

On what features/characteristics of the datadoes a method’s performance depend?How are the different characteristics/criteriaand methods related on real data?




Cluster validation


Data with known clusters used in different ways:On what characteristics of real clustering problemdoes a method’s performance depend?On what features/characteristics of the datadoes a method’s performance depend?

How are the different characteristics/criteriaand methods related on real data?




Cluster validation


Data with known clusters used in different ways:On what characteristics of real clustering problemdoes a method’s performance depend?On what features/characteristics of the datadoes a method’s performance depend?How are the different characteristics/criteriaand methods related on real data?




Cluster validation

Visual explorationParametric bootstrap

5. Cluster validationAssessing the quality of a clustering on a dataset,incl. comparing clusterings, parameters(such as number of clusters)




Cluster validation


Elements of cluster validationInternal validation indexesStability assessmentUse of external informationTesting for clustering structureVisual explorationComparison of different clusterings on same data/sensitivity analysis




Cluster validation


5.1 Visual exploration

Visualisation is central for cluster validation.I’d never trust and interpret a clusteringwithout visual evaluation.

(If at all possible, I wouldn’t leave number of clustersto automatic criterion, rather decide from graphsand background information.)




Cluster validation


Visualisation for cluster validation (Rao 1952, Hennig 2004)

1

2

1

1

1

1

11

1

3

4

4

5

3

33

3

3

3

333 1

1

18

55

7

11

55

4

4

44

7

4 5

4

55 65

5

5

45

4

564

54

55 5

55

5

8

555

5 565

59

4

55

4

7

4

5 55

7

7

55

75

5

5

5 5 5

8

7

4 6

44

41

3

7777

5

7775

5

8

6 7 7

8

51

7

15

65

67

74

8

775

7

77

74 5

55

4

4 5 54 4

55

54

45

4

4 4 554

55

77

8

444

95

5 7

4

−5 0 5

05

1015

EEE, all equal

PC1

PC

21

2

1

1

11

1

1

1

3

4

4

5

33 3

3

3

3

33 3

1 11

8

55

7

11

5 5

4

4

44

7 4

5

4

55

655

5

45 4

56

4

5

4

555

55

5

8

555

55655

94

55

47

4

55

5

775 57 5

5

5

55

5

8

7

4

6

4

44

1

3

77

7

7

57

7 75 5

8

67 7

8

51

7

15

6

56

7

7

4

8

775

777

7

4

555

4

455

445

5

54

4

5 444

5

5 4

55

7

7

8

44 4

95

57

4

−5 0 5

−15

−10

−5

05

dc 1

dc 2

1

2

1

1

1

11

1

1

3

44

5

3

3

3

3

3

3

3

3

3

111

8

5

5

7

11

5

5

4

4

4

4

7

4

5

4

55

6

5

5 5

4

5

4

56

4

5

4

5

5

5

5

5

5

8

5

5

55

5

6

5

5

94

55

4

7

4

55

5

7

7

5

5

7

5

5

5

555

8

7

4

64

4

4 1

3

7

7

77

5

7

7

7

5

5

867

7

8

5

17

15

6

56

7

7

4

8

7

7

5

7

7

77

4

5

5

5

4

4

5

5

4

4

5

5

54

4

5

4

4

4

5

5

4

55

7

7

8

4

4 49

5

5

7

4

−40 −20 0 20 40 60

−5

05

arc 1

arc

2

1

2

11

11

11

1

3

4

4

5

333

3

3

3

33

3

1

1

1

8

5

5

7

11

55

4

4

4

4

7

4

5

4

5

565

55

45

4

56

4

5

4

55

5

55

5

8

5

55

55

6

55

9

4

5

5

4

7

45

5

5

7

755

7

5 5

55

5

5

87

4

6

4

4

4

1

3

7

7

7

7

5 777

55

8

6

77

85

1

715

6

5

6

7

7

4

8

77

57

7

77

4

5 55

4

4 5

5

4

4

55

5

4

4

5

44 4

5

5

4

5

5

778

4

449

5

57

4

−5 0 5

−10

−5

05

anc 1

anc

2




Cluster validation


5.2 Parametric bootstrap

Parametric bootstrap (Efron 1979):fitting a model to data, sampling from fitted model.

Uses in cluster validation:Compare real datawith homogeneity model for “no clustering” (Hennig & Lin 2015)capturing all non-clustering structure.




Cluster validation


Buja et al. (2009) - visual testing:




Cluster validation


Comparing ASW with values under null model

2 4 6 8 10

0.0

00

.05

0.1

00

.15

0.2

00

.25

0.3

0

Number of clusters

Ave

rag

e s

ilho

ue

tte

wid

th




Cluster validation


Comparing quality index with fitted models for all n.o.c.(work in progress)

010

2030

4050

60

Number of clusters

Den

sity

crit

erio

n

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●●●●●●●

●

●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●

●●●

●●●

●●●

●

●●●●●●●●● ●

●●●

●●

●●

●

●●●●

●

●●●●●● ●●●

●●

●●●●

●●●●●●●

●

●●●●●●●●●

●

●

●

●●

●●

●

●●● ●

●●

●

●●●●●●●●●●●●●●●● ●●●

●●●●●●●●●

●

●

●●●●●●●●●●

●

●●●●●●●●

●●●

●

●●●

●

●●●●●

●●

●

●

●

●

●●

●

●●●●●

1 2 3 4 5 6 7 8 9 10 11




Cluster validation


Conclusion

Too much cluster analysis researchis about automatising decisions.

Too little cluster analysis researchacknowledges what researchers should decide,and tries to help them making the decisions.

There are endless opportunities forthis kind of research.




Cluster validation


Some marketing:




Cluster validation


. . . and some more:

C. Hennig (2004) Asymmetric linear dimension reduction for classification. Journal of Computational andGraphical Statistics 13, 930-945

C. Hennig (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysismethods. Journal of Multivariate Analysis 99, 1154-1176.

C. Hennig and C.-J. Lin (2015) Flexible parametric bootstrap for testing homogeneity against clustering andassessing the number of clusters. Statistics and Computing 25, 821-833Open access

C. Hennig (2015) Clustering strategy and method selection. In Hennig, C., M. Meila, F. Murtagh, and R.Rocci (Eds.). Handbook of Cluster Analysis. Chapman and Hall/CRCFree version on arxiv

C. Hennig (2017) Cluster validation by measurement of clustering characteristics relevant to the user.Submitted.Free version on arxiv

Other authors (selection):

M. Ackerman, S. Ben-David, and D. Loker (2010) Towards Property-Based Classification of ClusteringParadigms. NIPS 2010.

A. Buja, D. Cook, H. Hofmann, M. Lawrence, E.-K. Lee, D. Swayne, H. Wickham (2009) Statistical inferencefor exploratory data analysis and model diagnostics. Philosophical Transactions of The Royal Society, A.

L. Fisher and J. W. Van Ness (1971) Admissible Clustering Procedures. Biometrika 58, 91-104.

J. Kleinberg (2002) An Impossibility Theorem for Clustering. NIPS 2002.

C. R. Rao (1952) Advanced Statistical Methods in Biometric Research. Wiley, New York.


decisions that are needed when using cluster analysis, and

Documents