decisions that are needed when using cluster analysis, and

72
Introduction The data used for clustering Characterising clustering methods Cluster benchmarking Cluster validation Decisions that are needed when using cluster analysis, and research that helps with making them Christian Hennig Supported by EPSRC Grant EP/K033972/1 Christian Hennig Decisions in cluster analysis

Upload: others

Post on 01-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Decisions that are needed when using clusteranalysis,

and research that helps with making them

Christian Hennig

Supported by EPSRC Grant EP/K033972/1

Christian Hennig Decisions in cluster analysis

Page 2: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

1. Introduction

How to do clustering?How to represent information?What’s the best method?

What do we want from clustering?Various things. . .that aren’t necessarily served by the same clustering.

Christian Hennig Decisions in cluster analysis

Page 3: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

1. Introduction

How to do clustering?How to represent information?What’s the best method?

What do we want from clustering?Various things. . .that aren’t necessarily served by the same clustering.

Christian Hennig Decisions in cluster analysis

Page 4: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Optimal representation vs. pattern recognition

3 3333

33

33

333

3333333 3 33

3

3333

3

3333 3

33

333

33

1

22

1 1

2

1

2

1

2

2

21

2

1

1

1

1

1

1

22

1

2

2

1

2

2

12

1

2

2 12

21

2

2

2

1

11

1

1

2

2

1

2 1

12

2

2 1

1

2

2

1

1

1

1

21

1

2

2

1

2

2

1

22

1

1

2

2 1

2

2

22

1

2

2

2

12

12

1

12

2

1

22

2

12

−10 0 10 20 30

010

2030

40

xy[,1]

xy[,2

]

X

XX

−10 0 10 20 30

01

02

03

04

0

xy[,1]

xy[,

2]

Christian Hennig Decisions in cluster analysis

Page 5: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Granularity?

● ●●

●●

●●

● ●

●●

●● ●●

● ●

● ●

● ●

● ●●

● ●

●●●

● ●

●●

● ●●

●●● ●●●●

● ●

−2 −1 0 1 2

−2

−1

01

waiting

dura

tion

−2 −1 0 1 2

−2

−1

01

mclust

waiting

du

ratio

n

Christian Hennig Decisions in cluster analysis

Page 6: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Clustering is applied in various fieldswith various aims that have different requirements

Christian Hennig Decisions in cluster analysis

Page 7: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

E.g., object recognition in imagesrequires separation on suitable features,

clustering for information reduction requiresgood representation by centroids,

clustering regions for mapping requiressimilar (and close?) regions within clusters,

noisy observations from heterogeneous sourcesrequire telling apart homogeneous probability models.

Christian Hennig Decisions in cluster analysis

Page 8: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

E.g., object recognition in imagesrequires separation on suitable features,

clustering for information reduction requiresgood representation by centroids,

clustering regions for mapping requiressimilar (and close?) regions within clusters,

noisy observations from heterogeneous sourcesrequire telling apart homogeneous probability models.

Christian Hennig Decisions in cluster analysis

Page 9: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

E.g., object recognition in imagesrequires separation on suitable features,

clustering for information reduction requiresgood representation by centroids,

clustering regions for mapping requiressimilar (and close?) regions within clusters,

noisy observations from heterogeneous sourcesrequire telling apart homogeneous probability models.

Christian Hennig Decisions in cluster analysis

Page 10: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

E.g., object recognition in imagesrequires separation on suitable features,

clustering for information reduction requiresgood representation by centroids,

clustering regions for mapping requiressimilar (and close?) regions within clusters,

noisy observations from heterogeneous sourcesrequire telling apart homogeneous probability models.

Christian Hennig Decisions in cluster analysis

Page 11: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

How to do clusteringdepends on decisions that we have to make.

Speaking of “the natural clusters” is misleading;it suggests that the data on their owncan know the best clustering,and decisions can/should be made automatically.

Not so. . .but how can cluster analysis research help deciding?

Christian Hennig Decisions in cluster analysis

Page 12: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

How to do clusteringdepends on decisions that we have to make.

Speaking of “the natural clusters” is misleading;it suggests that the data on their owncan know the best clustering,and decisions can/should be made automatically.

Not so. . .but how can cluster analysis research help deciding?

Christian Hennig Decisions in cluster analysis

Page 13: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

How to do clusteringdepends on decisions that we have to make.

Speaking of “the natural clusters” is misleading;it suggests that the data on their owncan know the best clustering,and decisions can/should be made automatically.

Not so. . .but how can cluster analysis research help deciding?

Christian Hennig Decisions in cluster analysis

Page 14: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

2. The data used for clustering

Decision how to represent information in dataDefinition, choice and selection of variablesDissimilarity definitionTransformation and standardisationDealing with missing information etc.

These often make a huge difference.

Christian Hennig Decisions in cluster analysis

Page 15: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

2.1 Standardisation

. . . is reweighting of variablesin order to give all variables same influence(depending on method).

If relevant information contentis proportional to variationit’s a reason not to standardise(and not to use anything scale invariant).

Christian Hennig Decisions in cluster analysis

Page 16: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

2.1 Standardisation

. . . is reweighting of variablesin order to give all variables same influence(depending on method).

If relevant information contentis proportional to variationit’s a reason not to standardise(and not to use anything scale invariant).

Christian Hennig Decisions in cluster analysis

Page 17: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Bermuda pollen

Time

Vo

lta

ge

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Bermuda spores

Time

Vo

lta

ge

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Johnson spores

Time

Vo

lta

ge

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Walnut pollen

Time

Vo

lta

ge

Christian Hennig Decisions in cluster analysis

Page 18: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

2.2 Variable selection (and dimension reduction)

There’s much literatureon automatic variable selection for clustering.

Principles:Models assuming some variables “noise”(eg, Law, Figueiredo & Jain 2004)

Select variables to find “most clustered clustering”(eg, Montanari & Lizzani 2001)Eliminate redundant (dependent) variables(eg, Fraiman, Justel & Svarc 2008)

Christian Hennig Decisions in cluster analysis

Page 19: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

2.2 Variable selection (and dimension reduction)

There’s much literatureon automatic variable selection for clustering.

Principles:Models assuming some variables “noise”(eg, Law, Figueiredo & Jain 2004)Select variables to find “most clustered clustering”(eg, Montanari & Lizzani 2001)

Eliminate redundant (dependent) variables(eg, Fraiman, Justel & Svarc 2008)

Christian Hennig Decisions in cluster analysis

Page 20: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

2.2 Variable selection (and dimension reduction)

There’s much literatureon automatic variable selection for clustering.

Principles:Models assuming some variables “noise”(eg, Law, Figueiredo & Jain 2004)Select variables to find “most clustered clustering”(eg, Montanari & Lizzani 2001)Eliminate redundant (dependent) variables(eg, Fraiman, Justel & Svarc 2008)

Christian Hennig Decisions in cluster analysis

Page 21: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

Obviously this is helpful for exploratory DA. . .

. . . but the selected variables definethe meaning of the clustering!

Is this for the data to decide?

Christian Hennig Decisions in cluster analysis

Page 22: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

Obviously this is helpful for exploratory DA. . .

. . . but the selected variables definethe meaning of the clustering!

Is this for the data to decide?

Christian Hennig Decisions in cluster analysis

Page 23: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)

Statistically dependent variablesmay be different in meaning;dependence information is not redundant“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.

Could create meaningful summary variablesrather than reducing dimension automatically.

Christian Hennig Decisions in cluster analysis

Page 24: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)Statistically dependent variablesmay be different in meaning;dependence information is not redundant

“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.

Could create meaningful summary variablesrather than reducing dimension automatically.

Christian Hennig Decisions in cluster analysis

Page 25: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)Statistically dependent variablesmay be different in meaning;dependence information is not redundant“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.

Could create meaningful summary variablesrather than reducing dimension automatically.

Christian Hennig Decisions in cluster analysis

Page 26: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

StandardisationVariable selection

Reasons against automatic variable selectionDon’t eliminate variables essential for research aim(prescribed cluster meaning; or use for external objective)Statistically dependent variablesmay be different in meaning;dependence information is not redundant“Most clustered clustering” may mean very littledue to degrees of freedom in finding it.

Could create meaningful summary variablesrather than reducing dimension automatically.

Christian Hennig Decisions in cluster analysis

Page 27: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

3. Characterising clustering methods

For choosing a suitable clustering method, needunderstanding how what they do relates to clustering aim.

Every clustering method comes with an implicit“cluster concept” with certain characteristics.

Understanding by “direct interpretation”and theoretical investigation of characteristics.

Christian Hennig Decisions in cluster analysis

Page 28: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

E.g., k -means:n∑

i=1

‖xi −mc(i)‖2 = min!

Formalises representation of objects by centroids

Distances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)

Christian Hennig Decisions in cluster analysis

Page 29: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

E.g., k -means:n∑

i=1

‖xi −mc(i)‖2 = min!

Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)

No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)

Christian Hennig Decisions in cluster analysis

Page 30: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

E.g., k -means:n∑

i=1

‖xi −mc(i)‖2 = min!

Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separation

Squared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)

Christian Hennig Decisions in cluster analysis

Page 31: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

E.g., k -means:n∑

i=1

‖xi −mc(i)‖2 = min!

Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distances

Maximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)

Christian Hennig Decisions in cluster analysis

Page 32: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

E.g., k -means:n∑

i=1

‖xi −mc(i)‖2 = min!

Formalises representation of objects by centroidsDistances in all directions from centroid count the same(standardisation will have impact)No enforcement of between-cluster separationSquared distances⇒unforgiving against large within-cluster distancesMaximum likelihood for partition model ofspherical Gaussians (characteristic, not requirement!)

Christian Hennig Decisions in cluster analysis

Page 33: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Representation not separation

3 3333

33

33

333

3333333 3 33

3

3333

3

3333 3

33

333

33

1

22

1 1

2

1

2

1

2

2

21

2

1

1

1

1

1

1

22

1

2

2

1

2

2

12

1

2

2 12

21

2

2

2

1

11

1

1

2

2

1

2 1

12

2

2 1

1

2

2

1

1

1

1

21

1

2

2

1

2

2

1

22

1

1

2

2 1

2

2

22

1

2

2

2

12

12

1

12

2

1

22

2

12

−10 0 10 20 30

010

2030

40

xy[,1]

xy[,2

]

X

XX

Christian Hennig Decisions in cluster analysis

Page 34: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Unforgiving against large within-cluster distances

2

222

22 22 2

2

2

22

22 22

2

2

2

2

2

22

2

11

11

1

1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

4

4

4 4

4

44

4

4 44

4

4

4

4

4

4

4

4

4 44

44

4

3

333

3

3

3

3

3

33

33

3

3

3 3

33

33

3

3

3

3

0 1 2 3 4

01

23

4

x1

x2

4444444444444444444444444

4444444444444444444444444

4444444444444444444444444

4444444444444444444

444444

3

1

2

−30 −20 −10 0 10 20 30

05

1015

2025

30

x1

x2

Christian Hennig Decisions in cluster analysis

Page 35: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Theory:Axiomatic characterisation of clustering methodsJardine and Sibson (1971), Fisher and van Ness (1971)

Impossibility Theorem (Kleinberg, 2002)There’s no clustering method that fulfillsscale invariance, richness, “consistency”

Consistency: if all within-cluster distancesare made smaller and all between-cluster distancesmade larger, we get the same clustering.

This is not usually desirable!

Christian Hennig Decisions in cluster analysis

Page 36: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Theory:Axiomatic characterisation of clustering methodsJardine and Sibson (1971), Fisher and van Ness (1971)

Impossibility Theorem (Kleinberg, 2002)There’s no clustering method that fulfillsscale invariance, richness, “consistency”

Consistency: if all within-cluster distancesare made smaller and all between-cluster distancesmade larger, we get the same clustering.

This is not usually desirable!

Christian Hennig Decisions in cluster analysis

Page 37: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Theory:Axiomatic characterisation of clustering methodsJardine and Sibson (1971), Fisher and van Ness (1971)

Impossibility Theorem (Kleinberg, 2002)There’s no clustering method that fulfillsscale invariance, richness, “consistency”

Consistency: if all within-cluster distancesare made smaller and all between-cluster distancesmade larger, we get the same clustering.

This is not usually desirable!

Christian Hennig Decisions in cluster analysis

Page 38: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Until that point: focus on supposedlygenerally desirable characteristics, “admissibility”.

More useful for helping with decisions:Characteristics that distinguishdifferent kinds of clustering methods, e.g.,

order invariance: order-preserving transformationof distances doesn’t change clustering(Ackerman et al. 2010; Fisher & van Ness 1971)

Christian Hennig Decisions in cluster analysis

Page 39: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Until that point: focus on supposedlygenerally desirable characteristics, “admissibility”.

More useful for helping with decisions:Characteristics that distinguishdifferent kinds of clustering methods, e.g.,

order invariance: order-preserving transformationof distances doesn’t change clustering(Ackerman et al. 2010; Fisher & van Ness 1971)

Christian Hennig Decisions in cluster analysis

Page 40: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Attempts to formalise robustnessagainst adding observations:Hennig (2008) cluster-wise “dissolution point”Ackerman et al. (2012) “δ-robustness”

How dissimilar can a clustering becomewhen adding g points?

Christian Hennig Decisions in cluster analysis

Page 41: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

k -means’ fixed k dissolution point is minimum.

2

222

22 22 2

2

2

22

22 22

2

2

2

2

2

22

2

11

11

1

1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

4

4

4 4

4

44

4

4 44

4

4

4

4

4

4

4

4

4 44

44

4

3

333

3

3

3

3

3

33

33

3

3

3 3

33

33

3

3

3

3

0 1 2 3 4

01

23

4

x1

x2

4444444444444444444444444

4444444444444444444444444

4444444444444444444444444

4444444444444444444

444444

3

1

2

−30 −20 −10 0 10 20 30

05

1015

2025

30

x1

x2

Christian Hennig Decisions in cluster analysis

Page 42: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Theoretical characterisation of clustering methodsrelevant for practice byconnecting characteristics to clustering aimsis an underdeveloped,important field for further research!

Christian Hennig Decisions in cluster analysis

Page 43: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

4. Cluster Benchmarking - for help with choosing a method

CLUSTER BENCHMARK

DATA REPOSITORY

HTTP://IFCS.BOKU.AC.AT/REPOSITORY

/

(With Iven van Mechelen, Nema Dean, Fritz Leisch, Rainer Dangl, Anne-Laure Boulesteix, Isabelle Guyon, Doug

Steinley)

Christian Hennig Decisions in cluster analysis

Page 44: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

4.1 Approaches to cluster benchmarkingComparison of clustering methods on

Real datasets with known classesSimulated datasets (from mixture distributions?)Real datasets without known classes

Datasets?Competitors?Quality measurement?

Christian Hennig Decisions in cluster analysis

Page 45: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

4.1 Approaches to cluster benchmarkingComparison of clustering methods on

Real datasets with known classesSimulated datasets (from mixture distributions?)Real datasets without known classes

Datasets?Competitors?Quality measurement?

Christian Hennig Decisions in cluster analysis

Page 46: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Why not just use data with known true classes?

There may be more than onelegitimate clustering in a dataset.Real datasets with known classeswill tell us nothing about generalisability.“True classes” may not be data analytic clusters.

Christian Hennig Decisions in cluster analysis

Page 47: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Why not just use data with known true classes?There may be more than onelegitimate clustering in a dataset.Real datasets with known classeswill tell us nothing about generalisability.“True classes” may not be data analytic clusters.

Christian Hennig Decisions in cluster analysis

Page 48: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

7 standard occupation classes such as “manual workers”,“managerials and professionals”, “not working”

Christian Hennig Decisions in cluster analysis

Page 49: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

4.2 Quality measurement with unknown truth(current research, Hennig 2017, arxiv)

General approach: Measure different aspectsof clustering by different statisticsto give a multivariate characterisationof cluster validity.

Optimal clustering could be found bycomputing weighted average,according to relative importance of aspectsin given application.

Christian Hennig Decisions in cluster analysis

Page 50: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Typical clustering aimsBetween-cluster separationWithin-cluster homogeneity (low distances)Within-cluster homogeneous distributional shapeGood representation of data by centroidsGood representation of dissimilarity by clusteringClusters are regions of high density without within-clustergapsUniform cluster sizesClusters easily characterisable by few variablesClusters well related to external informationStability

Christian Hennig Decisions in cluster analysis

Page 51: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Measuring within-cluster homogeneityby average within-cluster dissimilarity.

Measuring representation of dissimilaritiesby “Pearson-Γ” (Hubert and Schultz 1976, Halkidi et al. 2016):Correlation between vector of dissimilaritiesand 0-1 vector for “in same/different cluster”.

Representation of objects by centroids- eg, k -means criterion.

Christian Hennig Decisions in cluster analysis

Page 52: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Measuring within-cluster homogeneityby average within-cluster dissimilarity.

Measuring representation of dissimilaritiesby “Pearson-Γ” (Hubert and Schultz 1976, Halkidi et al. 2016):Correlation between vector of dissimilaritiesand 0-1 vector for “in same/different cluster”.

Representation of objects by centroids- eg, k -means criterion.

Christian Hennig Decisions in cluster analysis

Page 53: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Measuring within-cluster homogeneityby average within-cluster dissimilarity.

Measuring representation of dissimilaritiesby “Pearson-Γ” (Hubert and Schultz 1976, Halkidi et al. 2016):Correlation between vector of dissimilaritiesand 0-1 vector for “in same/different cluster”.

Representation of objects by centroids- eg, k -means criterion.

Christian Hennig Decisions in cluster analysis

Page 54: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Measuring between-cluster separation:p-separation index:Average distance to nearest point in different cluster forp = 10% “border” points in any cluster.

−2 −1 0 1 2

−2

−1

01

waiting

du

ratio

n

X

X

X

X

X

X

X

XX

XX X

X

X

X

X

X

X

X

XX

XX

X

X

X X

X

X

X

X

Christian Hennig Decisions in cluster analysis

Page 55: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Clusters corresponding to high density regions- tough to measure (work in progress);

Many open problems:theoretical characterisation of indexes,relations between indexes,non-distance based criteria,big data computation/approximation.

Christian Hennig Decisions in cluster analysis

Page 56: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

4.3 Data with known clusters used in different ways

Benchmarking with data with known clustersis surely relevant;discovering hidden structureis a major aim of clustering,and can only be tested this way.

Unfortunately, benchmarking with known clustersis often used for 1-d quality rankingwith little attention to generalisation andwhat can be learnt for clustering in practice.

Christian Hennig Decisions in cluster analysis

Page 57: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

4.3 Data with known clusters used in different ways

Benchmarking with data with known clustersis surely relevant;discovering hidden structureis a major aim of clustering,and can only be tested this way.

Unfortunately, benchmarking with known clustersis often used for 1-d quality rankingwith little attention to generalisation andwhat can be learnt for clustering in practice.

Christian Hennig Decisions in cluster analysis

Page 58: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Data with known clusters used in different ways:On what characteristics of real clustering problemdoes a method’s performance depend?

On what features/characteristics of the datadoes a method’s performance depend?How are the different characteristics/criteriaand methods related on real data?

Christian Hennig Decisions in cluster analysis

Page 59: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Data with known clusters used in different ways:On what characteristics of real clustering problemdoes a method’s performance depend?On what features/characteristics of the datadoes a method’s performance depend?

How are the different characteristics/criteriaand methods related on real data?

Christian Hennig Decisions in cluster analysis

Page 60: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Approaches to cluster benchmarkingQuality measurement with unknown truthKnown clusters used in different ways

Data with known clusters used in different ways:On what characteristics of real clustering problemdoes a method’s performance depend?On what features/characteristics of the datadoes a method’s performance depend?How are the different characteristics/criteriaand methods related on real data?

Christian Hennig Decisions in cluster analysis

Page 61: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

5. Cluster validationAssessing the quality of a clustering on a dataset,incl. comparing clusterings, parameters(such as number of clusters)

Christian Hennig Decisions in cluster analysis

Page 62: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

Elements of cluster validationInternal validation indexesStability assessmentUse of external informationTesting for clustering structureVisual explorationComparison of different clusterings on same data/sensitivity analysis

Christian Hennig Decisions in cluster analysis

Page 63: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

5.1 Visual exploration

Visualisation is central for cluster validation.I’d never trust and interpret a clusteringwithout visual evaluation.

(If at all possible, I wouldn’t leave number of clustersto automatic criterion, rather decide from graphsand background information.)

Christian Hennig Decisions in cluster analysis

Page 64: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

5.1 Visual exploration

Visualisation is central for cluster validation.I’d never trust and interpret a clusteringwithout visual evaluation.

(If at all possible, I wouldn’t leave number of clustersto automatic criterion, rather decide from graphsand background information.)

Christian Hennig Decisions in cluster analysis

Page 65: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

Visualisation for cluster validation (Rao 1952, Hennig 2004)

1

2

1

1

1

1

11

1

3

4

4

5

3

33

3

3

3

333 1

1

18

55

7

11

55

4

4

44

7

4 5

4

55 65

5

5

45

4

564

54

55 5

55

5

8

555

5 565

59

4

55

4

7

4

5 55

7

7

55

75

5

5

5 5 5

8

7

4 6

44

41

3

7777

5

7775

5

8

6 7 7

8

51

7

15

65

67

74

8

775

7

77

74 5

55

4

4 5 54 4

55

54

45

4

4 4 554

55

77

8

444

95

5 7

4

−5 0 5

05

1015

EEE, all equal

PC1

PC

21

2

1

1

11

1

1

1

3

4

4

5

33 3

3

3

3

33 3

1 11

8

55

7

11

5 5

4

4

44

7 4

5

4

55

655

5

45 4

56

4

5

4

555

55

5

8

555

55655

94

55

47

4

55

5

775 57 5

5

5

55

5

8

7

4

6

4

44

1

3

77

7

7

57

7 75 5

8

67 7

8

51

7

15

6

56

7

7

4

8

775

777

7

4

555

4

455

445

5

54

4

5 444

5

5 4

55

7

7

8

44 4

95

57

4

−5 0 5

−15

−10

−5

05

dc 1

dc 2

1

2

1

1

1

11

1

1

3

44

5

3

3

3

3

3

3

3

3

3

111

8

5

5

7

11

5

5

4

4

4

4

7

4

5

4

55

6

5

5 5

4

5

4

56

4

5

4

5

5

5

5

5

5

8

5

5

55

5

6

5

5

94

55

4

7

4

55

5

7

7

5

5

7

5

5

5

555

8

7

4

64

4

4 1

3

7

7

77

5

7

7

7

5

5

867

7

8

5

17

15

6

56

7

7

4

8

7

7

5

7

7

77

4

5

5

5

4

4

5

5

4

4

5

5

54

4

5

4

4

4

5

5

4

55

7

7

8

4

4 49

5

5

7

4

−40 −20 0 20 40 60

−5

05

arc 1

arc

2

1

2

11

11

11

1

3

4

4

5

333

3

3

3

33

3

1

1

1

8

5

5

7

11

55

4

4

4

4

7

4

5

4

5

565

55

45

4

56

4

5

4

55

5

55

5

8

5

55

55

6

55

9

4

5

5

4

7

45

5

5

7

755

7

5 5

55

5

5

87

4

6

4

4

4

1

3

7

7

7

7

5 777

55

8

6

77

85

1

715

6

5

6

7

7

4

8

77

57

7

77

4

5 55

4

4 5

5

4

4

55

5

4

4

5

44 4

5

5

4

5

5

778

4

449

5

57

4

−5 0 5

−10

−5

05

anc 1

anc

2

Christian Hennig Decisions in cluster analysis

Page 66: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

5.2 Parametric bootstrap

Parametric bootstrap (Efron 1979):fitting a model to data, sampling from fitted model.

Uses in cluster validation:Compare real datawith homogeneity model for “no clustering” (Hennig & Lin 2015)capturing all non-clustering structure.

Christian Hennig Decisions in cluster analysis

Page 67: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

Buja et al. (2009) - visual testing:

Christian Hennig Decisions in cluster analysis

Page 68: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

Comparing ASW with values under null model

2 4 6 8 10

0.0

00

.05

0.1

00

.15

0.2

00

.25

0.3

0

Number of clusters

Ave

rag

e s

ilho

ue

tte

wid

th

Christian Hennig Decisions in cluster analysis

Page 69: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

Comparing quality index with fitted models for all n.o.c.(work in progress)

010

2030

4050

60

Number of clusters

Den

sity

crit

erio

n

●●

●●

●●

●●●●●●●

●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●

●●●

●●●

●●●

●●●●●●●●● ●

●●●

●●

●●

●●●●

●●●●●● ●●●

●●

●●●●

●●●●●●●

●●●●●●●●●

●●

●●

●●● ●

●●

●●●●●●●●●●●●●●●● ●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●

●●●

●●●●●

●●

●●

●●●●●

1 2 3 4 5 6 7 8 9 10 11

Christian Hennig Decisions in cluster analysis

Page 70: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

Conclusion

Too much cluster analysis researchis about automatising decisions.

Too little cluster analysis researchacknowledges what researchers should decide,and tries to help them making the decisions.

There are endless opportunities forthis kind of research.

Christian Hennig Decisions in cluster analysis

Page 71: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

Some marketing:

Christian Hennig Decisions in cluster analysis

Page 72: Decisions that are needed when using cluster analysis, and

IntroductionThe data used for clustering

Characterising clustering methodsCluster benchmarking

Cluster validation

Visual explorationParametric bootstrap

. . . and some more:

C. Hennig (2004) Asymmetric linear dimension reduction for classification. Journal of Computational andGraphical Statistics 13, 930-945

C. Hennig (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysismethods. Journal of Multivariate Analysis 99, 1154-1176.

C. Hennig and C.-J. Lin (2015) Flexible parametric bootstrap for testing homogeneity against clustering andassessing the number of clusters. Statistics and Computing 25, 821-833Open access

C. Hennig (2015) Clustering strategy and method selection. In Hennig, C., M. Meila, F. Murtagh, and R.Rocci (Eds.). Handbook of Cluster Analysis. Chapman and Hall/CRCFree version on arxiv

C. Hennig (2017) Cluster validation by measurement of clustering characteristics relevant to the user.Submitted.Free version on arxiv

Other authors (selection):

M. Ackerman, S. Ben-David, and D. Loker (2010) Towards Property-Based Classification of ClusteringParadigms. NIPS 2010.

A. Buja, D. Cook, H. Hofmann, M. Lawrence, E.-K. Lee, D. Swayne, H. Wickham (2009) Statistical inferencefor exploratory data analysis and model diagnostics. Philosophical Transactions of The Royal Society, A.

L. Fisher and J. W. Van Ness (1971) Admissible Clustering Procedures. Biometrika 58, 91-104.

J. Kleinberg (2002) An Impossibility Theorem for Clustering. NIPS 2002.

C. R. Rao (1952) Advanced Statistical Methods in Biometric Research. Wiley, New York.

Christian Hennig Decisions in cluster analysis