the effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster...

8
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-5, NO. 1, JANUARY 1983 The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure GLENN W. MILLIGAN, S. C. SOON, AND LISA M. SOKOL Abstract-An evaluation of four clustering methods and four extemal criterion measures was conducted with respect to the effect of the num- ber of clusters, dimensionality, and relative cluster sizes on the recovery of true cluster structure. The four methods were the single link, com- plete link, group average (UPGMA), and Ward's minimum variance algorithms. The results indicated that the four criterion measures were generally consistent with each other, of which two highly similar pairs were identified. The tirst pair consisted of the Rand and corrected Rand statistics, and the second pair was the Jaccard and the Fowlkes and Mallows indexes. With respect to the methods, recovery was found to improve as the number of clusters increased and as the number of dimensions increased. The relative cluster size factor produced differ- ential perfonnance effects, with Ward's procedure providing the best recovery when the clusters were of equal size. The group average method gave equivalent or better recovery when the clusters were of unequal size. Index Terins-Cluster recovery, clustering validation, error perturba- tion procedures, external criterion statistics, Monte Carlo methods. INTRODUCTION T HE present study was designed to provide information as to the effect of the number of clusters, the number of dimensions, and the relative cluster sizes on the recovery of true cluster structure. Four commonly used hierarchical al- gorithms were examined to determine whether the methods were differentially sensitive to the three factors. The methods were the single link, complete link, group average, and Ward's technique. Although several Monte Carlo studies have ex- amined the recovery characteristics of these algorithms [12], few experiments have attempted to provide specific informa- tion with respect to these three design factors. For example, Blashfield [1] included no systematic design factors in his study. Rather, cluster characteristics were gener- ated randomly under a specified scheme. Similarly, Milligan and Isaac [13] considered only the effect of intercluster spacing and error level on the recovery of cluster structure with ultrametric data. Kuiper and Fisher [9] reported several experiments which included two of the design factors used in the present study. However, the experiment which dealt with the effect of cluster size on recovery was confounded with an overall increase in the size of the entire data set. Furthermore, Manuscript received October 28, 1981; revised April 7, 1982. G. W. Milligan is with the Faculty of Management Sciences, Ohio State University, Columbus, OH 43210. S. C. Soon is at the College of Administrative Sciences, Ohio State University, Columbus, OH 43210. L. M. Sokol is with Flight Systems, Inc., Arlington, VA 22209. as the authors noted in the paper, the results concerning the number of clusters were partially confounded due to known characteristics of the Rand statistic. In fact, the majority of Monte Carlo studies have used only one external criterion index to evaluate the recovery of cluster structure. Hence, re- sults from such studies may be influenced by the characteristics of the recovery statistic. This provided a secondary purpose for the paper. That is, by measuring recovery with several cri- terion measures, it was possible to examine the extent of agree- ment between the indexes as to the degree of the recovery of the true cluster structure. DESIGN AND DATA GENERATION Data Sets The clusters present in the generated data sets were designed to possess the features of internal cohesion and external iso- lation as discussed by Cormack [4]. Internal cohesion requires that entities within the same cluster should be similar to each other. External isolation requires that entities in one cluster should be separated from entities in another cluster by fairly empty areas of space. This definition would seem to satisfy the intuitive concept of cluster structure held by many applied researchers. The definition is similar to the concept of natural clusters as described by Everitt [7], and it is consistent with the nature of the clustering algorithms which were examined. The results of the present study are meaningful only for these types of structures. Generalization of the results to over- lapping structures is risky at best. The clusters were assumed to be embedded in a k-dimen- sional space, and Euclidean distance was used as the similarity measure. Each data set consisted of 50 points. Internal cohe- sion was achieved through the use of truncated multivariate normal mixtures. Data points assigned to each cluster were required to fall with a 1.5 standard deviations of the cluster mean on each dimension. A spatial or geometric approach was used to provide the prop- erty of external isolation. Specifically, the clusters were not allowed to overlap on the first dimension of the variable space. That is, clusters were required to occupy disjoint regions of space. The separation between adjacent cluster boundaries on the first dimension was a function of the sum of the two corre- sponding within-cluster standard deviations. The separation factor was equal to the product of the sum times a uniform random variable distributed on the interval of 0.25-0.75. Cluster overlap was permitted on the remaining dimensions 0162-8828/83/0100-0040$01.00 © 1983 IEEE 40

Upload: lisa-m

Post on 07-Nov-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-5, NO. 1, JANUARY 1983

The Effect of Cluster Size, Dimensionality, andthe Number of Clusters on Recovery of

True Cluster StructureGLENN W. MILLIGAN, S. C. SOON, AND LISA M. SOKOL

Abstract-An evaluation of four clustering methods and four extemalcriterion measures was conducted with respect to the effect of the num-ber of clusters, dimensionality, and relative cluster sizes on the recoveryof true cluster structure. The four methods were the single link, com-plete link, group average (UPGMA), and Ward's minimum variancealgorithms. The results indicated that the four criterion measures weregenerally consistent with each other, of which two highly similar pairswere identified. The tirst pair consisted of the Rand and correctedRand statistics, and the second pair was the Jaccard and the Fowlkesand Mallows indexes. With respect to the methods, recovery was foundto improve as the number of clusters increased and as the number ofdimensions increased. The relative cluster size factor produced differ-ential perfonnance effects, with Ward's procedure providing the bestrecovery when the clusters were of equal size. The group averagemethod gave equivalent or better recovery when the clusters were ofunequal size.

Index Terins-Cluster recovery, clustering validation, error perturba-tion procedures, external criterion statistics, Monte Carlo methods.

INTRODUCTIONT HE present study was designed to provide information as

to the effect of the number of clusters, the number ofdimensions, and the relative cluster sizes on the recovery oftrue cluster structure. Four commonly used hierarchical al-gorithms were examined to determine whether the methodswere differentially sensitive to the three factors. The methodswere the single link, complete link, group average, and Ward'stechnique. Although several Monte Carlo studies have ex-amined the recovery characteristics of these algorithms [12],few experiments have attempted to provide specific informa-tion with respect to these three design factors.For example, Blashfield [1] included no systematic design

factors in his study. Rather, cluster characteristics were gener-ated randomly under a specified scheme. Similarly, Milliganand Isaac [13] considered only the effect of interclusterspacing and error level on the recovery of cluster structurewith ultrametric data. Kuiper and Fisher [9] reported severalexperiments which included two of the design factors used inthe present study. However, the experiment which dealt withthe effect of cluster size on recovery was confounded with anoverall increase in the size of the entire data set. Furthermore,

Manuscript received October 28, 1981; revised April 7, 1982.G. W. Milligan is with the Faculty of Management Sciences, Ohio

State University, Columbus, OH 43210.S. C. Soon is at the College of Administrative Sciences, Ohio State

University, Columbus, OH 43210.L. M. Sokol is with Flight Systems, Inc., Arlington, VA 22209.

as the authors noted in the paper, the results concerning thenumber of clusters were partially confounded due to knowncharacteristics of the Rand statistic. In fact, the majority ofMonte Carlo studies have used only one external criterionindex to evaluate the recovery of cluster structure. Hence, re-sults from such studies may be influenced by the characteristicsof the recovery statistic. This provided a secondary purposefor the paper. That is, by measuring recovery with several cri-terion measures, it was possible to examine the extent of agree-ment between the indexes as to the degree of the recovery ofthe true cluster structure.

DESIGN AND DATA GENERATION

Data SetsThe clusters present in the generated data sets were designed

to possess the features of internal cohesion and external iso-lation as discussed by Cormack [4]. Internal cohesion requiresthat entities within the same cluster should be similar to eachother. External isolation requires that entities in one clustershould be separated from entities in another cluster by fairlyempty areas of space. This definition would seem to satisfythe intuitive concept of cluster structure held by many appliedresearchers. The definition is similar to the concept of naturalclusters as described by Everitt [7], and it is consistent withthe nature of the clustering algorithms which were examined.The results of the present study are meaningful only for thesetypes of structures. Generalization of the results to over-lapping structures is risky at best.The clusters were assumed to be embedded in a k-dimen-

sional space, and Euclidean distance was used as the similaritymeasure. Each data set consisted of 50 points. Internal cohe-sion was achieved through the use of truncated multivariatenormal mixtures. Data points assigned to each cluster wererequired to fall with a 1.5 standard deviations of the clustermean on each dimension.A spatial or geometric approach was used to provide the prop-

erty of external isolation. Specifically, the clusters were notallowed to overlap on the first dimension of the variable space.That is, clusters were required to occupy disjoint regions ofspace. The separation between adjacent cluster boundaries onthe first dimension was a function of the sum of the two corre-sponding within-cluster standard deviations. The separationfactor was equal to the product of the sum times a uniformrandom variable distributed on the interval of 0.25-0.75.Cluster overlap was permitted on the remaining dimensions

0162-8828/83/0100-0040$01.00 © 1983 IEEE

40

Page 2: The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

MILLIGAN et al.: EFFECTS ON RECOVERY OF CLUSTER STRUCTURE

of the variable space. The range of each of the remaining di-mensions was set equal t6 two-thirds of the range of the firstdimension. Overlap was common on the remaining dimensionssince the cluster boundary locations were randomly selectedwithin the permissible range.

An example plot of a five-cluster data set is presented inFig. 1. The vertical axis corresponds to the dimension wherecluster overlap was not permitted. Projection of the clusterboundaries onto the horizontal axis indicates that clusters 1, 4,and 5 possessed overlapping boundaries. It should be notedthat the data set consisted of a total of six dimensions, and theplot which was used in Fig. 1 indicates the maximum clusterseparation. Other two-dimensional plots of this data set sug-

gested that several clusters occupied the same region of space.

Finally, the lengths for the cluster boundaries were randomlyselected from the uniform interval of 10-40 units. Theboundary length was defined to be three standard deviationsfor the cluster on that dimension.

Error Perturbation Conditions

Four error conditions were used in the study. The firsterror condition consisted of the error-free data sets as pro-

duced by the cluster construction program. The second er-

ror condition involved the error perturbation of the interpointdistances. Hence, the true structure was hidden with whatcould be considered a noisy measurement process. This was

accomplished by adding normally distributed noise to eachdimensional comparison when computing the distance betweentwo points. The mean of the distribution of the random errors

was zero. The standard deviation was set equal to the sum ofthe standard deviations from the cluster(s) that contained thetwo points for the given dimension. The effect of this error

level was to allow the boundaries to expand to the point whereoverlap between adjacent clusters was possible on the firstdimension of the variable space.

The third error condition involved the addition of two ran-

dom noise dimensions to the basic set of dimensions which de-fined the true cluster structure. The coordinate values on the

error dimensions were determined with the aid of a uniformrandom number generator. The range of the random noisedimensions was set equal to the range of the first dimensionof the variable space where cluster overlap was not permitted.Thus, the added dimensions provided no information whatso-ever as to the true cluster structure, and they only served tohide the structure.The last error condition actually represented random noise

data which contained no inherent cluster structure. The ran-

dom noise data sets were used to illustrate characteristics ofthe external criterion statistics and to provide baseline re-

covery rates. Characteristics such as the number of dimen-sions, the number and relative sizes of hypothesized clusters,and the range on each dimension matched the properties ofone of the error-free data sets.The error conditions in the present study correspond to the

error-free, high error, two-dimensional error, and random noiseerror levels in [10]. Additional details concerning the clusterconstruction process and the error conditions can be found in

[10] and [11].

-2 22:

*22 2

22 2

* 5 *::~

5 5

:5 55 5 :

Fig. 1. Plot of a five-cluster data set with equal cluster densities.

TABLE IPOINT DISTRIBUTION ACROSS CLUSTERS

Number of Density

ClustersEqual 10% 60%

2 25-25 5-45 30-20

3 16-17-17 5-22-23 30-10-10

4 12-12-13-13 5-15-15-15 30-6-7-7

5 10-10-10-10-10 5-11-11-11-12 30-5-5-5-5

Design FactorsThe overall characteristics ofthe clusters were controlled by a

three-factor design. The first design factor determined whethertwo, three, four or five clusters would be present in the data.The second factor involved embedding the clusters in either a

four-, six-, or eight-dimensional space. The third factor (pointdensity) varied the size of the clusters by using three differentdistribution patterns of the points to the clusters. Three rep-lications were taken from each cell in the design. This createda total of 108 error-free data sets.The first level of the point density factor distributed the

points across the clusters as equally as possible (see Table I).The second level specified that one cluster must always con-

tain 10 percent of the data points, while the third level placed60 percent of the points in a single cluster. The remaining

*........

: 4 4:-

:4:4 4*4 4 .

...........

q

3

: 3 3 :

3 3 :

*. 3:

. . . . . . . .

1 :: 1 1 :: 1

1. I: 1 1 .

1 ::1....... :

41

Page 3: The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-5, NO. 1, JANUARY 1983

points in the 10 and 60 percent density conditions were dis-tributed as equally as possible across the remaining clusters.For the larger number of clusters, the 60 percent conditionproduced a large discrepancy between the relative sizes. Onlya moderate difference existed when fewer clusters were pre-sent. The 10 percent condition produced a large discrepancybetween the relative sizes when only two or three clusters werepresent. The 10 percent condition produced a moderate dis-crepancy in sizes when the number of clusters was larger.

External Criterion MeasuresAn external criterion measure is used to evaluate recovery of

true cluster structure [17]. An external criterion index usesinformation obtained from outside the clustering process. Inthe present situation, the external information is the knowledgeof the true cluster structure. Four criterion measures wereused in the study. The measures were the Rand [16], cor-rected Rand [2], [15], Jaccard [5], [11], and Fowlkes andMallows statistic [8]. The Rand statistic is one of the earliestsuch measures, and has been used extensively in Monte Carloresearch [12]. The other three measures have been proposedmore recently in an attempt to overcome some undesirableproperties of the Rand index. For example, the Rand sta-tistic approaches its upper boundary value of 1.00 as the num-ber of clusters increases without limit [2], [8], [16].Table II will be used to define the four external indexes.

The four cells in the table indicate whether each pair of pointswas properly classified together. For example, cell a indicatesthe frequency count for the number of pairs which were cor-rectly clustered together by the algorithm. On the other hand,cell c indicates the number of occurrences where the algorithmplaced a pair of points in different clusters when the pointsactually came from the same cluster. Thus, cells a and d indi-cate the frequency of correct pairwise classifications, and cellsb and c represent the number of improperly clustered pairs.Given these definitions, the four statistics are defined as follows.Rand:

[a+d]/[a+b +c+d]. (1)Corrected Rand:

[a+d- Ncl/[a+ b +c+d- Nj]. (2)

Jaccard:

[a]/[a+b+c]. (3)Fowlkes and Mallows:

[a] ! a + b)(a+ c)] (4)Computational formulas for the four statistics are presented

in the Appendix. It should be noted that the term NC in (2)for the corrected Rand statistic does not have a simple expres-sion in terms of Table II. The corrected Rand index is a directextension of Cohen's Kappa statistic for use in the clusteringcontext [3].When evaluating cluster recovery, the indexes were com-

puted at the level of the hierarchy which corresponds to theexact number of clusters known to be present in the data. Allfour statistics return a value of 1.00 when the partition pro-

TABLE IIPAIRWISE CLASSIFICATION NOTATION FOR THE EXTERNAL CRITERIA

Cr i ter ion Sol ut ionAlgorithmSolution

Pair in Same Pair Not inCluster Same Cluster

Pair in Same a bCluster

Pair Not in c dSame Cluster

duced by the algorithm exactly matches the true data cluster-ing. When the clustering is perfect, the values for b and c areequal to zero. The lower limit for the Rand, Jaccard, andFowlkes and Mallows statistics is 0.0. However, the indexeswill virtually never produce such values in an actual dataclustering. The lower bound of the corrected Rand indexdepends on the exact data partitioning.

RESULTS

The results for the point density factors are presented inTables III and IV. For each error condition, four sets of meansare reported which were computed using each of the four ex-ternal criteria. Since the four sets of means were computedfrom the same Monte Carlo data, comparisons between criteriacan be conducted within each error condition. Since the in-dexes are measuring the same relative degree of cluster re-covery, the rank order of the methods within each error con-dition level should be the same from one index to the next.Similarly, the rank order within method across error levelsshould be consistent between indexes. Such a frameworkprovides a direct method for comparing and evaluating the ex-ternal criteria.For example, consider the Rand statistic with the error-free

data. The 60 percent density level produced recovery meansof 0.940, 0.995, 0.999, and 0.991 for Methods 1-4, respec-tively. This produces a rank order of 4, 2, 1, and 3. It turnsout that this rank order exactly matches the ranking producedby each of the other three recovery measures. Similarly, whenconsidering the order within rows, the Rand statistic means forMethod 1 were 0.999, 0.984, and 0.940. The ranking of 1, 2,and 3 exactly matches the ranks obtained from each of theother recovery measures. When all comparisons similar tothese are made for the error-free condition, one discoversthat the four statistics produced identical rankings in everycomparison. Thus, the four statistics were perfectly consistentin ranking the degree of cluster recovery by the algorithms.When considering the actual performance of the methods,

recovery was excellent in all density levels for the error-freedata. The only exceptions appear to be slightly depressedrecovery rates for the single link method in the 10 and 60percent levels and in the 10 percent level for Ward's method.The corrected Rand and Jaccard measures seem to have ac-centuated this effect more than the Rand or Fowlkes andMallows indexes.For the error-perturbed condition, the Jaccard and Fowlkes

and Mailows indexes exhibited perfect agreement for all rank-ings. The Rand and corrected Rand indexes were generally

42

Page 4: The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

MILLIGAN et al.: EFFECTS ON RECOVERY OF CLUSTER STRUCTURE

TABLE IIIRECOVERY VALUES FOR THE POINT DENSITY FACTOR: ERROR-FREE

AND ERROR-PERTURBED CONDITIONS

Error Free Error Perturbed

Index Method Density Level

Equal 10% S0% Equal 10% 60%

Rand 1 .999 .984 .940 .847 .872 .7852 .991 .998 .995 .949 .917 .8953 .999 .997 .999 .945 .957 .8994 .999 .971 .991 .974 .882 .921

CorrectedRand 1 .998 .944 .879 .693 .673 .580

2 .983 .992 .990 .893 .801 .7903 .998 .988 .998 .887 .855 .8004 .998 .941 .983 .943 .760 .840

Jaccard 1 .998 .978 .923 .775 .829 .7482 .986 .997 .991 .906 .877 .8263 .998 .996 .998 .910 .933 .8544 .998 .968 .982 .942 .845 .848

Fowlkes andMallows 1 .999 .988 .956 .867 .900 .850

2 .992 .999 .995 .941 .926 .8933 .999 .998 .999 .946 .964 .9134 .999 .979 .989 .966 .901 .904

Note. Methods 1 - 4 are respectively the single link, completelink, group average, and Ward's algorithm. Each entry is the aver-age of 36 data sets (four levels of clusters x three levels ofdimensions x three replications).

TABLE IVRECOVERY VALUES FOR THE POINT DENSITY FACTOR:

TWo-DIMENSIONAL NOISE AND RANDOM NOISE CONDITIONS

Two-Dimensional Noise Random Noise

Index Method Density Level

Equal 10% 60% Equal 10% 60%

Rand 1 .877 .833 .818 .377 .472 .4582 .840 .826 .812 .585 .579 .5343 .904 .906 .897 .552 .554 .5084 .922 .828 .815 .590 .585 .534

CorrectedRand 1 .740 .582 .656 .006 .004 .006

2 .608 .596 .608 .045 .040 .0403 .769 .776 .793 .032 .029 .0154 .804 .621 .609 .047 .042 .035

Jaccard 1 .744 .727 .753 .295 .403 .4042 .584 .614 .640 .201 .248 .2573 .739 .786 .811 .230 .307 .2974 .769 .666 .616 .199 .243 .238

Fowlkes andMallows 1 .841 .836 .857 .507 .586 .607

2 .702 .727 .744 .327 .383 .4083 .825 .857 .877 .371 .454 .4574 .845 .775 .740 .323 .375 .384

consistent with each other, but produced slightly differentrank orders than the other two measures. However, all fourstatistics agreed that the single link method gave the poorestrecovery in all density levels. Similarly, all indexes indicatedthat Ward's method gave the best recovery when the clusterswere of equal size, whereas all statistics confirmed that thegroup average method gave the best recovery in the 10 percentlevel. Overall, the poorest recovery was obtained in the 60percent condition, with the two Rand statistics suggestingWard's method as the best, while the Jaccard and Fowlkes andMallows measures supported the group average algorithm. Ac-tually, an examination of the recovery means for these lasttwo measures indicates essentially identical recovery rates forthe algorithms. The inconsistency between the two sets ofstatistics may be indicating that the recovery rates betweenmethods were rather close.For the two-dimensional noise condition, the rankings for

the Jaccard and Fowlkes and Mallows statistics were perfectly

consistent. As before, the Rand and corrected Rand indexesproduced similar, but not perfectly matched rankings. Allfour measures indicated that Ward's method gave the best re-covery in the equal density level. When the clusters were ofunequal size (10 and 60 percent levels), the group averagemethod was rated by all statistics as producing the best recovery.Given the overall magnitude of the recovery means, it appears

that the two-dimensional noise condition was a more severeerror condition than the error-perturbed data. However, anexamination of the baseline means from the random noisedata indicates that a significant degree of cluster structure wasbeing recovered.The random noise condition means are of interest because

they indicate the lower range of values for the four statistics.Since all four measures can obtain their upper limit of 1.00,the baseline measures give an indication as to the overall vari-ability of the statistics. The Rand statistic has been criticizedbecause of its limited variance [8]. The Fowlkes and Mallowsindex was derived in the hope of offering a statistic with greatervariability, among other goals. As can be seen in Table IV, thebaseline means for the Fowlkes and Mallows meaure arelower than those for the Rand statistic for all methods exceptthe single linkage procedure. However, it is clear that theJaccard, and especially the corrected Rand measures, havelower baseline values. In fact, it appears that the correctedRand statistic has an obtainable lower bound very close to0.0. This provides the statistic with an effective range equiv-alent to that of a measure of proportionality or correlation.This has some intuitive appeal to applied researchers. Anexamination of the within-cell standard deviations presentedin Table IX confirm the hypothesis of greater variance for thecorrected Rand index for all data sets which possessed clusterstructure. It is interesting to note that the index was derivedfor somewhat different purposes. Specifically, the orginatorshad hope to provide a statistic which would allow for com-parisons across hierarchy levels [2], [5].The results for the number of clusters factor are presented

in Tables V and VI. Highly consistent rankings between thefour recovery measures can be found for the error-free, error-perturbed, and two-dimensional noise conditions. The fewinconsistent rankings which can be found do not exhibit asystematic pattern, except that the rankings for the Rand andcorrected Rand measures are more similar to each other thanto the remaining two indexes. A similar statement can bemade concerning the Jaccard and Fowlkes and Mallowsmeasures.When considering the recovery means for the error-free con-

dition, it is clear that the group average method gave the bestrecovery and the single link method produced the poorest re-covery. More importantly, it is interesting to note that thelowest recovery values were obtained from data sets consistingof two clusters. This result was confirmed in the error-per-turbed data condition. In fact, all four statistics seem to sug-gest that recovery improves as one moves to data sets con-taining a larger number of clusters. This trend was noted byKuiper and Fisher [9], but due to the known characteristicsof the Rand statistic, the authors chose not to claim the resultas a valid feature of the algorithms. The present paper will

43

Page 5: The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-5, NO. 1, JANUARY 1983

TABLE VRECOVERY VALUES FOR THE NUMBER OF CLUSTERS FACTOR:

ERROR-FREE AND ERROR-PERTURBED CONDITIONS

Error Free Error Perturbed

Index Method Number of Clusters

2 3 4 5 2 3 4 5

Rand 1 .930 .981 .990 .995 .663 .766 .929 .9812 .978 1.000 1.000 1.000 .799 .940 .949 .9933 .993 1.000 1.000 1.000 .925 .942 .971 .9974 .957 1.000 .992 1.000 .809 .948 .957 .988

CorrectedRand 1 .826 .963 .981 .991 .208 .578 .854 .955

2 .953 1.000 1.000 1.000 .560 .880 .888 .9843 .978 1.000 1.000 1.000 .575 .884 .938 .9924 .913 1.000 .983 1.000 .618 .893 .906 .973

Jaccard 1 .927 .966 .982 .991 .653 .693 .846 .9432 .966 1.000 1.000 1.000 .746 .879 .873 .9793 .989 1.000 1.000 1.000 .792 .887 .929 .9894 .948 1.000 .981 1.000 .763 .889 .894 .966

Fowlkes andMallows 1 .957 .981 .990 .995 .794 .815 .911 .969

2 .980 1.000 1.000 1.000 .840 .929 .922 .9883 .994 1.000 1.000 1.000 .875 .934 .960 .9944 .968 1.000 .988 1.000 .848 .932 .935 .981

Note. Each mean is based on a sample size of 27 data sets.

TABLE VIRECOVERY VALUES FOR THE NUMBER OF CLUSTERS FACTOR:TWo-DIMENSIONAL NOISE AND RANDOM NOISE CONDITIONS

Two-Dimensional Noise Random Noise

Index Method Number of Clusters

2 3 4 5 2 3 4 5

Rand 1 .862 .814 .844 .850 .598 .408 .364 .3742 .907 .797 .793 .807 .504 .537 .586 .6383 .982 .859 .883 .887 .538 .494 .532 .5884 .888 .837 .845 .849 .507 .535 .594 .643

CorrectedRand 1 .599 .658 .692 .688 .003 .001 .008 .010

2 .815 .583 .520 .499 .007 .042 .046 .0733 .946 .721 .736 .714 .006 .027 .029 .0404 .778 .665 .646 .623 .015 .037 .049 .064

Jaccard 1 .858 .738 .695 .674 .589 .364 .282 .2342 .881 .617 .500 .452 .396 .236 .170 .1403 .972 .754 .714 .674 .472 .285 .204 .1514 .869 .678 .619 .568 .390 .227 .162 .127

Fowlkes andMallows 1 .918 .843 .818 .799 .756 .580 .497 .433

2 .924 .735 .643 .594 .572 .381 .291 .2473 .985 .841 .813 .774 .645 .452 .346 .2674 .918 .786 .741 .702 .564 .371 .280 .229

claim that recovery does improve as the number of clustersincreases, at least for this type of error condition. Not only doall four external criteria in the error-perturbed condition sup-

port this view, but the random noise data provide strongevidence for the hypothesis.Specifically, the baseline values for the corrected Rand mea-

sure are very close to 0.0, yet the index exhibits a clear ten-dency to increase in value as the number of clusters increasesin the error-perturbed data. In fact, recovery for the fivecluster data sets is excellent despite the added error. Evenmore convincing are the results for the Jaccard and Fowlkesand Mallows measures. Both statistics exhibit a clear tendencyto decline in value as the number of clusters increases in therandom noise data. This pattern is exactly opposite to the one

seen in the error-perturbed condition, and it would seem tosuggest that the increasing trend is not an artifact of thestatistics.

TABLE VIIRECOVERY VALUES FOR THE NUMBER OF DIMENSIONS FACTOR:

ERROR-FREE AND ERROR-PERTURBED CONDITIONS

Error Free Error Perturbed

Index Method Number of Dimensions

4 6 8 4 6 8

Rand 1 .974 .964 .984 .794 .832 .8782 .990 .995 .999 .878 .947 .9363 .998 .997 1.000 .911 .949 .9414 .979 .984 .999 .899 .943 .935

CorrectedRand 1 .940 .916 .964 .584 .642 .720

2 .979 .988 .998 .745 .892 .8473 .994 .989 1.000 .812 .892 .8384 .958 .966 .998 .791 .884 .867

Jaccard 1 .956 .959 .984 .715 .785 .8512 .985 .992 .998 .784 .921 .9033 .997 .995 1.000 .844 .929 .9244 .969 .981 .998 .814 .911 .909

Fowlkes andMallows 1 .976 .976 .991 .831 .873 .913

2 .991 .995 .999 .866 .952 .9423 .998 .998 1.000 .908 .958 .9564 .980 .988 .999 .884 .946 .942

Note. Each mean is based on a sample size of 36 data sets.

Interestingly, when the two-dimensional noise condition isconsidered, the best recovery was obtained with the two clusterdata sets. The corrected Rand, Jaccard, and Fowlkes andMallows indexes indicate that the poorest recovery was ob-tained when five clusters were present. This result suggeststhat the two-dimensional error condition was qualitativelydifferent from the error-perturbed data. This view will be con-sidered further in the discussion.Before leaving the number of clusters factor, it is interesting

to consider the behavior of the four recovery statistics with therandom noise data. The corrected Rand index produced thelowest baseline values, and hence it possesses the greatestoverall variability. A particularly curious result can be seenat the level of two clusters. The baseline means for the Randstatistic are uniformly lower than the means for the Fowlkesand Mallows statistic, thus indicating greater variance for theRand measure. This occurred despite the authors' intentionof deriving a measure with greater variance [8].

Tables VII and VIII present the recovery means for the factorinvolving the number of dimensions. Again, when consideringthe comparative rank orders, all four statistics are fairly con-sistent. The agreement is greater between the Rand and cor-rected Rand statistics and between the Jaccard and Fowlkesand Mallows indexes.The random noise condition indicates that all four statistics

were unaffected by the number of dimensions present in thedata. This information is useful when evaluating the error-free, error-perturbed, and two-dimensional noise conditions.In all conditions, the lowest recovery means were obtained fordata sets consisting of four dimensions. For the error-free andthe two-dimensional noise conditions, recovery gradually im-proved as one moved from four to six to eight dimensions. Ingeneral, slightly higher recovery values were obtained in thesix-dimensional than in the eight-dimensional data sets for theerror-perturbed condition. However, the recovery means in

44

Page 6: The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

MILLIGAN et al.: EFFECTS ON RECOVERY OF CLUSTER STRUCTURE

TABLE VIIIRECOVERY VALUES FOR THE NUMBER OF DIMENSIONS FACTOR:TWO-DIMENSIONAL NOISE AND RANDOM NOISE CONDITIONS

Two-Dimensional Noise Random Noise

Index Method Number of Dimensions

4 6 8 4 6 8

Rand 1 .768 .846 .914 .439 .445 .4252 .764 .809 .906 .568 .564 .5673 .831 .909 .968 .558 .532 .5254 .787 .864 .914 .568 .570 .571

CorrectedRand 1 .533 .664 .781 .001 .020 -.003

2 .470 .563 .779 .041 .040 .0443 .622 .800 .915 .033 .026 .0194 .521 .701 .812 .039 .041 .043

Jaccard 1 .637 .745 .841 .361 .374 .3672 .499 .566 .772 .237 .236 .2333 .629 .787 .920 .259 .278 .2974 .540 .701 .809 .225 .229 .226

Fowlkes andMallows 1 .775 .850 .908 .557 .573 .570

2 .640 .694 .839 .374 .374 .3703 .741 .868 .951 .399 .430 .4534 .677 .805 .878 .360 .364 .359

TABLE IXWITHIN-CELL STANDARD DEVIATIONS

Error Condition

Index MethodError Error Two-Dimensional RandomFree Perturbed Noise Noise

Rand 1 .088 .200 .163 .1292 .031 .132 .143 .0623 .010 .123 .120 .0754 .068 .140 .141 .065

CorrectedRand 1 .192 .400 .325 .042

2 .062 .280 .308 .03'3 .034 .277 .263 .0394 .1 36 .282 .296 .035

Jaccard 1 .101 .224 .227 .1652 .048 .185 .286 .1093 .014 .158 .251 .1414 .087 .192 .263 .109

Fowlkes andMallows 1 .059 .136 .143 .143

2 .029 .120 .217 .1383 .007 .097 .179 .1634 .056 .125 .138 .139

Note. Each standard deviation was based on 108 observations.

TABLE XPAIRWISE CORRELATIONS BETWEEN INDEXES

Index

indexRand Corrected Fowl kes

Rand and Mallows

CorrectedRand .966

Fowl kesand Mallows .859 .901

Jaccard .908 .935 .987

Note. Each correlation was based on a sample size of 1,728.

the six- and eight-dimensional data sets are quite close and notsignificantly different.The final results which are presented in Table X are the pair-

wise correlations for the four recovery measures computedacross all data sets and methods. All correlations are sub-

stantial and indicate that each pair of external criteria sharebetween 74 and 97 percent in common variance. The highestcorrelations are between the Rand and corrected Rand mea-sures (0.966), and between the Jaccard and Fowlkes andMallows indexes (0.987). As was found in Tables III-VIII,these two pairs of indexes produced rankings which weremore similar to each other than to members of the other pair.

DISCUSSION

The results for the point density factor are in general agree-ment with previous research [12]. When the clusters are ofequal size, Ward's method gives the best recovery. This char-acteristic is consistent with its observed tendency to produceclusters of approximately equal size, and it may be a functionof the minimum variance criterion invoked by the algorithm.The group average method, which invokes a mean instead of avariance criterion, appears to be more suitable when the clus-ters are of unequal size.The algorithms showed a clear tendency to give improved

recovery as the number of clusters increased in the error-freeand error-perturbed conditions. This is a logical consequenceof the operation of the algorithms. Specifically, the algo-rithms begin the clustering process by placing all points intosingle element clusters. The algorithms then merge the mostsimilar pair of clusters together. If an error is made, then itcannot be corrected later in the hierarchy. That is, all clustermerges are permanent. If an error is made, say at the level offour clusters, then the error will still be present at the level oftwo clusters. Thus, recovery at five clusters may be betterthan at four, three, or two clusters because the algorithms havehad fewer opportunities to make merger errors.

It should be noted that most statistical procedures becomemore accurate as the sample size increases. Yet, since theoverall sample size was fixed, the preceding result indicatesthat recovery improved as the clusters became smaller (seeTable I). One of the error conditions, the two-dimensionalnoise condition, indicated that recovery was better for thetwo cluster data sets than for those containing a larger numberof clusters. The nature of this error condition may have re-quired larger cluster sizes for better recovery. For the error-perturbed data condition, the error involved adding smallamounts of noise to each dimensional comparison when com-puting the distance. Although the point would have movedlocations, it would not have moved very far. On the otherhand, in the two-dimensional noise condition, each set of truecluster coordinates was augmented with an additional set ofcoordinates which were random noise. The added coordinatescould have had the effect of moving the point to another regionof the hyperspace and away from its original cluster. Theoverall lower recovery means for the condition in Tables IV,VI, and VIII support the view that this error disturbance pro-cess was more severe than the error-perturbed condition.Thus, when the error level becomes as severe as in the two-dimensional condition, larger cluster sizes may be necessaryfor accurate recovery. Further research will be necessary toresolve this issue.

45

Page 7: The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-5, NO. 1, JANUARY 1983

In general, recovery was found to increase as the number ofdimensiins increased. This result logically follows from theconstruction of the data sets. On the first dimension of thevariable space, cluster boundaries were not allowed to overlap.Although overlap was allowed on all remaining dimensions,it is likely that each successive dimension contained some in-formation as to the true cluster structure. Thus, as moredimensions are added, more information as to the true structurewas available to the algorithms. Given this characteristic ofthe data, it is not surprising to find enhanced recovery with in-creased dimensionality. However, it should be noted that theaddition of useless dimensions, as in the two-dimensionalnoise error condition, can have a serious effect on recoveryperformance.

It was refreshing to find a high degree of consistency amongthe external criterion measures. In those cases where discrep-,ant rankings were found, such as the 60 percent density levelfor the error-perturbed condition in Table III, the actual meanswere fairly close, and they probably reflected equivalent re-covery on the part of the algorithms.The high similarity between the Jaccard index and the

Fowlkes and Mallows measure is perhaps a function of the factthat both statistics use the same numerator term. The termrepresents the number of pairs of points which were properlyclustered together by the algorithm. Likewise, the similarityof the Rand to the corrected Rand measure is reasonable. Thecorrected Rand statistic represents nothing more than an adjust-ment to the original Rand index for the marginal frequencies.In terms of recommending an external criterion for future

research, it would seem reasonable to select one index fromeach of the two highly similar pairs. The enhanced variabilityof the corrected Rand statistic seems to be desirable. Cer-tainly, the fact that the statistic can assume values between0.0 and 1.0 for real life data provides the index with strongintuitive appeal for applied researchers. Similarly, since theJaccard index exhibited a wider range of variance than theFowlkes and Mallows measure, the Jaccard index may bethe best choice from this set. However, other characteristicswhich were not fully explored in the present work maynecessitate a change in this recommendation.The use of simulated random noise data sets to generate

baseline rates for the recovery measures seems advisable. Afew attempts have been made to derive characteristics of thesampling distribution of the external criteria. For example,DuBien and Warde [6] have derived expressions for the meanand variance of the Rand statistic under a set of "reasonableassumptions." In fact, at least one of the assumptions can beconsidered unreasonable. In order to simplify the derivationprocess, the authors assumed that all possible partitions ofNobjects into K clusters were equally likely. However, clusteringalgorithms will almost always place the two points nearest toeach other in the same cluster and place the two most distantpoints in different clusters. This will hold, even for randomnoise data which contain no cluster structure. This resultmakes the assumption of equally likely partitions unreason-able. Such an assumption ignores what is occurring in theclustering process, and it can be seriously misleading to ap-

TABLE XICOMPUTATIONAL FORMULAS FOR THE EXPRESSIONS IN TABLE II

Criterion SolutionAlgorithmSolution

Pair in Same Pair Not inCluster Same Cluster

Pair in Same ./2 - N /2 LNN i/2 - £LN /2Cluster 1J 1J

Pair Not in EN2 /2 - EN /2 L.N2 /2 + N2 /2Same Cluster 1] 1)j

- y.N2 /2 - IN */21. *

plied researchers [14]. Thus, until such derivations are con-ducted under more appropriate conditions, the use of simu-lated random noise data sets to produce baseline rates seemsadvisable.

APPENDIXCOMPUTATIONAL FORMULAS

In order to provide computational formulas for the four ex-ternal indexes, let us define the criterion solution to be thetrue clustering of the data. Define Ni1 as the number of pointsin cluster i as produced by the algorithm which are also incluster j of the criterion solution. Let Ni, Nj, and N. repre-sent the typical marginal and grand totals. Given these defi-nitions, the indexes are computed in the following ways.Rand:

[lr.,+N(N. .- 1)/2 - DAT./2- AT/2]/

[N.(N.- 1)/2].Corrected Rand:

[zLN2>-~(zy.fj)IN! ] /

[N2./2 + YN.j/2 - (ZZ2NN.N,)/N.2 ] .

Jaccard:

[IYN2i - AN..] / [TAr2. + IN!.2 N. - IZNjil -

(Al)

(A2)

(A3)

Fowlkes and Mallows:

[zYA2 - N.]//[(Nt2 - NV.)(N.?j - N)]- (A4)

Table XI provides computational formulas for the quantitiespresented in Table II. When using Table XI to determine(A1){A4), algebraic simplification may be necessary. For thecorrected Rand index, the formula for the correction factorN, indicated in (2) is given in (A5). If the clustering of thedata matches the agreement expected by chance, then the cor-rected Rand index is equal to zero.

[XIN 2N!jNf2\ + N(N. - 1)/2 - YN2/2 - ZN!j/2]. (A5)

REFERENCES

[1] R. K. Blashfield, "Mixture model tests of cluster analysis of fouragglomerative hierarchical methods," Psychol. Bull., voL 83, pp.377-388, May 1976.

[21 M. Childress, "Statistics for evaluating classifications: A newview," presented at the Meeting of the Classification Soc., To-ronto, Ont., Canada, June 1981.

46

Page 8: The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

MILLIGAN et al.: EFFECTS ON RECOVERY OF CLUSTER STRUCTURE

[3] J. Cohen, "A coefficient of agreement for nominal scales,"Educ. Psychol. Measurement, vol. 20, pp. 37-46, Jan. 1960.

[4] R. M. Cormack, "A review of classifilcation," J. Royal Statist.Soc., ser. A, vol. 134, pp. 321-367, 1971.

[5] M. Downton and T. Brennan, "Comparing classifications: Anevaluation of several coefficients of partition agreement," pre-sented at the Meeting of the Classification Soc., Boulder, CO,June 1980.

[6] J. L. DuBien and W. D. Warde, "Some distributional resultsconcerning a comparative statistic used in cluster analysis," pre-sented at the Meeting of American Statist. Ass., Detroit, MI, Aug.1981.

[7] B. Everitt, Cluster Analysis. New York: Wiley, 1975.[81 E. B. Fowlkes and C. L. Mallows, "A new measure of similarity

between two hierarchical clusterings and its use in studyinghierarchical clustering methods," presented at the Meeting ofthe Classification Soc., Boulder, CO, June 1980.

[9] F. K. Kuiper and L. A. Fisher, "A Monte Carlo comparison ofsix clustering procedures," Biometrics, vol. 31, pp. 777-783,1975.

[10] G. W. Milligan, "An examination of the effect of six types oferror perturbation on fifteen clustering algorithms," Psycho-metrika, voL 45, pp. 325-342, Sept. 1980.

[11] -, "A Monte Carlo study of thirty internal criterion measuresfor cluster analysis," Psychometrika, vol. 46, pp. 187-199, June1981.

[12] -, "A review of Monte Carlo tests of cluster analysis," Multi-variate Behavioral Res., voL 16, pp. 379-407, July 1981.

[13] G. W. Milhigan and P. D. Isaac, "The validation of four ultra-metric clustering algorithms," Pattern Recognition, vol. 12, pp.41-50, Mar. 1980.

[14] G. W. Milligan and V. Mahajan, "A note on procedures fortesting the quality of a clustering of a set of objects," DecisionScL, voL 11, pp. 669-677, Oct. 1980.

[15] L. Morey, "Comparison of clustering techniques in a validationframework," presented at the Meeting of the Classification Soc.,Toronto, Ont., Canada, June 1981.

[16] W. M. Rand, "Objective criteria for the evaluation of clusteringmethods,"J. Amer. Statist. Ass.., vol. 66, pp. 846-850, 1971.

[17] P. H. A. Sneath, "Evaluation of clustering methods," in NumericalTaxonomy, A. J. Cole, Ed. New York: Academic, 1969.

Glenn W. Milligan received the B.A. and M.A.degrees in psychology from the University ofSouthern Calfornia, Los Angeles, and Cali-fornia State University, Long Beach, in 1971 and1974, respectively, and the Ph.D. degree inquantitative psychology from the Ohio StateUniversity, Columbus, in 1978.Since 1978 he has been an Assistant Professor

in the Faculty of Management Sciences, OhioState University. He has teaching and researchinterests in cluster analysis, multidimensional

scaling, experimental design, and multidimensional contingency tables.

S. C. Soon received the B.S. degree from Na-tional Chiao Tung University, Taiwan, Republicof China, in 1975.He is currently in the Ph.D. Program in

Management Sciences at the Ohio State Uni-versity, Columbus. His areas of interest are inmanagement information systems and statistics.

Lisa M. Sokol received the B.A. degree from theState University of New York, Stony Brook, in1974, the M.S. degree from NorthwesternUniversity, Evanston, IL, in 1975, andthe Ph.D.

; degree in industrial engineering from theUniversity of Massachusetts, Amherst, in 1978.From 1978 to 1980 she was an Assistant Pro-

fessor at the Ohio State University, Columbus.She is currently with Flight Systems, Inc.,Arlington, VA. Her research interests are indatabase management systems and patternanalysis.

47