bagged clustering - epub.wu.ac.at · cluster analysis framew ork. the rest of this pap er is...

Bagged Clustering

Friedrich Leisch

Working Paper No. 51August 1999

August 1999

SFB‘Adaptive Information Systems and Modelling in Economics and Management Science’

Vienna University of Economicsand Business Administration

Augasse 2–6, 1090 Wien, Austria

in cooperation withUniversity of Vienna

Vienna University of Technology

http://www.wu-wien.ac.at/am

This piece of research was supported by the Austrian Science Foundation (FWF) under grantSFB#010 (‘Adaptive Information Systems and Modelling in Economics and Management

Science’).

Bagged ClusteringFriedrich Leisch

Abstract|A new ensemble method for cluster analysis isintroduced, which can be interpreted in two di�erent ways:As complexity-reducing preprocessing stage for hierarchicalclustering and as combination procedure for several parti-tioning results. The basic idea is to locate and combinestructurally stable cluster centers and/or prototypes. Ran-dom e�ects of the training set are reduced by repeatedlytraining on resampled sets (bootstrap samples). We discussthe algorithm both from a more theoretical and an appliedpoint of view and demonstrate it on several data sets.

Keywords| cluster analysis, bagging, bootstrap samples,k-means, learning vector quantization

I. Introduction

Clustering is an old data analysis problem and numerous

methods have been developed to solve this task. Most of

the currently popular clustering techniques fall into one of

the following two major categories:

� Partitioning Methods

� Hierarchical Methods

Both methods have in common that they try to group the

data such that patterns belonging to the same group (\clus-

ter") are as similar as possible and patterns belonging to

di�erent groups have strong di�erences. This de�nition is

of course rather vague and accordingly a lot of algorithms

have been de�ned with respect to di�erent notions of simi-

larity/dissimilarity between points and/or groups of points.

In this paper we propose a novel method we call bagged

clustering, which is a combination of partitioning and hier-

archical methods and has|up to our knowledge|not been

reported before in the literature.

In the last years ensemble methods have been success-

fully applied to enhance the performance of unstable or

weak regression and classi�cation algorithms in a variety

of ways. The two most popular approaches are probably

bagging [1] and boosting [2]. We take the main idea of bag-

ging (\bootstrap aggregating"), the creation of new train-

ing sets by bootstrap sampling, and incorporate it into the

cluster analysis framework.

The rest of this paper is organized as follows: Section II

gives a short introduction to partitioning and hierarchi-

cal cluster methods and discusses their respective advan-

tages and disadvantages. Section III introduces the bagged

cluster algorithm, various aspects of the algorithm are dis-

cussed in Section IV and demonstrated on several exam-

ples in Section V. Finally some more theoretical aspects of

The author is with the Institut f�ur Statistik, Wahrscheinlichkeits-theorie und Versicherungsmathematik, Technische Universit�at Wien,Wiedner Hauptstra�e 8{10/1071, A-1040 Wien, Austria. Email:[email protected]

bagged clustering are analyzed in Section VI.

II. Cluster Analysis Methods

A. Partitioning Methods

The standard partitioning methods are designed to �nd

convex clusters in the data, such that each segment can

be represented by a cluster center. Convex clustering (or

data segmentation) is closely related to vector quantiza-

tion, where each input vector is mapped onto a correspond-

ing representative. In fact, the two can be shown to be

identical, i.e., a data partition and the corresponding seg-

ment centers (with respect to a given distance measure)

have a one-to-one relation where one de�nes the other and

vice versa [3] under rather general conditions [4].

Let XN = fx1; : : : ; xNg denote the data set available fortraining and let CK = fc1; : : : ; cKg be a set of K cluster

centers. Further let c(x) 2 CK denote the center closest to

x with respect to some distance measure d. Then solving

a convex clustering problem amounts to

NX

n=1

d(xn; c(xn))! minCK

(1)

i.e., �nding a set of centers such that the mean distance of a

data point to the closest center is minimal. Unfortunately

this problem cannot be solved directly even for simple dis-

tance measures and iterative optimization procedures have

to be used.

Usually d is the Euclidean distance such that the cen-

ter of each cluster is simply the mean of the cluster and

Equation 1 is the sum of the within-cluster variances. If

absolute distance is used, then the correct cluster centers

are the respective medians. Recently several extensions to

non-Euclidean distances have been proposed [5], [6].

Popular partitioning algorithms include \classic" meth-

ods like the k-means algorithm and its online variants

(which are often called hard competitive learning). More

recent algorithms like the neural gas algorithm [7] or SOMs

[8] also fall into this category, but add some regularization

terms to Equation 1, which control the structure of the set

of centers CK by enforcing neighborhood topologies among

the centers.

B. Hierarchical Methods

Hierarchical methods do not try to �nd a segmentation

with a �xed number of clusters, but create solutions for

K = 1; : : : ; N clusters. Trivially for K = 1 the only pos-

sible solution is one big cluster consisting of the complete

data set XN . Similarly for K = N we have N clusters con-

taining only one point, i.e., each point is its own cluster.

In between a hierarchy of clusters is created by repeatedly

joining the two \closest" clusters until the complete data

set forms one cluster (agglomerative clustering); another

method is to repeatedly split clusters (divisive clustering).

We only consider agglomerative methods below.

First a dissimilarity matrix D containing the pairwise

distances d(xn; xm) between all data points is computed;

any distance measure may be used. Then one needs a

method for applying the distance on complete clusters, i.e.,

measuring the distance between two sets of points A and

B. As this is used for joining (or linking) clusters, these are

often referred to as linkage methods [9]. Popular linkage

methods include

Single linkage: Distance between the two closest points

of the clusters

d(A;B) = mina2A;b2B

d(a; b)

resulting in non-convex chain-like cluster structures.

Ward's minimum variance: Tries to �nd compact,

spherical clusters by using the distance

d(A;B) =2jAjjBj

jAj+ jBjjj�a� �bjj2

where j � j denotes the size of a set, �a the mean of set

A and jj � jj2 Euclidean distance.

Other linkage methods like average or complete linkage are

not listed, because they are not used in the experiments

below. The result of hierarchical clustering is typically pre-

sented as a dendrogram, i.e., a tree where the root repre-

sents the one-cluster solution (complete data set) and the

leaves of the tree are the single data points. The height

of the branches correspond to the distances between the

clusters.

Usually there is no \correct" combination of distance

and linkage method. Clustering in general and especially

hierarchical clustering should be seen as exploratory data

analysis and di�erent combinations may reveal di�erent

features of the data set. See standard textbooks on multi-

variate data analysis for details [10].

C. Problems of Classic Methods

Both partitioning and hierarchical cluster methods have

particular strengths and weaknesses. Hierarchical methods

provide solutions for clusters of size K = 1; : : : ; N which

are compatible in the sense that a solution with fewer clus-

ters is obtained by joining some clusters, hence clusters at

a �ner resolution are simply subgroups of the bigger clus-

ters. Hierarchical methods are more exible in the sense

that they can be more easily adapted to distance measures

other than the usual metric distances (Euclidean, abso-

lute). The greatest weakness of hierarchical methods is the

computational e�ort involved. The input typically consists

of a distance matrix between all data points which is of size

O(N2). In each iteration all clusters have to be compared

in order to join the closest two; resulting in long runtimes.

This makes hierarchical methods infeasible for large data

sets.

Partitioning methods, especially online algorithms scale

much better to large data sets. Solutions for di�erent num-

ber of clusters need not be nested, such that they often

cannot easily be compared, but are more exible at dif-

ferent resolution levels. However, they are not as exible

as hierarchical methods with respect to distance measures

etc. Also all partitioning methods are iterative stochastic

procedures and depend very much on initialization. Run-

ning the K-means algorithms twice with di�erent starting

points on the same data set may result in two di�erent

solutions. There is also the open problem of choosing the

\correct" number of clusters. Many di�erent indices have

been developed for this model selection task, however none

has yet been globally accepted [11].

III. The Bagged Cluster Algorithm

In this section we introduce a novel clustering algorithm

combining partitioning and hierarchical methods. The cen-

tral idea is to stabilize partitioning methods like K-means

or competitive learning by repeatedly running the cluster

algorithm and combining the results. K-means is an un-

stable method in the sense that in many runs one will not

�nd the global optimum of the error function but a local

optimum only. Both initializations and small changes in

the training set can have big in uence on the actual local

minimum where the algorithm converges, especially when

the correct number of clusters is unknown.

By repeatedly training on new data sets one gets dif-

ferent solutions which should on average be independent

from training set in uence and random initializations.

We can obtain a collection of training sets by sampling

from the empirical distribution of the original data, i.e.,

by bootstrapping. We then run any partitioning cluster

algorithm|called the base cluster method below|on each

of these training sets. However, we are left with the typical

problem when one obtains several cluster results: There is

no obvious way of choosing the \correct" one (when they

partition the input space di�erently but have similar error)

or combining them.

In [12] the authors propose a voting scheme for cluster

algorithms. The voting proceeds by pairwise comparison

of clusters and measuring the similarity by the amount of

shared points. The combined clustering provides a fuzzy

partition of the data.

We propose to combine the cluster results by hierarchical

clustering, i.e., the results of the base methods are com-

bined into a new data set which is then used as input for

a hierarchical method. The bagged clustering algorithm

works as follows:

1. Construct B bootstrap training samples X 1N ; : : : ;X

BN

by drawing with replacement from the original sample

XN .

2. Run the base cluster method (K-means, competitive

learning, . . . ) on each set, resulting in B �K centers

c11; c12; : : : ; c1K ; c21; : : : ; cBK where K is the number

of centers used in the base method and cij is the j-th

center found using X iN .

3. Combine all centers into a new data set CB =

CB(K) = fc11; : : : ; cBKg.

4. (Optional) Prune the set CB by computing the parti-

tion of XN with respect to CB and remove all centers

where the corresponding cluster is empty (or below a

prede�ned threshold �), resulting in the new set

CBprune(K; �) =�c 2 CB(K)j#fx : c = c(x)g � �

We also make all members of CBprune(K; �) unique, i.e.,

remove duplicates.

5. Run a hierarchical cluster algorithm on CB (or CBprune),resulting in the usual dendrogram.

6. Let c(x) 2 CB denote the center closest to x. A parti-

tion of the original data can now be obtained by cut-

ting the dendrogram at a certain level, resulting in a

partition CB1 ; : : : ; CBm, 1 � m � BK, of set CB . Each

point x 2 XN is now assigned to the cluster containing

c(x).

Example

We take a simple 2-dimensional example from [12] to

demonstrate the algorithm. The data set consists of 3900

points in 3 clusters as shown in Figure 1. The authors call

this example \Cassini" because the shape of the two big

clusters is a Cassini curve.

−1.5 −1.0 −0.5 0.0 0.5 1.0

−2

−1

01

2

−1.5 −1.0 −0.5 0.0 0.5 1.0

−2

−1

01

2

Fig. 1. Cassini problem: local minimum (left) and global minimum(right) of the K-means algorithm.

Standard K-means clustering cannot �nd the structure

in the data completely because the outer clusters are not

convex. Hence, even when we start K-means with the true

centers (the mean values of the three groups) such that

the algorithm converges immediately, we make little errors

at the edges (left plot in Figure 1); however this error is

only very small. The real problem is that the \true cluster

partition" is only a local minimum of error function (1).

The minimum error solution found in 1000 independent

repetitions of the K-means algorithm splits one of the large

clusters into two parts and ignores the small cluster in the

middle (right plot in Figure 1). Note that we used the

correct number of clusters, information which is typically

not available for real world problems.

−1.5 −1.0 −0.5 0.0 0.5 1.0

−2

−1

01

2

−1.5 −1.0 −0.5 0.0 0.5 1.0

−2

−1

01

2

Fig. 2. Cassini problem: 200 centers placed by bagged clustering(left) and �nal solution (right) by combining the 200 centers usinghierarchical clustering.

We now apply our bagged cluster algorithm on this data

using B = 10 bootstrap training samples and K-means as

base method with K = 20 centers in each run. The left

plot in Figure 2 shows the resulting 200 centers. We then

perform hierarchical (Euclidean distance, single linkage) on

these 200 points. The three-cluster partition can be seen in

the right plot in Figure 2, which recovers the three clusters

without error.

IV. Discussion

A. Number of Clusters

By inspection of the dendrogram from the hierarchical

clustering one can in many cases infer the appropriate num-

ber of clusters for a given data set. The dendrogram cor-

responding to single linkage of the Cassini example from

above is shown in Figure 3, which clearly indicates that

3 clusters are present in the data. The lower plot shows

the relative height at which the next split after the current

one occurs for 1,. . . ,20 clusters (black line). The grey line

shows the �rst di�erences of the dark line. This value is

large for splits into two well-separated subtrees; the di�er-

ences are small if shortly after the respective splits there is

another split. Note that we did not need to use the correct

number of clusters in the bagged clustering algorithm, it

can be inferred from the dendrogram.

B. Preprocessing for Hierarchical Clustering

The Cassini example above could also be solved by direct

hierarchical clustering with single linkage of the original

data set. However, the number of data points of N = 3900

is too large, the distance matrix alone has more than 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

2

3

4 56 7 8

910

11 12 13 14 15 16 17 1819 20

Fig. 3. Cassini problem: Hierarchical clustering of 200 bagged clustercenters using single linkage.

million entries. Of course the data structure can easily be

represented by fewer samples; if we take a subsample of size

100 and cluster it hierarchically we get the same solution

as with bagged clustering. The correct number of clusters

can also be inferred from the dendrogram.

However, taking a subsample of the original data has

the disadvantage that possibly valuable information gets

lost. Hence, the complexity reduction (reduction of data

set size) should take all data into account. This can be done

by vector quantization techniques, i.e., using a partitioning

cluster method such as K-means with K large. The quan-

tization performs a smoothing of the original data, noise

and outliers are removed.

Bagged clustering di�ers from this standard textbook

approach in that much fewer centers are used in the parti-

tioning step, i.e., the smoothing (in a single run) is much

stronger (see also the �rst part of Theorem 2 below). How-

ever, new variations are introduced by the use of bootstrap

samples, such that the overall result is less dependent on

random uctuations in the training set and the random

seed (initial centers) of the base method.

C. Combination of Independent Cluster Results

The second view (and initial motivation) of bagged clus-

tering is that it o�ers a method for combining several out-

comes of partitioning methods. Partitioning methods are

usually iterative optimization techniques which can easily

get stuck in local minima and depend heavily on start-

ing conditions. Intuitively a researcher will trust a certain

outcome much more if it is reproducible such that di�erent

restarts of the algorithm produce the same result and the

resulting centers are always the same (or at least close).

Ideally one would use independent training sets for each

repetition in order to get independence from a particular

training set. With the same motivation as in bagged re-

gression or classi�cation [1] we replace the (in practice typ-

ically unavailable) training sets drawn independently from

the data generating distribution by bootstrap samples and

run the base method on each. One now has to check if

several of the resulting centers are close to each other and

group them accordingly, which is exactly what hierarchical

clustering has been designed for.

D. Bagged Clustering for Very Large Data Sets

As the prices of data storage devices (hard disks, . . . )

have decreased dramatically during the last decade, more

and more data get stored. E.g., supermarkets or telephone

companies routinely log all consumer transactions, such

that the corresponding data sets easily get into the Gi-

gabyte range. For such very large data sets even the most

basic calculations like the sample mean or variance are com-

putationally very intensive and many standard statistical

approaches get as infeasible as they were 30 years ago (back

then for small to moderate sample sizes).

Committee methods o�er a viable way for adapting stan-

dard algorithms to scale well as the number of samples in-

creases. In data mining situations it is no problem to get

several independent training sets, one simply samples sev-

eral sets X iN1

from the original sample where N1 � N is as

large as possible (such that it is still possible to be handled

with reasonable computational e�ort). Finally the partial

results inferred from the subsamples are combined into a

�nal solution, in our case the combination is done by hier-

archical clustering.

V. Experiments

A. Data Sets

We have tested bagged clustering on several benchmark

examples, �ve examples with continuous data and three

examples with binary data which are related to our cur-

rent research on data segmentation for tourism marketing.

As base methods we used K-means and hard competitive

learning. In the hierarchical step we used Euclidean dis-

tance for continuous data and absolute distance for the

binary data (these combinations worked best); the agglom-

eration methods were single linkage and Ward's method.

The continuous examples are:

Cassini: Three clusters in 2-dimensional space. Two big-

ger clusters of size 1500 each with one small cluster of

size 900 in between. See Figure 1 for details. This

example is taken from [12].

Quadrants: Four clusters in 3-dimensional space. Three

large quadrant-shaped clusters of size 500 are located

around a smaller cube-shaped cluster of size 200. This

example is also taken from [12].

TABLE I

Scenario 1: Symmetric distribution of 0s and 1s.

x1{x3 x4{x6 x7{x9 x10{x12 m

Type 1 high high low low 1000

Type 2 low low high high 1000

Type 3 low high high low 1000

Type 4 high low low high 1000

Type 5 low high low high 1000

Type 6 high low high low 1000

2 Spirals: This example is often used as classi�cation

benchmark and consists of 2 spiral-shaped clusters. As

this is a hard problem even for supervised learners, we

increased the size of the training set considerably to

make the example feasible for unsupervised learners,

such that both spirals contain 2000 points each. A

similar example with 3 spirals has been used in [13] as

cluster benchmark.

Iris: Edgar Anderson's Iris data, 150 4-dimensional ob-

servations on 3 species of iris (setosa, versicolor, and

virginica).

Segmentation: Image data drawn randomly from a

database of 7 outdoor images. Each instance gives

19 statistics (centroids, densities, saturation, . . . ) of a

3 � 3 pixel region. The size of the complete data set

is 2310 (330 per class). This example was taken from

the UCI repository of machine learning databases at

http://www.ics.uci.edu/~mlearn/.

The binary examples are taken from a larger collection

of data scenarios from tourism marketing [14]. These sce-

narios model \typical" data found in tourism marketing

in a simpli�ed manner and are a result of joint e�orts be-

tween researchers from statistics and management science

in order to create a benchmark collection for this type of

data. All examples were tested with many di�erent cluster-

ing algorithms [15] such that their characteristics are well

known. All scenarios use 12-dimensional binary data with

6 clusters grouped as shown in Table I.

Scenario 1: Variables denoted as \high" are 1 with prob-

ability 0.8, \low" variables are 1 with a probability of

0.2. All clusters have size 1000.

Scenario 3: The probability for \low" is increased from

0.2 to 0.5. All clusters have again size 1000.

Scenario 5: The probability for \low" to be 1 is 0.2 as

in scenario 1, however the cluster sizes are 1000, 300,

700, 3000, 500, and 500, respectively.

B. Evaluation Method

As the true clusters are known in classi�cation problems,

we can evaluate the partition provided by a cluster algo-

rithm by comparing it with the true classes. Let ai denote

the true classes of a problem and bj denote the clusters

found by the cluster algorithm. We then associate cluster

bj with class ai if the majority of points in bj is from class

ai. All points in bj are then classi�ed as ai. Note that the

number of cluster need not necessarily be the same as the

number of classes.

Using this algorithm every cluster algorithm can be

turned into a classi�cation algorithm. As the class informa-

tion is not used during partitioning (unsupervised learning)

but only in the �nal step of associating clusters with classes,

the classi�cation performance of such a \clustering classi-

�er" will usually be (much) worse than the performance

of a designated classi�cation method which uses the class

labels during training. However, we can use this method

to compare and measure the ability of di�erent cluster al-

gorithms to recover structures in data.

The \winner-take-all" matching between clusters and

classes described above is of course not the only possible.

Another possibility is to compare class and cluster centers

and map according to their distance. Obviously this only

makes sense for problems and cluster methods with convex

clusters where centers and segments are dual. The center

based matching is appropriate if the cluster centers them-

selves are used after clustering. E.g., in marketing research

one is often interested in customer pro�les as described by

the mean values of market segments.

C. Results

All experiments were performed using the R software

package for statistical computing, which is a free imple-

mentation of the S language and can be downloaded from

http://www.ci.tuwien.ac.at/R. R functions for bagged

clustering and corresponding graphs will soon be available

on the web and can be obtained from the author upon re-

quest in the meantime.

Table II show the results of our experiments for the con-

tinuous data sets. First we used hard competitive learn-

ing (HCL) and K-means (KMN) as benchmark algorithms

(with the correct number of centers). Then we ran bagged

clustering (BC) with these two base methods and both with

single (s) and Ward's (w) linkage. We used 10 bootstrap

samples and 20 base centers for BC and produce the cor-

rect number of clusters by cutting the tree at the respective

level (before the cluster-class matching). All algorithms

were run 100 times on each data set.

The �rst three columns of the table gives the median

(Med), mean and standard deviation (SD) of the correctly

classi�ed cases in percent. The last column (MErr) gives

the percentage of correctly classi�ed cases of the minimum

error run, which minimized the internal error criterion of

the algorithm over the 100 repetitions (and would hence be

chosen when clustering real data). For HCL and K-means

this internal error is simply the sum of within-cluster vari-

ances, which measures the size of the clusters. For bagged

clustering we also use the sum of cluster sizes of the hi-

erarchical clustering (with respect to the current linkage

method), where each cluster size is given the height of the

respective branch in the dendrogram. Note that the ta-

ble reports percentage of correctly classi�ed cases, not the

internal error criterion.

The Cassini problem is \almost" solvable for HCL and

K-means with only few errors at the edges. However, this

solution is only a local minimum of the error function, the

global optimum splits one of the large clusters into two

parts and ignores the small one, getting approximately 77%

correct. In some repetitions HCL and K-means do better

such that the mean is greater as the median and the stan-

dard deviation is rather large. Bagged clustering makes no

error at all in more than half of all repetitions (median is

100%), we can also detect the correct solution using the

MErr criterion. Even the mean value is above 98% for

Ward's method and above 99% for single linkage. It is not

surprising that single linkage performs better for this ex-

ample, because we have non-convex and non-overlapping

true clusters.

TABLE II

Results for continuous problems.

Med Mean SD MErr

Cassini

HCL 76.92 80.91 8.55 76.92KMN 76.92 79.02 8.57 76.92BC (HCL, s) 100.00 99.82 1.67 100.00BC (HCL, w) 100.00 98.47 5.48 100.00BC (KMN, s) 100.00 99.45 3.23 100.00BC (KMN, w) 100.00 98.25 5.90 100.00

Quadrants

HCL 88.23 91.27 5.20 88.17KMN 88.23 88.56 2.02 88.17BC (HCL, s) 100.00 100.00 - 100.00BC (HCL, w) 100.00 98.12 4.31 100.00BC (KMN, s) 100.00 99.99 0.01 100.00BC (KMN, w) 100.00 97.92 4.46 100.00

2 Spirals

BC (HCL, s) 100.00 87.87 19.83 100.00BC (KMN, s) 95.67 80.11 22.04 84.95

Iris

HCL 89.33 89.04 0.32 89.33KMN 89.33 84.80 9.11 89.33BC (HCL, w) 91.33 91.07 0.75 91.33BC (KMN, w) 90.00 90.27 0.34 90.00

Segmentation

HCL 56.62 55.57 3.05 57.40KMN 55.97 55.90 3.23 53.12BC (HCL, w) 61.23 61.28 3.22 63.16BC (KMN, w) 60.24 60.49 2.91 62.42

The results for the quadrants problem are similar. Again

the correct solution is only a local optimum of HCL and

K-means such that the average performance of both algo-

rithms is around 90%. The MErr solution gets 88% correct.

Again bagged clustering solves the problem in more than

50% of all repetitions. Using bagged clustering with HCL

and single linkage even gave the correct solution in all rep-

etitions.

The spirals problem is of course unsolvable for HCL and

K-means, as both clusters cannot even be approximated

by convex sets. For the same reason we use only single

linkage for bagged clustering. Both clusters are very long

and thin and rather close together. Hence we have to use

many support points for a successful single linkage and used

K = 100 centers for the base methods. Using HCL as base

method yields excellent results with a median of 100%. K-

means is not as good, but still gives competitive results.

The two real world data sets (iris, segmentation) both

have overlapping classes, hence only Ward's linkage is used,

as single linkage is not appropriate for overlapping clusters

(the clusters would be joined immediately). The iris data

set is rather small (150 cases), hence we use fewer centers

in the base method (K = 10). The segmentation data

are larger and high-dimensional, here K = 30 gave good

performance. For both datasets bagged clustering increases

the number of correctly classi�ed cases.

For the binary data scenarios we are not only interested

in statistics of the number of correctly classi�ed cases, but

also how often the correct group pro�les (cluster centers)

were found. Each center is converted to a binary vector

by thresholding at the overall mean value, which is 0.5 for

scenarios 1 and 5 and 0.65 for scenario 3. A center is con-

sidered as detected if all bits are equal (Hamming distance

of zero between thresholded cluster center and thresholded

class mean). The practical explanation for this procedure

is that in marketing research groups are often characterized

as having a certain feature \above or below average".

In all scenarios the classes overlap, hence a 100% correct

classi�cation is impossible. The (optimal) Bayes rate is

82.98% for scenario 1, 48.93% for scenario 3 and 88.85%

for scenario 5. Again we use only Ward's linkage. For

all scenarios we use both the correct number of centers

(K = 6) in the base method and a larger value of K = 20.

HCL is close to the Bayes classi�er for scenario 1, hence

almost no further improvement is possible. Bagged HCL

is slightly better than plain HCL with K = 6 and slightly

worse for K = 20. For K-means, bagging stabilizes the

performance for K = 6 close to the base rate, the perfor-

mance decrease for K = 20 is larger than with HCL, but

it �nds the correct group centers more reliably.

In the other two scenarios bagging always improves on

the base methods classi�cation rate and boosts the number

of centers found (except for K = 6 in scenario 3). Sce-

nario 3 has turned out to be very hard to learn [15] and

the maximum number of correctly identi�ed centers using

12 di�erent cluster algorithms (HCL, K-means, neural gas,

self organizing maps, improved �xpoint method and vari-

ants of these) has been 4 so far. Bagged clustering based on

K-means identi�es all 6 clusters in some runs usingK = 20,

however we currently have no way of identifying these runs

(without using the true data structure).

The dendrogram of scenario 5 (Figure 4) clearly shows

TABLE III

Results for binary data scenarios from tourism marketing.

Classi�cation rate Centers foundK Med Mean SD MErr min mean max

Scenario 1

HCL 82.46 82.49 0.24 82.31 6 6 6KMN 72.45 80.60 4.76 82.28 4 5.82 6BC (HCL, w) 6 82.46 82.46 0.25 82.93 6 6 6BC (KMN, w) 6 81.65 81.48 1.00 81.65 6 6 6BC (HCL, w) 20 81.17 81.03 0.88 79.50 6 6 6BC (KMN, w) 20 76.78 76.69 1.71 75.13 6 6 6

Scenario 3

HCL 29.08 29.34 1.42 29.08 0 1.14 3KMN 31.29 31.76 2.29 27.68 0 0.76 3BC (HCL, w) 6 31.43 31.41 1.19 29.40 0 1.00 1BC (KMN, w) 6 31.53 31.82 1.75 30.52 0 2.60 2BC (HCL, w) 20 34.93 35.06 1.91 37.13 0 1.31 4BC (KMN, w) 20 35.55 35.48 2.03 37.68 0 2.07 6

Scenario 5

HCL 80.09 79.80 0.71 78.96 4 5.36 6KMN 79.09 78.82 1.87 79.05 4 4.86 6BC (HCL, w) 6 86.54 86.29 0.90 84.18 5 5.95 6BC (KMN, w) 6 84.52 84.33 2.18 84.25 5 5.70 6BC (HCL, w) 20 84.31 84.29 1.11 84.40 6 6 6BC (KMN, w) 20 82.12 81.74 1.81 82.28 5 5.97 6

the structure of the data set. One big cluster (half of the

training set) dominates, then there are larger clusters of

size 1000 and 700, and �nally some small clusters of sizes

300 and 500 (twice). Due to the symmetries in the data

set the dendrogram suggests to use 2 clusters (big cluster

vs. others) or 6 clusters.

05

1015

2025

30

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0 2

3

45

6

7 8 910

1112

1314

15 16 1718

19 20

Fig. 4. Scenario 5: Hierarchical clustering of 200 bagged clustercenters using Ward's method.

VI. Analysis of the Algorithm

A. E�ects of Pruning

In the �rst phase of bagged clustering we apply a (parti-

tioning) base cluster method on bootstrap samples of the

original data. Let P denote the base cluster method (in-

cluding all hyperparameters such as learning rates, random

initializations, . . . ). Further let

C?(K) = C?(K;XN ;P)

:= fc j 9CB(K) : c 2 CB(K)g

denote the (theoretical) set of all cluster centers that can be

generated by the base method when applied to bootstrap

samples of size N . Analogously de�ne C?prune(K; �).

Theorem 1: For bagged clustering as de�ned above,

C?prune(K; �) =

fx 2 XN j#fxi 2 XN : xi = xg � �g

Corollary 1: Bagged clustering with pruning of empty

clusters (� = 1) is asymptotically equivalent to hierarchical

clustering of the original data set (where the asymptotics

are with respect to B).

Corollary 2: If xi 6= xj , 8i 6= j, then C?prune(K; �) = ;,

8� > 1.

Proof. The sample X[n]N := fxn; : : : ; xng containing only

replicates of xn is a valid bootstrap sample (generated with

probability N�N if all points in XN are unique), and triv-

ially any partitioning cluster algorithm P should output

xn as unique center in this case. Hence, xn 2 C?(K),

8n = 1; : : : ; N and XN � C?(K);8K. Pruning removes

all points that are not contained at least � times in XN . �

The above theorem basically says that in the limit for

B !1 bagged clustering with pruning is identical to clus-

tering the original data set. Note that without pruning the

algorithm behaves completely di�erent, as bootstrap sam-

ples containing only a few original data points have very

low probability (and hence centers with extreme over�t-

ting are also not very frequent). Pruning e�ectively \thins

out" regions containing many centers while keeping out-

liers and should therefore be used only very carefully. The

main advantage of pruning is that it can drastically reduce

the number of centers used as input for the hierarchical

clustering step.

B. Cluster Distance and Background Noise

Clustering the original data set with the base method

amounts to drawing a single K-tuple with some proba-

bility distribution G = G(F;P ;K;N) depending on the

(unknown) data generating distribution F of x, the cluster

algorithm P (random initializations, . . . ), K and N . By re-

placing the true distribution F of x with the empirical dis-

tribution F of XN we bootstrap the base cluster algorithm

and are hence provided with a sample CB(K) drawn from

G = G(F ;P ;K;N). Standard bootstrap analysis would

now proceed by computing statistics like mean, standard

deviation or con�dence intervals; see [16] for a compre-

hensive introduction to the bootstrap. Bagged clustering

explores CB(K) using hierarchical clustering.

For the following we need generalized linkage meth-

ods using distances on continuous sets, this can easily

be obtained by replacing all minima/maxima with in-

�ma/suprema and sizes of sets with the probability of the

sets (with respect to data distribution F ). E.g., the contin-

uous generalization of the distance corresponding to single

linkage is

ds(A;B) = infa2A;b2B

d(a; b)

and for Ward's method we get

dw(A;B) =2 IPF (A) IPF (B)

IPF (A) + IPF (B)jj�a� �bjj2

Additionally we will need the following properties for dis-

tances of sets:

(I) For all nonempty subsets B of some convex set A

and all sets C, A \ C = ;, it follows that d(A; C) > 0 )

d(B; C) > 0.

(II) For all nonempty compact sets B in the interior of A

and all sets C, A\ C = ;, it follows that d(B; C) > d(A; C).

We get directly from these de�nitions that ds ful�lls both

(I) and (II), while dw is only (I).

Suppose that F is absolute continuous on the input space

X such that the density f exists. One possible de�nition

of a \cluster" is to assume that f is multimodal with each

modus corresponding to one cluster [17]. The following def-

inition characterizes a clustering problem by the amount of

background noise � (maximum density outside the clusters)

and the minimum distance between the clusters.

De�nition 1: We call the pairwise disjoint sets Ai � X ,

i = 1; : : : ;M , (�; Æ)-separated clusters with respect to dis-

tance d, if

1. 8x 2 X : f(x) � �, 9i : x 2 Ai.

2. mini6=j d(A�i ;A

�j) = Æ where

A�i = fx 2 Ai j f(x) � �g:

In general (�; Æ)-separated clusters do not correspond to the

usual notion of partitions (as returned by a partitioning

cluster algorithm), because they do not form a partition

of the complete input space X . However, (�; Æ)-separated

clusters are of course a partition of fx 2 X : f(x) � �g.A sensible assumption on any cluster algorithm P is that

it places cluster centers with high probability in regions of

high data density f , i.e., in the modi of f , if there are suÆ-

ciently many centers available. Suppose that G is absolute

continuous such that its density g exists.

Theorem 2: Let Ai � X , i = 1; : : : ;M be (�; Æ)-

separated clusters with respect to linkage method d and

density f on X for some �; Æ > 0. Further let Bi � Ai,

i = 1; : : : ;M , be a set of compact subsets of the clusters.

Ifg(x) > f(x) 8x 2

SM

i=1 Big(x) � f(x) otherwise

then 9�1 > � such that Bi, i = 1; : : : ;M are (�1; Æ1)-

separated clusters, Æ1 � 0.

If additionally

1. d ful�lls (I), then 9Æ2 > 0 such that the Bi are (�1; Æ2)-separated.

2. all Bi are in the interior of the respective Ai and d

is (II), then 9Æ3 > Æ such that the Bi are (�1; Æ3)-

separated.

Proof. Using the compactness of the Bi we get that g has

a minimum, say b1 2SBi. Let �1 := g(b1), then �1 =

g(b1) > f(b1) � �. The distance properties 1 and 2 follow

directly from (I) and (II). �

Bagged clustering transforms the clustering problem in

the original data space into a new problem in the space

of centers produced by the base method. If the centers of

the base method are concentrated in the modi of f , then

the new clusters are smaller than the original ones with

higher density. Whether the distance of the new clusters is

larger than the distance of the original ones, depends on the

linkage method actually used. Single linkage distance will

increase, but Ward's distance or average distance between

clusters may even decrease.

E.g., consider Gaussian clusters of equal size N=K and

that the number of clusters K is known. Then the min-

imum error solution of K-means places a cluster center

at the mean of each cluster (this is also the maximum

likelihood solution). By clustering bootstrap samples we

get centers with multivariate normal distributions around

the true clusters with K=N times the original variance.

Hence, the clusters in the new space have only K=N times

the size of the original clusters and single linkage distance

grows accordingly. Ward's distance between clusters re-

mains unchanged because the centers of the new clusters

are the same as the centers of the original clusters. How-

ever, Ward's distance within clusters decreases also by a

factor of K=N .

VII. Summary

We have presented a novel clustering framework, which

allows for the combination of hierarchical and partition-

ing algorithms. Partitioning cluster algorithms such as K-

means are used to concentrate centers in regions of high

data density and remove background noise. These centers

are then explored using hierarchical methods. The algo-

rithm compares favorably with standard partitioning tech-

niques on a mixture of arti�cial and real world benchmark

problems.

We are currently extending this research in several di-

rections. The exact number K of centers used by the base

method seemed not to be a critical parameter in our simu-

lations, however, better guidelines for choosingK would be

needed. Another future direction involves interpretation of

the result from hierarchical clustering, as a broad spectrum

of methods for analyzing dendrograms is available in the

literature. This includes splitting the tree into clusters by

more re�ned methods than a horizontal cut and choosing

the number of clusters. Finally, we are also working on new

methods for graphical visualization of (bagged) clusters in

binary data sets.

Acknowledgement

This piece of research was supported by the Austrian

Science Foundation (FWF) under grant SFB#010 (`Adap-

tive Information Systems and Modeling in Economics and

Management Science'). The author wants to thank Kurt

Hornik and Andreas Weingessel for helpful discussions.

References

[1] L. Breiman, \Bagging predictors," Machine Learning, vol. 24,pp. 123{140, 1996.

[2] Y. Freund and R. E. Schapire, \Experiments with a new boostingalgorithm," in Thirteenth International Conference on MachineLearning, 1996.

[3] J. Max, \Quantizing for minimumdistortion," IRE Transactionson Information Theory, vol. IT-6, pp. 7{12, Mar. 1960.

[4] K. P�otzelberger and H. Strasser, \Data compression by unsu-pervised classi�cation," Report 10, SFB \Adaptive InformationSystems and Modeling in Economics and Management Science",http://www.wu-wien.ac.at/am, 1997.

[5] F. Leisch, A. Weingessel, and E. Dimitriadou, \Competitivelearning for binary valued data," in Proceedings of the 8th Inter-national Conference on Arti�cial Neural Networks (ICANN 98)(L. Niklasson, M. Bod�en, and T. Ziemke, eds.), vol. 2, (Sk�ovde,Sweden), pp. 779{784, Springer, Sept. 1998.

[6] D. Weinshall, D. W. Jacobs, and Y. Gdalyahu, \Classi�cation innon-metric spaces," in Advances in Neural Information Process-

ing Systems (M. Kearns, S. Solla, and D. Cohn, eds.), vol. 11,MIT Press, USA, 1999.

[7] T. M. Martinetz, S. G. Berkovich, and K. J. Schulten, \\Neural-Gas" network for vector quantization and its application totime-series prediction," IEEE Transactions on Neural Networks,vol. 4, pp. 558{569, July 1993.

[8] T. Kohonen, Self-organizing maps. Berlin: Springer, 1995.[9] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data. New

York, USA: John Wiley & Sons, Inc., 1990.[10] J. Hartung and B. Elpelt, Multivariate Statistik: Lehr- und

Handbuch der angewandten Statistik. M�unchen, Germany: Old-enbourg Verlag, fourth ed., 1992.

[11] G. W. Milligan, \Clustering validation: Results and implicationsfor applied analyses," in Clustering and Classi�cation (P. Ara-bie, L. Hubert, and G. DeSoete, eds.), pp. 341{375, River Edge,NJ, USA: World Scienti�c Publishers, 1996.

[12] A. Weingessel, E. Dimitriadou, and K. Hornik, \A voting schemefor cluster algorithms," in Neural Networks in Applications, Pro-ceedings of the Fourth International Workshop NN'99 (G. Krell,B. Michaelis, D. Nauck, and R. Kruse, eds.), (Otto-von-GuerickeUniversity of Magdeburg, Germany), pp. 31{37, 1999.

[13] Y. Gdalyahu, D. Weinshall, and M. Werman, \A randomizedalgorithm for pairwise clustering," in Advances in Neural Infor-mation Processing Systems (M. Kearns, S. Solla, and D. Cohn,eds.), vol. 11, MIT Press, USA, 1999.

[14] S. Dolnicar, F. Leisch, and A. Weingessel, \Arti�cial binary datascenarios," Working Paper Series 20, SFB \Adaptive Informa-tion Systems and Modeling in Economics and Management Sci-ence", http://www.wu-wien.ac.at/am, Sept. 1998.

[15] S. Dolnicar, F. Leisch, A. Weingessel, C. Buchta, and E. Dim-itriadou, \A comparison of several cluster algorithms on arti-�cial binary data scenarios from tourism marketing," WorkingPaper Series 7, SFB \Adaptive Information Systems and Mod-eling in Economics and Management Science", http://www.wu-wien.ac.at/am, 1998.

[16] B. Efron and R. J. Tibshirani, An introduction to the bootstrap.Monographs on Statistics and Applied Probability, New York,USA: Chapman & Hall, 1993.

[17] H. H. Bock, \Probabilistic models in cluster analysis," Compu-tational Statistics & Data Analysis, vol. 23, pp. 5{28, 1996.

bagged clustering - epub.wu.ac.at · cluster analysis framew ork. the rest of this pap er is...

Documents