hybrid hierarchical clustering: an experimental analysis

HYBRID HIERARCHICAL CLUSTERING:

AN EXPERIMENTAL ANALYSIS ∗

Keerthiram Murugesan, Jun Zhang

Technical Report: CMIDA-HiPSCCS #001-11

Department of Computer Science,

University of Kentucky,

Lexington,

kmu222,[email protected]

February 17, 2011

Abstract

In this paper, we present a hybrid clustering method that combines

the divisive hierarchical clustering with the agglomerative hierarchical

clustering. We used the bisect K-means divisive clustering algorithm

in our method. First, we cluster the document collection using bisect

K-means clustering algorithm with K’ > K as the total number of

clusters. Second, we calculate the centroids of K’ clusters obtained

from the previous step. Then we apply the Unweighted Pair Group

Method with Arithmetic Mean (UPGMA) agglomerative hierarchical

algorithm on these centroids for the given K. After the UPGMA finds

K clusters in these K’ centroids, if two centroids ended up in the same

cluster, then all of their documents will belong to the same cluster.

∗ c© University of Kentucky, 2011.

1

We compared the goodness of clusters generated by the standard bi-

sect K-means algorithm and the proposed hybrid algorithm, measured

on various cluster evaluation metrics. Our experimental results show

that the proposed method outperforms the standard bisect K-means

algorithm. At the end, we show that picking a value for K’ has no

major impact in the end results by analyzing the relation between the

value of K’ and the quality of the clusters.

1 INTRODUCTION

Document clustering algorithm helps to find groups in documents that share a

common pattern [TSK05,SKK00,GRS98,ZKF05,CKPT92]. It has been used

to automatically find clusters in a collection without any user supervision.

The main goal of the clustering is to find meaningful groups so that the

analysis of all the documents within clusters is much easier compared to

viewing it as a whole collection. Some of the most common applications

of clustering are in information retrieval, document organization, genetics,

weather forecasting, medical imaging, etc [TSK05].

There are different ways to cluster documents. But two common types

of clustering methods are used: Partitional and Hierarchical clustering. A

partitional clustering algorithm finds all the non-overlapping clusters at once

by dividing the set of documents based on an objective function [JD88,Mac67,

NH94,CS96,ZHD+01,HXZ+98,Bol98,SG00,DHZ+01]. These algorithms try

to minimize or maximize an objective function. Most partitional clustering

algorithms are prototype-based, in which a prototype for each clusters are

chosen and the documents are grouped based on these prototypes. Usually

these algorithms run several times until a convergence occurs or an optimum

condition is met.

Hierarchical clustering generates a tree of clusters by splitting / merging

each cluster on each level until the desired numbers of clusters are gener-

ated. This generated tree is often called as Dendogram [GRS98,SS73,Kin67,

KHK99], shown in the Figure 1.

2

Figure 1: Dendogram

Hierarchical clustering can use top-down approach (Divisive) or bottom-

up approach (Agglomerative) to construct the dendogram. Agglomerative

clustering starts with one document in each cluster and repeatedly merges

two clusters that are most similar in their pattern at each step until a single

cluster of all documents is obtained. Divisive clustering, on the other hand,

starts with all documents as a single cluster and splits them until all clusters

are singleton clusters. These types of clustering are most commonly used

because their underlying representation of clusters in hierarchy resembles

their application domain [ZKF05,XW05,CKPT92].

Hierarchical agglomerative clustering algorithms merge pair of clusters

at each step based on one of the following linkage metrics for measuring

proximity between clusters [GRS98,JD88,SS73,Kin67,KHK99,GRS99].

Single Link Algorithm (SLA) measures the maximum of the pair-wise

similarity from each cluster to merge a pair of clusters [SS73].

SimilaritySLA(CA, CB) = maxx∈CA,y∈CB

cos(x, y) (1)

Complete Link Algorithm (CLA) measures the minimum of the pair-wise

similarity from each cluster [Kin67].

SimilarityCLA(CA, CB) = minx∈CA,y∈CB

cos(x, y) (2)

Group Average Algorithm (unweighted pair group method with arith-

metic mean (UPGMA)) measures the average of the pair-wise similarity of

the documents from each cluster [JD88].

SimilarityUPGMA(CA, CB) =1

ninj

∑x∈CA,y∈CB

cos(x, y) (3)

3

where cos(x, y) is the cosine similarity between two documents or clusters

represented as vectors. More information about cosine similarity is discussed

in Section 4.2.

Though there are many other linkage metrics, Equations (1),(2), and (3)

are commonly used. These methods use Euclidean distance or inter-cluster

similarity matrix for their comparisons and measurements. Experiments have

shown that UPGMA performs better than SLA and CLA algorithms.

Divisive hierarchical clustering algorithms pick a big cluster or cluster

with the lowest intra-cluster similarity measure to split on each level [ELL01,

KR90]. Most recently partitional clustering algorithms have been used to

split the clusters. Bisect K-means algorithm [SKK00] is a typical divisive

hierarchical clustering algorithm that uses the K-means partitional clustering

algorithm to split its cluster.

This paper focuses on how the hybrid of one or more of these clustering

algorithms can give better results and generate good clusters compared to

the traditional methods. The rest of this paper is organized as follows: Sec-

tion 2 discusses about the motivations for this work and the related works.

In Section 3, we present the suggested hybrid method that uses both the

divisive and agglomerative hierarchical clustering algorithm. Section 4 pro-

vides the experimental analysis for this method and discusses about the data

collections, cluster validity measures used for this method. Then, we present

the detailed experimental results for the suggested hybrid clustering method.

Appendix A presents the results of hybrid partitional algorithm (Hybrid K-

means) based on the quality of the clusters and explains how it would perform

in this experimental setting. Finally, we present the relation between K ′ and

the quality of the clusters generated by this hybrid clustering method and

discuss the total computational cost required for this method in Appendix

B.

4

2 MOTIVATIONS

In general, hierarchical clustering algorithm outperforms partitional cluster-

ing algorithm in terms of cluster quality [JD88]. But the computational cost

(space and time complexity) of agglomerative and divisive hierarchical clus-

tering is very high compared to that of partitional clustering [LA99,PHB00,

XW05, ELL01]. At each step, we need to analyze all the documents in a

collection to split or merge clusters to construct the dendogram.

A new method that combines the advantages of the partitional and hi-

erarchical clustering algorithms would give better results. Bisect K-means

algorithm is a hybrid partitional clustering algorithm that uses K-means

partitional clustering algorithm (2-way) to split the clusters in each step to

construct the hierarchical tree in a top-down approach. Bisect K-means al-

gorithm picks a cluster based on certain criteria at each step to split until the

desired number of clusters or the complete hierarchical tree is generated.

Using the partitional clustering algorithms for generating the hierarchical

tree is quite popular in recent years. Experiments have shown that these

hybrid partitional clustering algorithm performs better than the traditional

clustering algorithms [ZKF05].

A cluster generated by a clustering algorithm may have outliers and false-

positive documents. Most recently, new approach based on refining the clus-

ters generated by a clustering algorithm as a second level to remove the out-

liers and improve the quality of the clusters has been discussed. It is similar

in using the centroids of the clusters (generated by any clustering algorithm)

in the standard K-means algorithm as initial centroids. This eliminates the

initialization problem of the K-means clustering algorithm. Chitta et. al.

proposed a two-level K-means algorithm [CN10] which generates arbitrarily

chosen k′ number of clusters in the first level and refines these clusters in the

second level using the cluster radius ri and allowed threshold value t.

A hybrid hierarchical agglomerative algorithm is discussed by Vijaya et.

al. [VNS05]. It uses the Single Link (SLA) or Complete Link (CLA) ag-

glomerative hierarchical algorithm with the incremental partitional cluster-

5

ing algorithm (Leader algorithm). In order to generate k clusters using the

bottom-up hierarchical tree with (N to k) levels, it generates arbitrarily

chosen k′ number of clusters using the leader based partitional clustering al-

gorithm thereby bypassing the cumbersome process of generating (N to k′)

levels of the bottom-up hierarchical tree. Then SLA / CLA hierarchical ag-

glomerative clustering algorithm is applied for the remaining (k′ to k) levels

to generate the final k clusters.

In this paper, we used the ideas from [SKK00, CN10, VNS05] for the

hybrid bisect K-means clustering algorithm to generate a set of good clusters

using the agglomerative and divisive hierarchical clustering algorithms.

3 HYBRID CLUSTERING METHOD

Hierarchical clustering algorithms build a hierarchy of quality clusters. One

of the main problems with the hierarchical clustering is that the documents

put together in the early stage of the algorithm will never be changed. In

other words, hierarchical clustering tries to preserve the local optimization

criterion but not the global optimization criterion [TSK05]. If we somehow

correct these misplaced documents in the generated clusters, we can try to

preserve the global optimization criterion.

Our algorithm uses both the top-down (Bisect K-means) and bottom-up

(UPGMA) agglomerative hierarchical clustering algorithms to address this

problem. We pass the K’ cluster information (centroids) computed from the

bisect K-means algorithm to the UPGMA algorithm to correct the inconsis-

tencies occurred due to the wrong decision made while merging or splitting

a cluster.

First, we ran the bisect K-means algorithm on the document collection

for a particular value of the K ′ (in this case K ′ =√N , Appendix B) until K ′

number of document clusters were generated. One cluster with more number

of documents or highest intra-cluster similarity value is chosen at each step

to split. The generated document clusters should not be empty (discussed

6

in Appendix A). Then, we calculated the centroids for each of the resulting

clusters. Each of these centroids represents a document cluster and all of its

documents.

Listing 1: Hybrid Bisect K-means algorithm

1 Pick a c l u s t e r to s p l i t . ( I n i t i a l l y the whole document

c o l l e c t i o n i s used as a s i n g l e c l u s t e r )

2 Find 2 sub c l u s t e r s us ing K−means a lgor i thm .

3 Repeat Steps 1 ( I n i t i a l i z a t i o n step ) and 2 ( b i s e c t i n g

s tep ) u n t i l the K’ > K number o f c l u s t e r s are

generated .

4 Compute the c e n t r o i d s ( c l u s t e r prototypes ) f o r each o f

the K’ c l u s t e r s such that each document in a

c o l l e c t i o n be longs to one o f the se c e n t r o i d s .

5 Construct a K’ X K’ s i m i l a r i t y matrix between these

c en t r o i d c l u s t e r s .

6 Merge two s i m i l a r c en t r o id c l u s t e r s ( i . e . , p l ace these

c e n t r o i d s in the same c l u s t e r ) .

7 Update the c en t r o id c l u s t e r s s i m i l a r i t y matrix .

8 Repeat Steps 6 ( Merging step ) and 7 ( Updating step )

u n t i l the K c l u s t e r s o f c e n t r o i d s are generated .

9 I f two c e n t r o i d s belong to same cen t r o i d c l u s t e r s , then

the document c l u s t e r s o f the se c e n t r o i d s w i l l go

toge the r as a f i n a l c l u s t e r ( Merging step ) .

In Steps 5 – 8, We ran the UPGMA agglomerative hierarchical clustering

algorithm on the centroids of these document clusters for a given value of K

(given in the algorithm) to generate a set of K centroid clusters. We used

the term centroid clusters to avoid possible confusion with the document

clusters. Like document cluster is a cluster of documents, centroid cluster is

a cluster of centroids. The resulting centroid clusters are used as a reference

in merging the document clusters to obtain the final K clusters as shown in

7

the Step 10 of Listing 1.

4 EXPERIMENTAL ANALYSIS

4.1 DOCUMENT COLLECTIONS

We used ten data collections from TREC-5 [TRE99], TREC-6 [TRE99],

TREC-7 [TRE99] and Reuters source [Lew99]. Table 1 shows the list of data

used in this experiment and its class distribution [SKK00, HK00]. TR11,

TR12, TR13, TR31, and TR45 collections are taken from TREC-5, TREC-

6, and TREC-7. LA1 and LA2 are from the Los Angeles Times data of

TREC-5. FBIS is from the Foreign Broadcast Information Service data of

TREC-5. For Reuters-21578, we selected documents that contain only one

category. The resulting set is split into two collections RE0 and RE1.

For all the document collections shown in the Table 1, we removed the

common words in the stop list and stemmed using Porters stemmer algorithm

[Por80]. Classes of all these data collection are generated from the relevance

judgment given in these collections.

4.2 DISTANCE AND SIMILARITY MEASURES

All the documents in a document collection are represented using vector

space model [Sal89], which uses bag of words or vector of terms concept in

which each unique term in a document is assigned with a weight based on

its importance. Several term weighting schemes are available in use [SB88,

SW97]. We used tf−idf term weighing scheme and a document is represented

by:

d = t1, w1; t2, w2; t3, w3; t4, w4; · · · tn, wn. (4)

where wi = tfi. log(N/dfi) is the weight of the term ti in Equation (4), tfi is

the frequency of the term ti in a document d, dfi is the document frequency

of the term ti in a document collection ζ and N is the total number of

documents in a collection ζ.

8

Data

Source

Collection # of

Doc

# of

Class

Min

class

size

Max

class

size

Avg.

class

size

# of

words

FBIS TREC 2463 17 38 506 144.9 2000

LA1 TREC 3204 6 273 943 534.0 31472

LA2 TREC 3075 6 248 905 512.5 31472

RE0 REUTERS-21578 1504 13 11 608 115.7 2886

RE1 REUTERS-21578 1657 25 10 371 66.3 3758

TR11 TREC 414 9 6 132 46.0 6429

TR12 TREC 313 8 9 93 39.1 5804

TR23 TREC 204 6 6 91 34.0 5832

TR31 TREC 927 7 2 352 132.4 10128

TR45 TREC 690 10 14 160 69.0 8261

Table 1: Data collections used in this experiment.

Hybrid bisect K-means algorithm uses the Euclidean distance and Cosine

similarity measure between the documents/clusters to find their relationship

[XW05]. Euclidean distance measure is used by the K-means algorithm in

divisive algorithm to split the document clusters and cosine similarity is used

in UPGMA to merge the centroid clusters. Euclidean distance measures the

distance between two documents or a document and a centroid projected in

the Euclidean space.

D(x, y) =

√ ∑i : 1···T

(xi − yi)2 (5)

where T is the total number of terms in the collection ζ. Cosine similarity is

used to find the similarity between two documents. If two documents x and

y are similar then the cosine similarity measure will be closer to 1 otherwise

it will be closer to 0.

Cos(x, y) =xty

(‖x‖‖y‖)(6)

9

4.3 CLUSTER VALIDITY MEASURES

Cluster validity measures the goodness of the clusters generated by the clus-

tering algorithm. In order to compare the proposed clustering with standard

bisect K-means algorithm, we used the classification-oriented measures. In

other words, clustering algorithm’s performance can be measured using in-

ternal quality criterion and external quality criterion. In this paper, we used

the external quality criterion, which compares the clustering result produced

by the clustering algorithm with the known classes. These classes are identi-

fied based on the human judgment. We used three external quality measures

in this paper: Entropy, F-Measure, and Purity [TSK05,SKK00,ZKF05].

We used the following notations throughout this paper: C is the set

of clusters in a collection C = C1, C2, · · ·Ck and Ω is the set of known

classes Ω = Ω1,Ω2 · · ·Ωl. C1, C2, · · ·Ck are the clusters produced by a

clustering algorithm and Ω1,Ω2 · · ·Ωl are the known classes where l = k in

our experiement. |Ωi| represents the number of documents in the class Ωi,

|Cj| is the number of documents in the cluster Cj, |Ωi ∩ Cj| represents the

number of documents appears in both class Ωi and cluster Cj, N is the total

number of documents in the collection and c is the number of known classes

in the collection.

Entropy provides a measure of goodness [Sha48]. Entropy of the cluster

Cj can be defined as

Hj = −∑i : 1···l

pij log(pij) (7)

Hj = −∑i : 1···l

|Ωi ∩ Cj||Cj|

log|Ωi ∩ Cj||Cj|

, j : 1 · · · k (8)

where j is in 1 · · · k. Equations (7) and (8) are equivalent. Total entropy of

the clustering algorithm is the weighted sum of entropies of all the clusters

in a collection.

H =∑

j : 1···k

Hj ∗ |Cj|N

(9)

10

F-Measure provides a measure of accuracy [LA99]. It is based on the

recall and precision measure used for the evaluation of an information re-

trieval system. Precision is the fraction of a cluster that consists of objects

of a specified class. Recall is the extent to which a cluster contains all objects

of a specified class [TSK05]. Recall and Precision can be computed as:

Recall, R(Ωi, Cj) =|Ωi ∩ Cj||Ωi|

(10)

Precision, P (Ωi, Cj) =|Ωi ∩ Cj||Cj|

(11)

F-Measure between class Ωi and cluster Cj is given by:

F (Ωi, Cj) =2 ∗R(Ωi, Cj) ∗ P (Ωi, Cj)

R(Ωi, Cj) + P (Ωi, Cj)(12)

The F-Measure of any class Ωi is the maximum F-Measure value obtained

in any node of the hierarchical clustering. Total F-Measure is the weighted

average of the F-Measure values of all the class

F-Measure(Ωi) = maxF (Ωi, Cj), Cj ∈ C (13)

F-Measure =∑i : 1···c

|Ωi|N∗ F-Measure(Ωi) (14)

Purity measures the quality of the clusters. Purity of the cluster Cj is

given by:

Purity(Ωi, Cj) = maxi|Ωi ∩ Cj| (15)

Purity =∑

j : 1···k

Purity(Ωi, Cj)

N,Ωi ∈ Ω (16)

5 RESULTS AND DISCUSSIONS

We evaluated the hybrid bisect K-means algorithm generated clusters by run-

ning it on various document collections discussed in Section 4 along with the

11

Hybrid Bisect K-means

Data Source Average Best Average Best

FBIS 1.4430 1.3098 1.3533 1.3068

LA1 1.3663 1.1512 1.4164 1.3395

LA2 1.2797 1.3046 1.3143 1.2877

RE0 1.4100 1.2677 1.4162 1.3716

RE1 1.6190 1.4806 1.6418 1.5241

TR11 1.4011 1.2500 1.4102 1.3028

TR12 1.3798 1.1100 1.7344 1.6600

TR23 1.2071 1.0776 1.3351 1.3313

TR31 1.1233 0.9852 1.4105 1.3761

TR45 1.4059 1.1880 1.5922 1.5122

Table 2: Average and Best Entropy measured for Hybrid and Bisect K-means

algorithms.

bisect K-means algorithm to compare the quality of the generated clusters.

We used the Entropy, F-Measure and Purity as evaluation metrics. The

generated clusters vary on each run based on the following:

• Initial centroids used for the K-means in bisect K-means algorithm,

• The clusters picked in the bisect K-means of hybrid algorithm at each

step to split,

• The clusters picked in the UPGMA of hybrid algorithm at each step to

merge.

To avoid this, we ran the hybrid algorithm for 10 runs on each document

collection for a particular value of K ′ (in this case K ′ =√N). Taking the

average values of these 10 data results computed from each run on each

document collection gives the final results as Average Entropy, Average F-

Measure and Average Purity. The best results of Average Entropy, Average

12



FBIS 0.3798 0.4449 0.4052 0.4591

LA1 0.2400 0.3468 0.3256 0.3993

LA2 0.3181 0.3961 0.3905 0.4255

RE0 0.2936 0.3823 0.2502 0.3225

RE1 0.2929 0.3515 0.2818 0.3369

TR11 0.2910 0.3944 0.2478 0.3017

TR12 0.2928 0.3522 0.1946 0.2409

TR23 0.3217 0.4099 0.1719 0.2016

TR31 0.3361 0.4231 0.1407 0.2659

TR45 0.3981 0.4952 0.2627 0.3455

Table 3: Average and Best F-Measure measured for Hybrid and Bisect K-

Means algorithms.

F-Measure and Average Purity for each document collections are boldfaced

in the Tables 2, 3, and 4. The lower the Average Entropy and the higher the

Average F-Measure and Average Purity, the better the clusters generated.

Also, we measured the best value obtained for Average Entropy, F-Measure

and Purity in the 10 runs for each data collection.

Figure 2: Average Entropy measured for various datasets.

13



FBIS 0.5040 0.5684 0.5678 0.5863

LA1 0.4556 0.4969 0.4292 0.4800

LA2 0.4956 0.5385 0.4721 0.5024

RE0 0.4961 0.5299 0.5108 0.5352

RE1 0.4289 0.4689 0.5020 0.5576

TR11 0.4894 0.5362 0.4850 0.5362

TR12 0.3837 0.4281 0.3514 0.3802

TR23 0.5113 0.5735 0.4853 0.4853

TR31 0.5813 0.6656 0.4344 0.4423

TR45 0.4774 0.5710 0.4210 0.4507

Table 4: Average and Best Purity measured for Hybrid and Bisect K-Means

algorithms.

Figure 3: Average F-measure measured for various datasets.

14

Figure 4: Average Purity measured for various datasets.

Figure 2, 3, and 4 shows the graphical representation of Average Entropy,

Average F-Measure and Average Purity values measured for various data

collections. Series 1 represents the hyrid algorithm and Series 2 represents

the bisect K-means algorithm.

6 CONCLUSION

In this paper, we proposed a hybrid algorithm using top-down (Bisect K-

means) and bottom-up (UPGMA) agglomerative hierarchical clustering al-

gorithm called hybrid bisect K-means. We compared the clusters generated

by the hybrid bisect K-means algorithm with the clusters generated by the

bisect K-means algorithm based on the three evaluation metrics: Entropy, F-

Measure, and Purity of the clusters. Based on the results obtained, we found

that the hybrid bisect K-means algorithm outperforms the bisect K-means

algorithm and produces better clusters.

We have also shown how the hybrid K-means algorithm performs worse

than the K-means algorithm in the Appendix A. The relation between the

value of the parameter K ′ used in the hybrid bisect K-means and the cluster

quality is discussed in the Appendix B.

15

References

[Bol98] Daniel Boley. Principal direction divisive partitioning. Data Min-

ing and Knowledge Discovery, 2(4):325–344, December 1998.

[CKPT92] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and

John W. Tukey. Scatter/Gather: A cluster-based approach to

browsing large document collections. In Proceedings of the 15th

annual international ACM SIGIR conference on Research and de-

velopment in information retrieval - SIGIR ’92, pages 318–329,

New York, New York, USA, June 1992. ACM Press.

[CN10] Radha Chitta and M. Narasimha Murty. Two-level k-means clus-

tering algorithm for k–τ relationship establishment and linear-

time classification. Pattern Recognition, 43(3):796–804, March

2010.

[CS96] Peter Cheeseman and John Stutz. Bayesian Classification (Au-

toClass): Theory and Results, volume 180, pages 153–180. MIT

Press, 1996.

[DHZ+01] Chris H. Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and

Horst D. Simon. A min-max cut algorithm for graph partitioning

and data clustering. In Proceedings of the 2001 IEEE Interna-

tional Conference on Data Mining, ICDM ’01, pages 107–114,

Washington, 2001. IEEE Computer Society.

[ELL01] Brian S Everitt, Sabine Landau, and Morven Leese. Cluster Anal-

ysis, volume 33 of Social Science Research Council Reviews of

Current Research. Arnold, 2001.

[GRS98] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE. ACM

SIGMOD Record, 27(2):73–84, June 1998.

16

[GRS99] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A

robust clustering algorithm for categorical attributes. In 15th In-

ternational Conference on Data Engineering, pages 512 – 521,

March 1999.

[HJJ09] Xiong Hui, Wu Junjie, and Chen Jian. K-Means clustering ver-

sus validation measures: A data-distribution perspective. IEEE

Transactions on Systems, Man, and Cybernetics, Part B (Cyber-

netics), 39(2):318–331, April 2009.

[HK00] Eui-Hong Han and George Karypis. Centroid-based document

classification: Analysis and experimental results, 2000.

[HXZ+98] Tianming Hu, Hui Xiong, Wenjun Zhou, Sam Yuan Sung, and

Hangzai Luo. Hypergraph partitioning for document clustering:

A summary of results. Bulletin of the Technical committee on

Data Engineering, 21(1):15 – 22, July 1998.

[JD88] Anil K Jain and Richard C Dubes. Algorithms for Clustering in

Data. Prentice Hall, 1988.

[KHK99] G Karypis, E H Han, and V Kumar. Chameleon: A hierarchical

clustering algorithm using dynamic modeling. IEEE Computer,

32(8):68–75, 1999.

[Kin67] B King. Step-wise clustering procedures. Journal of the American

Statistical Association, 69:86–101, 1967.

[KR90] Leonard Kaufman and Peter J Rousseeuw. Finding Groups in

Data: An Introduction to Cluster Analysis. Wiley Series in Prob-

ability and Mathematical Statistics. Wiley, 1990.

[LA99] Bjornar Larsen and Chinatsu Aone. Fast and effective text mining

using linear-time document clustering. In Proceedings of the fifth

ACM SIGKDD international conference on Knowledge discovery

17

and data mining KDD 99, volume 5, pages 16–22. ACM Press,

1999.

[Lew99] David D. Lewis. The reuters-21578 text categorization test col-

lection, 1999.

[Mac67] J B MacQueen. Some methods for classification and analysis of

multivariate observations. In L M Le Cam and J Neyman, edi-

tors, Proceedings of the Fifth Berkeley Symposium on Mathemat-

ical Statistics and Probability, volume 1, pages 281–297. West-

ern Management Science Institute, University of California Press,

1967.

[NH94] R T Ng and J Han. Efficient and effective clustering methods for

spatial data mining. In Proceedings of the International Confer-

ence on Very Large Data Bases, pages 144–155. Citeseer, 1994.

[PHB00] Jan Puzicha, Tomas Hofmann, and Joachim M Buhmann. A the-

ory of proximity based clustering: structure detection by opti-

mization. Pattern Recognition, 33(4):617–634, 2000.

[Por80] M F Porter. An algorithm for suffix stripping. Program,

14(3):130–137, 1980.

[Sal89] Gerard Salton. Automatic Text Processing: The Transformation,

Analysis, and Retrieval of Information by Computer. Addison-

Wesley, 1989.

[SB88] Gerard Salton and Christopher Buckley. Term-weighting ap-

proaches in automatic text retrieval. Information Processing and

Management, 24(5):513–523, 1988.

[SG00] Alexander Strehl and Joydeep Ghosh. A scalable approach to bal-

anced, high-dimensional clustering of market-baskets. In HiPC,

pages 525–536, December 2000.

18

[Sha48] C E Shannon. A mathematical theory of communication. The

Bell System Technical Journal, 27(4):379–423, 1948.

[SKK00] M Steinbach, G Karypis, and V Kumar. A comparison of doc-

ument clustering techniques. In KDD workshop on text mining,

volume 400, pages 525–526. Department of Computer Science and

Engineering University of Minnesota, Citeseer, 2000.

[SS73] P H A Sneath and R R Sokal. Numerical Taxonomy: The Prin-

ciples and Practice of Numerical Classification. A Series of books

in biology. Freeman, 1973.

[SW97] Karen Spark Jones and Peter Willett. Readings in Information

Retrieval. Morgan Kaufmann, 1997.

[TRE99] TREC. Text REtrieval Conference (TREC), 1999.

[TSK05] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduc-

tion to Data Mining, (First Edition), volume chapter 8. May

2005.

[VNS05] P.A. Vijaya, M. Narasimha Murty, and D.K. Subramanian. An ef-

ficient hybrid hierarchical agglomerative clustering (HHAC) tech-

nique for partitioning large data sets. In PReMI, Lecture Notes

in Computer Science, pages 583–588, Berlin, Heidelberg, 2005.

Springer Berlin Heidelberg.

[XW05] Rui Xu and Donald Wunsch. Survey of clustering algorithms.

IEEE Transactions on Neural Networks, 16(3):645–678, 2005.

[ZHD+01] Hongyuan Zha, Xiaofeng He, Chris Ding, Horst Simon, and Ming

Gu. Bipartite graph partitioning and data clustering. ACM Press,

New York, New York, USA, October 2001.

19

[ZKF05] Ying Zhao, George Karypis, and Usama Fayyad. Hierarchical

clustering algorithms for document datasets. Data Mining and

Knowledge Discovery, 10(2):141–168, March 2005.

A PRILIMINARYANALYSIS (HYBRID K-

MEANS ALGORITHM)

We started our experiment with the K-means partitional clustering algorithm

to check whether the hybrid K-means algorithm would give better results.

Listing 2: Hybrid K-means algorithm

1 Pick K’ random po in t s as an i n i t i a l c e n t r o i d s ( c l u s t e r

prototypes ) .

2 Assign each document to the c l o s e s t c en t r o i d .

3 Reca l cu la t e the c en t r o i d f o r each c l u s t e r .

4 Repeat Steps 2 ( Assignment s tep ) and 3 ( Reca l cu l a t i on

step ) u n t i l the c e n t r o i d s do not change .

5 Compute the c l u s t e r prototypes / c e n t r o i d s f o r each o f

the K’ c l u s t e r s such that each document in a

c o l l e c t i o n be longs to one o f the se c e n t r o i d s .

6 Construct a K’ X K’ s i m i l a r i t y matrix between these

c e n t r o i d s .

7 Merge two s i m i l a r c e n t r o i d s ( i . e . , p l ace these

c e n t r o i d s in the same c l u s t e r ) .

8 Update the c en t r o id s i m i l a r i t y matrix .

9 Repeat Steps 6 ( Merging step ) and 7 ( Updating step )

u n t i l the K c l u s t e r s o f c e n t r o i d s are generated .

10 I f two c e n t r o i d s belong to same cen t r o i d c l u s t e r s , then

the document c l u s t e r s o f the se c e n t r o i d s w i l l go

toge the r as a f i n a l c l u s t e r ( Merging step ) .

20

Listing 2 shows the modified algorithm of the Listing 1. Steps 1 - 4

in the Listing 2 use the K-means algorithm. Steps 6 - 9 use the UPGMA

agglomerative hierarchical algorithm. The final Step 10 maps and merges

the document clusters based on the centroid clusters in Step 9 to generate

the final clusters.

We ran the hybrid K-means algorithm shown in the Listing 2 along with

the standard K-means algorithm to compare the goodness of the clusters

generated by both these methods. TREC Collections TR11, TR12, TR23,

TR31 and TR45 were used in this experiment. We calculated the Average

F-Measure value of the K-means and Hybrid K-means algorithms generated

clusters for each dataset to compare the results from both experiments.

Figure 5: Average F-measure for TREC Collections.

In this experiment, the hybrid K-Means clustering algorithm performs

worse than the K-means algorithm. We found the following possible reasons

for this inferior performance.

• The initialization problem in the K-means algorithm has major neg-

ative impact on its hybrid method. The initial centroids used in the

K-means algorithm of the Hybrid K-means affect the generated doc-

ument clusters. Bisect K-means algorithm is less susceptible to the

initialization problem.

• When we ran the Hybrid K-means algorithm, all the documents tend

to place in one or two document clusters on each datasets. Most of

21

Figure 6: TR31-Average F-Measure measured for various K’.

the document clusters generated are empty. Each cluster should have

at least a few documents to compute the centroids for each document

clusters. If a document cluster is empty, then its centroid will be at

origin. If more number of document clusters is empty, then these cen-

troids will be placed in the same centroid cluster, which affects the

cluster quality. For the algorithm to work properly a few documents

should be in most of the document clusters.

• K-means algorithm tends to generate non-uniform clusters when the

data is uniform [HJJ09].

B RELATION BETWEEN K’ AND QUAL-

ITY OF THE CLUSTERS

In this section, we will analyze the impact of the chosen value K ′ on the

quality of clusters. To find the importance of K ′, we ran the algorithm

shown in the Listing 1 for various values ofK ′ but with the same experimental

setting. We measured the Average Entropy, Average F-Measure and Average

Purity for each value of K ′. As the value of the parameter K ′ increases,

22

Average Entropy, F-Measure and Purity varies a little. We used the TREC

Collections TR31 and FBIS for this experiment.

Figure 7: FBIS-Average Entropy computed for various values of K’.

Figure 8: FBIS-Average F-measure computed for various values of K’.

23

Figure 9: FBIS-Average Purity computed for various values of K’.

Figure 10: TR31-Average Entropy computed for various values of K’.

24

Figure 11: TR31-Average F-measure computed for various values of K’.

Figure 12: TR31-Average Purity computed for various values of K’.

From the analysis, we found that the value of the parameter K ′ should

be greater than the value of the parameter K, i.e., K ′ > K or K ′ = nK for

n > 1. Even for K < K ′ ≤ 2K, this method gives better results as shown

in the Figures 7 – 12. For K ′ = N , the hybrid algorithm performs worse by

generating a complete top-down hierarchical tree where each document has

its own cluster and then uses these single document clusters in the UPGMA

algorithm to generate the bottom-up hierarchical tree until K number of

25

clusters are obtained. The value of K ′ should be in K < K ′ < N .

With K ′ = 2K, we can compute the time complexity of our method.

It takes O(N) to generate the set of K’ document clusters using bisect K-

means algorithm and O(K ′2) to generate the set of K centroid clusters using

UPGMA, where N, M, and K are the total number of documents, terms and

clusters in a collection ζ [SKK00]. Finally, O(K) is required to compute the

final K document clusters from this K centroid clusters. Then, the total

computational cost of this method is given as O(N + K ′2 + K) = O(N) for

K ′ > K.

c© University of Kentucky, 2011.

26

hybrid hierarchical clustering: an experimental analysis

Documents