hybrid hierarchical clustering: an experimental analysis
TRANSCRIPT
HYBRID HIERARCHICAL CLUSTERING:
AN EXPERIMENTAL ANALYSIS ∗
Keerthiram Murugesan, Jun Zhang
Technical Report: CMIDA-HiPSCCS #001-11
Department of Computer Science,
University of Kentucky,
Lexington,
kmu222,[email protected]
February 17, 2011
Abstract
In this paper, we present a hybrid clustering method that combines
the divisive hierarchical clustering with the agglomerative hierarchical
clustering. We used the bisect K-means divisive clustering algorithm
in our method. First, we cluster the document collection using bisect
K-means clustering algorithm with K’ > K as the total number of
clusters. Second, we calculate the centroids of K’ clusters obtained
from the previous step. Then we apply the Unweighted Pair Group
Method with Arithmetic Mean (UPGMA) agglomerative hierarchical
algorithm on these centroids for the given K. After the UPGMA finds
K clusters in these K’ centroids, if two centroids ended up in the same
cluster, then all of their documents will belong to the same cluster.
∗ c© University of Kentucky, 2011.
1
We compared the goodness of clusters generated by the standard bi-
sect K-means algorithm and the proposed hybrid algorithm, measured
on various cluster evaluation metrics. Our experimental results show
that the proposed method outperforms the standard bisect K-means
algorithm. At the end, we show that picking a value for K’ has no
major impact in the end results by analyzing the relation between the
value of K’ and the quality of the clusters.
1 INTRODUCTION
Document clustering algorithm helps to find groups in documents that share a
common pattern [TSK05,SKK00,GRS98,ZKF05,CKPT92]. It has been used
to automatically find clusters in a collection without any user supervision.
The main goal of the clustering is to find meaningful groups so that the
analysis of all the documents within clusters is much easier compared to
viewing it as a whole collection. Some of the most common applications
of clustering are in information retrieval, document organization, genetics,
weather forecasting, medical imaging, etc [TSK05].
There are different ways to cluster documents. But two common types
of clustering methods are used: Partitional and Hierarchical clustering. A
partitional clustering algorithm finds all the non-overlapping clusters at once
by dividing the set of documents based on an objective function [JD88,Mac67,
NH94,CS96,ZHD+01,HXZ+98,Bol98,SG00,DHZ+01]. These algorithms try
to minimize or maximize an objective function. Most partitional clustering
algorithms are prototype-based, in which a prototype for each clusters are
chosen and the documents are grouped based on these prototypes. Usually
these algorithms run several times until a convergence occurs or an optimum
condition is met.
Hierarchical clustering generates a tree of clusters by splitting / merging
each cluster on each level until the desired numbers of clusters are gener-
ated. This generated tree is often called as Dendogram [GRS98,SS73,Kin67,
KHK99], shown in the Figure 1.
2
Figure 1: Dendogram
Hierarchical clustering can use top-down approach (Divisive) or bottom-
up approach (Agglomerative) to construct the dendogram. Agglomerative
clustering starts with one document in each cluster and repeatedly merges
two clusters that are most similar in their pattern at each step until a single
cluster of all documents is obtained. Divisive clustering, on the other hand,
starts with all documents as a single cluster and splits them until all clusters
are singleton clusters. These types of clustering are most commonly used
because their underlying representation of clusters in hierarchy resembles
their application domain [ZKF05,XW05,CKPT92].
Hierarchical agglomerative clustering algorithms merge pair of clusters
at each step based on one of the following linkage metrics for measuring
proximity between clusters [GRS98,JD88,SS73,Kin67,KHK99,GRS99].
Single Link Algorithm (SLA) measures the maximum of the pair-wise
similarity from each cluster to merge a pair of clusters [SS73].
SimilaritySLA(CA, CB) = maxx∈CA,y∈CB
cos(x, y) (1)
Complete Link Algorithm (CLA) measures the minimum of the pair-wise
similarity from each cluster [Kin67].
SimilarityCLA(CA, CB) = minx∈CA,y∈CB
cos(x, y) (2)
Group Average Algorithm (unweighted pair group method with arith-
metic mean (UPGMA)) measures the average of the pair-wise similarity of
the documents from each cluster [JD88].
SimilarityUPGMA(CA, CB) =1
ninj
∑x∈CA,y∈CB
cos(x, y) (3)
3
where cos(x, y) is the cosine similarity between two documents or clusters
represented as vectors. More information about cosine similarity is discussed
in Section 4.2.
Though there are many other linkage metrics, Equations (1),(2), and (3)
are commonly used. These methods use Euclidean distance or inter-cluster
similarity matrix for their comparisons and measurements. Experiments have
shown that UPGMA performs better than SLA and CLA algorithms.
Divisive hierarchical clustering algorithms pick a big cluster or cluster
with the lowest intra-cluster similarity measure to split on each level [ELL01,
KR90]. Most recently partitional clustering algorithms have been used to
split the clusters. Bisect K-means algorithm [SKK00] is a typical divisive
hierarchical clustering algorithm that uses the K-means partitional clustering
algorithm to split its cluster.
This paper focuses on how the hybrid of one or more of these clustering
algorithms can give better results and generate good clusters compared to
the traditional methods. The rest of this paper is organized as follows: Sec-
tion 2 discusses about the motivations for this work and the related works.
In Section 3, we present the suggested hybrid method that uses both the
divisive and agglomerative hierarchical clustering algorithm. Section 4 pro-
vides the experimental analysis for this method and discusses about the data
collections, cluster validity measures used for this method. Then, we present
the detailed experimental results for the suggested hybrid clustering method.
Appendix A presents the results of hybrid partitional algorithm (Hybrid K-
means) based on the quality of the clusters and explains how it would perform
in this experimental setting. Finally, we present the relation between K ′ and
the quality of the clusters generated by this hybrid clustering method and
discuss the total computational cost required for this method in Appendix
B.
4
2 MOTIVATIONS
In general, hierarchical clustering algorithm outperforms partitional cluster-
ing algorithm in terms of cluster quality [JD88]. But the computational cost
(space and time complexity) of agglomerative and divisive hierarchical clus-
tering is very high compared to that of partitional clustering [LA99,PHB00,
XW05, ELL01]. At each step, we need to analyze all the documents in a
collection to split or merge clusters to construct the dendogram.
A new method that combines the advantages of the partitional and hi-
erarchical clustering algorithms would give better results. Bisect K-means
algorithm is a hybrid partitional clustering algorithm that uses K-means
partitional clustering algorithm (2-way) to split the clusters in each step to
construct the hierarchical tree in a top-down approach. Bisect K-means al-
gorithm picks a cluster based on certain criteria at each step to split until the
desired number of clusters or the complete hierarchical tree is generated.
Using the partitional clustering algorithms for generating the hierarchical
tree is quite popular in recent years. Experiments have shown that these
hybrid partitional clustering algorithm performs better than the traditional
clustering algorithms [ZKF05].
A cluster generated by a clustering algorithm may have outliers and false-
positive documents. Most recently, new approach based on refining the clus-
ters generated by a clustering algorithm as a second level to remove the out-
liers and improve the quality of the clusters has been discussed. It is similar
in using the centroids of the clusters (generated by any clustering algorithm)
in the standard K-means algorithm as initial centroids. This eliminates the
initialization problem of the K-means clustering algorithm. Chitta et. al.
proposed a two-level K-means algorithm [CN10] which generates arbitrarily
chosen k′ number of clusters in the first level and refines these clusters in the
second level using the cluster radius ri and allowed threshold value t.
A hybrid hierarchical agglomerative algorithm is discussed by Vijaya et.
al. [VNS05]. It uses the Single Link (SLA) or Complete Link (CLA) ag-
glomerative hierarchical algorithm with the incremental partitional cluster-
5
ing algorithm (Leader algorithm). In order to generate k clusters using the
bottom-up hierarchical tree with (N to k) levels, it generates arbitrarily
chosen k′ number of clusters using the leader based partitional clustering al-
gorithm thereby bypassing the cumbersome process of generating (N to k′)
levels of the bottom-up hierarchical tree. Then SLA / CLA hierarchical ag-
glomerative clustering algorithm is applied for the remaining (k′ to k) levels
to generate the final k clusters.
In this paper, we used the ideas from [SKK00, CN10, VNS05] for the
hybrid bisect K-means clustering algorithm to generate a set of good clusters
using the agglomerative and divisive hierarchical clustering algorithms.
3 HYBRID CLUSTERING METHOD
Hierarchical clustering algorithms build a hierarchy of quality clusters. One
of the main problems with the hierarchical clustering is that the documents
put together in the early stage of the algorithm will never be changed. In
other words, hierarchical clustering tries to preserve the local optimization
criterion but not the global optimization criterion [TSK05]. If we somehow
correct these misplaced documents in the generated clusters, we can try to
preserve the global optimization criterion.
Our algorithm uses both the top-down (Bisect K-means) and bottom-up
(UPGMA) agglomerative hierarchical clustering algorithms to address this
problem. We pass the K’ cluster information (centroids) computed from the
bisect K-means algorithm to the UPGMA algorithm to correct the inconsis-
tencies occurred due to the wrong decision made while merging or splitting
a cluster.
First, we ran the bisect K-means algorithm on the document collection
for a particular value of the K ′ (in this case K ′ =√N , Appendix B) until K ′
number of document clusters were generated. One cluster with more number
of documents or highest intra-cluster similarity value is chosen at each step
to split. The generated document clusters should not be empty (discussed
6
in Appendix A). Then, we calculated the centroids for each of the resulting
clusters. Each of these centroids represents a document cluster and all of its
documents.
Listing 1: Hybrid Bisect K-means algorithm
1 Pick a c l u s t e r to s p l i t . ( I n i t i a l l y the whole document
c o l l e c t i o n i s used as a s i n g l e c l u s t e r )
2 Find 2 sub c l u s t e r s us ing K−means a lgor i thm .
3 Repeat Steps 1 ( I n i t i a l i z a t i o n step ) and 2 ( b i s e c t i n g
s tep ) u n t i l the K’ > K number o f c l u s t e r s are
generated .
4 Compute the c e n t r o i d s ( c l u s t e r prototypes ) f o r each o f
the K’ c l u s t e r s such that each document in a
c o l l e c t i o n be longs to one o f the se c e n t r o i d s .
5 Construct a K’ X K’ s i m i l a r i t y matrix between these
c en t r o i d c l u s t e r s .
6 Merge two s i m i l a r c en t r o id c l u s t e r s ( i . e . , p l ace these
c e n t r o i d s in the same c l u s t e r ) .
7 Update the c en t r o id c l u s t e r s s i m i l a r i t y matrix .
8 Repeat Steps 6 ( Merging step ) and 7 ( Updating step )
u n t i l the K c l u s t e r s o f c e n t r o i d s are generated .
9 I f two c e n t r o i d s belong to same cen t r o i d c l u s t e r s , then
the document c l u s t e r s o f the se c e n t r o i d s w i l l go
toge the r as a f i n a l c l u s t e r ( Merging step ) .
In Steps 5 – 8, We ran the UPGMA agglomerative hierarchical clustering
algorithm on the centroids of these document clusters for a given value of K
(given in the algorithm) to generate a set of K centroid clusters. We used
the term centroid clusters to avoid possible confusion with the document
clusters. Like document cluster is a cluster of documents, centroid cluster is
a cluster of centroids. The resulting centroid clusters are used as a reference
in merging the document clusters to obtain the final K clusters as shown in
7
the Step 10 of Listing 1.
4 EXPERIMENTAL ANALYSIS
4.1 DOCUMENT COLLECTIONS
We used ten data collections from TREC-5 [TRE99], TREC-6 [TRE99],
TREC-7 [TRE99] and Reuters source [Lew99]. Table 1 shows the list of data
used in this experiment and its class distribution [SKK00, HK00]. TR11,
TR12, TR13, TR31, and TR45 collections are taken from TREC-5, TREC-
6, and TREC-7. LA1 and LA2 are from the Los Angeles Times data of
TREC-5. FBIS is from the Foreign Broadcast Information Service data of
TREC-5. For Reuters-21578, we selected documents that contain only one
category. The resulting set is split into two collections RE0 and RE1.
For all the document collections shown in the Table 1, we removed the
common words in the stop list and stemmed using Porters stemmer algorithm
[Por80]. Classes of all these data collection are generated from the relevance
judgment given in these collections.
4.2 DISTANCE AND SIMILARITY MEASURES
All the documents in a document collection are represented using vector
space model [Sal89], which uses bag of words or vector of terms concept in
which each unique term in a document is assigned with a weight based on
its importance. Several term weighting schemes are available in use [SB88,
SW97]. We used tf−idf term weighing scheme and a document is represented
by:
d = t1, w1; t2, w2; t3, w3; t4, w4; · · · tn, wn. (4)
where wi = tfi. log(N/dfi) is the weight of the term ti in Equation (4), tfi is
the frequency of the term ti in a document d, dfi is the document frequency
of the term ti in a document collection ζ and N is the total number of
documents in a collection ζ.
8
Data
Source
Collection # of
Doc
# of
Class
Min
class
size
Max
class
size
Avg.
class
size
# of
words
FBIS TREC 2463 17 38 506 144.9 2000
LA1 TREC 3204 6 273 943 534.0 31472
LA2 TREC 3075 6 248 905 512.5 31472
RE0 REUTERS-21578 1504 13 11 608 115.7 2886
RE1 REUTERS-21578 1657 25 10 371 66.3 3758
TR11 TREC 414 9 6 132 46.0 6429
TR12 TREC 313 8 9 93 39.1 5804
TR23 TREC 204 6 6 91 34.0 5832
TR31 TREC 927 7 2 352 132.4 10128
TR45 TREC 690 10 14 160 69.0 8261
Table 1: Data collections used in this experiment.
Hybrid bisect K-means algorithm uses the Euclidean distance and Cosine
similarity measure between the documents/clusters to find their relationship
[XW05]. Euclidean distance measure is used by the K-means algorithm in
divisive algorithm to split the document clusters and cosine similarity is used
in UPGMA to merge the centroid clusters. Euclidean distance measures the
distance between two documents or a document and a centroid projected in
the Euclidean space.
D(x, y) =
√ ∑i : 1···T
(xi − yi)2 (5)
where T is the total number of terms in the collection ζ. Cosine similarity is
used to find the similarity between two documents. If two documents x and
y are similar then the cosine similarity measure will be closer to 1 otherwise
it will be closer to 0.
Cos(x, y) =xty
(‖x‖‖y‖)(6)
9
4.3 CLUSTER VALIDITY MEASURES
Cluster validity measures the goodness of the clusters generated by the clus-
tering algorithm. In order to compare the proposed clustering with standard
bisect K-means algorithm, we used the classification-oriented measures. In
other words, clustering algorithm’s performance can be measured using in-
ternal quality criterion and external quality criterion. In this paper, we used
the external quality criterion, which compares the clustering result produced
by the clustering algorithm with the known classes. These classes are identi-
fied based on the human judgment. We used three external quality measures
in this paper: Entropy, F-Measure, and Purity [TSK05,SKK00,ZKF05].
We used the following notations throughout this paper: C is the set
of clusters in a collection C = C1, C2, · · ·Ck and Ω is the set of known
classes Ω = Ω1,Ω2 · · ·Ωl. C1, C2, · · ·Ck are the clusters produced by a
clustering algorithm and Ω1,Ω2 · · ·Ωl are the known classes where l = k in
our experiement. |Ωi| represents the number of documents in the class Ωi,
|Cj| is the number of documents in the cluster Cj, |Ωi ∩ Cj| represents the
number of documents appears in both class Ωi and cluster Cj, N is the total
number of documents in the collection and c is the number of known classes
in the collection.
Entropy provides a measure of goodness [Sha48]. Entropy of the cluster
Cj can be defined as
Hj = −∑i : 1···l
pij log(pij) (7)
Hj = −∑i : 1···l
|Ωi ∩ Cj||Cj|
log|Ωi ∩ Cj||Cj|
, j : 1 · · · k (8)
where j is in 1 · · · k. Equations (7) and (8) are equivalent. Total entropy of
the clustering algorithm is the weighted sum of entropies of all the clusters
in a collection.
H =∑
j : 1···k
Hj ∗ |Cj|N
(9)
10
F-Measure provides a measure of accuracy [LA99]. It is based on the
recall and precision measure used for the evaluation of an information re-
trieval system. Precision is the fraction of a cluster that consists of objects
of a specified class. Recall is the extent to which a cluster contains all objects
of a specified class [TSK05]. Recall and Precision can be computed as:
Recall, R(Ωi, Cj) =|Ωi ∩ Cj||Ωi|
(10)
Precision, P (Ωi, Cj) =|Ωi ∩ Cj||Cj|
(11)
F-Measure between class Ωi and cluster Cj is given by:
F (Ωi, Cj) =2 ∗R(Ωi, Cj) ∗ P (Ωi, Cj)
R(Ωi, Cj) + P (Ωi, Cj)(12)
The F-Measure of any class Ωi is the maximum F-Measure value obtained
in any node of the hierarchical clustering. Total F-Measure is the weighted
average of the F-Measure values of all the class
F-Measure(Ωi) = maxF (Ωi, Cj), Cj ∈ C (13)
F-Measure =∑i : 1···c
|Ωi|N∗ F-Measure(Ωi) (14)
Purity measures the quality of the clusters. Purity of the cluster Cj is
given by:
Purity(Ωi, Cj) = maxi|Ωi ∩ Cj| (15)
Purity =∑
j : 1···k
Purity(Ωi, Cj)
N,Ωi ∈ Ω (16)
5 RESULTS AND DISCUSSIONS
We evaluated the hybrid bisect K-means algorithm generated clusters by run-
ning it on various document collections discussed in Section 4 along with the
11
Hybrid Bisect K-means
Data Source Average Best Average Best
FBIS 1.4430 1.3098 1.3533 1.3068
LA1 1.3663 1.1512 1.4164 1.3395
LA2 1.2797 1.3046 1.3143 1.2877
RE0 1.4100 1.2677 1.4162 1.3716
RE1 1.6190 1.4806 1.6418 1.5241
TR11 1.4011 1.2500 1.4102 1.3028
TR12 1.3798 1.1100 1.7344 1.6600
TR23 1.2071 1.0776 1.3351 1.3313
TR31 1.1233 0.9852 1.4105 1.3761
TR45 1.4059 1.1880 1.5922 1.5122
Table 2: Average and Best Entropy measured for Hybrid and Bisect K-means
algorithms.
bisect K-means algorithm to compare the quality of the generated clusters.
We used the Entropy, F-Measure and Purity as evaluation metrics. The
generated clusters vary on each run based on the following:
• Initial centroids used for the K-means in bisect K-means algorithm,
• The clusters picked in the bisect K-means of hybrid algorithm at each
step to split,
• The clusters picked in the UPGMA of hybrid algorithm at each step to
merge.
To avoid this, we ran the hybrid algorithm for 10 runs on each document
collection for a particular value of K ′ (in this case K ′ =√N). Taking the
average values of these 10 data results computed from each run on each
document collection gives the final results as Average Entropy, Average F-
Measure and Average Purity. The best results of Average Entropy, Average
12
Hybrid Bisect K-means
Data Source Average Best Average Best
FBIS 0.3798 0.4449 0.4052 0.4591
LA1 0.2400 0.3468 0.3256 0.3993
LA2 0.3181 0.3961 0.3905 0.4255
RE0 0.2936 0.3823 0.2502 0.3225
RE1 0.2929 0.3515 0.2818 0.3369
TR11 0.2910 0.3944 0.2478 0.3017
TR12 0.2928 0.3522 0.1946 0.2409
TR23 0.3217 0.4099 0.1719 0.2016
TR31 0.3361 0.4231 0.1407 0.2659
TR45 0.3981 0.4952 0.2627 0.3455
Table 3: Average and Best F-Measure measured for Hybrid and Bisect K-
Means algorithms.
F-Measure and Average Purity for each document collections are boldfaced
in the Tables 2, 3, and 4. The lower the Average Entropy and the higher the
Average F-Measure and Average Purity, the better the clusters generated.
Also, we measured the best value obtained for Average Entropy, F-Measure
and Purity in the 10 runs for each data collection.
Figure 2: Average Entropy measured for various datasets.
13
Hybrid Bisect K-means
Data Source Average Best Average Best
FBIS 0.5040 0.5684 0.5678 0.5863
LA1 0.4556 0.4969 0.4292 0.4800
LA2 0.4956 0.5385 0.4721 0.5024
RE0 0.4961 0.5299 0.5108 0.5352
RE1 0.4289 0.4689 0.5020 0.5576
TR11 0.4894 0.5362 0.4850 0.5362
TR12 0.3837 0.4281 0.3514 0.3802
TR23 0.5113 0.5735 0.4853 0.4853
TR31 0.5813 0.6656 0.4344 0.4423
TR45 0.4774 0.5710 0.4210 0.4507
Table 4: Average and Best Purity measured for Hybrid and Bisect K-Means
algorithms.
Figure 3: Average F-measure measured for various datasets.
14
Figure 4: Average Purity measured for various datasets.
Figure 2, 3, and 4 shows the graphical representation of Average Entropy,
Average F-Measure and Average Purity values measured for various data
collections. Series 1 represents the hyrid algorithm and Series 2 represents
the bisect K-means algorithm.
6 CONCLUSION
In this paper, we proposed a hybrid algorithm using top-down (Bisect K-
means) and bottom-up (UPGMA) agglomerative hierarchical clustering al-
gorithm called hybrid bisect K-means. We compared the clusters generated
by the hybrid bisect K-means algorithm with the clusters generated by the
bisect K-means algorithm based on the three evaluation metrics: Entropy, F-
Measure, and Purity of the clusters. Based on the results obtained, we found
that the hybrid bisect K-means algorithm outperforms the bisect K-means
algorithm and produces better clusters.
We have also shown how the hybrid K-means algorithm performs worse
than the K-means algorithm in the Appendix A. The relation between the
value of the parameter K ′ used in the hybrid bisect K-means and the cluster
quality is discussed in the Appendix B.
15
References
[Bol98] Daniel Boley. Principal direction divisive partitioning. Data Min-
ing and Knowledge Discovery, 2(4):325–344, December 1998.
[CKPT92] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and
John W. Tukey. Scatter/Gather: A cluster-based approach to
browsing large document collections. In Proceedings of the 15th
annual international ACM SIGIR conference on Research and de-
velopment in information retrieval - SIGIR ’92, pages 318–329,
New York, New York, USA, June 1992. ACM Press.
[CN10] Radha Chitta and M. Narasimha Murty. Two-level k-means clus-
tering algorithm for k–τ relationship establishment and linear-
time classification. Pattern Recognition, 43(3):796–804, March
2010.
[CS96] Peter Cheeseman and John Stutz. Bayesian Classification (Au-
toClass): Theory and Results, volume 180, pages 153–180. MIT
Press, 1996.
[DHZ+01] Chris H. Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and
Horst D. Simon. A min-max cut algorithm for graph partitioning
and data clustering. In Proceedings of the 2001 IEEE Interna-
tional Conference on Data Mining, ICDM ’01, pages 107–114,
Washington, 2001. IEEE Computer Society.
[ELL01] Brian S Everitt, Sabine Landau, and Morven Leese. Cluster Anal-
ysis, volume 33 of Social Science Research Council Reviews of
Current Research. Arnold, 2001.
[GRS98] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE. ACM
SIGMOD Record, 27(2):73–84, June 1998.
16
[GRS99] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A
robust clustering algorithm for categorical attributes. In 15th In-
ternational Conference on Data Engineering, pages 512 – 521,
March 1999.
[HJJ09] Xiong Hui, Wu Junjie, and Chen Jian. K-Means clustering ver-
sus validation measures: A data-distribution perspective. IEEE
Transactions on Systems, Man, and Cybernetics, Part B (Cyber-
netics), 39(2):318–331, April 2009.
[HK00] Eui-Hong Han and George Karypis. Centroid-based document
classification: Analysis and experimental results, 2000.
[HXZ+98] Tianming Hu, Hui Xiong, Wenjun Zhou, Sam Yuan Sung, and
Hangzai Luo. Hypergraph partitioning for document clustering:
A summary of results. Bulletin of the Technical committee on
Data Engineering, 21(1):15 – 22, July 1998.
[JD88] Anil K Jain and Richard C Dubes. Algorithms for Clustering in
Data. Prentice Hall, 1988.
[KHK99] G Karypis, E H Han, and V Kumar. Chameleon: A hierarchical
clustering algorithm using dynamic modeling. IEEE Computer,
32(8):68–75, 1999.
[Kin67] B King. Step-wise clustering procedures. Journal of the American
Statistical Association, 69:86–101, 1967.
[KR90] Leonard Kaufman and Peter J Rousseeuw. Finding Groups in
Data: An Introduction to Cluster Analysis. Wiley Series in Prob-
ability and Mathematical Statistics. Wiley, 1990.
[LA99] Bjornar Larsen and Chinatsu Aone. Fast and effective text mining
using linear-time document clustering. In Proceedings of the fifth
ACM SIGKDD international conference on Knowledge discovery
17
and data mining KDD 99, volume 5, pages 16–22. ACM Press,
1999.
[Lew99] David D. Lewis. The reuters-21578 text categorization test col-
lection, 1999.
[Mac67] J B MacQueen. Some methods for classification and analysis of
multivariate observations. In L M Le Cam and J Neyman, edi-
tors, Proceedings of the Fifth Berkeley Symposium on Mathemat-
ical Statistics and Probability, volume 1, pages 281–297. West-
ern Management Science Institute, University of California Press,
1967.
[NH94] R T Ng and J Han. Efficient and effective clustering methods for
spatial data mining. In Proceedings of the International Confer-
ence on Very Large Data Bases, pages 144–155. Citeseer, 1994.
[PHB00] Jan Puzicha, Tomas Hofmann, and Joachim M Buhmann. A the-
ory of proximity based clustering: structure detection by opti-
mization. Pattern Recognition, 33(4):617–634, 2000.
[Por80] M F Porter. An algorithm for suffix stripping. Program,
14(3):130–137, 1980.
[Sal89] Gerard Salton. Automatic Text Processing: The Transformation,
Analysis, and Retrieval of Information by Computer. Addison-
Wesley, 1989.
[SB88] Gerard Salton and Christopher Buckley. Term-weighting ap-
proaches in automatic text retrieval. Information Processing and
Management, 24(5):513–523, 1988.
[SG00] Alexander Strehl and Joydeep Ghosh. A scalable approach to bal-
anced, high-dimensional clustering of market-baskets. In HiPC,
pages 525–536, December 2000.
18
[Sha48] C E Shannon. A mathematical theory of communication. The
Bell System Technical Journal, 27(4):379–423, 1948.
[SKK00] M Steinbach, G Karypis, and V Kumar. A comparison of doc-
ument clustering techniques. In KDD workshop on text mining,
volume 400, pages 525–526. Department of Computer Science and
Engineering University of Minnesota, Citeseer, 2000.
[SS73] P H A Sneath and R R Sokal. Numerical Taxonomy: The Prin-
ciples and Practice of Numerical Classification. A Series of books
in biology. Freeman, 1973.
[SW97] Karen Spark Jones and Peter Willett. Readings in Information
Retrieval. Morgan Kaufmann, 1997.
[TRE99] TREC. Text REtrieval Conference (TREC), 1999.
[TSK05] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduc-
tion to Data Mining, (First Edition), volume chapter 8. May
2005.
[VNS05] P.A. Vijaya, M. Narasimha Murty, and D.K. Subramanian. An ef-
ficient hybrid hierarchical agglomerative clustering (HHAC) tech-
nique for partitioning large data sets. In PReMI, Lecture Notes
in Computer Science, pages 583–588, Berlin, Heidelberg, 2005.
Springer Berlin Heidelberg.
[XW05] Rui Xu and Donald Wunsch. Survey of clustering algorithms.
IEEE Transactions on Neural Networks, 16(3):645–678, 2005.
[ZHD+01] Hongyuan Zha, Xiaofeng He, Chris Ding, Horst Simon, and Ming
Gu. Bipartite graph partitioning and data clustering. ACM Press,
New York, New York, USA, October 2001.
19
[ZKF05] Ying Zhao, George Karypis, and Usama Fayyad. Hierarchical
clustering algorithms for document datasets. Data Mining and
Knowledge Discovery, 10(2):141–168, March 2005.
A PRILIMINARYANALYSIS (HYBRID K-
MEANS ALGORITHM)
We started our experiment with the K-means partitional clustering algorithm
to check whether the hybrid K-means algorithm would give better results.
Listing 2: Hybrid K-means algorithm
1 Pick K’ random po in t s as an i n i t i a l c e n t r o i d s ( c l u s t e r
prototypes ) .
2 Assign each document to the c l o s e s t c en t r o i d .
3 Reca l cu la t e the c en t r o i d f o r each c l u s t e r .
4 Repeat Steps 2 ( Assignment s tep ) and 3 ( Reca l cu l a t i on
step ) u n t i l the c e n t r o i d s do not change .
5 Compute the c l u s t e r prototypes / c e n t r o i d s f o r each o f
the K’ c l u s t e r s such that each document in a
c o l l e c t i o n be longs to one o f the se c e n t r o i d s .
6 Construct a K’ X K’ s i m i l a r i t y matrix between these
c e n t r o i d s .
7 Merge two s i m i l a r c e n t r o i d s ( i . e . , p l ace these
c e n t r o i d s in the same c l u s t e r ) .
8 Update the c en t r o id s i m i l a r i t y matrix .
9 Repeat Steps 6 ( Merging step ) and 7 ( Updating step )
u n t i l the K c l u s t e r s o f c e n t r o i d s are generated .
10 I f two c e n t r o i d s belong to same cen t r o i d c l u s t e r s , then
the document c l u s t e r s o f the se c e n t r o i d s w i l l go
toge the r as a f i n a l c l u s t e r ( Merging step ) .
20
Listing 2 shows the modified algorithm of the Listing 1. Steps 1 - 4
in the Listing 2 use the K-means algorithm. Steps 6 - 9 use the UPGMA
agglomerative hierarchical algorithm. The final Step 10 maps and merges
the document clusters based on the centroid clusters in Step 9 to generate
the final clusters.
We ran the hybrid K-means algorithm shown in the Listing 2 along with
the standard K-means algorithm to compare the goodness of the clusters
generated by both these methods. TREC Collections TR11, TR12, TR23,
TR31 and TR45 were used in this experiment. We calculated the Average
F-Measure value of the K-means and Hybrid K-means algorithms generated
clusters for each dataset to compare the results from both experiments.
Figure 5: Average F-measure for TREC Collections.
In this experiment, the hybrid K-Means clustering algorithm performs
worse than the K-means algorithm. We found the following possible reasons
for this inferior performance.
• The initialization problem in the K-means algorithm has major neg-
ative impact on its hybrid method. The initial centroids used in the
K-means algorithm of the Hybrid K-means affect the generated doc-
ument clusters. Bisect K-means algorithm is less susceptible to the
initialization problem.
• When we ran the Hybrid K-means algorithm, all the documents tend
to place in one or two document clusters on each datasets. Most of
21
Figure 6: TR31-Average F-Measure measured for various K’.
the document clusters generated are empty. Each cluster should have
at least a few documents to compute the centroids for each document
clusters. If a document cluster is empty, then its centroid will be at
origin. If more number of document clusters is empty, then these cen-
troids will be placed in the same centroid cluster, which affects the
cluster quality. For the algorithm to work properly a few documents
should be in most of the document clusters.
• K-means algorithm tends to generate non-uniform clusters when the
data is uniform [HJJ09].
B RELATION BETWEEN K’ AND QUAL-
ITY OF THE CLUSTERS
In this section, we will analyze the impact of the chosen value K ′ on the
quality of clusters. To find the importance of K ′, we ran the algorithm
shown in the Listing 1 for various values ofK ′ but with the same experimental
setting. We measured the Average Entropy, Average F-Measure and Average
Purity for each value of K ′. As the value of the parameter K ′ increases,
22
Average Entropy, F-Measure and Purity varies a little. We used the TREC
Collections TR31 and FBIS for this experiment.
Figure 7: FBIS-Average Entropy computed for various values of K’.
Figure 8: FBIS-Average F-measure computed for various values of K’.
23
Figure 9: FBIS-Average Purity computed for various values of K’.
Figure 10: TR31-Average Entropy computed for various values of K’.
24
Figure 11: TR31-Average F-measure computed for various values of K’.
Figure 12: TR31-Average Purity computed for various values of K’.
From the analysis, we found that the value of the parameter K ′ should
be greater than the value of the parameter K, i.e., K ′ > K or K ′ = nK for
n > 1. Even for K < K ′ ≤ 2K, this method gives better results as shown
in the Figures 7 – 12. For K ′ = N , the hybrid algorithm performs worse by
generating a complete top-down hierarchical tree where each document has
its own cluster and then uses these single document clusters in the UPGMA
algorithm to generate the bottom-up hierarchical tree until K number of
25
clusters are obtained. The value of K ′ should be in K < K ′ < N .
With K ′ = 2K, we can compute the time complexity of our method.
It takes O(N) to generate the set of K’ document clusters using bisect K-
means algorithm and O(K ′2) to generate the set of K centroid clusters using
UPGMA, where N, M, and K are the total number of documents, terms and
clusters in a collection ζ [SKK00]. Finally, O(K) is required to compute the
final K document clusters from this K centroid clusters. Then, the total
computational cost of this method is given as O(N + K ′2 + K) = O(N) for
K ′ > K.
c© University of Kentucky, 2011.
26