10.1016 j.ins.2015.03.038 a similarity assessment technique for effective grouping of documents
DESCRIPTION
webminingTRANSCRIPT
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 1/14
A similarity assessment technique for effective grouping
of documents
Tanmay Basu ⇑, C.A. Murthy
Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India
a r t i c l e i n f o
Article history:
Received 20 February 2014
Received in revised form 25 December 2014
Accepted 15 March 2015
Available online 21 March 2015
Keywords:
Document clustering
Text mining
Applied data mining
a b s t r a c t
Document clustering refers to the task of grouping similar documents and segregating dis-
similar documents. It is very useful to find meaningful categories from a large corpus. In
practice, the task to categorize a corpus is not so easy, since it generally contains huge
documents and the document vectors are high dimensional. This paper introduces a hybrid
document clustering technique by combining a new hierarchical and the traditional
k-means clustering techniques. A distance function is proposed to find the distance
between the hierarchical clusters. Initially the algorithm constructs some clusters by the
hierarchical clustering technique using the new distance function. Then k-means algorithm
is performed by using the centroids of the hierarchical clusters to group the documents
that are not included in the hierarchical clusters. The major advantage of the proposed dis-
tance function is that it is able to find the nature of the corpora by varying a similarity
threshold. Thus the proposed clustering technique does not require the number of clusters
prior to executing the algorithm. In this way the initial random selection of k centroids for
k-means algorithm is not needed for the proposed method. The experimental evaluation
using Reuter, Ohsumed and various TREC data sets shows that the proposed method per-
forms significantly better than several other document clustering techniques. F -measure
and normalized mutual information are used to show that the proposed method is effec-
tively grouping the text data sets.
2015 Elsevier Inc. All rights reserved.
1. Introduction
Clustering algorithms partition a data set into several groups such that the data points in the same group are close to each
other and the points across groups are far from each other [9]. The document clustering algorithms try to identify inherent
grouping of the documents to produce good quality clusters for text data sets. In recent years it has been recognized that
partitional clustering algorithms e.g., k-means, buckshot are advantageous due to their low computational complexity. Onthe other hand these algorithms need the knowledge of the number of clusters. Generally document corpora are huge in size
with high dimensionality. Hence it is not so easy to estimate the number of clusters for any real life document corpus.
Hierarchical clustering techniques do not need the knowledge of number of clusters, but a stopping criterion is needed to
terminate the algorithms. Finding a specific stopping criterion is difficult for large data sets.
The main difficulty of most of the document clustering techniques is to determine the (content) similarity of a pair of docu-
ments for putting them into the same cluster [3]. Generally cosine similarity is used to determine the content similarity
between two documents [24]. Cosine similarity actually checks the number of common terms present in the documents. If
http://dx.doi.org/10.1016/j.ins.2015.03.038
0020-0255/ 2015 Elsevier Inc. All rights reserved.
⇑ Corresponding author. Tel.: +91 33 25753109; fax: +91 33 25783357.
E-mail addresses: [email protected] (T. Basu), [email protected] (C.A. Murthy).
Information Sciences 311 (2015) 149–162
Contents lists available at ScienceDirect
Information Sciences
j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l oc a t e / i n s
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 2/14
two documents contain many common termsthen they arevery likelyto be similar. Thedifficulty is that there is no clear expla-
nation as to how many common terms can identify two documents as similar. The text data sets are high dimensional data set
andmost of theterms do notoccur in each document.Hence theissue is tofind thecontentsimilarity in such a way sothat it can
restrictthe lowsimilarity values. The actual content similarity betweentwo documents may not be found properlyby checking
only the individual terms of the documents. A new distance function is proposed to find distance between two clusters based
on a similarity measure, extensive similarity between documents. Intuitively, the extensive similarity restricts the low (con-
tent) similarity values by a predefined threshold and then determines the similarity between two documents by finding their
distancewith every otherdocumentin thecorpus. It assigns a score to each pair of documents to measurethe degree of content
similarity. A threshold is set on the content similarity value of the document vectors to restrict the lowsimilarityvalues. A his-
togram thresholding based method is used to estimate the value of the threshold from the similarity matrix of a corpus.
A new hybrid document clustering algorithm is proposed, which is a combination of a hierarchical and k-means clustering
technique. The hierarchical clustering technique produces some baseline clusters by using the proposed cluster distance
function. The hierarchical clusters are named as baseline clusters. These clusters are created in such a way that the documents
inside a cluster are very similar to each other. Actually the extensive similarity of all pair of documents of a baseline cluster is
very high. The documents of two different baseline clusters are very dissimilar to each other. Thus the baseline clusters intui-
tively determine the actual categories of the document collection. Generally there exist some singleton clusters after con-
structing the hierarchical clusters. The distance between a singleton cluster and each baseline cluster is not so small.
Hence k-means clustering algorithm is performed to group these documents to a particular baseline cluster, with which
it has highest content similarity. If for several iterations of k-means algorithm each of these singleton clusters are grouped
to the same baseline cluster then they are likely to be assigned correctly. The significant property of the proposed technique
is that it can automatically identify the number of clusters. It has become clear from the experiments that the number of
clusters of each corpus is very close to the actual category. The experimental analysis using several well known TREC and
Reuter data sets have shown that the proposed method performs significantly better than several existing document clus-
tering algorithms.
The paper is organized as follows. Section 2 describes some related works. The document representation technique is pre-
sented in Section 3. The proposed document clustering technique is explained in Section 4. The evaluation criteria for eval-
uating the clusters generated by a particular method is described in Section 5. Section 6 presents the experimental results
and a detailed analysis on the results. Finally we conclude and discuss about the further scope of this work in Section 7.
2. Related works
There are two basic types of document clustering techniques available in the literature – hierarchical and partitional clus-
tering techniques [8,11].
Hierarchical clustering produces a hierarchical tree of clusters where each individual level can be viewed as a com-
bination of clusters in the next lower level. This hierarchical structure of clusters is also known as dendrogram. The
hierarchical clustering techniques can be divided into two parts – agglomerative and divisive. In a n Agglomerative
Hierarchical Clustering (AHC) method [30], starting with each document as individual cluster, at each step, the most similar
clusters are merged until a given termination condition is satisfied. In a divisive method, starting with the whole set of docu-
ments as a single cluster, the method splits a cluster into smaller clusters at each step until a given termination condition is
satisfied. Several halting criteria for AHC algorithms have been proposed. But no widely acceptable halting criterion is avail-
able for these algorithms. As a result some good clusters may be merged, which will be eventually meaningless to the user.
There are mainly three variations of AHC techniques – single-link, complete-link and group-average hierarchical method for
document clustering [6].
In single-link method the similarity between a pair of clusters is calculated as the similarity between the two most similar
documents where each document represents each individual cluster. The complete-link method measures the similarity
between a pair of clusters as the least similar documents, one of which is in each cluster. The group average method merges
two clusters if they have least average similarity than the other clusters. Average similarity means the average of the simi-
larities between the documents of each cluster. In a divisive hierarchical clustering technique, initially, the method assumes
the whole data set as a single cluster. Then at each step, the method chooses one of the existing clusters and splits it into two.
The process continues till only singleton clusters remain or it reaches a given halting criterion. Generally the cluster with the
least overall similarity is chosen for splitting [30].
In a recent study, Lai et al. have proposed an agglomerative hierarchical clustering algorithm by using dynamic k-nearest
neighbor list for each cluster. The clustering technique is named as Dynamic k-Nearest Neighbor Algorithm (DKNNA) [16]. The
method uses a list of dynamic k nearest neighbors to store k nearest neighbors of each cluster. Initially the method assumes
each document as a cluster and finds the k nearest neighbors of each cluster. The minimum distant clusters are merged and
their nearest neighbors are updated accordingly and then again finds the minimum distant clusters and merge them and so
on. The algorithm continues until the desired number of clusters are obtained. In the merging and updating process of each
iteration, the k nearest neighbors of the clusters, which are affected by the merging process are updated. If the set of k near-
est neighbors are empty for some of the clusters being updated, their nearest neighbors are determined by searching all the
clusters. Thus the proposed approach can guarantee the exactness of the nearest neighbors of a cluster and can obtain goodquality clusters [16]. Although the algorithm has shown good results for some artificial and image data sets, but it has two
150 T. Basu, C.A. Murthy/ Information Sciences 311 (2015) 149–162
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 3/14
limitations to apply it to text data sets. The method needs the knowledge of desired number of clusters, which is very diffi-
cult to predict and it is problematic to determine a valid k for text data sets.
In contrast to hierarchical clustering techniques, partitional clustering techniques allocate data into a previously known
fixed number of clusters. The commonly used partitional clustering technique is k-means method, where k is the desired
number of clusters [13]. Here initially k documents are chosen randomly from the data set, and they are called seed points.
Each document is assigned to its nearest seed point, thereby creating k clusters. Then the centroids of the clusters are com-
puted, and each document is assigned to its nearest centroid. The same process continues until the clustering does not
change i.e., the centroids in two consecutive iterations remain the same. Generally, the number of iterations is fixed by
the user. The procedure stops if it converges to a solution i.e., the centroids are the same for two consecutive iterations,
or the process terminates after a fixed number of iterations. k-means algorithm is advantageous for its low computational
complexity [23]. It takes linear time to build the clusters. The main disadvantage is that the number of clusters is fixed and it
is very difficult to select a valid k for an unknown text data set. Also there is no universally acceptable way of choosing the
initial seed points. Recently Chiang et al. proposed a time efficient k-means algorithm by compressing and removing the
patterns at each iteration that are unlikely to change their membership thereafter [22], but the limitations of the k-means
clustering technique have not been discussed.
Bisecting k-means method [30] is a variation of basic k-means algorithm. This algorithm tries to improve the quality of
clusters in comparison to k-means clusters. In each iteration, it selects the largest existing cluster (the whole data set in
the first iteration) and divides it into two subsets using k-means (k = 2) algorithm. This process is continued till k clusters
are formed. Bisecting k-means algorithm generally produces almost uniform sized clusters. Thus it can perform better than
k-means algorithm when the actual groups of a data set are almost of similar size i.e., the number of documents in the
categories of a corpus are close to each other. On the contrary, the method produces poor clusters for the corpora, where
the number of documents in the categories differ very much. This method also faces difficulties like k-means algorithm,
in choosing the initial seed points and a proper value of the parameter k.
Buckshot algorithm is a combination of basic k-means and hierarchical clustering methods. It tries to improve the perfor-
mance of k-means algorithm by choosing better initial centroids [26]. It uses a hierarchical clustering algorithm on some
sample documents of the corpus in order to find robust initial centroids. Then k-means algorithm is performed to find
the clusters using these robust centroids as the initial centroids [3]. But repeated calls to this algorithm may produce differ-
ent partitions. If the initial random sampling does not represent the whole data set properly, the resulting clusters may be of
poor quality. Note that appropriate value of k is necessary for this method too.
Spectral clustering algorithm is a very popular clustering method which works on the similarity matrix rather than the
original term-document matrix using the idea of graph cut. It uses the top eigenvectors of the similarity matrix derived from
the similarity between documents [25]. The basic idea is to construct a weighted graph from the corpus, where each node
represents a document and each weighted edge represents the similarity between two documents. In this technique the
clustering problem is formulated as a graph cut problem. The core of this theory is the eigenvalue decomposition of the
Laplacian matrix of the weighted graph obtained from data [10]. Let X ¼ fd1; d2; . . . ; dN g be the set of N documents to cluster.
Let S be the N N similarity matrix where S ij represents the similarity between the documents di and d j. Ng et al. [25] pro-
posed a spectral clustering algorithm, which simultaneously partitions the Laplacian data matrix L into k subsets using the k
largest eigenvectors and they have used a gaussian kernel S ij ¼ exp qðdi ;d jÞ2r2
on the similarity matrix. Here qðdi; d jÞ denotes
the similarity between di and d j and r is the scaling parameter. The gaussian kernel is used to get rid of the curse of dimen-
sionality. The main difficulty of using a gaussian kernel is that, it is sensitive to the parameterr [21]. A wrong value of r may
highly degrade the quality of the clusters. It is extremely difficult to select a proper value of r for a document collection,
since the text data sets are generally sparse with high dimension. It should be noted that the method also suffers from
the limitations of the k-means method, discussed above.
Non-negative Matrix Factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. It
finds the positive factorization of a given positive matrix [19]. Xu et al. [33] have demonstrated that NMF performs well for
text clustering compared to the other similar methods like singular value decomposition and latent sematic indexing. The
technique factorizes the original term-document matrix D approximately as, D UV T , where U is a non-negative matrix
of size n m, and V is an m N non-negative matrix. The base vectors in U can be interpreted as a set of terms in the vocabu-
lary of the corpus, while V describes the contribution of the documents to these terms. The matrices U and V are randomly
initialized, and their contents iteratively estimated [1]. The Non-negative Matrix Factorization method attempts to deter-
mine U and V , which minimize the following objective function
J ¼ 1
2 D UV T ð1Þ
where kk denotes the squared sum of all the elements in the matrix. This is an optimization problem with respect to the
matrices U ¼ ½uik; V ¼ ½v jk; 8i ¼ 1;2; . . . ;n, 8 j ¼ 1;2; . . . ;N and k ¼ 1;2; . . . ;m and as the matrices U and V are non-negative,
we have uik P 0; v jk P 0. This is a typical constrained non-linear optimization problem and can be solved using the Lagrange
method [3]. The interesting property of NMF technique is that it can also be used to find the word clusters instead of docu-
ment clusters. The columns of U
can be used to discover a basis, which corresponds to word clusters. The NMF algorithm hasits disadvantages too. The optimization problem of Eq. (1) is convex in either U or V , but not in both U and V , which means
T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162 151
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 4/14
that the algorithm can guarantee convergence to a local minimum only. In practice, NMF users often compare the local min-
ima from several different starting points, using the results of the best local minimum found. On large sized corpora this may
be problematic [17]. Another problem with NMF algorithm is that it relies on random initialization and as a result, the same
data might produce different results across runs [1].
Xu et al. [34] proposed concept factorization (CF) based document clustering technique, which models each cluster as a
linear combination of the documents, and each document as a linear combination of the cluster centers. The document clus-
tering is then accomplished by computing the two sets of linear coefficients, which is carried out by finding the non-negative
solution that minimizes the reconstruction error of the documents. The major advantage of CF over NMF is that it can be
applied to data containing negative values and the method can be implemented in the kernel space. The method has to select
k concepts (cluster centers) initially and it is very difficult to predict a value of k in practice. Dasgupta et al. [7] proposed a
simple active clustering algorithm which is capable of producing multiple clusterings of the same data according to user
interest. The advantage of this algorithm is that the user feedback required by this algorithm is minimal compared to the
other existing feedback-oriented clustering techniques, but the algorithm may suffer from human feedback, if the topics
are sensitive or when the perception varies. Carpineto et al. have done a good survey on search results clustering techniques.
They have elaborately explained and discussed various issues related to web clustering engines [5]. Wang et al. [32] pro-
posed an efficient soft-constraint algorithm by obtaining a satisfactory clustering result so that the constraints would be
respected as many as possible. The algorithm is basically an optimization problem and it starts by randomly assuming some
initial cluster centroids. The method can produce insignificant clusters if the initial centroids are not properly selected. Zhu
et al. [35] proposed a semi-supervised Non-negative Matrix Factorization method based on the pairwise constraints – must-
link and cannot-link. In this method must-link constraints are used to control the distance of the data in the compressed
form, and cannot-link constraints are used to control the encoding factor to obtain a very good performance. The method
has shown very good performance in some real life text corpora. The algorithm is a new variety of NMF method, which again
relies on random initialization and may produce different clusters for several runs on a corpus, where the sizes of the
categories highly varies from each other.
3. Vector space model for document representation
The number of documents in the corpus throughout this article is denoted by N . The number of terms in the corpus is
denoted by n. The ith
term is represented by t i. Number of times the term t i occurs in the jth
document is denoted by
tf ij; i ¼ 1;2; . . . ;n; j ¼ 1;2; . . . ;N . Document frequency df i is the number of documents in which the term t i occurs. Inverse
document frequency idf i ¼ log N df i
, determines how frequently a word occurs in the document collection. The weight of
the ith
term in the jth
document, denoted by wij , is determined by combining the term frequency with the inverse document
frequency [29] as follows:
wij ¼ tf ij idf i ¼ tf ij log N
df i
; 8i ¼ 1; 2; . . . ;n and 8 j ¼ 1; 2; . . . ;N
The documents are represented using the vector space model in most of the clustering algorithms [29]. In this model each
document d j is considered to be a vector, where the ith
component of the vector is w ij i.e., ~d j ¼ ðw1 j;w2 j; . . . ;wnjÞ.
The key factor in the success of any clustering algorithm is the selection of a good similarity measure. The similarity
between two documents is achieved through some distance function. Given two document vectors ~di and ~d j, it is required
to find the degree of similarity (or dissimilarity) between them. Various similarity measures are available in the literature
but the commonly used measure is cosine similarity between two document vectors [30], which is given by
cos ~di; ~d j
¼
~di ~d j
~di
~d j
¼
Pnk¼1ðwik w jkÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiPn
k¼1w2ik
Pn
k¼1w2 jk
q ; 8i; j ð2Þ
The weight of each term in a document is non-negative. As a result the cosine similarity is non-negative and boundedbetween 0 and 1. cos ~di;
~d j
¼ 1 means the documents are exactly similar and the similarity decreases as the value decreases
to 0. An important property of the cosine similarity is its independence of document length. Thus cosine similarity has
become popular as a similarity measure in the vector space model [14]. Let D ¼ fd1; d2; . . . ; dr g be the set of r documents,
where each document has n number of terms. The centroid of D;Dcn can be calculated as, Dcn ¼ 1r
Pr j¼1
~d j, where ~d j is the
corresponding vector of document d j.
4. Proposed clustering technique for effective grouping of documents
A combination of hierarchical clustering and k-means clustering methods has been introduced based on a similarity
assessment technique to effectively group the documents. The existing document clustering algorithms so far discussed
determine the (content) similarity of a pair of documents for putting them into the same cluster. Generally the content simi-
larity is determined by the cosine of the angle between two document vectors. The cosine similarity actually checks thenumber of common terms present in the documents. If two documents contain many common terms then the documents
152 T. Basu, C.A. Murthy/ Information Sciences 311 (2015) 149–162
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 5/14
are very likely to be similar, but the difficulty is that there is no clear explanation as to how many common terms can identify
two documents as similar. The text data sets are high dimensional data set and most of the terms do not occur in each docu-
ment. Hence the issue is to find the content similarity in such a way so that it can restrict the low similarity values. The
actual content similarity between two documents may not be found properly by checking the individual terms of the docu-
ments. Intuitively, if two documents are content wise similar then they should have similar type of relation with most of the
other documents i.e., if two documents x and y have similar content and if x is similar to any other document z then y must
be similar or somehow related to z . This important characteristic is not observed in cosine similarity measure.
4.1. A similarity assessment technique
A similarity measure, Extensive Similarity (ES) is used to find the similarity between two documents in the proposed work.
The similarity measure extensively checks all the documents in the corpus to determine the similarity. The extensive similar-
ity between two documents is determined depending on their distances with every other document in the corpus. Intuitively,
two documents are exactly similar, if they have sufficient content similarity and they have almost same distance with every
other document in the corpus (i.e., both are either similar or dissimilar to all the other documents) [18]. The content similarity
is defined as a binary valued distance function. The distance between two documents is minimum i.e., 0 when they have suf-
ficient content similarity, otherwise the distance is 1 i.e., they have very low content similarity. The distance between two
documents di and d j; 8i; j is determined by putting a threshold h 2 ð0;1Þ on their content similarity as follows:
disðdi; d jÞ ¼ 1 if qðdi; d jÞ 6 h
0 otherwise
ð3Þ
where q is the similarity measure to find the content similarity between di and d j. Here h is a threshold value on the content
similarity and it is used to restrict the low similarity values. A data dependent method for estimating the value of h is dis-
cussed later. In the context of document clustering q is considered as cosine similarity i.e.,qðdi; d jÞ ¼ cos ~di;~d j
, where ~di and
~d j are the corresponding vectors of documents di and d j respectively. If cos ~di;~d j
¼ 1 then we can strictly say that the docu-
ments are dissimilar. On the other hand, if the distance is 0, i.e., cos ~di;~d j
> h then they have sufficient content similarity
and the documents are somehow related to each other. Let us assume that di and d j have cosine similarity 0.52 and d j and d0
(another document) have cosine similarity 0.44 and h ¼ 0:1. Hence both disðdi; d jÞ ¼ 0 and disðd j;d0Þ ¼ 0 and the task is to
distinguish these two distances of same value.
The extensive similarity is thus designed to find the grade of similarity of the pair of documents which are similar content
wise [18]. If disðdi;d jÞ ¼ 0 then extensive similarity finds the individual content similarities of di and d j with every other
document, and assigns a score (l) to denote the extensive similarity between the documents as below.
li; j ¼XN
k¼1
jdisðdi; dkÞ disðd j; dkÞj
Thus the extensive similarity between documents d i and d j; 8i; j is defined as
ESðdi; d jÞ ¼ N li; j if disðdi; d jÞ ¼ 0
1 otherwise
ð4Þ
Two documents di; d j have maximum extensive similarity N , if the distance between them is zero, and distance between di
and dk is same as the distance between d j and dk for every k. In general, if the above said distances differ for li; j times then the
extensive similarity is N li; j. Unlike other similarity measures, ES takes into account the distances of the said two docu-
ments di; d j with respect to all the other documents in the corpus when measuring the distance between them [18]. li; j indi-
cates the number of documents with which the similarity of d
i is not the same as the similarity of d
j. As thel
i; j value increases,the similarity between the documents d i and d j decreases. If li; j ¼ 0 then di and d j are exactly similar. Actually l i; j denotes a
grade of dissimilarity and it indicates that di and d j have different distances with l i; j number of documents. The extensive
similarity is used to define the distance between two clusters in the first stage of the proposed document clustering method.
A distance function is proposed to create the baseline clusters. It finds the distance between two clusters say, C x and C y.
Let T xy be a multi-set consisting of the extensive similarities between each pair of documents, one from C x and the other from
C y and it is defined as,
T xy ¼ fESðdi; d jÞ : ESðdi; d jÞP 0; 8di 2 C x and d j 2 C yg
Note that T xy consisting of all the occurrences of the same extensive similarity values (if any) for different pairs of documents.
The proposed distance between two clusters C x and C y can be defined as
dist cluster ðC x; C yÞ ¼ 1 if T xy ¼ ;
N avgðT xyÞ otherwise
ð5Þ
T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162 153
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 6/14
The function dist cluster finds the distance between two clusters C x and C y as the average of the multi set of non-negative ES
values. The distance between C x and C y is infinite, if there are no two documents that have a non-negative ES value i.e., no
similar documents are present in C x and C y. Intuitively, infinite distance between clusters denotes that every pair of docu-
ments, one from C x and the other from C y either share a very few number of terms, or no term is common between them i.e.,
they have a very low content similarity. Later we shall observe that any two clusters with infinite distance between them
remain segregated from each other. Thus, a significant characteristic of the function dist cluster is that it would never merge
two clusters with infinite distance between them.
The proposed document clustering algorithm initially assumes each document as a singleton cluster. Then it merges thoseclusters with minimum distance, and the distance is within a previously fixed limit a. The process of merging continues until
the distance between every two clusters is less than or equal toa. The clusters which are not singletons are named as Baseline
Clusters (BC). The selection of the value of a is discussed in Section 6.2 of this article.
4.2. Properties of dist cluster
The important properties of the function dist cluster are described below.
The minimum distance between any two clusters C x and C y is 0, when avgðT xyÞ ¼ N i.e., the extensive similarity value
between every pair of documents, one from C x and the other from C y is N . Although in practice this minimum value
can be rarely observed between two different document clusters. The maximum value of dist cluster is infinite.
If C x ¼ C y then dist cluster ðC x;C yÞ ¼ N avgðT xxÞ ¼ 0.
dist cluster ðC x; C yÞ ¼ 0 ) avgðT xyÞ ¼ N ) ESðdi; d jÞ ¼ N ; 8di 2 C x and 8d j 2 C y:
Now ESðdi; d jÞ ¼ N implies that two documents di and d j are exactly similar. Note that ESðdi; d jÞ ¼ N ) disðdi; d jÞ ¼ 0
and li; j ¼ 0. Here disðdi; d jÞ ¼ 0 implies that di and d j are similar in terms of content, but they are not necessarily same
i.e., we can not say d i ¼ d j, if disðdi;d jÞ ¼ 0.
Thus dist cluster ðC x;C yÞ ¼ 0; C x ¼ C y and hence dist cluster is not a metric.
It is symmetric. For every pair of clusters C x and C y; dist cluster ðC x;C yÞ ¼ dist cluster ðC y;C xÞ.
dist cluster ðC x;C yÞP 0 for any pair of clusters C x and C y.
For any three clusters C x; C y and C 0, we may have
dist cluster ðC x; C yÞ þ dist cluster ðC y; C 0Þ dist cluster ðC x; C 0Þ < 0
when 0 6 dist cluster ðC x;C yÞ < N , 0 6 dist cluster ðC y;C 0Þ < N and dist cluster ðC x;C 0Þ ¼ 1. Thus it does not satisfy the
triangular inequality.
4.3. A method for estimation of h
There are several types of document collections available in real life. The similarities or dissimilarities between docu-
ments present in one corpus may not be same as the similarities or dissimilarities of the other corpora, since the character-
istics of the corpora are different [18]. Additionally, one may view the clusters present in a corpus (or in different corpora)
under different scales, and different scales produce different partitions. Similarities corresponding to one scale in one corpus
may not be same as the similarities corresponding to the same scale in a different corpus. This has been the reason to make
the threshold on similarities data dependent [18]. In fact, we feel that a fixed threshold on similarities will not give satisfac-
tory results on several data sets.
There are several methods available in literature for finding a threshold for a two-class (one class corresponds to similar
points, and the other corresponds to dissimilar points) classification problem. A popular method for such classification is his-
togram thresholding [12].
Let, for a given corpus, the number of distinct similarity values be p, and let the similarity values be s0; s1; . . . ; s p1. Withoutloss of generality, let us assume that (a) si < s j; if i < j and (b) (siþ1 siÞ ¼ ðs1 s0Þ; 8i ¼ 1;2; . . . ; ð p 2Þ. Let g ðsiÞ denote the
number of occurrences of si; 8i ¼ 0;1; . . . ; ð p 1Þ. Our aim is to find a threshold h on the similarity values so that a similarity
value s < h implies the corresponding documents are practically dissimilar, otherwise they are similar. The aim is to make
the choice of threshold to be data dependent. The basic steps of the histogram thresholding technique are as follows:
Obtain the histogram corresponding to the given problem.
Reduce the ambiguity in histogram. Usually this step is carried out using a window. One of the earliest such tech-
niques is the moving average technique in time series analysis [2], which is used to reduce the local variations in
a histogram. It is convolved with the histogram resulting in a less ambiguous histogram. We have used the weighted
moving averages using window length 5 of the g ðsiÞ values as,
f ðsiÞ ¼ g ðsiÞ
P p1
j¼0 g ðs jÞ
g ðsi2Þ þ g ðsi1Þ þ g ðsiÞ þ g ðsiþ1Þ þ g ðsiþ2Þ
5
; 8i ¼ 2; 3; . . . ; p 3 ð6Þ
154 T. Basu, C.A. Murthy/ Information Sciences 311 (2015) 149–162
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 7/14
Find the valley points in the modified histogram. A point si corresponding to the weight function f ðsiÞ is said to be a
valley point if f ðsi1Þ > f ðsiÞ and f ðsiÞ < f ðsiþ1Þ.
The first valley point of the modified histogram is taken as the required threshold on the similarity values.
In the modified histogram corresponding to f , there can be three possibilities regarding the valley points, which are stated
below.
(i) There is no valley point in the histogram. If there is no valley point in the histogram then either it is a constant func-
tion, or it is an increasing or decreasing function of similarity values. These three types of histograms impose strong
conditions on the similarity values which are unnatural for a document collection. Another possible histogram where
there is no valley point is a unimodal histogram. There is a single mode in the histogram, and the number of occur-
rences of a similarity value increases as the similarity values increase to the mode, and decreases as the similarity val-
ues move away from mode. This is an unnatural setup since, there is no reason of having such a strong property to be
satisfied by a histogram of similarity values.
(ii) Another option is that there exists exactly one valley point in the histogram. The number of occurrences of the valley
point is smaller than the number of occurrences of the other similarity values in a neighborhood of valley point. In
practice this type of example is also rare.
(iii) The third and most usual possibility is that the number of valley points is more than one i.e., there exist several varia-
tions in the number of occurrences of similarity values. Here the task is to find a threshold from a particular valley. In
the proposed technique the threshold is selected from the first valley point. The threshold may be selected from the
second or third or a higher valley. But for this we may treat some really similar documents as dissimilar, which lie in
between the first valley point and the higher one. Practically the text data sets are sparse and high dimensional. Hence
high similarities between documents are observed in very few cases. It is true that, for a high h value the extensive
similarity between every two documents in a cluster will be high, but the number of documents in each cluster will
be too few due to the sparsity of the data. Hence h is selected from the first valley point as the similarity values in the
other valleys are higher than the similarity values in the first valley point.
Generally similarity values do not satisfy the property that ðsiþ1 siÞ ¼ ðs1 s0Þ; 8i ¼ 1;2; . . . ; ð p 2Þ. In reality there are
ð p þ 1Þ distinct class intervals of similarity values, where the ith
class interval is ½vi1;viÞ, a semi-closed interval, for
i ¼ 1;2; . . . ; p. The ð p þ 1Þth class interval corresponds to the set where each similarity value is greater than or equal to v p.
g ðsiÞ corresponds to the number of similarity values falling in the ith
class interval. The vi’s are taken in such a way that
ðvi1 viÞ ¼ ðv1 v0Þ; 8i ¼ 2;3; . . . ; p. Note that v0 ¼ 0 and the value of v p is decided on the basis of the observations.
The last interval, i.e., the ð p þ 1Þth
interval is not considered for the valley point selection, since we assume that if any simi-larity value is greater than or equal to v p then the corresponding documents are actually similar. Under this setup, we have
taken si ¼ v iþviþ1
2 ; 8i ¼ 0;1; . . . ; ð p 1Þ. Note that the defined si’s satisfy the properties (a) si < s j; if i < j and (b)
(siþ1 siÞ ¼ ðs1 s0Þ; 8i ¼ 1;2; . . . ; ð p 2Þ. The proposed method finds the valley point, and its corresponding class interval.
The minimum value of that particular class interval is taken as the threshold.
Example. Let us consider an example of histogram thresholding for the selection of theta for a corpus. The similarity values,
the values of g and f are shown in Table 1. Initially we have divided the similarity values into a few class intervals of length
0.001. Let us assume that there are 80 such intervals of equal length and si represents the middle point of the ith class interval
for i ¼ 0;1; . . . ;79. The values of g ðsiÞ’s and the corresponding f ðsiÞ’s are then found. Note that the moving averages have been
used to remove the ambiguities in the g ðsiÞ values. Valleys in the similarity values corresponding to 76 f ðsiÞ’s are then found.
Let s40, which is equal to 0.0405, be the first valley point, i.e., f ðs39Þ > f ðs40Þ and f ðs40Þ < f ðs41Þ. The minimum similarity value
of the class interval
½0:040
0:041
Þ is taken as the threshold h .
Table 1
An example of h estimation by histogram thresholding technique.
Class intervals (vi ’s) si ’s No. of elements of the intervals Moving averages
½0:000—0:001Þ 0:0005 g ðs0Þ –
½0:001—0:002Þ 0:0015 g ðs1Þ –
½0:002—0:003Þ 0:0025 g ðs2Þ f ðs2Þ..
...
...
...
.
½0:040—0:041Þ 0:0405 g ðs40Þ f ðs40Þ..
...
...
...
.
½0:077—0:078Þ 0:0775 g ðs77Þ f ðs77Þ½0:078—0:079Þ 0:0785 g ðs78Þ –
½0:079—0:080
Þ 0:0795 g
ðs79
Þ –
P0.080 – g ðs80Þ –
T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162 155
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 8/14
4.4. Procedure of the proposed document clustering technique
The proposed document clustering technique is described in Algorithm 1. Initially each document is taken as a cluster.
Therefore Algorithm 1 starts with N individual clusters. In the first stage of Algorithm 1, a distance matrix is developed
whose ijth
entry is the dist cluster ðC i;C jÞ value where C i and C j are ith
and jth
cluster respectively. It is a square matrix and
has N rows and N columns for N number of documents in the corpus. Each row or column of the distance matrix is treated
as a cluster. Then the Baseline Clusters (BC) are generated by merging the clusters whose distance is less than a fixed thresh-
old a. The value of a is constant throughout Algorithm 1. The process of merging stated in step 3 of Algorithm 1 merges tworows say i and j and the corresponding columns of the distance matrix by following a convention regarding numbering. It
merges two rows into one, the resultant row is numbered as minimum of i; j, and the other row is removed. Similar num-
bering follows for columns too. Then the index structure of the distance matrix is updated accordingly.
Algorithm 1. Iterative document clustering by baseline clusters
Input: (a) A set of clusters C ¼ fC 1;C 2; . . . ;C N g, where N is the number of documents.
C i ¼ fdig; i ¼ 1;2; . . . ;N , where d i is the ith document of the corpus.
(b) A distance matrix DM½i½ j ¼ dist clusterðC i;C jÞ; 8i; j 2 N .
(c) a be the desired threshold on dist cluster and iter be the number of iteration.
Steps of the algorithm:
1: for each clusters C i;C j 2 C where C i – C j and N > 1 do2: if dist cluster ðC i;C j) 6 a then
3: DM merge (DM; i; j)
4: C i C i [ C j5: N N 1
6: end if
7: end for
8: nbc 0; BC ; //Baseline clusters are initialized to empty set
9: nsc 0; SC ; //Singleton clusters are initialized to empty set
10: for i ¼ 1 to N do
11: if jC ij > 1 then
12: nbc nbc þ 1 // No. of baseline clusters
13: BCnbc
C i // Baseline clusters
14: else15: nsc nsc þ 1 // No. of singleton clusters
16: SCnsc C i // Singleton clusters
17: end if
18: end for
19: if nsc ¼ 0jjnbc ¼ 0 then
20: return BC // If no singleton cluster at all exists or no baseline cluster is generated
21: else
22: EBCk BCk; 8k ¼ 1;2; . . . ;nbc // Initialization of extended baseline clusters
23: ebct k centroid of BCk; 8k ¼ 1;2; . . . ;nbc // Extended base centroids
24: nct k ð0Þ; 8k ¼ 1;2; . . . ;nbc ; it 0
25: while ebct k – nct k; 8k ¼ 1;2; . . . ;nbc and it 6 iter do
26: ebct k
centroid of EBCk;
8k
¼ 1;2; . . . ;nbc
27: NCL k BCk; 8k ¼ 1;2; . . . ;nbc // New set of clusters at each iteration28: for j ¼ 1 to nsc do
29: if ebct k is the nearest centroid of SC j; 8k ¼ 1;2; . . . ;nbc then
30: NCL k NCL k [ SC j // Merger of singleton clusters to baseline clusters
31: end if
32: end for
33: nct k centroid of NCL k; 8k ¼ 1;2; . . . ;nbc
34: EBCk NCL k; 8k ¼ 1;2; . . . ;nbc
35: it ¼ it þ 1
36: end while
37: return EBC
38: end if
Output: A set of extended baseline clusters EBC ¼ fEBC1;EBC2; . . . ;EBCnbc g
156 T. Basu, C.A. Murthy/ Information Sciences 311 (2015) 149–162
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 9/14
After constructing the baseline clusters some clusters may remain as singleton clusters. Every such singleton cluster
(i.e., a single document) is merged with one of the baseline clusters using k-means algorithm in the second stage. In
the second stage the centroids of the baseline clusters (i.e., non singleton clusters) are calculated and they are named
as base centroids. The value of k for k-means algorithm is taken as the number of baseline clusters. The rest of the docu-
ments which are not included in the baseline clusters are grouped by the iterative steps of the k-means algorithm using
these base centroids as the initial seed points. Note that, those documents, which are not included in the baseline clusters,
are only considered for clustering in this stage. But, for the calculation of a cluster centroid, every document in the cluster,
including the documents in the baseline clusters, are considered. A document is put into that cluster for which the content
similarity between the document and the base centroid is maximum. The newly formed clusters are named as Extended
Baseline Clusters (EBC).
It may be noted that the processing in the second stage is not needed if no singleton cluster is produced in the first stage.
We believe that such a possibility is remote in real life and none of our experiments yielded such an outcome. However, such
a clustering is desirable as it produces compact clusters.
4.5. Impact of extensive similarity on the document clustering technique
The extensive similarity plays significant role in constructing the baseline clusters. The documents in the baseline clusters
are very similar to each other as their extensive similarity is very high (above a threshold h). It may be observed that when-
ever two baseline clusters are merged in the first stage, the similarity between any two documents in the baseline clusters
are at least be equal to h. Note that the distance between two different baseline clusters is greater than or equal to a and the
distance between a baseline cluster and a singleton cluster (or between two singleton clusters) may be infinite and theywould never merge to construct a new baseline cluster. Infinite distance between two clusters indicates that the extensive
similarity between at least one document of the baseline cluster and the document of the singleton cluster (or, between the
documents of two different singleton clusters) is 1. Thus the baseline clusters intuitively determine the categories of the
document collection by measuring the extensive similarity between documents.
4.6. Discussion
The proposed clustering method is a combination of baseline clustering and k-means clustering methods. Initially it cre-
ates some baseline clusters. The documents which do not have much similarity with any one of the baseline clusters would
remain as singleton clusters. Therefore k-means method is implemented to group these documents to the corresponding
baseline clusters. k-means algorithm has been used due to its low computational complexity. It is also useful as it can be
easily implemented. But the performance of k
-means algorithm suffers from selection of initial seed points and there isno method for selecting a valid k. It is very difficult to select a proper k for a sparse text data set with high dimensionality.
In various other clustering techniques, k-means algorithm has been used as an intermediary stage e.g., spectral clustering,
buckshot algorithms etc. These algorithms also suffer from the said limitations of k-means method. Note that the proposed
clustering method overcomes these two major limitations of k-means clustering algorithm and has utilized the effectiveness
of k-means method by introducing the idea of baseline clusters. The effectiveness of the proposed technique in terms of clus-
tering quality may be observed in the experimental results section later.
The proposed technique is designed like buckshot clustering algorithm. The main difference between buckshot and the
proposed one lies in designing the hierarchical clusters in the first stage of both of the methods. Buckshot uses the tradi-
tional single-link clustering technique to develop the hierarchical clusters to create the initial centroids of the k-mean
clustering in the second stage. Thus buckshot may suffer from the limitations of both single-link clustering technique
(e.g., chaining effect [8]) and k-means clustering technique. In practice the text data sets contain many categories of
uneven sizes. In those data sets initial random selection of ffiffiffiffiffiffi kn
p documents may not be proper, i.e, no documents may
be selected from an original cluster if its size is small. Note that no random sampling is required for the proposed clus-tering technique. In the proposed one the hierarchical clusters are created using extensive similarity between documents,
and these hierarchical baseline clusters would no more be used in the second stage. In the second stage, k-means algo-
rithm is performed only to group those documents that have not been included in the baseline clusters and the initial
centroids are generated from these baseline clusters. In Buckshot algorithm, all the documents are taken into considera-
tion for clustering by k-means algorithm. It uses the single-link clustering technique, only to create the initial seed points
of the k-means algorithm. Later it can be seen from the experiments that the proposed one performs significantly better
than buckshot clustering technique.
The process of creating baseline clusters in the first stage of the proposed technique is quite similar to the group-average
hierarchical document clustering technique [30]. Both the techniques find the average of similarities of the documents of
two individual clusters for merging them into one. The proposed method finds the distance between two clusters using
extensive similarity, whereas the group-average hierarchical document clustering technique generally uses cosine similarity.
The group-average hierarchical clustering technique cannot distinguish two dissimilar clusters explicitly, like the proposed
method. This is the main difference between the two techniques.
T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162 157
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 10/14
5. Evaluation criteria
If the documents within a cluster are similar to each other and dissimilar to the documents in the other clusters then the
clustering algorithm is considered to perform well. The data sets under consideration have labeled documents. Hence quality
measures based on labeled data are used here for comparison.
Normalized mutual information and f -measure are very popular and are used by a number of researchers [30]31 to mea-
sure the quality of a cluster using the information of the actual categories of the document collection. Let us assume that R is
the set of categories and S is the set of clusters. Consider there are I number of categories in R and J number of clusters in S .There are a total of N number of documents in the corpus i.e., both R and S individually contains N documents. Let ni be the
number of documents belonging to category i, m j be the number of documents belonging to cluster j and nij be the number of
documents belonging to both category i and cluster j, for all i ¼ 1;2; . . . ; I and j ¼ 1;2; . . . ; J .
Mutual information is a symmetric measure to quantify the statistical information shared between two distributions,
which provides an indication of the shared information between a set of categories and a set of clusters. Let I ðR; S Þ denotes
the mutual information between R and S and E ðRÞ and E ðS Þ be the entropy of R and S respectively. I ðR; S Þ and E ðRÞ can be
defined as
I ðR; S Þ ¼XI
i¼1
X J
j¼1
nij
N log
Nnij
nim j
; E ðRÞ ¼
XI
i¼1
ni
N log
ni
N
There is no upper bound for I (R, S ), so for easier interpretation and comparisons a normalized mutual information that ranges
from 0 to 1 is desirable. The normalized mutual information (NMI) is defined by Strehl et al. [31] as follows:
NMIðR; S Þ ¼ I ðR; S Þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffi
E ðRÞE ðS Þp
F-measure determines the recall and precision value of each cluster with a corresponding category. Let, for a query the set
of relevant documents be from category i and the set of retrieved documents be from cluster j. Then recall, precision and f -
measure are given as follows:
Recallij ¼ nij
ni
; 8 i; j; Precisionij ¼ n ij
m j
; 8 i; j
F ij ¼ 2 Recallij Precisionij
Recallij þ Precisionij
; 8 i; j
If there is no common instance between a category and a cluster (i.e., n ij ¼ 0) then we shall assume F ij ¼ 0. The value of F ij
will be maximum when Precisionij ¼ Recallij ¼ 1 for a category i and a cluster j. Thus the value of F ij lies between 0 and 1. Thebest f -measure among all the clusters is selected as the f -measure for the query of a particular category i.e.,
F i ¼ max j2½0; J F ij; 8i. The f -measure of all the clusters is weighted average of the sum of the f -measures of each category,
F ¼ PI i¼1
ni
N F i. We would like to maximize f -measure and normalized mutual information to achieve good quality clusters.
6. Experimental evaluation
6.1. Document collections
Reuters-21578 is a collection of documents that appeared on Reuters newswire in 1987. The documents were originally
assembled and indexed with categories by Carnegie Group, Inc. and Reuters, Ltd. The corpus contains 21,578 documents in
135 categories. Here we considered the ModApte version used in [4], in which there are 30 categories and 8067 documents.
We have divided this corpus into four groups and with the name as rcv1, rcv2, rcv3 and rcv4.
20-Newsgroups corpus is a collection of news articles collected from 20 different news sources. Each news source con-
stitutes a different category. In this data set, articles with multiple topics are cross posted to multiple newsgroups i.e., there
are overlaps between several categories. The data set is named as 20ns here.
The rest of the corpora were developed in the Karypis lab [15]. The corpora tr31, tr41 and tr45 are derived from TREC-5,
TREC-6, and TREC-7 collections.1 The categories of the tr31, tr41 and tr45 were generated from the relevance judgment
provided in these collections. The corpus fbis was collected from the Foreign Broadcast Information Service data of TREC-5.
The corpora la1 and la2 were from the Los Angeles Times data of TREC-5. The category labels of la1 and la2 were generated
according to the name of the newspaper sections where these articles appeared, such as Entertainment, Financial, Foreign,
Metro, National, and Sports. The documents that have a single label were selected for la1 and la2 data sets. The corpora
oh10 and oh15 were created from OHSUMED collection, subset of MEDLINE database, which contains 233,445 documents
indexed using 14,321 unique categories[15]. Different subsets of categories have been taken to construct these data sets.
1
http://trec.nist.gov.
158 T. Basu, C.A. Murthy/ Information Sciences 311 (2015) 149–162
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 11/14
The number of documents, number of terms and number of categories of these corpora can be found in Table 2. For each
of the above corpora, the stop words have been extracted using the standard English stop word list.2 Then, by applying the
standard porter stemmer algorithm [27] for stemming, the inverted index is developed.
6.2. Experimental setup
Single-Link Hierarchical Clustering (SLHC) [30], Average-Link Hierarchical Clustering (ALHC) [30], Dynamic k-Nearest
Neighbor Algorithm (DKNNA) [16], k-means clustering [13], bisecting k-means clustering [30], buckshot clustering [6], spec-
tral clustering [25] and clustering by Non-negative Matrix Factorization (NMF) [33] techniques are selected for comparison
with the proposed clustering technique. k-means and bisecting k-means algorithms have been executed 10 times to reduce
the effect of random initialization of seed points and for each execution they have been iterated 100 times to reach a solution
(if they are not converged automatically). Buckshot algorithm has also been executed 10 times to reduce the effect of random
initialization of initial ffiffiffiffiffiffi kN
p documents. The f -measure and NMI values of k-means, bisecting k-means and buckshot cluster-
ing techniques shown here are the average of 10 different results. Note that the proposed method finds the number of clus-
ters automatically from the data sets. The proposed clustering algorithm has been executed first and then all the other
algorithms have been executed to produce the same number of clusters as the proposed one. Tables 3 and 4 show the f -mea-
sure and NMI values respectively for all the data sets. Number of Clusters (NCL) developed by the proposed method is also
shown. The f -measure and NMI are calculated using these NCL values. The value of a is chosen as, a ¼ ffiffiffiffiffiffiffiffi ffiffiffiffi N
p p for N number of
documents in the corpus. The NMF based clustering algorithm has been executed 10 times to reduce the effect of random
initialization and for each time it has been iterated 100 times to reach a solution. The values of k for DKNNA is taken as
k ¼ 10. The value of r of the spectral clustering technique is set by search over values from 10 to 20 percent of the total range
of the similarity values and the one that gives the tightest clusters is picked, as suggested by Ng et al. [25].
The proposed histogram thresholding based technique for estimating a value of h has been followed in the experiments.
We have considered class intervals of length 0.005 for similarity values. We have also assumed that content similarity (here
cosine similarity) value greater than 0.5 means that the corresponding documents are similar. Thus, the issue here is to find a
h; 0 < h < 0:5 such that a similarity value grater than h denotes that the corresponding documents are similar. In the experi-
ments we have used the method of moving averages with the window length of 5 for convolution. The text data sets are
generally sparse and the number of high similarity values is practically very low and there are fluctuations in the heights
of the histogram for two successive similarity values. Hence it is not desirable to take the window length of 3 as the method
considers the heights of just the previous and the next value for calculating f ðsiÞ’s. We have tried with window length of 7 or
9 on some of the corpora in the experiments, but the values of h remain more or less same as they are selected by considering
window length of 5. On the other hand window length of 7 or 9 need more calculations than window length of 5. These are
the reasons for the choice of window of length 5. It has been found that several local peaks and local valleys are removed by
this method. The number of valley regions after smoothing the histogram by the method of moving averages is always found
to be greater than three.
6.3. Analysis of results
Tables 3 and 4 show the comparison of proposed document clustering method with the other methods using f -measure
and NMI respectively, for all data sets. There are 104 comparisons for the proposed method using f -measure in Table 3. The
proposed one performs better than the other methods in 91 cases and for the rest 13 cases other methods (e.g., buckshot,
spectral clustering algorithms) have an edge over the proposed method. Few of the exceptions, where the other methods
perform better than the proposed one are e.g., SLHC and NMF for rcv3 (Here the f -measure of SLHC and NMF are respectively
Table 2
Data sets overview.
Data set No. of documents No. of terms No. of categories
20ns 18,000 35,218 20
fbis 2463 2000 17
la1 3204 31,472 6
la2 3075 31,472 6
oh10 1050 3238 10
oh15 913 3100 10rcv1 2017 12,906 30
rcv2 2017 12,912 30
rcv3 2017 12,820 30
rcv4 2016 13,181 30
tr31 927 10,128 7
tr41 878 7454 10
tr45 690 8261 10
2 http://www.textfixer.com/resources/common-english-words.txt.
T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162 159
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 12/14
0.408, 0.511 and the f -measure of the proposed method is 0.294). Similarly, Table 4 shows that the proposed method per-
forms better than the other methods using NMI in 98 out of 104 cases.
A statistical significance test has been performed to check whether these differences are significant when other clustering
algorithms beat the proposed algorithm both in Tables 3 and 4. The same statistical significance test has been performed
when the proposed algorithm performs better than the other clustering algorithms.
A generalized version of paired t-test is suitable for testing the equality of means when the variances are unknown. This
problem is the classical Behrens–Fisher problem in hypothesis testing and a suitable test statistic 3 is described and tabled in
[20,28], respectively. It has been found that out of those 91 cases, where the proposed algorithm performed better than the
other algorithms, in Table 3, the differences are statistically significant in 86 cases for the level of significance 0.05. For all of the rest 13 cases the differences are statistically significant in Table 3, for the same level of significance. Hence the performance
of the proposed method is found to be significantly better than the other methods in 86.86% (86/99) cases using f -measure.
Similarly, in Table 4 the results are significant in 89 out of 98 cases when proposed method performed better than the other
methods and all the results of the rest 6 cases are significant. Thus in 93.68% (89/95) cases the proposed method performs sig-
nificantly better than the other methods using NMI. Clearly, these results show the effectiveness of the proposed document
clustering technique.
Remark. A point is to be noted that the number of clusters produced by the proposed method for each corpus are close to
the actual number of categories of each corpus. It may be observed from Tables 3 and 4 that the number of clusters is equal to
the actual number of categories for la2, oh15, rcv2, tr31 and tr41 corpora. The difference between the number of clusters and
actual number of categories is at most 2 for rest of the corpora. Since the text data sets provided here are very sparse and
Table 3
Comparison of various clustering methods using f -measure.
Data sets NCTa NCL b F -measure
BKMc KM BS SLHC ALHC DKNNA SC NMF Proposed
20ns 20 23 0.357 0.449 0.436 0.367 0.385 0.408 0.428 0.445 0.474
fbis 17 19 0.423 0.534 0.516 0.192 0.192 0.288 0.535 0.435 0.584
la1 6 8 0.506 0.531 0.504 0.327 0.325 0.393 0.536 0.544 0.570
la2 6 6 0.484 0.550 0.553 0.330 0.328 0.405 0.541 0.542 0.563
oh10 10 12 0.304 0.465 0.461 0.205 0.206 0.381 0.527 0.481 0.500
oh15 10 10 0.363 0.485 0.482 0.206 0.202 0.366 0.516 0.478 0.532
rcv1 30 31 0.231 0.247 0.307 0.411 0.360 0.431 0.298 0.516 0.553
rcv2 30 30 0.233 0.281 0.324 0.404 0.353 0.438 0.312 0.489 0.517
rcv3 30 32 0.188 0.271 0.351 0.408 0.376 0.436 0.338 0.511 0.294
rcv4 30 32 0.247 0.322 0.289 0.405 0.381 0.440 0.401 0.509 0.289
tr31 7 7 0.558 0.665 0.646 0.388 0.387 0.457 0.589 0.545 0.678
tr41 10 10 0.564 0.607 0.593 0.286 0.280 0.416 0.557 0.537 0.698
tr45 10 11 0.556 0.673 0.681 0.243 0.248 0.444 0.605 0.596 0.750
a NCT stands for number of categories.b NCL stands for number of clusters.c BKM, KM, BS, SLHC, ALHC, DKNNA, SC and NMF stand for Bisecting k-Means, k-Means, BuckShot, Single-Link Hierarchical Clustering, Average-Link
Hierarchical Clustering, Dynamic k-Nearest Neighbor Algorithm, spectral clustering and Non-negative Matrix Factorization respectively.
Table 4
Comparison of various clustering methods using normalized mutual information.
Dat a s ets NCT NCL Norma liz ed mutual informat ion
BKMa KM BS SLHC ALHC DKNNA SC NMF Proposed
20ns 20 23 0.417 0.428 0.437 0.270 0.286 0.325 0.451 0.432 0.433
fbis 17 19 0.443 0.525 0.524 0.051 0.362 0.405 0.520 0.446 0.544
la1 6 8 0.266 0.299 0.295 0.021 0.218 0.241 0.285 0.296 0.308
la2 6 6 0.249 0.312 0.323 0.021 0.215 0.252 0.335 0.360 0.386
oh10 10 12 0.226 0.352 0.333 0.050 0.157 0.239 0.417 0.410 0.406
oh15 10 10 0.213 0.352 0.357 0.067 0.155 0.236 0.358 0.357 0.380
rcv1 30 31 0.302 0.409 0.407 0.0871 0.108 0.213 0.429 0.434 0.495
rcv2 30 30 0.296 0.411 0.399 0.053 0.150 0.218 0.426 0.420 0.465
rcv3 30 32 0.316 0.416 0.408 0.049 0.162 0.215 0.404 0.476 0.448
rcv4 30 32 0.317 0.414 0.416 0.048 0.175 0.220 0.414 0.507 0.452
tr31 7 7 0.478 0.463 0.471 0.065 0.212 0.414 0.436 0.197 0.509
tr41 10 10 0.470 0.550 0.553 0.054 0.237 0.456 0.479 0.506 0.619
tr45 10 11 0.492 0.599 0.591 0.084 0.354 0.512 0.503 0.488 0.694
a All the symbols in this table are the same symbols used in Table 3.
3 The test statistic is of the form t ¼ x1 x2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s21=n1þs2
2=n2
p , where x1 ; x2 are the means, s1 ; s2 are the standard deviations and n1 ; n2 are the number of observations.
160 T. Basu, C.A. Murthy/ Information Sciences 311 (2015) 149–162
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 13/14
high dimensional, then it may be implied that the method proposed here for estimating the value of h is able to detect the
actual grouping of the corpus.
6.4. Processing time
The similarity matrix requires N N memory locations, and to store N clusters, initially, N memory locations are needed
for the proposed method. Thus the space complexity of the proposed document clustering algorithm is OðN 2Þ. OðN 2Þ time is
required to build the extensive similarity matrix and to construct (say) mð N Þ baseline clusters, the proposed method takes
OðmN 2Þ time. In the final stage of the proposed technique, k-means algorithm takes OððN bÞmt Þ time to merge (say) b num-
ber of singleton clusters to the baseline clusters, where t is number of iterations of k-means algorithm. Thus the time com-
plexity of the proposed algorithm is OðN 2Þ as m is very small compared to N .
The processing time of each algorithm used in the experiments have been measured on a quad core Linux workstation.
The time (in seconds) taken by different clustering algorithms to cluster every text data set are reported in Table 5. The time
shown here for the proposed algorithm is the sum of the times taken to estimate the value of h, to build the baseline clusters,
and to perform the k-means clustering algorithm to merge the remaining singleton clusters to the baseline clusters. The time
shown for bisecting k-means, buckshot, k-means and NMF clustering techniques are the average of the processing times of 10 iterations. It is to be mentioned that the codes for all the algorithms are written in C++ and the data structures for all the
algorithms are developed by the authors. Hence the processing time can be reduced by incorporating some more efficient
data structures for the proposed algorithm as well as the other algorithms. Note that the processing time of the proposed
algorithm is less than KM, SLHC, ALHC and DKNNA for each data set. The execution time of BKM is less than the proposed
one for 20ns, the execution time of SC is less than the proposed one for tr41 and the execution time of NMF is less than the
proposed algorithm for tr31. The execution time of the proposed algorithm is less than BKM, SC and NMF for each of the
other data sets. The processing time of the proposed algorithm is comparable with buckshot algorithm (although in most
of the data set the processing time of the proposed algorithm is less than buckshot). The dimensionality of the data sets (used
in the experiments) varies from 2000 (fbis) to 35,218 (20ns). Hence the proposed clustering algorithm may be useful in
terms of processing time for any real life high dimensional data set.
7. Conclusions
A hybrid document clustering algorithm is introduced by combining a new hierarchical and traditional k-means cluster-
ing techniques. The baseline clusters produced by the new hierarchical technique are the clusters where the documents pos-
sess high similarity among them. The extensive similarity between documents ensures this quality of the baseline clusters. It
is developed on the basis of similarity between two documents and their distances with every other documents in the docu-
ment collection. Thus the documents with high extensive similarity are grouped in the same cluster. Most of the singleton
clusters are nothing but the documents which have low content similarity with every other document. In practice the
number of such singleton clusters is sufficient and can not be ignored as outliers. Therefore k-means algorithm is performed
iteratively to assign these singleton clusters to one of the baseline clusters. Thus the proposed method reduces the error of
k-means algorithm due to random seed selection. Moreover the method is not as expensive as the hierarchical clustering
algorithm which can be observed from Table 5. The significant characteristic of the proposed clustering technique is that
the algorithm automatically decides the number of clusters in the data. The automatic detection of number of clusters for
such a sparse and high dimensional text data is very important.
The proposed method is able to determine the number of clusters prior to implement the algorithm by applying a thresh-old h on the similarity value between documents. An estimation technique is introduced to determine a value of h from a
Table 5
Processing time (in seconds) of different clustering methods.
Methods BKMa KM BS SLHC ALHC DKNNA SC NMF Proposed
20ns 1582.25 1594.54 1578.12 1618.50 1664.31 1601.23 1583.62 1587.36 1595.23
fbis 94.17 91.52 92.46 112.15 129.58 104.32 100.90 93.19 90.05
la1 159.23 153.22 142.36 160.12 179.62 162.25 146.68 153.50 140.31
la2 149.41 144.12 139.34 163.47 182.50 164.33 142.33 144.46 140.29
oh10 18.57 18.32 17.32 26.31 33.51 23.26 24.82 22.48 18.06
oh15 18.12 20.02 18.26 24.15 31.46 23.18 20.94 17.61 16.22rcv1 89.41 91.15 86.37 87.47 103.37 88.37 94.62 87.80 86.18
rcv2 98.32 106.08 103.16 104.17 120.45 99.27 97.81 93.51 92.53
rcv3 97.11 106.29 98.35 98.47 124.72 98.47 108.24 94.96 93.54
rcv4 100.17 95.28 96.49 109.32 126.30 99.49 113.81 98.70 93.32
tr31 29.41 30.23 30.34 33.15 40.13 32.35 37.98 29.33 29.38
tr41 27.29 28.16 27.46 33.96 39.58 30.52 25.85 27.54 26.49
tr45 25.45 25.01 26.06 31.17 38.23 28.50 29.72 26.51 24.65
a All the symbols in this table are the same symbols used in Table 3.
T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162 161
7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents
http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 14/14
corpus. Theexperimental results show the value and validity of the proposed estimation of h. In the experiments the threshold
on the distance between two clusters is taken as a ¼ ffiffiffiffiffiffiffiffi ffiffiffiffi
N p p
. It indicates that the distance between two different clusters must
be greater than a. It is very difficult to fix a lower bound on the distance between two clusters in practice as the corpora are
sparse in nature and have high dimensionality. If we select a > ffiffiffiffiffiffiffiffi ffiffiffiffi
N p p
, say,a ¼ N 13 or a ¼ N
12 then some really different clusters
may be merged into one, which is surely not desirable. On the other hand, if we select a < ffiffiffiffiffiffiffiffi ffiffiffiffi
N p p
, say, a ¼ N 15 then we may get
some very compact clusters, but large number of small sized clusters would be created, which is also not expected in practice.
It may be observed from the experiments that the number of clusters produced by the proposed technique is very close to the
actual number of categories for each corpus and the proposed one outperforms the other methods. Hence we may claim that
the selection of a ¼ ffiffiffiffiffiffiffiffi ffiffiffiffi
N p p
is proper, though it has been selected heuristically. The proposed hybrid clustering technique tries
to solve some issues of some of the well known partitional and hierarchical clustering techniques. Hence it may be useful in
many real life unsupervised applications. Note that any similarity measure can be used instead of cosine similarity to design
extensive similarity for different types of data sets except text data. It is to be mentioned that the value of a should be chosen
carefully whenever the method is applied to different other types of applications. In future we shall apply the proposed
method on social network data to find different types of communities or topics. In that case we may have to incorporate
the idea of graph theory into the proposed distance function to find relation between different sets of nodes of a social site.
Acknowledgment
The authors would like to thank the reviewers and the editor for their valuable comments and suggestions.
References
[1] N.O. Andrews, E.A. Fox, Recent Developments in Document Clustering, Technical Report, Verginia Tech., USA, 2007.[2] R.G. Brown, Smoothing, Forecasting and Prediction of Discrete Time Series, Prentice-Hall, Englewood Cliffs, NJ, 1962.[3] C. Aggarwal, C. Zhai, A survey of text clustering algorithms, Mining Text Data (2012) 77–128.[4] D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Trans. Knowl. Data Eng. 17 (12) (2005) 1624–1637.[5] C. Carpineto, S. Osinski, G. Romano, D. Weiss, A survey of web clustering engines, ACM Comput. Surveys 41 (3) (2009).[6] D.R. Cutting, D.R. Karger, J.O. Pedersen, J.W. Tukey, Scatter/gather: a cluster-based approach to browsing large document collections, in: Proceedings of
the International Conference on Research and Development in Information Retrieval, SIGIR’93, 1993, pp. 126–135.[7] S. Dasgupta, V. Ng, Towards subjectifying text clustering, in: Proceedings of the International Conference on Research and Development in Information
Retrieval, SIGIR’10, NY, USA, 2010, pp. 483–490.[8] R.C. Dubes, A.K. Jain, Algorithms for Clustering Data, Prentice Hall, 1988.[9] R. Duda, P. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, 1973.
[10] M. Filipponea, F. Camastrab, F. Masullia, S. Rovettaa, A survey of kernel and spectral methods for clustering, Pattern Recognit. 41 (1) (2008) 176–190.[11] R. Forsati, M. Mahdavi, M. Shamsfard, M.R. Meybodi, Efficient stochastic algorithms for document clustering, Inform. Sci. 220 (2013) 269–291.
[12] C.A. Glasbey, An analysis of histogram-based thresholding algorithms, Graph. Models Image Process. 55 (6) (1993) 532–537.
[13] J.A. Hartigan, M.A. Wong, A k-means clustering algorithm, J. Roy. Statist. Soc. (Appl. Statist.) 28 (1) (1979) 100–108.[14] A. Huang, Similarity measures for text document clustering, in: Proceedings of the New Zealand Computer Science Research Student Conference,
Christchurch, New Zealand, 2008, pp. 49–56.
[15] G. Karypis, E.H. Han, Centroid-based document classification: analysis and experimental results, in: Proceedings of the Fourth European Conference onthe Principles of Data Mining and Knowledge Discovery, PKDD’00, Lyon, France, 2000, pp. 424–431.
[16] J.Z.C. Lai, T.J. Huang, An agglomerative clustering algorithm using a dynamic k-nearest neighbor list, Inform. Sci. 217 (2012) 31–38.[17] A.N. Langville, C.D. Meyer, R. Albright, Initializations for the Non-negative Matrix Factorization, in: Proceedings of the Conference on Knowledge
Discovery from Data, KDD’06, 2006.[18] T. Basu, C.A. Murthy, Cues: a new hierarchical approach for document clustering, J. Pattern Recognit. Res. 8 (1) (2013) 66–84.[19] D.D. Lee, H.S. Seung, Algorithms for Non-negative Matrix Factorization, in: Advances in Neural Information Processing Systems, vol. 13, 2001, pp. 556–
562.[20] E.L. Lehmann, Testing of Statistical Hypotheses, John Wiley, New York, 1976.[21] X. Liu,X. Yong, H. Lin, An improved spectralclustering algorithm based on local neighbors in kernel space, Comput. Sci.Inform. Syst. 8 (4) (2011) 1143–
1157.[22] C.S. Yang, M.C. Chiang, C.W. Tsai, A time efficient pattern reduction algorithm for k-means clustering, Inform. Sci. 181 (2011) 716–731.[23] M.I. Malinen, P. Franti, Clustering by analytic functions, Inform. Sci. 217 (2012) 31–38.[24] C.D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, New York, 2008.
[25] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in: Proceedings of Neural Information Processing Systems (NIPS’01),2001, pp. 849–856.[26] P. Pantel, D. Lin, Document clustering with committees, in: Proceedings of the Internati onal Conference on Research and Development in Information
Retrieval, SIGIR’02, 2002, pp. 199–206.
[27] M.F. Porter, An algorithm for suffix stripping, Program 14 (3) (1980) 130–137.[28] C.R. Rao, S.K. Mitra, A. Matthai, K.G. Ramamurthy (Eds.), Formulae and Tables for Statistical Work, Statistical Publishing Society, Calcutta, 1966.
[29] G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983.[30] M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques, in: Proceedings of the Text Mining Workshop, ACM International
Conference on Knowledge Discovery and Data Mining (KDD’00), 2000.[31] A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse framework for combining multiple partitions, J. Machine Learn. Res. 3 (2003) 583–617.
[32] J. Wang, S. Wu, H.Q. Vu, G. Li, Text document clustering with metric learning, in: Proceedings of the 33rd International Conference on Research andDevelopment in Information Retrieval, SIGIR’10, 2010, pp. 783–784.
[33] W. Xu, X.Liu, Y.Gong, Document clustering based on Non-negative Matrix Factorization, in: Proceedings of the International Conference on Researchand Development in Information Retrieval, SIGIR’03, Toronto, Canada, 2003, pp. 267–273.
[34] W. Xu, Y.Gong, Document clustering by concept factorization, in: Proceedings of the International Conference on Research and Development inInformation Retrieval, SIGIR’2004, NY, USA, 2010, pp. 483–490.
[35] Y. Zhu, L. Jing, J. Yu, Text clustering via constrained nonnegative matrix factorization, in: Proceedings of the IEEE International Conference on Data
Mining (ICDM’2011), 2011, pp. 1278–1283.
162 T. Basu, C.A. Murthy/ Information Sciences 311 (2015) 149–162