10.1016 j.ins.2015.03.038 a similarity assessment technique for effective grouping of documents

7/21/2019 10.1016 J.ins.2015.03.038 a Similarity Assessment Technique for Effective Grouping of Documents

http://slidepdf.com/reader/full/101016-jins201503038-a-similarity-assessment-technique-for-effective-grouping 1/14

A similarity assessment technique for effective grouping

of documents

Tanmay Basu ⇑, C.A. Murthy

Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India

a r t i c l e i n f o

Article history:

Received 20 February 2014

Received in revised form 25 December 2014

Accepted 15 March 2015

Available online 21 March 2015

Keywords:

Document clustering

Text mining

Applied data mining

a b s t r a c t

Document clustering refers to the task of grouping similar documents and segregating dis-

similar documents. It is very useful to find meaningful categories from a large corpus. In

practice, the task to categorize a corpus is not so easy, since it generally contains huge

documents and the document vectors are high dimensional. This paper introduces a hybrid

document clustering technique by combining a new hierarchical and the traditional

k-means clustering techniques. A distance function is proposed to find the distance

between the hierarchical clusters. Initially the algorithm constructs some clusters by the

hierarchical clustering technique using the new distance function. Then k-means algorithm

is performed by using the centroids of the hierarchical clusters to group the documents

that are not included in the hierarchical clusters. The major advantage of the proposed dis-

tance function is that it is able to find the nature of the corpora by varying a similarity

threshold. Thus the proposed clustering technique does not require the number of clusters

prior to executing the algorithm. In this way the initial random selection of k centroids for

k-means algorithm is not needed for the proposed method. The experimental evaluation

using Reuter, Ohsumed and various TREC data sets shows that the proposed method per-

forms significantly better than several other document clustering techniques. F -measure

and normalized mutual information are used to show that the proposed method is effec-

tively grouping the text data sets.

2015 Elsevier Inc. All rights reserved.

1. Introduction

Clustering algorithms partition a data set into several groups such that the data points in the same group are close to each

other and the points across groups are far from each other [9]. The document clustering algorithms try to identify inherent

grouping of the documents to produce good quality clusters for text data sets. In recent years it has been recognized that

partitional clustering algorithms e.g., k-means, buckshot are advantageous due to their low computational complexity. Onthe other hand these algorithms need the knowledge of the number of clusters. Generally document corpora are huge in size

with high dimensionality. Hence it is not so easy to estimate the number of clusters for any real life document corpus.

Hierarchical clustering techniques do not need the knowledge of number of clusters, but a stopping criterion is needed to

terminate the algorithms. Finding a specific stopping criterion is difficult for large data sets.

The main difficulty of most of the document clustering techniques is to determine the (content) similarity of a pair of docu-

ments for putting them into the same cluster [3]. Generally cosine similarity is used to determine the content similarity

between two documents [24]. Cosine similarity actually checks the number of common terms present in the documents. If

http://dx.doi.org/10.1016/j.ins.2015.03.038

0020-0255/ 2015 Elsevier Inc. All rights reserved.

⇑ Corresponding author. Tel.: +91 33 25753109; fax: +91 33 25783357.

E-mail addresses: [email protected] (T. Basu), [email protected] (C.A. Murthy).

Information Sciences 311 (2015) 149–162

Contents lists available at ScienceDirect

Information Sciences

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l oc a t e / i n s


mailto:[email protected]



http://www.sciencedirect.com/science/journal/00200255

http://www.elsevier.com/locate/ins

http://www.elsevier.com/locate/ins

http://www.sciencedirect.com/science/journal/00200255





http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://crossmark.crossref.org/dialog/?doi=10.1016/j.ins.2015.03.038&domain=pdf



two documents contain many common termsthen they arevery likelyto be similar. Thedifficulty is that there is no clear expla-

nation as to how many common terms can identify two documents as similar. The text data sets are high dimensional data set

andmost of theterms do notoccur in each document.Hence theissue is tofind thecontentsimilarity in such a way sothat it can

restrictthe lowsimilarity values. The actual content similarity betweentwo documents may not be found properlyby checking

only the individual terms of the documents. A new distance function is proposed to find distance between two clusters based

on a similarity measure, extensive similarity between documents. Intuitively, the extensive similarity restricts the low (con-

tent) similarity values by a predefined threshold and then determines the similarity between two documents by finding their

distancewith every otherdocumentin thecorpus. It assigns a score to each pair of documents to measurethe degree of content

similarity. A threshold is set on the content similarity value of the document vectors to restrict the lowsimilarityvalues. A his-

togram thresholding based method is used to estimate the value of the threshold from the similarity matrix of a corpus.

A new hybrid document clustering algorithm is proposed, which is a combination of a hierarchical and k-means clustering

technique. The hierarchical clustering technique produces some baseline clusters by using the proposed cluster distance

function. The hierarchical clusters are named as baseline clusters. These clusters are created in such a way that the documents

inside a cluster are very similar to each other. Actually the extensive similarity of all pair of documents of a baseline cluster is

very high. The documents of two different baseline clusters are very dissimilar to each other. Thus the baseline clusters intui-

tively determine the actual categories of the document collection. Generally there exist some singleton clusters after con-

structing the hierarchical clusters. The distance between a singleton cluster and each baseline cluster is not so small.

Hence k-means clustering algorithm is performed to group these documents to a particular baseline cluster, with which

it has highest content similarity. If for several iterations of k-means algorithm each of these singleton clusters are grouped

to the same baseline cluster then they are likely to be assigned correctly. The significant property of the proposed technique

is that it can automatically identify the number of clusters. It has become clear from the experiments that the number of

clusters of each corpus is very close to the actual category. The experimental analysis using several well known TREC and

Reuter data sets have shown that the proposed method performs significantly better than several existing document clus-

tering algorithms.

The paper is organized as follows. Section 2 describes some related works. The document representation technique is pre-

sented in Section 3. The proposed document clustering technique is explained in Section 4. The evaluation criteria for eval-

uating the clusters generated by a particular method is described in Section 5. Section 6 presents the experimental results

and a detailed analysis on the results. Finally we conclude and discuss about the further scope of this work in Section 7.

2. Related works

There are two basic types of document clustering techniques available in the literature – hierarchical and partitional clus-

tering techniques [8,11].

Hierarchical clustering produces a hierarchical tree of clusters where each individual level can be viewed as a com-

bination of clusters in the next lower level. This hierarchical structure of clusters is also known as dendrogram. The

hierarchical clustering techniques can be divided into two parts – agglomerative and divisive. In a n Agglomerative

Hierarchical Clustering (AHC) method [30], starting with each document as individual cluster, at each step, the most similar

clusters are merged until a given termination condition is satisfied. In a divisive method, starting with the whole set of docu-

ments as a single cluster, the method splits a cluster into smaller clusters at each step until a given termination condition is

satisfied. Several halting criteria for AHC algorithms have been proposed. But no widely acceptable halting criterion is avail-

able for these algorithms. As a result some good clusters may be merged, which will be eventually meaningless to the user.

There are mainly three variations of AHC techniques – single-link, complete-link and group-average hierarchical method for

document clustering [6].

In single-link method the similarity between a pair of clusters is calculated as the similarity between the two most similar

documents where each document represents each individual cluster. The complete-link method measures the similarity

between a pair of clusters as the least similar documents, one of which is in each cluster. The group average method merges

two clusters if they have least average similarity than the other clusters. Average similarity means the average of the simi-

larities between the documents of each cluster. In a divisive hierarchical clustering technique, initially, the method assumes

the whole data set as a single cluster. Then at each step, the method chooses one of the existing clusters and splits it into two.

The process continues till only singleton clusters remain or it reaches a given halting criterion. Generally the cluster with the

least overall similarity is chosen for splitting [30].

In a recent study, Lai et al. have proposed an agglomerative hierarchical clustering algorithm by using dynamic k-nearest

neighbor list for each cluster. The clustering technique is named as Dynamic k-Nearest Neighbor Algorithm (DKNNA) [16]. The

method uses a list of dynamic k nearest neighbors to store k nearest neighbors of each cluster. Initially the method assumes

each document as a cluster and finds the k nearest neighbors of each cluster. The minimum distant clusters are merged and

their nearest neighbors are updated accordingly and then again finds the minimum distant clusters and merge them and so

on. The algorithm continues until the desired number of clusters are obtained. In the merging and updating process of each

iteration, the k nearest neighbors of the clusters, which are affected by the merging process are updated. If the set of k near-

est neighbors are empty for some of the clusters being updated, their nearest neighbors are determined by searching all the

clusters. Thus the proposed approach can guarantee the exactness of the nearest neighbors of a cluster and can obtain goodquality clusters [16]. Although the algorithm has shown good results for some artificial and image data sets, but it has two

150 T. Basu, C.A. Murthy/ Information Sciences 311 (2015) 149–162

http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-



limitations to apply it to text data sets. The method needs the knowledge of desired number of clusters, which is very diffi-

cult to predict and it is problematic to determine a valid k for text data sets.

In contrast to hierarchical clustering techniques, partitional clustering techniques allocate data into a previously known

fixed number of clusters. The commonly used partitional clustering technique is k-means method, where k is the desired

number of clusters [13]. Here initially k documents are chosen randomly from the data set, and they are called seed points.

Each document is assigned to its nearest seed point, thereby creating k clusters. Then the centroids of the clusters are com-

puted, and each document is assigned to its nearest centroid. The same process continues until the clustering does not

change i.e., the centroids in two consecutive iterations remain the same. Generally, the number of iterations is fixed by

the user. The procedure stops if it converges to a solution i.e., the centroids are the same for two consecutive iterations,

or the process terminates after a fixed number of iterations. k-means algorithm is advantageous for its low computational

complexity [23]. It takes linear time to build the clusters. The main disadvantage is that the number of clusters is fixed and it

is very difficult to select a valid k for an unknown text data set. Also there is no universally acceptable way of choosing the

initial seed points. Recently Chiang et al. proposed a time efficient k-means algorithm by compressing and removing the

patterns at each iteration that are unlikely to change their membership thereafter [22], but the limitations of the k-means

clustering technique have not been discussed.

Bisecting k-means method [30] is a variation of basic k-means algorithm. This algorithm tries to improve the quality of

clusters in comparison to k-means clusters. In each iteration, it selects the largest existing cluster (the whole data set in

the first iteration) and divides it into two subsets using k-means (k = 2) algorithm. This process is continued till k clusters

are formed. Bisecting k-means algorithm generally produces almost uniform sized clusters. Thus it can perform better than

k-means algorithm when the actual groups of a data set are almost of similar size i.e., the number of documents in the

categories of a corpus are close to each other. On the contrary, the method produces poor clusters for the corpora, where

the number of documents in the categories differ very much. This method also faces difficulties like k-means algorithm,

in choosing the initial seed points and a proper value of the parameter k.

Buckshot algorithm is a combination of basic k-means and hierarchical clustering methods. It tries to improve the perfor-

mance of k-means algorithm by choosing better initial centroids [26]. It uses a hierarchical clustering algorithm on some

sample documents of the corpus in order to find robust initial centroids. Then k-means algorithm is performed to find

the clusters using these robust centroids as the initial centroids [3]. But repeated calls to this algorithm may produce differ-

ent partitions. If the initial random sampling does not represent the whole data set properly, the resulting clusters may be of

poor quality. Note that appropriate value of k is necessary for this method too.

Spectral clustering algorithm is a very popular clustering method which works on the similarity matrix rather than the

original term-document matrix using the idea of graph cut. It uses the top eigenvectors of the similarity matrix derived from

the similarity between documents [25]. The basic idea is to construct a weighted graph from the corpus, where each node

represents a document and each weighted edge represents the similarity between two documents. In this technique the

clustering problem is formulated as a graph cut problem. The core of this theory is the eigenvalue decomposition of the

Laplacian matrix of the weighted graph obtained from data [10]. Let X ¼ fd1; d2; . . . ; dN g be the set of N documents to cluster.

Let S be the N N similarity matrix where S ij represents the similarity between the documents di and d j. Ng et al. [25] pro-

posed a spectral clustering algorithm, which simultaneously partitions the Laplacian data matrix L into k subsets using the k

largest eigenvectors and they have used a gaussian kernel S ij ¼ exp qðdi ;d jÞ2r2

on the similarity matrix. Here qðdi; d jÞ denotes

the similarity between di and d j and r is the scaling parameter. The gaussian kernel is used to get rid of the curse of dimen-

sionality. The main difficulty of using a gaussian kernel is that, it is sensitive to the parameterr [21]. A wrong value of r may

highly degrade the quality of the clusters. It is extremely difficult to select a proper value of r for a document collection,

since the text data sets are generally sparse with high dimension. It should be noted that the method also suffers from

the limitations of the k-means method, discussed above.

Non-negative Matrix Factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. It

finds the positive factorization of a given positive matrix [19]. Xu et al. [33] have demonstrated that NMF performs well for

text clustering compared to the other similar methods like singular value decomposition and latent sematic indexing. The

technique factorizes the original term-document matrix D approximately as, D UV T , where U is a non-negative matrix

of size n m, and V is an m N non-negative matrix. The base vectors in U can be interpreted as a set of terms in the vocabu-

lary of the corpus, while V describes the contribution of the documents to these terms. The matrices U and V are randomly

initialized, and their contents iteratively estimated [1]. The Non-negative Matrix Factorization method attempts to deter-

mine U and V , which minimize the following objective function

J ¼ 1

2 D UV T ð1Þ

where kk denotes the squared sum of all the elements in the matrix. This is an optimization problem with respect to the

matrices U ¼ ½uik; V ¼ ½v jk; 8i ¼ 1;2; . . . ;n, 8 j ¼ 1;2; . . . ;N and k ¼ 1;2; . . . ;m and as the matrices U and V are non-negative,

we have uik P 0; v jk P 0. This is a typical constrained non-linear optimization problem and can be solved using the Lagrange

method [3]. The interesting property of NMF technique is that it can also be used to find the word clusters instead of docu-

ment clusters. The columns of U

can be used to discover a basis, which corresponds to word clusters. The NMF algorithm hasits disadvantages too. The optimization problem of Eq. (1) is convex in either U or V , but not in both U and V , which means

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162 151

http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-



that the algorithm can guarantee convergence to a local minimum only. In practice, NMF users often compare the local min-

ima from several different starting points, using the results of the best local minimum found. On large sized corpora this may

be problematic [17]. Another problem with NMF algorithm is that it relies on random initialization and as a result, the same

data might produce different results across runs [1].

Xu et al. [34] proposed concept factorization (CF) based document clustering technique, which models each cluster as a

linear combination of the documents, and each document as a linear combination of the cluster centers. The document clus-

tering is then accomplished by computing the two sets of linear coefficients, which is carried out by finding the non-negative

solution that minimizes the reconstruction error of the documents. The major advantage of CF over NMF is that it can be

applied to data containing negative values and the method can be implemented in the kernel space. The method has to select

k concepts (cluster centers) initially and it is very difficult to predict a value of k in practice. Dasgupta et al. [7] proposed a

simple active clustering algorithm which is capable of producing multiple clusterings of the same data according to user

interest. The advantage of this algorithm is that the user feedback required by this algorithm is minimal compared to the

other existing feedback-oriented clustering techniques, but the algorithm may suffer from human feedback, if the topics

are sensitive or when the perception varies. Carpineto et al. have done a good survey on search results clustering techniques.

They have elaborately explained and discussed various issues related to web clustering engines [5]. Wang et al. [32] pro-

posed an efficient soft-constraint algorithm by obtaining a satisfactory clustering result so that the constraints would be

respected as many as possible. The algorithm is basically an optimization problem and it starts by randomly assuming some

initial cluster centroids. The method can produce insignificant clusters if the initial centroids are not properly selected. Zhu

et al. [35] proposed a semi-supervised Non-negative Matrix Factorization method based on the pairwise constraints – must-

link and cannot-link. In this method must-link constraints are used to control the distance of the data in the compressed

form, and cannot-link constraints are used to control the encoding factor to obtain a very good performance. The method

has shown very good performance in some real life text corpora. The algorithm is a new variety of NMF method, which again

relies on random initialization and may produce different clusters for several runs on a corpus, where the sizes of the

categories highly varies from each other.

3. Vector space model for document representation

The number of documents in the corpus throughout this article is denoted by N . The number of terms in the corpus is

denoted by n. The ith

term is represented by t i. Number of times the term t i occurs in the jth

document is denoted by

tf ij; i ¼ 1;2; . . . ;n; j ¼ 1;2; . . . ;N . Document frequency df i is the number of documents in which the term t i occurs. Inverse

document frequency idf i ¼ log N df i

, determines how frequently a word occurs in the document collection. The weight of

the ith

term in the jth

document, denoted by wij , is determined by combining the term frequency with the inverse document

frequency [29] as follows:

wij ¼ tf ij idf i ¼ tf ij log N

df i

; 8i ¼ 1; 2; . . . ;n and 8 j ¼ 1; 2; . . . ;N

The documents are represented using the vector space model in most of the clustering algorithms [29]. In this model each

document d j is considered to be a vector, where the ith

component of the vector is w ij i.e., ~d j ¼ ðw1 j;w2 j; . . . ;wnjÞ.

The key factor in the success of any clustering algorithm is the selection of a good similarity measure. The similarity

between two documents is achieved through some distance function. Given two document vectors ~di and ~d j, it is required

to find the degree of similarity (or dissimilarity) between them. Various similarity measures are available in the literature

but the commonly used measure is cosine similarity between two document vectors [30], which is given by

cos ~di; ~d j

¼

~di ~d j

~di

~d j

¼

Pnk¼1ðwik w jkÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiPn

k¼1w2ik

Pn

k¼1w2 jk

q ; 8i; j ð2Þ

The weight of each term in a document is non-negative. As a result the cosine similarity is non-negative and boundedbetween 0 and 1. cos ~di;

~d j

¼ 1 means the documents are exactly similar and the similarity decreases as the value decreases

to 0. An important property of the cosine similarity is its independence of document length. Thus cosine similarity has

become popular as a similarity measure in the vector space model [14]. Let D ¼ fd1; d2; . . . ; dr g be the set of r documents,

where each document has n number of terms. The centroid of D;Dcn can be calculated as, Dcn ¼ 1r

Pr j¼1

~d j, where ~d j is the

corresponding vector of document d j.

4. Proposed clustering technique for effective grouping of documents

A combination of hierarchical clustering and k-means clustering methods has been introduced based on a similarity

assessment technique to effectively group the documents. The existing document clustering algorithms so far discussed

determine the (content) similarity of a pair of documents for putting them into the same cluster. Generally the content simi-

larity is determined by the cosine of the angle between two document vectors. The cosine similarity actually checks thenumber of common terms present in the documents. If two documents contain many common terms then the documents


http://-/?-

http://-/?-

http://-/?-

http://-/?-



are very likely to be similar, but the difficulty is that there is no clear explanation as to how many common terms can identify

two documents as similar. The text data sets are high dimensional data set and most of the terms do not occur in each docu-

ment. Hence the issue is to find the content similarity in such a way so that it can restrict the low similarity values. The

actual content similarity between two documents may not be found properly by checking the individual terms of the docu-

ments. Intuitively, if two documents are content wise similar then they should have similar type of relation with most of the

other documents i.e., if two documents x and y have similar content and if x is similar to any other document z then y must

be similar or somehow related to z . This important characteristic is not observed in cosine similarity measure.

4.1. A similarity assessment technique

A similarity measure, Extensive Similarity (ES) is used to find the similarity between two documents in the proposed work.

The similarity measure extensively checks all the documents in the corpus to determine the similarity. The extensive similar-

ity between two documents is determined depending on their distances with every other document in the corpus. Intuitively,

two documents are exactly similar, if they have sufficient content similarity and they have almost same distance with every

other document in the corpus (i.e., both are either similar or dissimilar to all the other documents) [18]. The content similarity

is defined as a binary valued distance function. The distance between two documents is minimum i.e., 0 when they have suf-

ficient content similarity, otherwise the distance is 1 i.e., they have very low content similarity. The distance between two

documents di and d j; 8i; j is determined by putting a threshold h 2 ð0;1Þ on their content similarity as follows:

disðdi; d jÞ ¼ 1 if qðdi; d jÞ 6 h

0 otherwise

ð3Þ

where q is the similarity measure to find the content similarity between di and d j. Here h is a threshold value on the content

similarity and it is used to restrict the low similarity values. A data dependent method for estimating the value of h is dis-

cussed later. In the context of document clustering q is considered as cosine similarity i.e.,qðdi; d jÞ ¼ cos ~di;~d j

, where ~di and

~d j are the corresponding vectors of documents di and d j respectively. If cos ~di;~d j

¼ 1 then we can strictly say that the docu-

ments are dissimilar. On the other hand, if the distance is 0, i.e., cos ~di;~d j

> h then they have sufficient content similarity

and the documents are somehow related to each other. Let us assume that di and d j have cosine similarity 0.52 and d j and d0

(another document) have cosine similarity 0.44 and h ¼ 0:1. Hence both disðdi; d jÞ ¼ 0 and disðd j;d0Þ ¼ 0 and the task is to

distinguish these two distances of same value.

The extensive similarity is thus designed to find the grade of similarity of the pair of documents which are similar content

wise [18]. If disðdi;d jÞ ¼ 0 then extensive similarity finds the individual content similarities of di and d j with every other

document, and assigns a score (l) to denote the extensive similarity between the documents as below.

li; j ¼XN

k¼1

jdisðdi; dkÞ disðd j; dkÞj

Thus the extensive similarity between documents d i and d j; 8i; j is defined as

ESðdi; d jÞ ¼ N li; j if disðdi; d jÞ ¼ 0

1 otherwise

ð4Þ

Two documents di; d j have maximum extensive similarity N , if the distance between them is zero, and distance between di

and dk is same as the distance between d j and dk for every k. In general, if the above said distances differ for li; j times then the

extensive similarity is N li; j. Unlike other similarity measures, ES takes into account the distances of the said two docu-

ments di; d j with respect to all the other documents in the corpus when measuring the distance between them [18]. li; j indi-

cates the number of documents with which the similarity of d

i is not the same as the similarity of d

j. As thel

i; j value increases,the similarity between the documents d i and d j decreases. If li; j ¼ 0 then di and d j are exactly similar. Actually l i; j denotes a

grade of dissimilarity and it indicates that di and d j have different distances with l i; j number of documents. The extensive

similarity is used to define the distance between two clusters in the first stage of the proposed document clustering method.

A distance function is proposed to create the baseline clusters. It finds the distance between two clusters say, C x and C y.

Let T xy be a multi-set consisting of the extensive similarities between each pair of documents, one from C x and the other from

C y and it is defined as,

T xy ¼ fESðdi; d jÞ : ESðdi; d jÞP 0; 8di 2 C x and d j 2 C yg

Note that T xy consisting of all the occurrences of the same extensive similarity values (if any) for different pairs of documents.

The proposed distance between two clusters C x and C y can be defined as

dist cluster ðC x; C yÞ ¼ 1 if T xy ¼ ;

N avgðT xyÞ otherwise

ð5Þ


http://-/?-

http://-/?-

http://-/?-



The function dist cluster finds the distance between two clusters C x and C y as the average of the multi set of non-negative ES

values. The distance between C x and C y is infinite, if there are no two documents that have a non-negative ES value i.e., no

similar documents are present in C x and C y. Intuitively, infinite distance between clusters denotes that every pair of docu-

ments, one from C x and the other from C y either share a very few number of terms, or no term is common between them i.e.,

they have a very low content similarity. Later we shall observe that any two clusters with infinite distance between them

remain segregated from each other. Thus, a significant characteristic of the function dist cluster is that it would never merge

two clusters with infinite distance between them.

The proposed document clustering algorithm initially assumes each document as a singleton cluster. Then it merges thoseclusters with minimum distance, and the distance is within a previously fixed limit a. The process of merging continues until

the distance between every two clusters is less than or equal toa. The clusters which are not singletons are named as Baseline

Clusters (BC). The selection of the value of a is discussed in Section 6.2 of this article.

4.2. Properties of dist cluster

The important properties of the function dist cluster are described below.

The minimum distance between any two clusters C x and C y is 0, when avgðT xyÞ ¼ N i.e., the extensive similarity value

between every pair of documents, one from C x and the other from C y is N . Although in practice this minimum value

can be rarely observed between two different document clusters. The maximum value of dist cluster is infinite.

If C x ¼ C y then dist cluster ðC x;C yÞ ¼ N avgðT xxÞ ¼ 0.

dist cluster ðC x; C yÞ ¼ 0 ) avgðT xyÞ ¼ N ) ESðdi; d jÞ ¼ N ; 8di 2 C x and 8d j 2 C y:

Now ESðdi; d jÞ ¼ N implies that two documents di and d j are exactly similar. Note that ESðdi; d jÞ ¼ N ) disðdi; d jÞ ¼ 0

and li; j ¼ 0. Here disðdi; d jÞ ¼ 0 implies that di and d j are similar in terms of content, but they are not necessarily same

i.e., we can not say d i ¼ d j, if disðdi;d jÞ ¼ 0.

Thus dist cluster ðC x;C yÞ ¼ 0; C x ¼ C y and hence dist cluster is not a metric.

It is symmetric. For every pair of clusters C x and C y; dist cluster ðC x;C yÞ ¼ dist cluster ðC y;C xÞ.

dist cluster ðC x;C yÞP 0 for any pair of clusters C x and C y.

For any three clusters C x; C y and C 0, we may have

dist cluster ðC x; C yÞ þ dist cluster ðC y; C 0Þ dist cluster ðC x; C 0Þ < 0

when 0 6 dist cluster ðC x;C yÞ < N , 0 6 dist cluster ðC y;C 0Þ < N and dist cluster ðC x;C 0Þ ¼ 1. Thus it does not satisfy the

triangular inequality.

4.3. A method for estimation of h

There are several types of document collections available in real life. The similarities or dissimilarities between docu-

ments present in one corpus may not be same as the similarities or dissimilarities of the other corpora, since the character-

istics of the corpora are different [18]. Additionally, one may view the clusters present in a corpus (or in different corpora)

under different scales, and different scales produce different partitions. Similarities corresponding to one scale in one corpus

may not be same as the similarities corresponding to the same scale in a different corpus. This has been the reason to make

the threshold on similarities data dependent [18]. In fact, we feel that a fixed threshold on similarities will not give satisfac-

tory results on several data sets.

There are several methods available in literature for finding a threshold for a two-class (one class corresponds to similar

points, and the other corresponds to dissimilar points) classification problem. A popular method for such classification is his-

togram thresholding [12].

Let, for a given corpus, the number of distinct similarity values be p, and let the similarity values be s0; s1; . . . ; s p1. Withoutloss of generality, let us assume that (a) si < s j; if i < j and (b) (siþ1 siÞ ¼ ðs1 s0Þ; 8i ¼ 1;2; . . . ; ð p 2Þ. Let g ðsiÞ denote the

number of occurrences of si; 8i ¼ 0;1; . . . ; ð p 1Þ. Our aim is to find a threshold h on the similarity values so that a similarity

value s < h implies the corresponding documents are practically dissimilar, otherwise they are similar. The aim is to make

the choice of threshold to be data dependent. The basic steps of the histogram thresholding technique are as follows:

Obtain the histogram corresponding to the given problem.

Reduce the ambiguity in histogram. Usually this step is carried out using a window. One of the earliest such tech-

niques is the moving average technique in time series analysis [2], which is used to reduce the local variations in

a histogram. It is convolved with the histogram resulting in a less ambiguous histogram. We have used the weighted

moving averages using window length 5 of the g ðsiÞ values as,

f ðsiÞ ¼ g ðsiÞ

P p1

j¼0 g ðs jÞ

g ðsi2Þ þ g ðsi1Þ þ g ðsiÞ þ g ðsiþ1Þ þ g ðsiþ2Þ

5

; 8i ¼ 2; 3; . . . ; p 3 ð6Þ


http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-



Find the valley points in the modified histogram. A point si corresponding to the weight function f ðsiÞ is said to be a

valley point if f ðsi1Þ > f ðsiÞ and f ðsiÞ < f ðsiþ1Þ.

The first valley point of the modified histogram is taken as the required threshold on the similarity values.

In the modified histogram corresponding to f , there can be three possibilities regarding the valley points, which are stated

below.

(i) There is no valley point in the histogram. If there is no valley point in the histogram then either it is a constant func-

tion, or it is an increasing or decreasing function of similarity values. These three types of histograms impose strong

conditions on the similarity values which are unnatural for a document collection. Another possible histogram where

there is no valley point is a unimodal histogram. There is a single mode in the histogram, and the number of occur-

rences of a similarity value increases as the similarity values increase to the mode, and decreases as the similarity val-

ues move away from mode. This is an unnatural setup since, there is no reason of having such a strong property to be

satisfied by a histogram of similarity values.

(ii) Another option is that there exists exactly one valley point in the histogram. The number of occurrences of the valley

point is smaller than the number of occurrences of the other similarity values in a neighborhood of valley point. In

practice this type of example is also rare.

(iii) The third and most usual possibility is that the number of valley points is more than one i.e., there exist several varia-

tions in the number of occurrences of similarity values. Here the task is to find a threshold from a particular valley. In

the proposed technique the threshold is selected from the first valley point. The threshold may be selected from the

second or third or a higher valley. But for this we may treat some really similar documents as dissimilar, which lie in

between the first valley point and the higher one. Practically the text data sets are sparse and high dimensional. Hence

high similarities between documents are observed in very few cases. It is true that, for a high h value the extensive

similarity between every two documents in a cluster will be high, but the number of documents in each cluster will

be too few due to the sparsity of the data. Hence h is selected from the first valley point as the similarity values in the

other valleys are higher than the similarity values in the first valley point.

Generally similarity values do not satisfy the property that ðsiþ1 siÞ ¼ ðs1 s0Þ; 8i ¼ 1;2; . . . ; ð p 2Þ. In reality there are

ð p þ 1Þ distinct class intervals of similarity values, where the ith

class interval is ½vi1;viÞ, a semi-closed interval, for

i ¼ 1;2; . . . ; p. The ð p þ 1Þth class interval corresponds to the set where each similarity value is greater than or equal to v p.

g ðsiÞ corresponds to the number of similarity values falling in the ith

class interval. The vi’s are taken in such a way that

ðvi1 viÞ ¼ ðv1 v0Þ; 8i ¼ 2;3; . . . ; p. Note that v0 ¼ 0 and the value of v p is decided on the basis of the observations.

The last interval, i.e., the ð p þ 1Þth

interval is not considered for the valley point selection, since we assume that if any simi-larity value is greater than or equal to v p then the corresponding documents are actually similar. Under this setup, we have

taken si ¼ v iþviþ1

2 ; 8i ¼ 0;1; . . . ; ð p 1Þ. Note that the defined si’s satisfy the properties (a) si < s j; if i < j and (b)

(siþ1 siÞ ¼ ðs1 s0Þ; 8i ¼ 1;2; . . . ; ð p 2Þ. The proposed method finds the valley point, and its corresponding class interval.

The minimum value of that particular class interval is taken as the threshold.

Example. Let us consider an example of histogram thresholding for the selection of theta for a corpus. The similarity values,

the values of g and f are shown in Table 1. Initially we have divided the similarity values into a few class intervals of length

0.001. Let us assume that there are 80 such intervals of equal length and si represents the middle point of the ith class interval

for i ¼ 0;1; . . . ;79. The values of g ðsiÞ’s and the corresponding f ðsiÞ’s are then found. Note that the moving averages have been

used to remove the ambiguities in the g ðsiÞ values. Valleys in the similarity values corresponding to 76 f ðsiÞ’s are then found.

Let s40, which is equal to 0.0405, be the first valley point, i.e., f ðs39Þ > f ðs40Þ and f ðs40Þ < f ðs41Þ. The minimum similarity value

of the class interval

½0:040

0:041

Þ is taken as the threshold h .

Table 1

An example of h estimation by histogram thresholding technique.

Class intervals (vi ’s) si ’s No. of elements of the intervals Moving averages

½0:000—0:001Þ 0:0005 g ðs0Þ –

½0:001—0:002Þ 0:0015 g ðs1Þ –

½0:002—0:003Þ 0:0025 g ðs2Þ f ðs2Þ..

...

...

...

.

½0:040—0:041Þ 0:0405 g ðs40Þ f ðs40Þ..

...

...

...

.

½0:077—0:078Þ 0:0775 g ðs77Þ f ðs77Þ½0:078—0:079Þ 0:0785 g ðs78Þ –

½0:079—0:080

Þ 0:0795 g

ðs79

Þ –

P0.080 – g ðs80Þ –


http://-/?-

http://-/?-



4.4. Procedure of the proposed document clustering technique

The proposed document clustering technique is described in Algorithm 1. Initially each document is taken as a cluster.

Therefore Algorithm 1 starts with N individual clusters. In the first stage of Algorithm 1, a distance matrix is developed

whose ijth

entry is the dist cluster ðC i;C jÞ value where C i and C j are ith

and jth

cluster respectively. It is a square matrix and

has N rows and N columns for N number of documents in the corpus. Each row or column of the distance matrix is treated

as a cluster. Then the Baseline Clusters (BC) are generated by merging the clusters whose distance is less than a fixed thresh-

old a. The value of a is constant throughout Algorithm 1. The process of merging stated in step 3 of Algorithm 1 merges tworows say i and j and the corresponding columns of the distance matrix by following a convention regarding numbering. It

merges two rows into one, the resultant row is numbered as minimum of i; j, and the other row is removed. Similar num-

bering follows for columns too. Then the index structure of the distance matrix is updated accordingly.

Algorithm 1. Iterative document clustering by baseline clusters

Input: (a) A set of clusters C ¼ fC 1;C 2; . . . ;C N g, where N is the number of documents.

C i ¼ fdig; i ¼ 1;2; . . . ;N , where d i is the ith document of the corpus.

(b) A distance matrix DM½i½ j ¼ dist clusterðC i;C jÞ; 8i; j 2 N .

(c) a be the desired threshold on dist cluster and iter be the number of iteration.

Steps of the algorithm:

1: for each clusters C i;C j 2 C where C i – C j and N > 1 do2: if dist cluster ðC i;C j) 6 a then

3: DM merge (DM; i; j)

4: C i C i [ C j5: N N 1

6: end if

7: end for

8: nbc 0; BC ; //Baseline clusters are initialized to empty set

9: nsc 0; SC ; //Singleton clusters are initialized to empty set

10: for i ¼ 1 to N do

11: if jC ij > 1 then

12: nbc nbc þ 1 // No. of baseline clusters

13: BCnbc

C i // Baseline clusters

14: else15: nsc nsc þ 1 // No. of singleton clusters

16: SCnsc C i // Singleton clusters

17: end if

18: end for

19: if nsc ¼ 0jjnbc ¼ 0 then

20: return BC // If no singleton cluster at all exists or no baseline cluster is generated

21: else

22: EBCk BCk; 8k ¼ 1;2; . . . ;nbc // Initialization of extended baseline clusters

23: ebct k centroid of BCk; 8k ¼ 1;2; . . . ;nbc // Extended base centroids

24: nct k ð0Þ; 8k ¼ 1;2; . . . ;nbc ; it 0

25: while ebct k – nct k; 8k ¼ 1;2; . . . ;nbc and it 6 iter do

26: ebct k

centroid of EBCk;

8k

¼ 1;2; . . . ;nbc

27: NCL k BCk; 8k ¼ 1;2; . . . ;nbc // New set of clusters at each iteration28: for j ¼ 1 to nsc do

29: if ebct k is the nearest centroid of SC j; 8k ¼ 1;2; . . . ;nbc then

30: NCL k NCL k [ SC j // Merger of singleton clusters to baseline clusters

31: end if

32: end for

33: nct k centroid of NCL k; 8k ¼ 1;2; . . . ;nbc

34: EBCk NCL k; 8k ¼ 1;2; . . . ;nbc

35: it ¼ it þ 1

36: end while

37: return EBC

38: end if

Output: A set of extended baseline clusters EBC ¼ fEBC1;EBC2; . . . ;EBCnbc g


http://-/?-



After constructing the baseline clusters some clusters may remain as singleton clusters. Every such singleton cluster

(i.e., a single document) is merged with one of the baseline clusters using k-means algorithm in the second stage. In

the second stage the centroids of the baseline clusters (i.e., non singleton clusters) are calculated and they are named

as base centroids. The value of k for k-means algorithm is taken as the number of baseline clusters. The rest of the docu-

ments which are not included in the baseline clusters are grouped by the iterative steps of the k-means algorithm using

these base centroids as the initial seed points. Note that, those documents, which are not included in the baseline clusters,

are only considered for clustering in this stage. But, for the calculation of a cluster centroid, every document in the cluster,

including the documents in the baseline clusters, are considered. A document is put into that cluster for which the content

similarity between the document and the base centroid is maximum. The newly formed clusters are named as Extended

Baseline Clusters (EBC).

It may be noted that the processing in the second stage is not needed if no singleton cluster is produced in the first stage.

We believe that such a possibility is remote in real life and none of our experiments yielded such an outcome. However, such

a clustering is desirable as it produces compact clusters.

4.5. Impact of extensive similarity on the document clustering technique

The extensive similarity plays significant role in constructing the baseline clusters. The documents in the baseline clusters

are very similar to each other as their extensive similarity is very high (above a threshold h). It may be observed that when-

ever two baseline clusters are merged in the first stage, the similarity between any two documents in the baseline clusters

are at least be equal to h. Note that the distance between two different baseline clusters is greater than or equal to a and the

distance between a baseline cluster and a singleton cluster (or between two singleton clusters) may be infinite and theywould never merge to construct a new baseline cluster. Infinite distance between two clusters indicates that the extensive

similarity between at least one document of the baseline cluster and the document of the singleton cluster (or, between the

documents of two different singleton clusters) is 1. Thus the baseline clusters intuitively determine the categories of the

document collection by measuring the extensive similarity between documents.

4.6. Discussion

The proposed clustering method is a combination of baseline clustering and k-means clustering methods. Initially it cre-

ates some baseline clusters. The documents which do not have much similarity with any one of the baseline clusters would

remain as singleton clusters. Therefore k-means method is implemented to group these documents to the corresponding

baseline clusters. k-means algorithm has been used due to its low computational complexity. It is also useful as it can be

easily implemented. But the performance of k

-means algorithm suffers from selection of initial seed points and there isno method for selecting a valid k. It is very difficult to select a proper k for a sparse text data set with high dimensionality.

In various other clustering techniques, k-means algorithm has been used as an intermediary stage e.g., spectral clustering,

buckshot algorithms etc. These algorithms also suffer from the said limitations of k-means method. Note that the proposed

clustering method overcomes these two major limitations of k-means clustering algorithm and has utilized the effectiveness

of k-means method by introducing the idea of baseline clusters. The effectiveness of the proposed technique in terms of clus-

tering quality may be observed in the experimental results section later.

The proposed technique is designed like buckshot clustering algorithm. The main difference between buckshot and the

proposed one lies in designing the hierarchical clusters in the first stage of both of the methods. Buckshot uses the tradi-

tional single-link clustering technique to develop the hierarchical clusters to create the initial centroids of the k-mean

clustering in the second stage. Thus buckshot may suffer from the limitations of both single-link clustering technique

(e.g., chaining effect [8]) and k-means clustering technique. In practice the text data sets contain many categories of

uneven sizes. In those data sets initial random selection of ffiffiffiffiffiffi kn

p documents may not be proper, i.e, no documents may

be selected from an original cluster if its size is small. Note that no random sampling is required for the proposed clus-tering technique. In the proposed one the hierarchical clusters are created using extensive similarity between documents,

and these hierarchical baseline clusters would no more be used in the second stage. In the second stage, k-means algo-

rithm is performed only to group those documents that have not been included in the baseline clusters and the initial

centroids are generated from these baseline clusters. In Buckshot algorithm, all the documents are taken into considera-

tion for clustering by k-means algorithm. It uses the single-link clustering technique, only to create the initial seed points

of the k-means algorithm. Later it can be seen from the experiments that the proposed one performs significantly better

than buckshot clustering technique.

The process of creating baseline clusters in the first stage of the proposed technique is quite similar to the group-average

hierarchical document clustering technique [30]. Both the techniques find the average of similarities of the documents of

two individual clusters for merging them into one. The proposed method finds the distance between two clusters using

extensive similarity, whereas the group-average hierarchical document clustering technique generally uses cosine similarity.

The group-average hierarchical clustering technique cannot distinguish two dissimilar clusters explicitly, like the proposed

method. This is the main difference between the two techniques.


http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-



5. Evaluation criteria

If the documents within a cluster are similar to each other and dissimilar to the documents in the other clusters then the

clustering algorithm is considered to perform well. The data sets under consideration have labeled documents. Hence quality

measures based on labeled data are used here for comparison.

Normalized mutual information and f -measure are very popular and are used by a number of researchers [30]31 to mea-

sure the quality of a cluster using the information of the actual categories of the document collection. Let us assume that R is

the set of categories and S is the set of clusters. Consider there are I number of categories in R and J number of clusters in S .There are a total of N number of documents in the corpus i.e., both R and S individually contains N documents. Let ni be the

number of documents belonging to category i, m j be the number of documents belonging to cluster j and nij be the number of

documents belonging to both category i and cluster j, for all i ¼ 1;2; . . . ; I and j ¼ 1;2; . . . ; J .

Mutual information is a symmetric measure to quantify the statistical information shared between two distributions,

which provides an indication of the shared information between a set of categories and a set of clusters. Let I ðR; S Þ denotes

the mutual information between R and S and E ðRÞ and E ðS Þ be the entropy of R and S respectively. I ðR; S Þ and E ðRÞ can be

defined as

I ðR; S Þ ¼XI

i¼1

X J

j¼1

nij

N log

Nnij

nim j

; E ðRÞ ¼

XI

i¼1

ni

N log

ni

N

There is no upper bound for I (R, S ), so for easier interpretation and comparisons a normalized mutual information that ranges

from 0 to 1 is desirable. The normalized mutual information (NMI) is defined by Strehl et al. [31] as follows:

NMIðR; S Þ ¼ I ðR; S Þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffi

E ðRÞE ðS Þp

F-measure determines the recall and precision value of each cluster with a corresponding category. Let, for a query the set

of relevant documents be from category i and the set of retrieved documents be from cluster j. Then recall, precision and f -

measure are given as follows:

Recallij ¼ nij

ni

; 8 i; j; Precisionij ¼ n ij

m j

; 8 i; j

F ij ¼ 2 Recallij Precisionij

Recallij þ Precisionij

; 8 i; j

If there is no common instance between a category and a cluster (i.e., n ij ¼ 0) then we shall assume F ij ¼ 0. The value of F ij

will be maximum when Precisionij ¼ Recallij ¼ 1 for a category i and a cluster j. Thus the value of F ij lies between 0 and 1. Thebest f -measure among all the clusters is selected as the f -measure for the query of a particular category i.e.,

F i ¼ max j2½0; J F ij; 8i. The f -measure of all the clusters is weighted average of the sum of the f -measures of each category,

F ¼ PI i¼1

ni

N F i. We would like to maximize f -measure and normalized mutual information to achieve good quality clusters.

6. Experimental evaluation

6.1. Document collections

Reuters-21578 is a collection of documents that appeared on Reuters newswire in 1987. The documents were originally

assembled and indexed with categories by Carnegie Group, Inc. and Reuters, Ltd. The corpus contains 21,578 documents in

135 categories. Here we considered the ModApte version used in [4], in which there are 30 categories and 8067 documents.

We have divided this corpus into four groups and with the name as rcv1, rcv2, rcv3 and rcv4.

20-Newsgroups corpus is a collection of news articles collected from 20 different news sources. Each news source con-

stitutes a different category. In this data set, articles with multiple topics are cross posted to multiple newsgroups i.e., there

are overlaps between several categories. The data set is named as 20ns here.

The rest of the corpora were developed in the Karypis lab [15]. The corpora tr31, tr41 and tr45 are derived from TREC-5,

TREC-6, and TREC-7 collections.1 The categories of the tr31, tr41 and tr45 were generated from the relevance judgment

provided in these collections. The corpus fbis was collected from the Foreign Broadcast Information Service data of TREC-5.

The corpora la1 and la2 were from the Los Angeles Times data of TREC-5. The category labels of la1 and la2 were generated

according to the name of the newspaper sections where these articles appeared, such as Entertainment, Financial, Foreign,

Metro, National, and Sports. The documents that have a single label were selected for la1 and la2 data sets. The corpora

oh10 and oh15 were created from OHSUMED collection, subset of MEDLINE database, which contains 233,445 documents

indexed using 14,321 unique categories[15]. Different subsets of categories have been taken to construct these data sets.

1

http://trec.nist.gov.


http://trec.nist.gov/



http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-

http://-/?-



The number of documents, number of terms and number of categories of these corpora can be found in Table 2. For each

of the above corpora, the stop words have been extracted using the standard English stop word list.2 Then, by applying the

standard porter stemmer algorithm [27] for stemming, the inverted index is developed.

6.2. Experimental setup

Single-Link Hierarchical Clustering (SLHC) [30], Average-Link Hierarchical Clustering (ALHC) [30], Dynamic k-Nearest

Neighbor Algorithm (DKNNA) [16], k-means clustering [13], bisecting k-means clustering [30], buckshot clustering [6], spec-

tral clustering [25] and clustering by Non-negative Matrix Factorization (NMF) [33] techniques are selected for comparison

with the proposed clustering technique. k-means and bisecting k-means algorithms have been executed 10 times to reduce

the effect of random initialization of seed points and for each execution they have been iterated 100 times to reach a solution

(if they are not converged automatically). Buckshot algorithm has also been executed 10 times to reduce the effect of random

initialization of initial ffiffiffiffiffiffi kN

p documents. The f -measure and NMI values of k-means, bisecting k-means and buckshot cluster-

ing techniques shown here are the average of 10 different results. Note that the proposed method finds the number of clus-

ters automatically from the data sets. The proposed clustering algorithm has been executed first and then all the other

algorithms have been executed to produce the same number of clusters as the proposed one. Tables 3 and 4 show the f -mea-

sure and NMI values respectively for all the data sets. Number of Clusters (NCL) developed by the proposed method is also

shown. The f -measure and NMI are calculated using these NCL values. The value of a is chosen as, a ¼ ffiffiffiffiffiffiffiffi ffiffiffiffi N

p p for N number of

documents in the corpus. The NMF based clustering algorithm has been executed 10 times to reduce the effect of random

initialization and for each time it has been iterated 100 times to reach a solution. The values of k for DKNNA is taken as

k ¼ 10. The value of r of the spectral clustering technique is set by search over values from 10 to 20 percent of the total range

of the similarity values and the one that gives the tightest clusters is picked, as suggested by Ng et al. [25].

The proposed histogram thresholding based technique for estimating a value of h has been followed in the experiments.

We have considered class intervals of length 0.005 for similarity values. We have also assumed that content similarity (here

cosine similarity) value greater than 0.5 means that the corresponding documents are similar. Thus, the issue here is to find a

h; 0 < h < 0:5 such that a similarity value grater than h denotes that the corresponding documents are similar. In the experi-

ments we have used the method of moving averages with the window length of 5 for convolution. The text data sets are

generally sparse and the number of high similarity values is practically very low and there are fluctuations in the heights

of the histogram for two successive similarity values. Hence it is not desirable to take the window length of 3 as the method

considers the heights of just the previous and the next value for calculating f ðsiÞ’s. We have tried with window length of 7 or

9 on some of the corpora in the experiments, but the values of h remain more or less same as they are selected by considering

window length of 5. On the other hand window length of 7 or 9 need more calculations than window length of 5. These are

the reasons for the choice of window of length 5. It has been found that several local peaks and local valleys are removed by

this method. The number of valley regions after smoothing the histogram by the method of moving averages is always found

to be greater than three.

6.3. Analysis of results

Tables 3 and 4 show the comparison of proposed document clustering method with the other methods using f -measure

and NMI respectively, for all data sets. There are 104 comparisons for the proposed method using f -measure in Table 3. The

proposed one performs better than the other methods in 91 cases and for the rest 13 cases other methods (e.g., buckshot,

spectral clustering algorithms) have an edge over the proposed method. Few of the exceptions, where the other methods

perform better than the proposed one are e.g., SLHC and NMF for rcv3 (Here the f -measure of SLHC and NMF are respectively

Table 2

Data sets overview.

Data set No. of documents No. of terms No. of categories

20ns 18,000 35,218 20

fbis 2463 2000 17

la1 3204 31,472 6

la2 3075 31,472 6

oh10 1050 3238 10

oh15 913 3100 10rcv1 2017 12,906 30

rcv2 2017 12,912 30

rcv3 2017 12,820 30

rcv4 2016 13,181 30

tr31 927 10,128 7

tr41 878 7454 10

tr45 690 8261 10

2 http://www.textfixer.com/resources/common-english-words.txt.


http://www.textfixer.com/resources/common-english-words.txt

http://www.textfixer.com/resources/common-english-words.txt

http://-/?-

http://-/?-

http://-/?-

http://-/?-



0.408, 0.511 and the f -measure of the proposed method is 0.294). Similarly, Table 4 shows that the proposed method per-

forms better than the other methods using NMI in 98 out of 104 cases.

A statistical significance test has been performed to check whether these differences are significant when other clustering

algorithms beat the proposed algorithm both in Tables 3 and 4. The same statistical significance test has been performed

when the proposed algorithm performs better than the other clustering algorithms.

A generalized version of paired t-test is suitable for testing the equality of means when the variances are unknown. This

problem is the classical Behrens–Fisher problem in hypothesis testing and a suitable test statistic 3 is described and tabled in

[20,28], respectively. It has been found that out of those 91 cases, where the proposed algorithm performed better than the

other algorithms, in Table 3, the differences are statistically significant in 86 cases for the level of significance 0.05. For all of the rest 13 cases the differences are statistically significant in Table 3, for the same level of significance. Hence the performance

of the proposed method is found to be significantly better than the other methods in 86.86% (86/99) cases using f -measure.

Similarly, in Table 4 the results are significant in 89 out of 98 cases when proposed method performed better than the other

methods and all the results of the rest 6 cases are significant. Thus in 93.68% (89/95) cases the proposed method performs sig-

nificantly better than the other methods using NMI. Clearly, these results show the effectiveness of the proposed document

clustering technique.

Remark. A point is to be noted that the number of clusters produced by the proposed method for each corpus are close to

the actual number of categories of each corpus. It may be observed from Tables 3 and 4 that the number of clusters is equal to

the actual number of categories for la2, oh15, rcv2, tr31 and tr41 corpora. The difference between the number of clusters and

actual number of categories is at most 2 for rest of the corpora. Since the text data sets provided here are very sparse and

Table 3

Comparison of various clustering methods using f -measure.

Data sets NCTa NCL b F -measure

BKMc KM BS SLHC ALHC DKNNA SC NMF Proposed

20ns 20 23 0.357 0.449 0.436 0.367 0.385 0.408 0.428 0.445 0.474

fbis 17 19 0.423 0.534 0.516 0.192 0.192 0.288 0.535 0.435 0.584

la1 6 8 0.506 0.531 0.504 0.327 0.325 0.393 0.536 0.544 0.570

la2 6 6 0.484 0.550 0.553 0.330 0.328 0.405 0.541 0.542 0.563

oh10 10 12 0.304 0.465 0.461 0.205 0.206 0.381 0.527 0.481 0.500

oh15 10 10 0.363 0.485 0.482 0.206 0.202 0.366 0.516 0.478 0.532

rcv1 30 31 0.231 0.247 0.307 0.411 0.360 0.431 0.298 0.516 0.553

rcv2 30 30 0.233 0.281 0.324 0.404 0.353 0.438 0.312 0.489 0.517

rcv3 30 32 0.188 0.271 0.351 0.408 0.376 0.436 0.338 0.511 0.294

rcv4 30 32 0.247 0.322 0.289 0.405 0.381 0.440 0.401 0.509 0.289

tr31 7 7 0.558 0.665 0.646 0.388 0.387 0.457 0.589 0.545 0.678

tr41 10 10 0.564 0.607 0.593 0.286 0.280 0.416 0.557 0.537 0.698

tr45 10 11 0.556 0.673 0.681 0.243 0.248 0.444 0.605 0.596 0.750

a NCT stands for number of categories.b NCL stands for number of clusters.c BKM, KM, BS, SLHC, ALHC, DKNNA, SC and NMF stand for Bisecting k-Means, k-Means, BuckShot, Single-Link Hierarchical Clustering, Average-Link

Hierarchical Clustering, Dynamic k-Nearest Neighbor Algorithm, spectral clustering and Non-negative Matrix Factorization respectively.

Table 4

Comparison of various clustering methods using normalized mutual information.

Dat a s ets NCT NCL Norma liz ed mutual informat ion

BKMa KM BS SLHC ALHC DKNNA SC NMF Proposed

20ns 20 23 0.417 0.428 0.437 0.270 0.286 0.325 0.451 0.432 0.433

fbis 17 19 0.443 0.525 0.524 0.051 0.362 0.405 0.520 0.446 0.544

la1 6 8 0.266 0.299 0.295 0.021 0.218 0.241 0.285 0.296 0.308

la2 6 6 0.249 0.312 0.323 0.021 0.215 0.252 0.335 0.360 0.386

oh10 10 12 0.226 0.352 0.333 0.050 0.157 0.239 0.417 0.410 0.406

oh15 10 10 0.213 0.352 0.357 0.067 0.155 0.236 0.358 0.357 0.380

rcv1 30 31 0.302 0.409 0.407 0.0871 0.108 0.213 0.429 0.434 0.495

rcv2 30 30 0.296 0.411 0.399 0.053 0.150 0.218 0.426 0.420 0.465

rcv3 30 32 0.316 0.416 0.408 0.049 0.162 0.215 0.404 0.476 0.448

rcv4 30 32 0.317 0.414 0.416 0.048 0.175 0.220 0.414 0.507 0.452

tr31 7 7 0.478 0.463 0.471 0.065 0.212 0.414 0.436 0.197 0.509

tr41 10 10 0.470 0.550 0.553 0.054 0.237 0.456 0.479 0.506 0.619

tr45 10 11 0.492 0.599 0.591 0.084 0.354 0.512 0.503 0.488 0.694

a All the symbols in this table are the same symbols used in Table 3.

3 The test statistic is of the form t ¼ x1 x2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

s21=n1þs2

2=n2

p , where x1 ; x2 are the means, s1 ; s2 are the standard deviations and n1 ; n2 are the number of observations.


http://-/?-

http://-/?-



high dimensional, then it may be implied that the method proposed here for estimating the value of h is able to detect the

actual grouping of the corpus.

6.4. Processing time

The similarity matrix requires N N memory locations, and to store N clusters, initially, N memory locations are needed

for the proposed method. Thus the space complexity of the proposed document clustering algorithm is OðN 2Þ. OðN 2Þ time is

required to build the extensive similarity matrix and to construct (say) mð N Þ baseline clusters, the proposed method takes

OðmN 2Þ time. In the final stage of the proposed technique, k-means algorithm takes OððN bÞmt Þ time to merge (say) b num-

ber of singleton clusters to the baseline clusters, where t is number of iterations of k-means algorithm. Thus the time com-

plexity of the proposed algorithm is OðN 2Þ as m is very small compared to N .

The processing time of each algorithm used in the experiments have been measured on a quad core Linux workstation.

The time (in seconds) taken by different clustering algorithms to cluster every text data set are reported in Table 5. The time

shown here for the proposed algorithm is the sum of the times taken to estimate the value of h, to build the baseline clusters,

and to perform the k-means clustering algorithm to merge the remaining singleton clusters to the baseline clusters. The time

shown for bisecting k-means, buckshot, k-means and NMF clustering techniques are the average of the processing times of 10 iterations. It is to be mentioned that the codes for all the algorithms are written in C++ and the data structures for all the

algorithms are developed by the authors. Hence the processing time can be reduced by incorporating some more efficient

data structures for the proposed algorithm as well as the other algorithms. Note that the processing time of the proposed

algorithm is less than KM, SLHC, ALHC and DKNNA for each data set. The execution time of BKM is less than the proposed

one for 20ns, the execution time of SC is less than the proposed one for tr41 and the execution time of NMF is less than the

proposed algorithm for tr31. The execution time of the proposed algorithm is less than BKM, SC and NMF for each of the

other data sets. The processing time of the proposed algorithm is comparable with buckshot algorithm (although in most

of the data set the processing time of the proposed algorithm is less than buckshot). The dimensionality of the data sets (used

in the experiments) varies from 2000 (fbis) to 35,218 (20ns). Hence the proposed clustering algorithm may be useful in

terms of processing time for any real life high dimensional data set.

7. Conclusions

A hybrid document clustering algorithm is introduced by combining a new hierarchical and traditional k-means cluster-

ing techniques. The baseline clusters produced by the new hierarchical technique are the clusters where the documents pos-

sess high similarity among them. The extensive similarity between documents ensures this quality of the baseline clusters. It

is developed on the basis of similarity between two documents and their distances with every other documents in the docu-

ment collection. Thus the documents with high extensive similarity are grouped in the same cluster. Most of the singleton

clusters are nothing but the documents which have low content similarity with every other document. In practice the

number of such singleton clusters is sufficient and can not be ignored as outliers. Therefore k-means algorithm is performed

iteratively to assign these singleton clusters to one of the baseline clusters. Thus the proposed method reduces the error of

k-means algorithm due to random seed selection. Moreover the method is not as expensive as the hierarchical clustering

algorithm which can be observed from Table 5. The significant characteristic of the proposed clustering technique is that

the algorithm automatically decides the number of clusters in the data. The automatic detection of number of clusters for

such a sparse and high dimensional text data is very important.

The proposed method is able to determine the number of clusters prior to implement the algorithm by applying a thresh-old h on the similarity value between documents. An estimation technique is introduced to determine a value of h from a

Table 5

Processing time (in seconds) of different clustering methods.

Methods BKMa KM BS SLHC ALHC DKNNA SC NMF Proposed

20ns 1582.25 1594.54 1578.12 1618.50 1664.31 1601.23 1583.62 1587.36 1595.23

fbis 94.17 91.52 92.46 112.15 129.58 104.32 100.90 93.19 90.05

la1 159.23 153.22 142.36 160.12 179.62 162.25 146.68 153.50 140.31

la2 149.41 144.12 139.34 163.47 182.50 164.33 142.33 144.46 140.29

oh10 18.57 18.32 17.32 26.31 33.51 23.26 24.82 22.48 18.06

oh15 18.12 20.02 18.26 24.15 31.46 23.18 20.94 17.61 16.22rcv1 89.41 91.15 86.37 87.47 103.37 88.37 94.62 87.80 86.18

rcv2 98.32 106.08 103.16 104.17 120.45 99.27 97.81 93.51 92.53

rcv3 97.11 106.29 98.35 98.47 124.72 98.47 108.24 94.96 93.54

rcv4 100.17 95.28 96.49 109.32 126.30 99.49 113.81 98.70 93.32

tr31 29.41 30.23 30.34 33.15 40.13 32.35 37.98 29.33 29.38

tr41 27.29 28.16 27.46 33.96 39.58 30.52 25.85 27.54 26.49

tr45 25.45 25.01 26.06 31.17 38.23 28.50 29.72 26.51 24.65

a All the symbols in this table are the same symbols used in Table 3.


http://-/?-

http://-/?-

http://-/?-

http://-/?-



corpus. Theexperimental results show the value and validity of the proposed estimation of h. In the experiments the threshold

on the distance between two clusters is taken as a ¼ ffiffiffiffiffiffiffiffi ffiffiffiffi

N p p

. It indicates that the distance between two different clusters must

be greater than a. It is very difficult to fix a lower bound on the distance between two clusters in practice as the corpora are

sparse in nature and have high dimensionality. If we select a > ffiffiffiffiffiffiffiffi ffiffiffiffi

N p p

, say,a ¼ N 13 or a ¼ N

12 then some really different clusters

may be merged into one, which is surely not desirable. On the other hand, if we select a < ffiffiffiffiffiffiffiffi ffiffiffiffi

N p p

, say, a ¼ N 15 then we may get

some very compact clusters, but large number of small sized clusters would be created, which is also not expected in practice.

It may be observed from the experiments that the number of clusters produced by the proposed technique is very close to the

actual number of categories for each corpus and the proposed one outperforms the other methods. Hence we may claim that

the selection of a ¼ ffiffiffiffiffiffiffiffi ffiffiffiffi

N p p

is proper, though it has been selected heuristically. The proposed hybrid clustering technique tries

to solve some issues of some of the well known partitional and hierarchical clustering techniques. Hence it may be useful in

many real life unsupervised applications. Note that any similarity measure can be used instead of cosine similarity to design

extensive similarity for different types of data sets except text data. It is to be mentioned that the value of a should be chosen

carefully whenever the method is applied to different other types of applications. In future we shall apply the proposed

method on social network data to find different types of communities or topics. In that case we may have to incorporate

the idea of graph theory into the proposed distance function to find relation between different sets of nodes of a social site.

Acknowledgment

The authors would like to thank the reviewers and the editor for their valuable comments and suggestions.

References

[1] N.O. Andrews, E.A. Fox, Recent Developments in Document Clustering, Technical Report, Verginia Tech., USA, 2007.[2] R.G. Brown, Smoothing, Forecasting and Prediction of Discrete Time Series, Prentice-Hall, Englewood Cliffs, NJ, 1962.[3] C. Aggarwal, C. Zhai, A survey of text clustering algorithms, Mining Text Data (2012) 77–128.[4] D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Trans. Knowl. Data Eng. 17 (12) (2005) 1624–1637.[5] C. Carpineto, S. Osinski, G. Romano, D. Weiss, A survey of web clustering engines, ACM Comput. Surveys 41 (3) (2009).[6] D.R. Cutting, D.R. Karger, J.O. Pedersen, J.W. Tukey, Scatter/gather: a cluster-based approach to browsing large document collections, in: Proceedings of

the International Conference on Research and Development in Information Retrieval, SIGIR’93, 1993, pp. 126–135.[7] S. Dasgupta, V. Ng, Towards subjectifying text clustering, in: Proceedings of the International Conference on Research and Development in Information

Retrieval, SIGIR’10, NY, USA, 2010, pp. 483–490.[8] R.C. Dubes, A.K. Jain, Algorithms for Clustering Data, Prentice Hall, 1988.[9] R. Duda, P. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, 1973.

[10] M. Filipponea, F. Camastrab, F. Masullia, S. Rovettaa, A survey of kernel and spectral methods for clustering, Pattern Recognit. 41 (1) (2008) 176–190.[11] R. Forsati, M. Mahdavi, M. Shamsfard, M.R. Meybodi, Efficient stochastic algorithms for document clustering, Inform. Sci. 220 (2013) 269–291.

[12] C.A. Glasbey, An analysis of histogram-based thresholding algorithms, Graph. Models Image Process. 55 (6) (1993) 532–537.

[13] J.A. Hartigan, M.A. Wong, A k-means clustering algorithm, J. Roy. Statist. Soc. (Appl. Statist.) 28 (1) (1979) 100–108.[14] A. Huang, Similarity measures for text document clustering, in: Proceedings of the New Zealand Computer Science Research Student Conference,

Christchurch, New Zealand, 2008, pp. 49–56.

[15] G. Karypis, E.H. Han, Centroid-based document classification: analysis and experimental results, in: Proceedings of the Fourth European Conference onthe Principles of Data Mining and Knowledge Discovery, PKDD’00, Lyon, France, 2000, pp. 424–431.

[16] J.Z.C. Lai, T.J. Huang, An agglomerative clustering algorithm using a dynamic k-nearest neighbor list, Inform. Sci. 217 (2012) 31–38.[17] A.N. Langville, C.D. Meyer, R. Albright, Initializations for the Non-negative Matrix Factorization, in: Proceedings of the Conference on Knowledge

Discovery from Data, KDD’06, 2006.[18] T. Basu, C.A. Murthy, Cues: a new hierarchical approach for document clustering, J. Pattern Recognit. Res. 8 (1) (2013) 66–84.[19] D.D. Lee, H.S. Seung, Algorithms for Non-negative Matrix Factorization, in: Advances in Neural Information Processing Systems, vol. 13, 2001, pp. 556–

562.[20] E.L. Lehmann, Testing of Statistical Hypotheses, John Wiley, New York, 1976.[21] X. Liu,X. Yong, H. Lin, An improved spectralclustering algorithm based on local neighbors in kernel space, Comput. Sci.Inform. Syst. 8 (4) (2011) 1143–

1157.[22] C.S. Yang, M.C. Chiang, C.W. Tsai, A time efficient pattern reduction algorithm for k-means clustering, Inform. Sci. 181 (2011) 716–731.[23] M.I. Malinen, P. Franti, Clustering by analytic functions, Inform. Sci. 217 (2012) 31–38.[24] C.D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, New York, 2008.

[25] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in: Proceedings of Neural Information Processing Systems (NIPS’01),2001, pp. 849–856.[26] P. Pantel, D. Lin, Document clustering with committees, in: Proceedings of the Internati onal Conference on Research and Development in Information

Retrieval, SIGIR’02, 2002, pp. 199–206.

[27] M.F. Porter, An algorithm for suffix stripping, Program 14 (3) (1980) 130–137.[28] C.R. Rao, S.K. Mitra, A. Matthai, K.G. Ramamurthy (Eds.), Formulae and Tables for Statistical Work, Statistical Publishing Society, Calcutta, 1966.

[29] G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983.[30] M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques, in: Proceedings of the Text Mining Workshop, ACM International

Conference on Knowledge Discovery and Data Mining (KDD’00), 2000.[31] A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse framework for combining multiple partitions, J. Machine Learn. Res. 3 (2003) 583–617.

[32] J. Wang, S. Wu, H.Q. Vu, G. Li, Text document clustering with metric learning, in: Proceedings of the 33rd International Conference on Research andDevelopment in Information Retrieval, SIGIR’10, 2010, pp. 783–784.

[33] W. Xu, X.Liu, Y.Gong, Document clustering based on Non-negative Matrix Factorization, in: Proceedings of the International Conference on Researchand Development in Information Retrieval, SIGIR’03, Toronto, Canada, 2003, pp. 267–273.

[34] W. Xu, Y.Gong, Document clustering by concept factorization, in: Proceedings of the International Conference on Research and Development inInformation Retrieval, SIGIR’2004, NY, USA, 2010, pp. 483–490.

[35] Y. Zhu, L. Jing, J. Yu, Text clustering via constrained nonnegative matrix factorization, in: Proceedings of the IEEE International Conference on Data

Mining (ICDM’2011), 2011, pp. 1278–1283.


http://refhub.elsevier.com/S0020-0255(15)00203-0/h0010



























































http://-/?-



http://-/?-

http://-/?-

http://-/?-


http://-/?-


http://-/?-


http://-/?-



http://-/?-

http://-/?-