mahashweta das ([email protected]) cse 6339 (dr. chengkai li)

“A Comparison of Document Clustering Techniques”

Michael Steinbach, George Karypis and Vipin Kumar

(Technical Report, CSE, UMN, 2000)

Mahashweta Das ([email protected])

CSE 6339 (Dr. Chengkai Li)

Feb 9, 2010

Feb 9, 2010 CSE 6339 2

Document Clustering

• Clustering - act of grouping similar object into sets

• Document Clustering - act of collecting similar documents into bins, where similarity is some function on a document

• Uses of Document Clustering• Browsing a large collection of documents (document

organization, automatic topic extraction, fast information retrieval)

• Organizing results returned by search engine (efficient web search, automatic generation of taxonomy of web documents, effective document classifier)

- Improves precision and recall in information retrieval systems

Feb 9, 2010 CSE 6339 3

Types of Clustering

• Agglomerative Hierarchical Clustering• Begin with as many clusters as objects; most similar

clusters are successively merged until only one cluster remains

• Superior cluster quality; but O(n2) complexity

• Partitional Clustering• Begin with k initial centroids and assign all n objects

to closest centroid; recompute centroid of each cluster and repeat until centroids don’t change

• Efficient O(knt) complexity; but often poor cluster quality

Feb 9, 2010 CSE 6339 4

Agglomerative Hierarchical Clustering

Euclidean distance is the similarity/distance metric

Feb 9, 2010 CSE 6339 5

Comparison: Agglomerative Hierarchical Clustering

• Intra-Cluster Similarity Technique (IST)• looks at the similarity of all documents in a cluster to their

cluster centroid - to find which pair of cluster-merge will lead to smallest decrease in similarity

• Centroid Similarity Technique (CST)• looks at the cosine similarity between the centroids of the two

clusters

• UPGMA • looks at cluster similarity as follows:

Performs Best

Feb 9, 2010 CSE 6339 6

Partitional Clustering (K-Means)

Euclidean distance is the similarity/distance metric

Feb 9, 2010 CSE 6339 7

Vector Space Model and Document Clustering

• Cosine Similarity between documents d1 and d2

• Cluster Centroid Vector for a set of S documents in a cluster

• Cosine Similarity between a document and centroid vector

• Cosine Similarity between centroid vectors c1 and c2

Feb 9, 2010 CSE 6339 8

Cluster Quality Evaluation Measures• Internal Quality

Measure• Cohesiveness of cluster

as measure of cluster quality

• OVERALL SIMILARITY

• Based on pairwise similarity of documents in a cluster

• For a set of S documents in a cluster

• External Quality Measure• Compares the groups

produced by clustering techniques to known classes

• ENTROPY

• F-MEASURE

The Higher, The Better

Feb 9, 2010 CSE 6339 9

ENTROPY: External Cluster Quality Measure

• ENTROPY• Calculate class distribution of data

• pij : the “probability” that a member of cluster j belongs to class i

• Entropy of cluster j

• Total entropy

The Lower, The Better

Feb 9, 2010 CSE 6339 10

F-MEASURE: External Cluster Quality Measure

• F-MEASURE• Combines precision and recall ideas from information

retrieval

• For cluster j and class i

•

where nij: number of members of cluster j in class i; nj: number of members of cluster j; ni: number of members of class i

• P

• Entire F-Measure

• p

The Higher, The Better

Feb 9, 2010 CSE 6339 11

Bisecting K-Means Clustering

The algorithm starts with a single cluster of all documents

Largest Cluster or Least Overall Similarity or Both

Feb 9, 2010 CSE 6339 12

Bisecting K-Means Example

Feb 9, 2010 CSE 6339 13

S

K

L

D

HS

H4

H2

H3 H4

K

L

S

H2

H4

H4

S

S

Bisecting K-Means Clustering Document Cluster Hierarchy

Bisecting K-Means Example

Feb 9, 2010 CSE 6339 14

Observations

• Bisecting K-Means is actually divisive hierarchical clustering

• Bisecting K-Means has a time complexity linear in number of documents

• Multiple runs of Bisecting K-Means does not improve results

• Bisecting K-Means (with or without refinement) is better than regular K-Means and UPGMA (with or without refinement) quite consistently (Overall Similarity and Entropy)

• Bisecting K-means produces better document hierarchies

Refinement: Bisecting K-Means and UPGMA algorithms are followed by basic K-Means clustering algorithm which uses the centroids of the clusters produced by the techniques as initial centroids

Feb 9, 2010 CSE 6339 15

Agglomerative Hierarchical Clustering vs. K-Means/Bisecting K-Means

• Documents share “core” vocabularies

• Two documents can often be nearest neighbors without belonging to the same class, so agglomerative algorithms make mistakes

• “Global properties” help overcome local minima• Global property: computing the cosine similarity of a

document to a cluster centroid is the same as computing the average similarity of the document to all the cluster’s documents

• K-means better suited to document clustering• However, UPGMA outperforms a single run of K-Means

• Incremental update of centroid version of K-Means has been used

• Hybrid Hierarchical K-Means performs better than Hierarchical

Feb 9, 2010 CSE 6339 16

Bisecting K-Means vs. K-Means

• Bisecting K-means tends to produce clusters of relatively uniform size

• Regular K-means tends to produce clusters of widely different sizes which affects overall cluster quality measure

• Bisecting K-means beats Regular K-means in Entropy measurement

• Is this explanation/intuition sufficient? What is the scope of the algorithm outside document clustering?

Thank You !!

??

Feb 9, 2010 CSE 6339 18

References

• Cluster Analysis: Basic Concepts and Algorithms, Ruoming Jin www.cs.kent.edu/~jin/DM07/cluster.ppt

• A Comparison of Document Clustering Techniques, Leo Chen www.cs.sfu.ca/~wangk/894report/chen1.pdf

• TaxaMiner: An Experimental Framework for Automated Taxonomy Bootstrapping, Vipul Kashyap www.lsdis.cs.uga.edu/~kashyap/talks/lhncbc-talk.ppt

• K Means Clustering, Panos Pardalos www.ise.ufl.edu/pardalos/dm/k-means.pdf

• Wikipedia

mahashweta das ([email protected]) cse 6339 (dr. chengkai li)

Documents

single cluster

documentslargest cluster

pair of cluster

similarity of documents

cluster j ni

member of cluster j

poor cluster qualitycse

clustercosine similarity