mahashweta das (mahashweta.das@mavs.uta) cse 6339 (dr. chengkai li)
Post on 03-Feb-2016
16 Views
Preview:
DESCRIPTION
TRANSCRIPT
“A Comparison of Document Clustering Techniques”
Michael Steinbach, George Karypis and Vipin Kumar
(Technical Report, CSE, UMN, 2000)
Mahashweta Das (mahashweta.das@mavs.uta.edu)
CSE 6339 (Dr. Chengkai Li)
Feb 9, 2010
Feb 9, 2010 CSE 6339 2
Document Clustering
• Clustering - act of grouping similar object into sets
• Document Clustering - act of collecting similar documents into bins, where similarity is some function on a document
• Uses of Document Clustering• Browsing a large collection of documents (document
organization, automatic topic extraction, fast information retrieval)
• Organizing results returned by search engine (efficient web search, automatic generation of taxonomy of web documents, effective document classifier)
- Improves precision and recall in information retrieval systems
Feb 9, 2010 CSE 6339 3
Types of Clustering
• Agglomerative Hierarchical Clustering• Begin with as many clusters as objects; most similar
clusters are successively merged until only one cluster remains
• Superior cluster quality; but O(n2) complexity
• Partitional Clustering• Begin with k initial centroids and assign all n objects
to closest centroid; recompute centroid of each cluster and repeat until centroids don’t change
• Efficient O(knt) complexity; but often poor cluster quality
Feb 9, 2010 CSE 6339 4
Agglomerative Hierarchical Clustering
Euclidean distance is the similarity/distance metric
Feb 9, 2010 CSE 6339 5
Comparison: Agglomerative Hierarchical Clustering
• Intra-Cluster Similarity Technique (IST)• looks at the similarity of all documents in a cluster to their
cluster centroid - to find which pair of cluster-merge will lead to smallest decrease in similarity
• Centroid Similarity Technique (CST)• looks at the cosine similarity between the centroids of the two
clusters
• UPGMA • looks at cluster similarity as follows:
Performs Best
Feb 9, 2010 CSE 6339 6
Partitional Clustering (K-Means)
Euclidean distance is the similarity/distance metric
Feb 9, 2010 CSE 6339 7
Vector Space Model and Document Clustering
• Cosine Similarity between documents d1 and d2
• Cluster Centroid Vector for a set of S documents in a cluster
• Cosine Similarity between a document and centroid vector
• Cosine Similarity between centroid vectors c1 and c2
Feb 9, 2010 CSE 6339 8
Cluster Quality Evaluation Measures• Internal Quality
Measure• Cohesiveness of cluster
as measure of cluster quality
• OVERALL SIMILARITY
• Based on pairwise similarity of documents in a cluster
• For a set of S documents in a cluster
• External Quality Measure• Compares the groups
produced by clustering techniques to known classes
• ENTROPY
• F-MEASURE
The Higher, The Better
Feb 9, 2010 CSE 6339 9
ENTROPY: External Cluster Quality Measure
• ENTROPY• Calculate class distribution of data
• pij : the “probability” that a member of cluster j belongs to class i
• Entropy of cluster j
• Total entropy
The Lower, The Better
Feb 9, 2010 CSE 6339 10
F-MEASURE: External Cluster Quality Measure
• F-MEASURE• Combines precision and recall ideas from information
retrieval
• For cluster j and class i
•
where nij: number of members of cluster j in class i; nj: number of members of cluster j; ni: number of members of class i
• P
• Entire F-Measure
• p
The Higher, The Better
Feb 9, 2010 CSE 6339 11
Bisecting K-Means Clustering
The algorithm starts with a single cluster of all documents
Largest Cluster or Least Overall Similarity or Both
Feb 9, 2010 CSE 6339 12
Bisecting K-Means Example
Feb 9, 2010 CSE 6339 13
S
K
L
D
HS
H4
H2
H3 H4
K
L
S
H2
H4
H4
S
S
Bisecting K-Means Clustering Document Cluster Hierarchy
Bisecting K-Means Example
Feb 9, 2010 CSE 6339 14
Observations
• Bisecting K-Means is actually divisive hierarchical clustering
• Bisecting K-Means has a time complexity linear in number of documents
• Multiple runs of Bisecting K-Means does not improve results
• Bisecting K-Means (with or without refinement) is better than regular K-Means and UPGMA (with or without refinement) quite consistently (Overall Similarity and Entropy)
• Bisecting K-means produces better document hierarchies
Refinement: Bisecting K-Means and UPGMA algorithms are followed by basic K-Means clustering algorithm which uses the centroids of the clusters produced by the techniques as initial centroids
Feb 9, 2010 CSE 6339 15
Agglomerative Hierarchical Clustering vs. K-Means/Bisecting K-Means
• Documents share “core” vocabularies
• Two documents can often be nearest neighbors without belonging to the same class, so agglomerative algorithms make mistakes
• “Global properties” help overcome local minima• Global property: computing the cosine similarity of a
document to a cluster centroid is the same as computing the average similarity of the document to all the cluster’s documents
• K-means better suited to document clustering• However, UPGMA outperforms a single run of K-Means
• Incremental update of centroid version of K-Means has been used
• Hybrid Hierarchical K-Means performs better than Hierarchical
Feb 9, 2010 CSE 6339 16
Bisecting K-Means vs. K-Means
• Bisecting K-means tends to produce clusters of relatively uniform size
• Regular K-means tends to produce clusters of widely different sizes which affects overall cluster quality measure
• Bisecting K-means beats Regular K-means in Entropy measurement
• Is this explanation/intuition sufficient? What is the scope of the algorithm outside document clustering?
Thank You !!
??
Feb 9, 2010 CSE 6339 18
References
• Cluster Analysis: Basic Concepts and Algorithms, Ruoming Jin www.cs.kent.edu/~jin/DM07/cluster.ppt
• A Comparison of Document Clustering Techniques, Leo Chen www.cs.sfu.ca/~wangk/894report/chen1.pdf
• TaxaMiner: An Experimental Framework for Automated Taxonomy Bootstrapping, Vipul Kashyap www.lsdis.cs.uga.edu/~kashyap/talks/lhncbc-talk.ppt
• K Means Clustering, Panos Pardalos www.ise.ufl.edu/pardalos/dm/k-means.pdf
• Wikipedia
top related