clustering
DESCRIPTION
Clustering. Rong Jin. What is Clustering?. Identify the underlying structure for given data points Doc. clustering: groups documents of same topics into the same cluster. $$$. age. query. Improve IR by Document Clustering. Cluster-based retrieval. Improve IR by Document Clustering. - PowerPoint PPT PresentationTRANSCRIPT
Clustering
Rong Jin
What is Clustering?
$$$
age
Identify the underlying structure for given data points Doc. clustering: groups documents of same topics into
the same cluster
Improve IR by Document Clustering Cluster-based retrieval
query
Improve IR by Document Clustering Cluster-based retrieval
Cluster docs in collection a priori Only compute the relevance scores for docs
in the cluster closest to the query Improve retrieval efficiency by only search
a small portion of the document collection
Application (I): Search Result Clustering
Application (II): Navigation
Application (III): Google News
Application (III): Visualization
Islands of music (Pampalk et al., KDD’ 03)
How to Find good Clusters?
x1x2
x3x4
x5
x7
x6
How to Find good Clusters? Measure the compactness
by the sum of distance square within clusters
x1x2
x3x4
x5
x7
x6
How to Find good Clusters? Measure the compactness
by the sum of distance square within clusters
x1x2
x3x4
x5
x7
x6
How to Find good Clusters? Measure the compactness
by the sum of distance square within clusters
x1x2
x3x4
x5
x7
x6
C1
C2
How to Find good Clusters? Measure the compactness
by the sum of distance square within clusters
x1x2
x3x4
x5
x7
x6
C1
C2
How to Find good Clusters? Measure the compactness
by the sum of distance square within clusters
Membership indicators:
mi,j =1 if xi is assigned to Cj, and zero otherwise.
x1x2
x3x4
x5
x7
x6
C1
C2
How to Find good Clusters? Measure the compactness
by the sum of distance square within clusters
Membership indicators:
mi,j =1 if xi is assigned to Cj, and zero otherwise.
x1x2
x3x4
x5
x7
x6
C1
C2
How to Find good Clusters? Measure the compactness
by the sum of distance square within clusters
x1x2
x3x4
x5
x7
x6
C1
C2
How to Find good Clusters? Measure the compactness
by the sum of distance square within clusters
Find good clusters by minimizing the cluster compactness Cluster centers C1 and C2 Membership mi,j
x1x2
x3x4
x5
x7
x6
C1
C2
How to Find good Clusters? Measure the compactness
by the sum of distance square within clusters
Find good clusters by minimizing the cluster compactness Cluster centers C1 and C2 Membership mi,j
x1x2
x3x4
x5
x7
x6
C1
C2
How to Find good Clusters? Find good clusters by
minimizing the cluster compactness Cluster centers C1 and C2 Membership mi,j
x1x2
x3x4
x5
x7
x6
C1
C2
How to Efficiently Cluster Data?
2
,
1 arg min( ) Given centers { },
0 otherwise
i kkj i j
j x CC m
Update mi,j: assign xi to the closest Cj
How to Efficiently Cluster Data?
,
1,
,1
Given memberships ,
n
i j ii
i j j n
i ji
m x
m C
m
2
,
1 arg min( ) Given centers { },
0 otherwise
i kkj i j
j x CC m
Update mi,j: assign xi to the closest Cj
Update Cj as the average of xi assigned to Cj
How to Efficiently Cluster Data?
,
1,
,1
Given memberships ,
n
i j ii
i j j n
i ji
m x
m C
m
2
,
1 arg min( ) Given centers { },
0 otherwise
i kkj i j
j x CC m
Update mi,j: assign xi to the closest Cj
Update Cj as the average of xi assigned to Cj
K-means algorithm
Example of k-means Start with random cluster
centers C1 than to C2x1
x2
x3x4
x5
x7
x6
C1
C2
Example of k-means
x1x2
x3x4
x5
x7
x6
C1
C2
Identify the points that are closer to C1 than to C2
Example of k-means Update C1 x1
x2
x3x4
x5
x7
x6
C1
C2
Example of k-means Identify the points that are
closer to C2 than to C1x1
x2
x3x4
x5
x7
x6
C1
C2
Example of k-means Identify the points that are
closer to C2 than to C1x1
x2
x3x4
x5
x7
x6
C1
C2
Example of k-means Identify the points that are
closer to C2 than C1, and points that are closer to C1 than to C2
x1x2
x3x4
x5
x7
x6
C1
C2
Example of k-means
x1x2
x3x4
x5
x7
x6
C1
C2
Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2
Update C1 and C2
K-means for Clustering K-means
Start with a random guess of cluster centers
Determine the membership of each data points
Adjust the cluster centers
K-means for Clustering K-means
Start with a random guess of cluster centers
Determine the membership of each data points
Adjust the cluster centers
K-means for Clustering K-means
Start with a random guess of cluster centers
Determine the membership of each data points
Adjust the cluster centers
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
K-means
Any Computational Problem ?
K-means
Need to go through each data point at each iteration of k-means
Improve K-means Group nearby data points by
region KD tree SR tree
Try to update the membership for all the data points in the same region
Improved K-means Find the closest center for
each rectangle Assign all the points within
a rectangle to one cluster
Document Clustering
A Mixture Model for Document Clustering Assume that data are
generated from a mixture of multinomial distributions
Estimate the mixture distribution from the observed documents
Gaussian Mixture Example: Start
Measure the probability for every data point to be associated with each cluster
After First Iteration
After 2nd Iteration
After 3rd Iteration
After 4th Iteration
After 5th Iteration
After 6th Iteration
After 20th Iteration
Hierarchical Doc Clustering Goal is to create a hierarchy of topics Challenge: create this hierarchy automatically Approaches: top-down or bottom-up
Hierarchical Agglomerative Clustering (HAC) Given a similarity measure for determining the
similarity between two clusters Start with each document in a separate cluster repeatedly merge the two most similar clusters Until there is only one cluster The history of merging forms a binary tree The standard way of depicting this history is a
dendrogram.
An Example of Dendrogram
With an appropriately chosen similarity cut, we can convert the dendrogram into a flat clustering.
similarity
Similarity of Clusters Single-link: Maximum similarity
Maximum over all document pairs
Similarity of Clusters Single-link: Maximum similarity
Maximum over all document pairs
Complete-link: Minimum similarity Minimum over all document pairs
Similarity of Clusters Single-link: Maximum similarity
Maximum over all document pairs
Complete-link: Minimum similarity Minimum over all document pairs
Centroid: Average “intersimilarity” Average over all document pairs
Single Link vs. Complete LinkSingle Link Complete Link
Complete link usually produces balanced clusters
Divisive Hierarchical Clustering Top-down (instead of bottom-up as in HAC) Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its
own. Example: Bisecting K-means