clustering

Clustering

Rong Jin

What is Clustering?

$$$

age

Identify the underlying structure for given data points Doc. clustering: groups documents of same topics into

the same cluster

Improve IR by Document Clustering Cluster-based retrieval

query

Improve IR by Document Clustering Cluster-based retrieval

Cluster docs in collection a priori Only compute the relevance scores for docs

in the cluster closest to the query Improve retrieval efficiency by only search

a small portion of the document collection

Application (I): Search Result Clustering

Application (II): Navigation

Application (III): Google News

Application (III): Visualization

Islands of music (Pampalk et al., KDD’ 03)

How to Find good Clusters?

x1x2

x3x4

x5

x7

x6

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

x1x2

x3x4

x5

x7

x6



x1x2

x3x4

x5

x7

x6

C1

C2



Membership indicators:

mi,j =1 if xi is assigned to Cj, and zero otherwise.

x1x2

x3x4

x5

x7

x6

C1

C2



x1x2

x3x4

x5

x7

x6

C1

C2



Find good clusters by minimizing the cluster compactness Cluster centers C1 and C2 Membership mi,j

x1x2

x3x4

x5

x7

x6

C1

C2

How to Find good Clusters? Find good clusters by

minimizing the cluster compactness Cluster centers C1 and C2 Membership mi,j

x1x2

x3x4

x5

x7

x6

C1

C2

How to Efficiently Cluster Data?

2

,

1 arg min( ) Given centers { },

0 otherwise

i kkj i j

j x CC m

Update mi,j: assign xi to the closest Cj


,

1,

,1

Given memberships ,

n

i j ii

i j j n

i ji

m x

m C

m

2

,


0 otherwise

i kkj i j

j x CC m


Update Cj as the average of xi assigned to Cj


,

1,

,1

Given memberships ,

n

i j ii

i j j n

i ji

m x

m C

m

2

,


0 otherwise

i kkj i j

j x CC m


Update Cj as the average of xi assigned to Cj

K-means algorithm

Example of k-means Start with random cluster

centers C1 than to C2x1

x2

x3x4

x5

x7

x6

C1

C2

Example of k-means

x1x2

x3x4

x5

x7

x6

C1

C2

Identify the points that are closer to C1 than to C2

Example of k-means Update C1 x1

x2

x3x4

x5

x7

x6

C1

C2

Example of k-means Identify the points that are

closer to C2 than to C1x1

x2

x3x4

x5

x7

x6

C1

C2

Example of k-means Identify the points that are

closer to C2 than C1, and points that are closer to C1 than to C2

x1x2

x3x4

x5

x7

x6

C1

C2

Example of k-means

x1x2

x3x4

x5

x7

x6

C1

C2

Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2

Update C1 and C2

K-means for Clustering K-means

Start with a random guess of cluster centers

Determine the membership of each data points

Adjust the cluster centers

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)



2. Randomly guess k cluster Center locations




3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)




3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

K-means

Any Computational Problem ?

K-means

Need to go through each data point at each iteration of k-means

Improve K-means Group nearby data points by

region KD tree SR tree

Try to update the membership for all the data points in the same region

Improved K-means Find the closest center for

each rectangle Assign all the points within

a rectangle to one cluster

Document Clustering

A Mixture Model for Document Clustering Assume that data are

generated from a mixture of multinomial distributions

Estimate the mixture distribution from the observed documents

Gaussian Mixture Example: Start

Measure the probability for every data point to be associated with each cluster

After First Iteration

After 2nd Iteration

After 3rd Iteration

After 4th Iteration

After 5th Iteration

After 6th Iteration

After 20th Iteration

Hierarchical Doc Clustering Goal is to create a hierarchy of topics Challenge: create this hierarchy automatically Approaches: top-down or bottom-up

Hierarchical Agglomerative Clustering (HAC) Given a similarity measure for determining the

similarity between two clusters Start with each document in a separate cluster repeatedly merge the two most similar clusters Until there is only one cluster The history of merging forms a binary tree The standard way of depicting this history is a

dendrogram.

An Example of Dendrogram

With an appropriately chosen similarity cut, we can convert the dendrogram into a flat clustering.

similarity

Similarity of Clusters Single-link: Maximum similarity

Maximum over all document pairs



Complete-link: Minimum similarity Minimum over all document pairs



Complete-link: Minimum similarity Minimum over all document pairs

Centroid: Average “intersimilarity” Average over all document pairs

Single Link vs. Complete LinkSingle Link Complete Link

Complete link usually produces balanced clusters

Divisive Hierarchical Clustering Top-down (instead of bottom-up as in HAC) Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its

own. Example: Bisecting K-means

clustering

Documents

clustersfind good clusters

sum of distance square

cluster data

cluster closest

random cluster centers

c2membership mi

update mi

c2update c1