clustering

59
Clustering Rong Jin

Upload: malia

Post on 14-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Clustering. Rong Jin. What is Clustering?. Identify the underlying structure for given data points Doc. clustering: groups documents of same topics into the same cluster. $$$. age. query. Improve IR by Document Clustering. Cluster-based retrieval. Improve IR by Document Clustering. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering

Clustering

Rong Jin

Page 2: Clustering

What is Clustering?

$$$

age

Identify the underlying structure for given data points Doc. clustering: groups documents of same topics into

the same cluster

Page 3: Clustering

Improve IR by Document Clustering Cluster-based retrieval

query

Page 4: Clustering

Improve IR by Document Clustering Cluster-based retrieval

Cluster docs in collection a priori Only compute the relevance scores for docs

in the cluster closest to the query Improve retrieval efficiency by only search

a small portion of the document collection

Page 5: Clustering

Application (I): Search Result Clustering

Page 6: Clustering

Application (II): Navigation

Page 7: Clustering

Application (III): Google News

Page 8: Clustering

Application (III): Visualization

Islands of music (Pampalk et al., KDD’ 03)

Page 9: Clustering

How to Find good Clusters?

x1x2

x3x4

x5

x7

x6

Page 10: Clustering

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

x1x2

x3x4

x5

x7

x6

Page 11: Clustering

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

x1x2

x3x4

x5

x7

x6

Page 12: Clustering

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

x1x2

x3x4

x5

x7

x6

C1

C2

Page 13: Clustering

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

x1x2

x3x4

x5

x7

x6

C1

C2

Page 14: Clustering

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

Membership indicators:

mi,j =1 if xi is assigned to Cj, and zero otherwise.

x1x2

x3x4

x5

x7

x6

C1

C2

Page 15: Clustering

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

Membership indicators:

mi,j =1 if xi is assigned to Cj, and zero otherwise.

x1x2

x3x4

x5

x7

x6

C1

C2

Page 16: Clustering

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

x1x2

x3x4

x5

x7

x6

C1

C2

Page 17: Clustering

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

Find good clusters by minimizing the cluster compactness Cluster centers C1 and C2 Membership mi,j

x1x2

x3x4

x5

x7

x6

C1

C2

Page 18: Clustering

How to Find good Clusters? Measure the compactness

by the sum of distance square within clusters

Find good clusters by minimizing the cluster compactness Cluster centers C1 and C2 Membership mi,j

x1x2

x3x4

x5

x7

x6

C1

C2

Page 19: Clustering

How to Find good Clusters? Find good clusters by

minimizing the cluster compactness Cluster centers C1 and C2 Membership mi,j

x1x2

x3x4

x5

x7

x6

C1

C2

Page 20: Clustering

How to Efficiently Cluster Data?

2

,

1 arg min( ) Given centers { },

0 otherwise

i kkj i j

j x CC m

Update mi,j: assign xi to the closest Cj

Page 21: Clustering

How to Efficiently Cluster Data?

,

1,

,1

Given memberships ,

n

i j ii

i j j n

i ji

m x

m C

m

2

,

1 arg min( ) Given centers { },

0 otherwise

i kkj i j

j x CC m

Update mi,j: assign xi to the closest Cj

Update Cj as the average of xi assigned to Cj

Page 22: Clustering

How to Efficiently Cluster Data?

,

1,

,1

Given memberships ,

n

i j ii

i j j n

i ji

m x

m C

m

2

,

1 arg min( ) Given centers { },

0 otherwise

i kkj i j

j x CC m

Update mi,j: assign xi to the closest Cj

Update Cj as the average of xi assigned to Cj

K-means algorithm

Page 23: Clustering

Example of k-means Start with random cluster

centers C1 than to C2x1

x2

x3x4

x5

x7

x6

C1

C2

Page 24: Clustering

Example of k-means

x1x2

x3x4

x5

x7

x6

C1

C2

Identify the points that are closer to C1 than to C2

Page 25: Clustering

Example of k-means Update C1 x1

x2

x3x4

x5

x7

x6

C1

C2

Page 26: Clustering

Example of k-means Identify the points that are

closer to C2 than to C1x1

x2

x3x4

x5

x7

x6

C1

C2

Page 27: Clustering

Example of k-means Identify the points that are

closer to C2 than to C1x1

x2

x3x4

x5

x7

x6

C1

C2

Page 28: Clustering

Example of k-means Identify the points that are

closer to C2 than C1, and points that are closer to C1 than to C2

x1x2

x3x4

x5

x7

x6

C1

C2

Page 29: Clustering

Example of k-means

x1x2

x3x4

x5

x7

x6

C1

C2

Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2

Update C1 and C2

Page 30: Clustering

K-means for Clustering K-means

Start with a random guess of cluster centers

Determine the membership of each data points

Adjust the cluster centers

Page 31: Clustering

K-means for Clustering K-means

Start with a random guess of cluster centers

Determine the membership of each data points

Adjust the cluster centers

Page 32: Clustering

K-means for Clustering K-means

Start with a random guess of cluster centers

Determine the membership of each data points

Adjust the cluster centers

Page 33: Clustering

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

Page 34: Clustering

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

Page 35: Clustering

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

Page 36: Clustering

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

Page 37: Clustering

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

Page 38: Clustering

K-means

Any Computational Problem ?

Page 39: Clustering

K-means

Need to go through each data point at each iteration of k-means

Page 40: Clustering

Improve K-means Group nearby data points by

region KD tree SR tree

Try to update the membership for all the data points in the same region

Page 41: Clustering

Improved K-means Find the closest center for

each rectangle Assign all the points within

a rectangle to one cluster

Page 42: Clustering

Document Clustering

Page 43: Clustering

A Mixture Model for Document Clustering Assume that data are

generated from a mixture of multinomial distributions

Estimate the mixture distribution from the observed documents

Page 44: Clustering

Gaussian Mixture Example: Start

Measure the probability for every data point to be associated with each cluster

Page 45: Clustering

After First Iteration

Page 46: Clustering

After 2nd Iteration

Page 47: Clustering

After 3rd Iteration

Page 48: Clustering

After 4th Iteration

Page 49: Clustering

After 5th Iteration

Page 50: Clustering

After 6th Iteration

Page 51: Clustering

After 20th Iteration

Page 52: Clustering

Hierarchical Doc Clustering Goal is to create a hierarchy of topics Challenge: create this hierarchy automatically Approaches: top-down or bottom-up

Page 53: Clustering

Hierarchical Agglomerative Clustering (HAC) Given a similarity measure for determining the

similarity between two clusters Start with each document in a separate cluster repeatedly merge the two most similar clusters Until there is only one cluster The history of merging forms a binary tree The standard way of depicting this history is a

dendrogram.

Page 54: Clustering

An Example of Dendrogram

With an appropriately chosen similarity cut, we can convert the dendrogram into a flat clustering.

similarity

Page 55: Clustering

Similarity of Clusters Single-link: Maximum similarity

Maximum over all document pairs

Page 56: Clustering

Similarity of Clusters Single-link: Maximum similarity

Maximum over all document pairs

Complete-link: Minimum similarity Minimum over all document pairs

Page 57: Clustering

Similarity of Clusters Single-link: Maximum similarity

Maximum over all document pairs

Complete-link: Minimum similarity Minimum over all document pairs

Centroid: Average “intersimilarity” Average over all document pairs

Page 58: Clustering

Single Link vs. Complete LinkSingle Link Complete Link

Complete link usually produces balanced clusters

Page 59: Clustering

Divisive Hierarchical Clustering Top-down (instead of bottom-up as in HAC) Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its

own. Example: Bisecting K-means