ue 141 clustering - university at buffalojing/ue141/sp13/doc/clustering1.pdfagglomerative clustering...

27
Clustering UE 141 Spring 2013 1 Jing Gao SUNY Buffalo

Upload: others

Post on 12-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Clustering

UE 141 Spring 2013

1

Jing Gao SUNY Buffalo

Page 2: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Definition of Clustering • Finding groups of objects such that the objects in a group will

be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster similarity are minimized

Intra-cluster similarity are maximized

2

Page 3: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

3

K-means

• Partition {x1,…,xn} into K clusters – K is predefined

• Initialization – Specify the initial cluster centers (centroids)

• Iteration until no change – For each object xi

• Calculate the distances between xi and the K centroids • (Re)assign xi to the cluster whose centroid is the

closest to xi – Update the cluster centroids based on current

assignment

Page 4: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

0

1

2

3

4

5

0 1 2 3 4 5

K-means: Initialization Initialization: Determine the three cluster centers

m1

m2

m3

4

Page 5: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering: Cluster Assignment Assign each object to the cluster which has the closet distance from the centroid to the object

m1

m2

m3

5

Page 6: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering: Update Cluster Centroid Compute cluster centroid as the center of the points in the cluster

m1

m2

m3

6

Page 7: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

K-means Clustering: Update Cluster Centroid Compute cluster centroid as the center of the points in the cluster

0

1

2

3

4

5

0 1 2 3 4 5

m1

m2

m3

7

Page 8: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

0

1

2

3

4

5

0 1 2 3 4 5

m1

m2

m3

K-means Clustering: Cluster Assignment Assign each object to the cluster which has the closet distance from the centroid to the object

8

Page 9: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

0

1

2

3

4

5

0 1 2 3 4 5

m1

m2

m3

K-means Clustering: Update Cluster Centroid Compute cluster centroid as the center of the points in the cluster

9

Page 10: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

0

1

2

3

4

5

0 1 2 3 4 5

m1

m2 m3

K-means Clustering: Update Cluster Centroid Compute cluster centroid as the center of the points in the cluster

10

Page 11: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Sum of Squared Error (SSE)

• Suppose the centroid of cluster Cj is mj • For each object x in Cj, compute the squared error between x and the

centroid mj • Sum up the error of all the objects

å åÎ

-=j Cx

jj

mxSSE 2)(

11

1 2 4 5 ´ ´ m1=1.5

m2=4.5

1)5.45()5.44()5.12()5.11( 2222 =-+-+-+-=SSE

Page 12: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

How to Minimize SSE

12

• Two sets of variables to minimize – Each object x belongs to which cluster? – What’s the cluster centroid? mj

• Minimize the error wrt each set of variable alternatively – Fix the cluster centroid—find cluster assignment that

minimizes the current error – Fix the cluster assignment—compute the cluster centroids

that minimize the current error

å åÎ

-j Cx

jj

mx 2)(min

jCxÎ

Page 13: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Cluster Assignment Step

13

• Cluster centroids (mj) are known • For each object

– Choose Cj among all the clusters for x such that the distance between x and mj is the minimum

– Choose another cluster will incur a bigger error

• Minimize error on each object will minimize the SSE

å åÎ

-j Cx

jj

mx 2)(min

Page 14: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

14

0 1 2 3 4 5 6 7 8 9 10

10

9

8

7

6

5

4

3

2

1

m1

m2

x1

x2

x3

x4

x5

Example—Cluster Assignment

1321 ,, Cxxx Î

254 , Cxx Î

213

212

211 )()()( mxmxmxSSE -+-+-=

Given m1, m2, which cluster each of the five points belongs to?

Assign points to the closet centroid—minimize SSE

225

224 )()( mxmx -+-+

Page 15: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Cluster Centroid Computation Step

15

• For each cluster – Choose cluster centroid mj as the center of the

points

• Minimize error on each cluster will minimize the SSE

å åÎ

-j Cx

jj

mx 2)(min

|| j

Cxj C

xm j

åÎ=

Page 16: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

16

10

9

8

7

6

5

4

3

2

1

m1

m2

x1

x2

x3

x4

x5

Example—Cluster Centroid Computation

Given the cluster assignment, compute the centers of the two clusters

0 1 2 3 4 5 6 7 8 9 10

Page 17: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

17

Hierarchical Clustering

• Agglomerative approach

b

d c

e

a a b

d e c d e

a b c d e

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

Initialization:

Each object is a cluster

Iteration:

Merge two clusters which are

most similar to each other;

Until all objects are merged

into a single cluster

Page 18: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

18

Hierarchical Clustering

• Divisive Approaches

b

d c

e

a a b

d e c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

Initialization:

All objects stay in one cluster

Iteration:

Select a cluster and split it into

two sub clusters

Until each leaf cluster contains

only one object

Page 19: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

19

Dendrogram • A tree that shows how clusters are merged/split

hierarchically • Each node on the tree is a cluster; each leaf node is a

singleton cluster

Page 20: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

20

Dendrogram • A clustering of the data objects is obtained by cutting

the dendrogram at the desired level, then each connected component forms a cluster

Page 21: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Agglomerative Clustering Algorithm

• More popular hierarchical clustering technique

• Basic algorithm is straightforward 1. Compute the distance matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. Until only a single cluster remains

• Key operation is the computation of the distance between two clusters

– Different approaches to defining the distance between clusters distinguish the different algorithms

21

Page 22: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Starting Situation

• Start with clusters of individual points and a distance matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Distance Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

22

Page 23: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Intermediate Situation • After some merging steps, we have some clusters • Choose two clusters that has the smallest distance (largest similarity) to merge

C1

C4

C2 C5

C3

C2 C1 C1

C3

C5

C4

C2

C3 C4 C5

Distance Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

23

Page 24: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update

the distance matrix.

C1

C4

C2 C5

C3

C2 C1 C1

C3

C5

C4

C2

C3 C4 C5

Distance Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

24

Page 25: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

After Merging • The question is “How do we update the distance matrix?”

C1

C4

C2 U C5

C3 ? ? ? ?

?

?

?

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Distance Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

25

Page 26: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

How to Define Inter-Cluster Distance

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Distance?

l MIN l MAX l Group Average l Distance Between Centroids l ……

Distance Matrix

26

Page 27: UE 141 Clustering - University at Buffalojing/ue141/sp13/doc/Clustering1.pdfAgglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm

Question

• Talk about big data – What’s the definition of big data? – What applications generate big data? – What are the challenges? – What are the technologies for mining big data?

27