slide 1 ee3j2 data mining ee3j2 data mining lecture 11: clustering martin russell

22
EE3J2 Data Mining Slide 1 EE3J2 Data Mining Lecture 11: Clustering Martin Russell

Post on 22-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 1

EE3J2 Data Mining

Lecture 11: Clustering

Martin Russell

Page 2: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 2

Objectives

To explain the motivation for clustering To introduce the ideas of distance and distortion To describe agglomerative and divisive clustering To explain the relationships between clustering and

decision trees

Page 3: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 3

Example from speech processing

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

Plot of high-frequency energy vs low-frequency energy, for 25 ms speech

segments, sampled every 10ms

Page 4: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 4

Structure of data

Typical real data is not uniformly distrubuted It has structure Variables might be correlated The data might be grouped into natural ‘clusters’ The purpose of cluster analysis is to find this

underlying structure automatically

Page 5: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 5

Clusters and centroids

If we assume that the clusters are spherical, then they are determined by their centres

The cluster centres are called centroids

How many centroids do we need?

Where should we put them? centroids

Page 6: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 6

Distance

A function d(x,y) defined on pairs of points x and y is called a distance or metric if it satisfies:– d(x,x) = 0 for every point x

– d(x,y) = d(y,x) for all points x and y (d is symmetric)

– d(x,z) d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)

Page 7: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 7

Example metrics

The most common metric is the Euclidean metric In this case, if x = (x1, x2,…,xN) and y = (y1,y2,…,yN)

then:

This corresponds to the standard notion of distance in Euclidean space

There are lots of others, but focus on this one

2222

211 ..., NN yxyxyxyxd

Page 8: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 8

Distortion

Distortion is a measure of how well a set of centroids models a set of data

Suppose we have:– data points y1, y2,…,yT

– centroids c1,…,cM

For each data point yt let ci(t) be the closest centroid

In other words: d(yt, ci(t)) = minmd(yt,cm)

Page 9: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 9

Distortion

The distortion for the centroid set C = c1,…,cM is defined by:

In other words, the distortion is the sum of distances between each data point and its nearest centroid

The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

T

ttit cydCDist

1

,

Page 10: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 10

Types of Clustering

Initially we will look at two types of cluster analysis:– Agglomerative clustering, or ‘bottom-up’ clustering

– Divisive clustering, or ‘top-down’ clustering

Page 11: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 11

Agglomerative clustering

Agglomerative clustering begins by assuming that each data point belongs to its own, unique, 1 point cluster

Clusters are then combined until the required number of clusters is obtained

The simplest agglomerative clustering algorithm is one which, at each stage, combines the two closest centroids into a single centroid

Page 12: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 12

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

Original data (302 points)

Page 13: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 13

6

6

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

252 centroids

Page 14: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 14

6

6

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

152 centroids

Page 15: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 15

52 centroids

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 146

6

Page 16: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 16

6

6

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

12 centroids

Page 17: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 17

Divisive Clustering

Divisive clustering begins by assuming that there is just one centroid – typically in the centre of the set of data points

That point is replaced with 2 new centroids Then each of these is replaced with 2 new centroids …

Page 18: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 18

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

Original data (302 points)

Page 19: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 19

6

7

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14

Original data (302 points)

Page 20: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 20

Decision tree interpretation

.

.

.

.

Single centroid - whole set

Multiple centroids – one per data point

Top down clustering -

divisive

Bottom up clustering -

agglomerative

Page 21: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 21

Note on optimality

An ‘optimal’ set of centroids is one which minimises the distortion

None of these methods necessarily give optimal sets of centroids

Instead they give locally optimal sets of centroids Why?

Page 22: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data MiningSlide 22

Summary

Distance metrics and distortion Agglomerative clustering Divisive clustering Decision tree interpretation