logo clustering lecturer: dr. bo yuan e-mail: [email protected]

48
LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: [email protected]

Upload: peter-stafford

Post on 15-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

LOGO

Clustering

Lecturer Dr Bo Yuan

E-mail yuanbsztsinghuaeducn

Overview

Partitioning Methods

K-Means

Sequential Leader

Model Based Methods

Density Based Methods

Hierarchical Methods

2

What is cluster analysis

Finding groups of objects

Objects similar to each other are in the same group

Objects are different from those in other groups

Unsupervised Learning

No labels

Data driven

3

Clusters

4

Inter-Cluster

Intra-Cluster

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 2: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Overview

Partitioning Methods

K-Means

Sequential Leader

Model Based Methods

Density Based Methods

Hierarchical Methods

2

What is cluster analysis

Finding groups of objects

Objects similar to each other are in the same group

Objects are different from those in other groups

Unsupervised Learning

No labels

Data driven

3

Clusters

4

Inter-Cluster

Intra-Cluster

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 3: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

What is cluster analysis

Finding groups of objects

Objects similar to each other are in the same group

Objects are different from those in other groups

Unsupervised Learning

No labels

Data driven

3

Clusters

4

Inter-Cluster

Intra-Cluster

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 4: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Clusters

4

Inter-Cluster

Intra-Cluster

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 5: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 6: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 7: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 8: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 9: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 10: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 11: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 12: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 13: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ

1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 14: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 15: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 16: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 17: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 18: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 19: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on

Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 20: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Comments on K-Means

Pros

Simple and works well for regular disjoint clusters

Converges relatively fast

Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points

Cons

Need to specify the value of K in advancebull Difficult and domain knowledge may help

May converge to local optimabull In practice try different initial centroids

May be sensitive to noisy data and outliersbull Mean of data points hellip

Not suitable for clusters of bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 21: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 22: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 23: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m

For each non-medoid point o

Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 24: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 25: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Sequential Leader Clustering

A very efficient clustering algorithm

No iteration

Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point

Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point

to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 26: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max

)()()(

iaib

iaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 27: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 28: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Gaussian Mixture

28

)2()(

2

22

2

1)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 29: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 30: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 31: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 32: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 33: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethe

components mixture ofnumber the

points data ofnumber the

ijz

n

m

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 34: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 35: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 36: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 37: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

DBSCAN

A cluster is defined as the maximal set of density connected points

Start from a randomly selected unseen point P

If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set

Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 38: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram

Clustering is obtained by cutting at desired level

No need to specify K in advance

May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 39: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 40: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link

Minimum distance between points

Complete Link

Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 41: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 42: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255

FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 43: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Example

43

BANARM FI MITO

BANARM 0 268 564

FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 44: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 45: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons

J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo

ACM Computing Surveys Vol 31(3) pp 264-323

R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678

A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 46: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 47: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation

Science 315 972ndash976 2007

Clustering by passing messages between points

httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks

Science 344 1492ndash1496 2014

Cluster centers higher density than neighbors

Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 48: LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Assignment

Topic Clustering Techniques and Applications

Techniques

K-Means

Another clustering method for comparison

Task 1 2D Artificial Datasets

To demonstrate the influence of data patterns

To demonstrate the influence of algorithm factors

Task 2 Image Segmentation

Gray vs Colour

Deliverables

Reports (experiment specification algorithm parameters in-depth analysis)

Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 15 48

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment