logo clustering lecturer: dr. bo yuan e-mail: [email protected]
TRANSCRIPT
LOGO
Clustering
Lecturer Dr Bo Yuan
E-mail yuanbsztsinghuaeducn
Overview
Partitioning Methods
K-Means
Sequential Leader
Model Based Methods
Density Based Methods
Hierarchical Methods
2
What is cluster analysis
Finding groups of objects
Objects similar to each other are in the same group
Objects are different from those in other groups
Unsupervised Learning
No labels
Data driven
3
Clusters
4
Inter-Cluster
Intra-Cluster
Clusters
5
Applications of Clustering
Marketing Finding groups of customers with similar behaviours
Biology Finding groups of animals or plants with similar features
Bioinformatics Clustering of microarray data genes and sequences
Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones
WWW Clustering weblog data to discover groups of similar access patterns
Social Networks Discovering groups of individuals with close friendships internally
6
Earthquakes
7
Image Segmentation
8
The Big Picture
9
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Overview
Partitioning Methods
K-Means
Sequential Leader
Model Based Methods
Density Based Methods
Hierarchical Methods
2
What is cluster analysis
Finding groups of objects
Objects similar to each other are in the same group
Objects are different from those in other groups
Unsupervised Learning
No labels
Data driven
3
Clusters
4
Inter-Cluster
Intra-Cluster
Clusters
5
Applications of Clustering
Marketing Finding groups of customers with similar behaviours
Biology Finding groups of animals or plants with similar features
Bioinformatics Clustering of microarray data genes and sequences
Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones
WWW Clustering weblog data to discover groups of similar access patterns
Social Networks Discovering groups of individuals with close friendships internally
6
Earthquakes
7
Image Segmentation
8
The Big Picture
9
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
What is cluster analysis
Finding groups of objects
Objects similar to each other are in the same group
Objects are different from those in other groups
Unsupervised Learning
No labels
Data driven
3
Clusters
4
Inter-Cluster
Intra-Cluster
Clusters
5
Applications of Clustering
Marketing Finding groups of customers with similar behaviours
Biology Finding groups of animals or plants with similar features
Bioinformatics Clustering of microarray data genes and sequences
Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones
WWW Clustering weblog data to discover groups of similar access patterns
Social Networks Discovering groups of individuals with close friendships internally
6
Earthquakes
7
Image Segmentation
8
The Big Picture
9
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Clusters
4
Inter-Cluster
Intra-Cluster
Clusters
5
Applications of Clustering
Marketing Finding groups of customers with similar behaviours
Biology Finding groups of animals or plants with similar features
Bioinformatics Clustering of microarray data genes and sequences
Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones
WWW Clustering weblog data to discover groups of similar access patterns
Social Networks Discovering groups of individuals with close friendships internally
6
Earthquakes
7
Image Segmentation
8
The Big Picture
9
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Clusters
5
Applications of Clustering
Marketing Finding groups of customers with similar behaviours
Biology Finding groups of animals or plants with similar features
Bioinformatics Clustering of microarray data genes and sequences
Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones
WWW Clustering weblog data to discover groups of similar access patterns
Social Networks Discovering groups of individuals with close friendships internally
6
Earthquakes
7
Image Segmentation
8
The Big Picture
9
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Applications of Clustering
Marketing Finding groups of customers with similar behaviours
Biology Finding groups of animals or plants with similar features
Bioinformatics Clustering of microarray data genes and sequences
Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones
WWW Clustering weblog data to discover groups of similar access patterns
Social Networks Discovering groups of individuals with close friendships internally
6
Earthquakes
7
Image Segmentation
8
The Big Picture
9
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Earthquakes
7
Image Segmentation
8
The Big Picture
9
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Image Segmentation
8
The Big Picture
9
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
The Big Picture
9
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Requirements
Scalability
Ability to deal with different types of attributes
Ability to discover clusters with arbitrary shape
Minimum requirements for domain knowledge
Ability to deal with noise and outliers
Insensitivity to order of input records
Incorporation of user-defined constraints
Interpretability and usability
10
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Practical Considerations
11
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Normalization or Not
12
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Evaluation
13
ii Dxi
i
c
i Dxie x
nmmxJ
1
1
2
VS
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Evaluation
14
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
The Influence of Outliers
15
outlier
K=2
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
K-Means
16
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
K-Means
17
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
K-Means
18
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
K-Means
Determine the value of K
Choose K cluster centres randomly
Each data point is assigned to its closest centroid
Use the mean of each cluster to update each centroid
Repeat until no more new assignment
Return the K centroids
Reference J MacQueen (1967) Some Methods for Classification and Analysis of
Multivariate Observations Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability vol1 pp 281-297
19
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Comments on K-Means
Pros
Simple and works well for regular disjoint clusters
Converges relatively fast
Relatively efficient and scalable O(tkn)bull t iteration k number of centroids n number of data points
Cons
Need to specify the value of K in advancebull Difficult and domain knowledge may help
May converge to local optimabull In practice try different initial centroids
May be sensitive to noisy data and outliersbull Mean of data points hellip
Not suitable for clusters of bull Non-convex shapes
20
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
The Influence of Initial Centroids
21
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
The Influence of Initial Centroids
22
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
The K-Medoids Method
The basic idea is to use real data points as centres
Determine the value of K in advance
Randomly select K points as medoids
Assign each data point to the closest medoid
Calculate the cost of the configuration J
For each medoid m
For each non-medoid point o
Swap m and o and calculate the new cost of configuration Jprime
If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps
Otherwise terminate the procedure23
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
The K-Medoids Method
24
Cost =20 Cost =26
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Sequential Leader Clustering
A very efficient clustering algorithm
No iteration
Time complexity O(nk)
No need to specify K in advance
Choose a cluster threshold value
For every new data point
Compute the distance between the new data point and every clusters centre
If the distance is smaller than the chosen threshold assign the new data point
to the corresponding cluster and re-compute cluster centre
Otherwise create a new cluster with the new data point as its centre
Clustering results may be influenced by the sequence of data points
25
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Silhouette
A method of interpretation and validation of clusters of data
A succinct graphical representation of how well each data point lies within its cluster compared to other clusters
a(i) average dissimilarity of i with all other points in the same cluster
b(i) the lowest average dissimilarity of i to other clusters
26
)()(max
)()()(
iaib
iaibis
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Silhouette
27
-02 0 02 04 06 08 1
1
2
Silhouette Value
Clu
ster
-3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
4
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Gaussian Mixture
28
)2()(
2
22
2
1)(
xexg
1amp0)()(1
i
ii
n
iiii xgxf
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Clustering by Mixture Models
29
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
K-Means Revisited
30
120579=(1199091 1199101 ) (1199092 119910 2)
119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2
model parameters
latent parameters
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Expectation Maximization
31
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
32
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
EM Gaussian Mixture
33
Gaussianjth by the generated is i instancer whethe
components mixture ofnumber the
points data ofnumber the
ijz
n
m
n
kk
x
j
x
n
kkki
jjiij
ki
ji
e
e
xxp
xxpzE
1
)(2
1
)(2
1
1
22
22
)|(
)|(][
m
iij
m
iiij
j
zE
xzE
1
1
][
][
m
iijj zE
m 1
][1
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Density Based Methods
Generate clusters of arbitrary shapes
Robust against noise
No K value required in advance
Somewhat similar to human vision
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
Density number of points within a specified radius
Core Point points with high density
Border Point points with low density but in the neighbourhood of a core point
Noise Point neither a core point nor a border point
35
Core Point
Noise Point
Border Point
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
DBSCAN
36
p
q
directly density reachable
p
q
density reachable
o
qp
density connected
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
DBSCAN
A cluster is defined as the maximal set of density connected points
Start from a randomly selected unseen point P
If P is a core point build a cluster by gradually adding all points that are density reachable to the current point set
Noise points are discarded (unlabelled)
37
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Hierarchical Clustering
Produce a set of nested tree-like clusters
Can be visualized as a dendrogram
Clustering is obtained by cutting at desired level
No need to specify K in advance
May correspond to meaningful taxonomies
38
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Dinosaur Family Tree
39
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Agglomerative Methods
Bottom-up Method
Assign each data point to a cluster
Calculate the proximity matrix
Merge the pair of closest clusters
Repeat until only a single cluster remains
How to calculate the distance between clusters
Single Link
Minimum distance between points
Complete Link
Maximum distance between points
40
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Example
41
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0Single Link
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Example
42
BA FI MITO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MITO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
BA FI MITO NARM
BA 0 662 877 255
FI 662 0 295 268
MITO 877 295 0 564
NARM 255 268 564 0
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Example
43
BANARM FI MITO
BANARM 0 268 564
FI 268 0 295
MITO 564 295 0
BAFINARM MITO
BAFINARM 0 295
MITO 295 0
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Min vs Max
44
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Reading Materials
Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons
J Han and M Kamber Data Mining Concepts and Techniques Chapter 8 Morgan Kaufmann
Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo
ACM Computing Surveys Vol 31(3) pp 264-323
R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions on Neural Networks Vol 16(3) pp 645-678
A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern Recognition Letters Vol 31 pp 651-666
Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial
45
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Review
What is clustering
What are the two categories of clustering methods
How does the K-Means algorithm work
What are the major issues of K-Means
How to control the number of clusters in Sequential Leader Clustering
How to use Gaussian mixture models for clustering
What are the main advantages of density methods
What is the core idea of DBSCAN
What is the general procedure of hierarchical clustering
Which clustering methods do not require K as the input
46
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Next Weekrsquos Class Talk
Volunteers are required for next weekrsquos class talk
Topic Affinity Propagation
Science 315 972ndash976 2007
Clustering by passing messages between points
httpwwwpsitorontoeduindexphpq=affinity20propagation
Topic Clustering by Fast Search and Find of Density Peaks
Science 344 1492ndash1496 2014
Cluster centers higher density than neighbors
Cluster centers distant from others points with higher densities
Length 20 minutes plus question time
47
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-
Assignment
Topic Clustering Techniques and Applications
Techniques
K-Means
Another clustering method for comparison
Task 1 2D Artificial Datasets
To demonstrate the influence of data patterns
To demonstrate the influence of algorithm factors
Task 2 Image Segmentation
Gray vs Colour
Deliverables
Reports (experiment specification algorithm parameters in-depth analysis)
Code (any programming language with detailed comments)
Due Sunday 28 December
Credit 15 48
- Clustering
- Overview
- What is cluster analysis
- Clusters
- Clusters (2)
- Applications of Clustering
- Earthquakes
- Image Segmentation
- The Big Picture
- Requirements
- Practical Considerations
- Normalization or Not
- Evaluation
- Evaluation (2)
- The Influence of Outliers
- K-Means
- K-Means (2)
- K-Means (3)
- K-Means (4)
- Comments on K-Means
- The Influence of Initial Centroids
- The Influence of Initial Centroids (2)
- The K-Medoids Method
- The K-Medoids Method (2)
- Sequential Leader Clustering
- Silhouette
- Silhouette (2)
- Gaussian Mixture
- Clustering by Mixture Models
- K-Means Revisited
- Expectation Maximization
- Slide 32
- EM Gaussian Mixture
- Density Based Methods
- DBSCAN
- DBSCAN (2)
- DBSCAN (3)
- Hierarchical Clustering
- Dinosaur Family Tree
- Agglomerative Methods
- Example
- Example (2)
- Example (3)
- Min vs Max
- Reading Materials
- Review
- Next Weekrsquos Class Talk
- Assignment
-