12 clustering - k-meansi-systems.github.io/hse545/machine learning all/05...k.means:,limitations...
TRANSCRIPT
![Page 1: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/1.jpg)
Clustering: K-‐means
Industrial AI Lab.
![Page 2: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/2.jpg)
Supervised vs. Unsupervised Learning
2
Supervised Learning Unsupervised Learning
Building a model from labeled data Clustering from unlabeled data
![Page 3: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/3.jpg)
Data Clustering• Data clustering is an unsupervised learning problem
• Given:– 𝑚 unlabeled examples {𝑥 $ , 𝑥 & ,⋯ , 𝑥())}– the number of partitions 𝑘
• Goal: group the examples into 𝑘 partitions
3
![Page 4: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/4.jpg)
Data Clustering: Similarity
• The only information clustering uses is the mutual similarity between samples
• A good clustering is one that achieves:– high within-‐cluster similarity– low inter-‐cluster similarity
4
![Page 5: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/5.jpg)
K-‐means: (Iterative) Algorithm1) Initialization
• Input– 𝑘: the number of clusters– Training set 𝑥 $ , 𝑥 & ,⋯ , 𝑥 )
• Randomly initialize cluster centers anywhere in ℝ.
5
![Page 6: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/6.jpg)
K-‐means: (Iterative) Algorithm2) Iteration
• Repeat until convergence (a possible convergence criteria: cluster centers do not change anymore)
6
![Page 7: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/7.jpg)
K-‐means: (Iterative) Algorithm3) Output• 𝑐 (label) : index (1 to 𝑘) of cluster centroid (centers)• 𝜇: averages (mean) of points assigned to cluster 𝜇$, 𝜇&,⋯ , 𝜇1
7
![Page 8: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/8.jpg)
Initialization (𝒌 = 𝟐)
8
![Page 9: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/9.jpg)
Assigning Points
9
![Page 10: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/10.jpg)
Recomputing the Cluster Centers
10
![Page 11: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/11.jpg)
Assigning Points
11
![Page 12: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/12.jpg)
Recomputing the Cluster Centers
12
![Page 13: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/13.jpg)
Assigning Points
13
![Page 14: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/14.jpg)
Recomputing the Cluster Centers
14
![Page 15: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/15.jpg)
Assigning Points
15
![Page 16: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/16.jpg)
Recomputing the Cluster Centers
16
![Page 17: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/17.jpg)
Summary: K-‐means Clustering• (Iterative) Algorithm
17
![Page 18: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/18.jpg)
K-‐means: Optimization Point of View (optional)• 𝑐5 = index of cluster (1, 2,⋯ , 𝑘) to which example 𝑥 5 is currently assigned
• 𝜇1 = cluster centroid• 𝜇9: = cluster centroid of cluster to which example 𝑥 5 has been assigned
• Optimization objective:
18
![Page 19: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/19.jpg)
Expectation Maximization Algorithm• It is a "chicken and egg" problem (dilemma)– Q: if we knew 𝑐5s, how would we determine which points to associate with each cluster center?
– A: for each point 𝑥 5 , choose closest 𝑐5
– Q: if we knew the cluster memberships, how do we get the centers?– A: choose 𝑐5 to be the mean of all points in the cluster
• Extension of K-‐means algorithm– A special case of Expectation Maximization (EM) algorithm– A special case of Gaussian Mixture Model (GMM)– Won’t be discussed in this course
19
![Page 20: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/20.jpg)
Python: Data Generation
20
![Page 21: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/21.jpg)
Python: Random Initialization
21
![Page 22: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/22.jpg)
Python: K-‐Means
22
![Page 23: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/23.jpg)
Python: Result
23
![Page 24: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/24.jpg)
Python: K-‐Means in Scikit-‐learn
24
![Page 25: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/25.jpg)
Initialization Issues• k-‐means is extremely sensitive to cluster center initialization
• Bad initialization can lead to– Poor convergence speed– Bad overall clustering
• Safeguarding measures:– Choose first center as one of the examples, second which is the farthest from the first, third which is the farthest from both, and so on.
– Try multiple initialization and choose the best result
25
![Page 26: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/26.jpg)
Choosing the Number of Clusters• Idea: when adding another cluster does not give much better modeling of the data
• One way to select 𝑘 for the K-‐means algorithm is to try different values of 𝑘, plot the K-‐means objective versus 𝑘, and look at the 'elbow-‐point' in the plot
26
![Page 27: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/27.jpg)
Choosing the Number of Clusters
27
![Page 28: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/28.jpg)
K-‐means: Limitations• Make hard assignments of points to clusters– A point either completely belongs to a cluster or not belongs at all– No notion of a soft assignment (i.e., probability of being assigned to each cluster)
– Gaussian mixture model (we will study later) and Fuzzy K-‐means allow soft assignments
• Sensitive to outlier examples– K-‐medians algorithm is a more robust alternative for data with outliers
28
![Page 29: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/29.jpg)
K-‐means: Limitations• Works well only for round shaped, and of roughly equal sizes/density cluster
• Does badly if the cluster have non-‐convex shapes– Spectral clustering (we will study later) and Kernelized K-‐means can be an alternative
29
![Page 30: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/30.jpg)
K-‐means: Limitations• Non-‐convex/non-‐round-‐shaped cluster: standard K-‐means fails !
• (optional) Connectivity → networks → spectral partitioning
30
![Page 31: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all](https://reader034.vdocuments.net/reader034/viewer/2022051803/5b057e6f7f8b9a93418b6b1b/html5/thumbnails/31.jpg)
K-‐means: Limitations• Clusters with different densities
31