clustering: k-means - github pagesi-systems.github.io/.../12_clustering_k-means.pdf · k-means:...

30
Clustering: K-means Industrial AI Lab. Prof. Seungchul Lee

Upload: others

Post on 28-Jan-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Clustering: K-means

Industrial AI Lab.

Prof. Seungchul Lee

Page 2: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Supervised vs. Unsupervised Learning

2

Supervised Learning Unsupervised Learning

Building a model from labeled data Clustering from unlabeled data

Page 3: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Data Clustering

• Data clustering is an unsupervised learning problem

• Given:

– 𝑚 unlabeled examples {𝑥 1 , 𝑥 2 , ⋯ , 𝑥(𝑚)}

– the number of partitions 𝑘

• Goal: group the examples into 𝑘 partitions

3

Page 4: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Data Clustering: Similarity

• The only information clustering uses is the mutual similarity between samples

• A good clustering is one that achieves:

– high within-cluster similarity

– low inter-cluster similarity

4

Page 5: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

K-means: (Iterative) Algorithm

1) Initialization

• Input

– 𝑘: the number of clusters

– Training set 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑚

• Randomly initialize cluster centers anywhere in ℝ𝑛

5

Page 6: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

K-means: (Iterative) Algorithm

2) Iteration

• Repeat until convergence – A possible convergence criteria: cluster centers do not change anymore

6

Page 7: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

K-means: (Iterative) Algorithm

3) Output• 𝑐 (label) : index (1 to 𝑘) of cluster centroid (centers)

• 𝜇: averages (mean) of points assigned to cluster 𝜇1, 𝜇2, ⋯ , 𝜇𝑘

7

Page 8: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Initialization (𝒌 = 𝟐)

8

Page 9: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Assigning Points

9

Page 10: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Recomputing the Cluster Centers

10

Page 11: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Assigning Points

11

Page 12: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Recomputing the Cluster Centers

12

Page 13: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Assigning Points

13

Page 14: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Recomputing the Cluster Centers

14

Page 15: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Assigning Points

15

Page 16: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Recomputing the Cluster Centers

16

Page 17: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Summary: K-means Clustering

• (Iterative) Algorithm

17

Page 18: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

K-means: Optimization Point of View (Optional)

• 𝑐𝑖 = index of cluster (1, 2,⋯ , 𝑘) to which example 𝑥 𝑖 is currently assigned

• 𝜇𝑘 = cluster centroid

• 𝜇𝑐𝑖 = cluster centroid of cluster to which example 𝑥 𝑖 has been assigned

• Optimization objective:

18

Page 19: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Expectation Maximization (EM) Algorithm

• It is a "chicken and egg" problem (dilemma)– Q: if we knew 𝑐𝑖s, how would we determine which points to associate with each cluster center?

– A: for each point 𝑥 𝑖 , choose closest 𝑐𝑖

– Q: if we knew the cluster memberships, how do we get the centers?

– A: choose 𝑐𝑖 to be the mean of all points in the cluster

• Extension of K-means algorithm– A special case of Expectation Maximization (EM) algorithm

– A special case of Gaussian Mixture Model (GMM)

– Won’t be discussed in this course

19

Page 20: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Python: Data Generation

20

Page 21: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Python: Data Generation and Random Initialization

21

Page 22: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Python: K-Means

22

Page 23: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Python: K-Means in Scikit-learn

23

Page 24: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Initialization Issues

• k-means is extremely sensitive to cluster center initialization

• Bad initialization can lead to– Poor convergence speed

– Bad overall clustering

• Safeguarding measures:– Choose first center as one of the examples, second which is the farthest from the first, third which is

the farthest from both, and so on.

– Try multiple initialization and choose the best result

24

Page 25: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Choosing the Number of Clusters

• Idea: when adding another cluster does not give much better modeling of the data

• One way to select 𝑘 for the K-means algorithm is to try different values of 𝑘, plot the K-means objective versus 𝑘, and look at the 'elbow-point' in the plot

25

Page 26: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

Choosing the Number of Clusters

26

Page 27: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

K-means: Limitations

• Make hard assignments of points to clusters

– A point either completely belongs to a cluster or not belongs at all

– No notion of a soft assignment (i.e., probability of being assigned to each cluster)

– Gaussian mixture model (we will study later) and Fuzzy K-means allow soft assignments

• Sensitive to outlier examples

– K-medians algorithm is a more robust alternative for data with outliers

27

Page 28: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

K-means: Limitations

• Works well only for round shaped, and of roughly equal sizes/density cluster

• Does badly if the cluster have non-convex shapes

– Spectral clustering (we will study later) and Kernelized K-means can be an alternative

28

Page 29: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

K-means: Limitations

• Non-convex/non-round-shaped cluster: standard K-means fails !

• (optional) Connectivity → networks → spectral partitioning

29

Page 30: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely

K-means: Limitations

• Clusters with different densities

30