clustering - department of computer science &...

78
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan [email protected]

Upload: others

Post on 14-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

ClusteringCSL465/603 - Fall 2016Narayanan C Krishnan

[email protected]

Page 2: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Supervised vs Unsupervised Learning• Supervised learning – Given 𝑥", 𝑦" "%&

' , learn a function 𝑓: 𝑋 → 𝑌• Categorical output – classification• Continuous output – regression

• Unsupervised learning - Given 𝑥" "%&' , can we

infer the structure of the data?• Learning without a teacher

Clustering CSL465/603 - Machine Learning 2

Page 3: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Why Unsupervised Learning?• Unlabeled data is cheap• Labeled data is expensive – cumbersome to collect

• Exploratory data analysis• Preprocessing step for supervised learning

algorithms• Analysis of data in high dimensional spaces

Clustering CSL465/603 - Machine Learning 3

Page 4: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Cluster Analysis• Discover groups such that samples within a group

are more similar to each other than samples across groups

Clustering CSL465/603 - Machine Learning 4

Page 5: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Applications of Clustering (1)• Unsupervised image segmentation

Clustering CSL465/603 - Machine Learning 5

Page 6: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Applications of Clustering (2)• Image Compression

Clustering CSL465/603 - Machine Learning 6

Page 7: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Applications of Clustering (3)• Social network clustering

Clustering CSL465/603 - Machine Learning 7

Page 8: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Applications of Clustering (4)• Recommendation Systems

Clustering CSL465/603 - Machine Learning 8

Page 9: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Components of Clustering• A dissimilarity (similarity) function• Measures the distance/dissimilarity between examples

• A loss function• Evaluates the clusters

• An algorithm that optimizes this loss function

Clustering CSL465/603 - Machine Learning 9

Page 10: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Proximity Matrices• Data is directly represented in terms of proximity

between pairs of objects• Subjectively judged dissimilarities are seldom

distance in the strict sense (not necessarily follow the properties of a distance measure)• Replace the proximity matrix 𝐷 by

𝐷 + 𝐷0 /2

Clustering CSL465/603 - Machine Learning 10

Page 11: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Dissimilarity Based on Attributes (1)• Data point x" has 𝐷 features• Attributes are real-valued• Euclidean distance between the data points

𝐷 x", x4 = 6 𝑥"7 − 𝑥47 ^2:

7%&

• Resulting clusters are invariant to rotation and translation, but not to scaling

• If features have different scales - standardize the data

Clustering CSL465/603 - Machine Learning 11

Page 12: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Dissimilarity Based on Attributes (2)• Data point x" has 𝐷 features• Attributes are real-valued• Any ℒ= norm

𝐷 x", x4 = 6 𝑥"7 − 𝑥47>

:

7%&

?

• Cosine distance between the data points𝐷 x", x4 =

∑ 𝑥"7𝑥47:7%&

∑ 𝑥"7A:7%&

� ∑ 𝑥47A:7%&

Clustering CSL465/603 - Machine Learning 12

Page 13: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Dissimilarity Based on Attributes (3)• Data point x" has 𝐷 features• Attributes are ordinal• Grades – A, B, C, D• Answers to survey question - strongly agree, agree,

neutral, disagree• Replace the ordinal values by quantitative

representations𝑚 − 1/2

𝑀 ,𝑚 = 1,… ,𝑀

Clustering CSL465/603 - Machine Learning 13

Page 14: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Dissimilarity Based on Attributes (4)• Data point x" has 𝐷 features• Attributes are categorical• Values of an attribute are unordered• Define explicit difference between the values

𝑑&& ⋯ 𝑑&H⋮ ⋱ ⋮

𝑑H& ⋯ 𝑑HH• Often

• For identical values - 𝑑K,KL = 0, if𝑚 = 𝑚P

• For different values- 𝑑K,KL= 1, if𝑚 ≠ 𝑚P

Clustering CSL465/603 - Machine Learning 14

Page 15: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Loss Function for Clustering (1)• Assign each observation to a cluster without regard

to the probability model describing the data• Let 𝐾 - be the number of clusters and 𝑘 - indexes into the

number of clusters• Each observation is assigned to one and only one cluster• View the assignment as a function 𝐶 𝑖 = 𝑘

• Loss function

𝑊 𝐶 =126 6 6 𝑑(x", x"L)

Y "L %Z

Y " %Z

[

Z%&• Characterized the extent to which observations assigned

to the same cluster tend to be close to one another• Within cluster distance/scatter

Clustering CSL465/603 - Machine Learning 15

Page 16: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Loss Function for Clustering (2)• Consider the function

𝑇 =1266 𝑑""L

'

"L%&

'

"%&• Total point scatter

• This can be divided as

𝑇 =126 6 6 𝑑""L

Y "L %Z

+ 6 𝑑""L�

Y "L ]Z

Y " %Z

[

Z%&

𝑇 = 𝑊 𝐶 + 𝐵(𝐶)

Clustering CSL465/603 - Machine Learning 16

Page 17: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Loss Function for Clustering (3)• The function 𝐵 𝐶

𝐵 𝐶 =126 6 6 𝑑""L

Y "L ]Z

Y " %Z

[

Z%&• Between cluster distance/scatter

• Thus minimizing 𝑊 𝐶 is equivalent to maximizing 𝐵 𝐶

Clustering CSL465/603 - Machine Learning 17

Page 18: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Combinatorial Clustering• Minimize 𝑊 over all possible assignments of 𝑁 data

points to 𝐾 clusters• Unfortunately feasible only for very small data sets• The number of distinct assignments is

𝑆 𝑁,𝐾 =1𝐾!6 −1 [bZ 𝐾

𝑘 𝑘'[

Z%&• 𝑆(10, 4) = 34,105• 𝑆 19, 4 = 10&g

• Not a practical clustering algorithm

Clustering CSL465/603 - Machine Learning 18

Page 19: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K- Means Clustering (1)• Most popular iterative descent clustering method• Suppose all variables/features are real-valued and

we use squared Euclidean distance as the dissimilarity measure

𝑑 xh, x"L = x" − x"L A

• The within cluster scatter can be written as

𝑊 𝐶 =126 6 6 x" − x"L A

Y "L %Z

Y " %Z

[

Z%&

= 6𝑁Z 6 x" − xiZ A�

Y " %Z

[

Z%&Clustering CSL465/603 - Machine Learning 19

Page 20: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering (2)• Find

𝐶∗ = minY6𝑁Z 6 x" − xiZ A

Y " %Z

[

Z%&

• Note that for a set 𝑆�̅�n = argminr6 x" − m A

"∈n• So find

𝐶∗ = minY, rt tuv

w 6𝑁Z 6 x" − mZ

A�

Y " %Z

[

Z%&

Clustering CSL465/603 - Machine Learning 20

Page 21: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering (3)• Find the optimal solution using Expectation

Maximization• Iterative procedure consisting of two steps

• Expectation step (E Step) – Fix the mean vectors mZ Z%&

[ and find the optimal 𝐶∗

• Maximization step (M step) – Fix the cluster assignments 𝐶 and find the optimal mean vectors mZ Z%&

[

• Each step of this procedure reduces the loss function value

Clustering CSL465/603 - Machine Learning 21

Page 22: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (1)

Clustering CSL465/603 - Machine Learning 22

Page 23: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (2)

Clustering CSL465/603 - Machine Learning 23

Page 24: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (3)

Clustering CSL465/603 - Machine Learning 24

Page 25: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (4)

Clustering CSL465/603 - Machine Learning 25

Page 26: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (5)

Clustering CSL465/603 - Machine Learning 26

Page 27: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (6)

Clustering CSL465/603 - Machine Learning 27

Page 28: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (7)

Clustering CSL465/603 - Machine Learning 28

Page 29: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (8)

Clustering CSL465/603 - Machine Learning 29

Page 30: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (9)

Clustering CSL465/603 - Machine Learning 30

Page 31: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Means Clustering Illustration (10)

Clustering CSL465/603 - Machine Learning 31

• Blue point - Expectation step• Red point – Maximization step

Page 32: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

How to Choose K?• Similar to choosing 𝐾 in kNN• The loss function generally decreases with 𝐾

Clustering CSL465/603 - Machine Learning 32

Page 33: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Limitations of K-Means Clustering• Hard assignments are susceptible to noise/outliers• Assumes spherical (convex) clusters with uniform

prior on the clusters• Clusters can change arbitrarily for different 𝐾 and

initializations

Clustering CSL465/603 - Machine Learning 33

Page 34: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

K-Medoids• K-Means is suitable only when using Euclidean

distance• Susceptible to outliers• Challenge when the centroid of a cluster is not a valid

data point• Generalizing K-Means to arbitrary distance

measures• Replace the mean calcluation by median calculation

• Ensures the centroid to be a medoid – always a valid data point• Increases computation as we have to now find the

medoid

Clustering CSL465/603 - Machine Learning 34

Page 35: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Soft K-Means as Gaussian Mixture Models (1)• Probabilistic Clusters• Each cluster is associated with a Gaussian

Distribution - 𝒩(𝜇Z, ΣZ)• Each cluster also has a prior probability - 𝜋Z• Then the likelihood of a data point drawn from the 𝐾

clusters will be

𝑃 𝑥 = 6𝜋Z𝑃 𝑥 𝜇Z, ΣZ

[

Z%&• Where ∑ 𝜋Z = 1[

Z%&

Clustering CSL465/603 - Machine Learning 35

Page 36: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Soft K-Means as Gaussian Mixture Models (2)• Given 𝑁 iid data points, the likelihood function 𝑃 𝑥&, … , 𝑥' is 𝑃 𝑥&, … , 𝑥' =

Clustering CSL465/603 - Machine Learning 36

Page 37: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Soft K-Means as Gaussian Mixture Models (3)• Given 𝑁 iid data points, the likelihood function 𝑃 𝑥&, … , 𝑥' is 𝑃 𝑥&, … , 𝑥' = ∏ 𝑃(𝑥")'

"%& = ∏ ∑ 𝜋Z𝑃 𝑥" 𝜇Z, ΣZ[Z%&

'"%&

• Let us take the negative log likelihood

Clustering CSL465/603 - Machine Learning 37

Page 38: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Soft K-Means as Gaussian Mixture Models (4)• Given 𝑁 iid data points, the likelihood function 𝑃 𝑥&, … , 𝑥' is 𝑃 𝑥&, … , 𝑥' = ∏ 𝑃(𝑥")'

"%& = ∏ ∑ 𝜋Z𝑃 𝑥" 𝜇Z, ΣZ[Z%&

'"%&

• Let us take the log likelihood

6log 6𝜋Z𝑃 𝑥" 𝜇Z, ΣZ

[

Z%&

'

"%&

Clustering CSL465/603 - Machine Learning 38

Page 39: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Soft K-Means as Gaussian Mixture Models (5)• Problem with maximum likelihood• Sum over the components appears inside the log, thus

coupling all parameters• Can lead to singularities

Clustering CSL465/603 - Machine Learning 39

6log 6𝜋Z𝑃 𝑥" 𝜇Z, ΣZ

[

Z%&

'

"%&

Page 40: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Soft K-Means as Gaussian Mixture Models (6)• Latent Variables• Each data point 𝑥" is associated with a latent

variable - 𝑧" = 𝑧"&, … , 𝑧"[• Where 𝑧"Z ∈ 0, 1 , ∑ 𝑧"Z = 1[

Z%& and 𝑃 𝑧"Z = 1 = 𝜋Z• Given the complete data 𝑋, 𝑍, we look at maximizing

𝑃 𝑋, 𝑍 𝜋Z, 𝜇Z, ΣZ

Clustering CSL465/603 - Machine Learning 40

Page 41: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Soft K-Means as Gaussian Mixture Models (7)• Latent Variables• Each data point 𝑥" is associated with a latent

variable - 𝑧" = 𝑧"&, … , 𝑧"[• Where 𝑧"Z ∈ 0, 1 , ∑ 𝑧"Z = 1[

Z%& and 𝑃 𝑧"Z = 1 = 𝜋Z• Let the probability 𝑃 𝑧"Z = 1|𝑥" be denoted as 𝛾 𝑧"Z• From Bayes theorem

𝛾 𝑧"Z = 𝑃 𝑧"Z = 1|𝑥" =𝑃 𝑧Z = 1 𝑃 𝑥"|𝑧"Z = 1

𝑃 𝑥"• The marginal distribution 𝑃 𝑥" = ∑ 𝑃 𝑥", 𝑧"�

��t =∑ 𝑃 𝑧"Z = 1 𝑃 𝑥"|𝑧"Z = 1[Z%&

Clustering CSL465/603 - Machine Learning 41

Page 42: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Soft K-Means as Gaussian Mixture Models (8)• Now,

𝑃 𝑧Z = 1 = 𝜋Z𝑃 𝑥"|𝑧"Z = 1 = 𝑁 𝑥"|𝜇Z, ΣZ

• Therefore𝛾 𝑧"Z = 𝑃 𝑧"Z = 1|𝑥" =

Clustering CSL465/603 - Machine Learning 42

Page 43: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Estimating the mean 𝜇Z (1)• Begin with the log-likelihood function

6log 6𝜋Z𝑃 𝑥" 𝜇Z, ΣZ

[

Z%&

'

"%&• Taking the derivative wrt to 𝜇Z and equating it to 0

Clustering CSL465/603 - Machine Learning 43

Page 44: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Estimating the mean 𝜇Z (2)

𝜇Z =1𝑁Z

6𝛾 𝑧"Z 𝑥"

'

"%&• Where 𝑁Z = ∑ 𝛾 𝑧"Z'

"%&• Effective number of points assigned to cluster 𝑘

• So the mean of 𝑘��Gaussian component is the weighted mean of all the points in the dataset• Where the weight of the 𝑖��data point is the posterior

probability that component k was responsible for generating 𝑥"

Clustering CSL465/603 - Machine Learning 44

Page 45: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Estimating the Covariance ΣZ• Begin with the log-likelihood function

6log 6𝜋Z𝑃 𝑥" 𝜇Z, ΣZ

[

Z%&

'

"%&• Taking the derivative wrt to ΣZ and equating it to 0

ΣZ =1𝑁Z

6𝛾 𝑧"Z 𝑥" − 𝜇Z 0 𝑥" − 𝜇Z

'

"%&• Similar to the result for a single Gaussian for the

dataset, but each data point is weighted by the corresponding posterior probability.

Clustering CSL465/603 - Machine Learning 45

Page 46: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Estimating the mixing coefficients 𝜋Z• Begin with the log-likelihood function

6log 6𝜋Z𝑃 𝑥" 𝜇Z, ΣZ

[

Z%&

'

"%&• Maximize the log-likelihood, w.r.t 𝜋Z

• Subject to the condition that ∑ 𝜋Z[Z%& = 1

• Use Lagrange multiplier 𝜆 and maximize

6log 6𝜋Z𝑃 𝑥" 𝜇Z, ΣZ

[

Z%&

'

"%&

+ 𝜆 6 𝜋Z[

Z%&− 1

• Solving this will result in 𝜋Z =

𝑁Z𝑁

Clustering CSL465/603 - Machine Learning 46

Page 47: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Soft K-Means as Gaussian Mixture Models (8)• In Summary• 𝜋Z =

't'

• 𝜇Z =&'t∑ 𝛾 𝑧"Z 𝑥"'"%&

• ΣZ =&'t∑ 𝛾 𝑧"Z 𝑥" − 𝜇Z 0 𝑥" − 𝜇Z'"%&

• But then what if 𝑧"Z is unkown?• Use EM algorithm!

Clustering CSL465/603 - Machine Learning 47

Page 48: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

EM for GMM• First choose initial values for 𝜋Z, 𝜇Z, ΣZ• Alternate between Expectation and Maximization

Steps• Expectation Step (E) – Given the parameters of the

compute the posterior probabilities 𝛾(𝑧"Z)• Maximization step (M) – Given the posterior

probabilities, update 𝜋Z, 𝜇Z, ΣZ

Clustering CSL465/603 - Machine Learning 48

Page 49: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

EM for GMM Illustration (1)

Clustering CSL465/603 - Machine Learning 49

Page 50: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

EM for GMM Illustration (2)

Clustering CSL465/603 - Machine Learning 50

Page 51: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

EM for GMM Illustration (3)

Clustering CSL465/603 - Machine Learning 51

Page 52: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

EM for GMM Illustration (4)

Clustering CSL465/603 - Machine Learning 52

Page 53: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

EM for GMM Illustration (5)

Clustering CSL465/603 - Machine Learning 53

Page 54: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

EM for GMM Illustration (6)

Clustering CSL465/603 - Machine Learning 54

Page 55: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Practical Issues with EM for GMM• Takes many more iterations than k-Means• Each iteration requires more computation• Run k-Means first, and then EM for GMM• Covariance can be initialized to the covariance of the

clusters obtained from k-Means• EM is not guaranteed to find the global maximum of

the log likelihood function• Check for convergence• Log likelihood does not change significantly between two

iterations

Clustering CSL465/603 - Machine Learning 55

Page 56: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Hierarchical Clustering (1)• Organize clusters in a hierarchical fashion• Produces a rooted binary tree (dendrogram)

Clustering CSL465/603 - Machine Learning 56

Page 57: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Hierarchical Clustering (2)• Bottom-up (agglomerative): recursively merge two

groups with the smallest between cluster similarity• Top-down (divisive): recursively split the least

coherent cluster• Users can choose a cut through the hierarchy to

represent the most natural division of clusters

Clustering CSL465/603 - Machine Learning 57

Page 58: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Hierarchical Clustering (3)• Bottom-up (agglomerative): recursively merge two

groups with the smallest between cluster similarity• Top-down (divisive): recursively split the least coherent

cluster• Share a monotonicity property

• Dissimilarity between merged clusters is monotone increase with the level of the merger

• Cophenetic correlation coefficient• Correlation between the 𝑁(𝑁 − 1)/2pairwise observation

dissimilarities and the cophenetic dissmilarities derived from the dendrogram

• Cophenetic dissimilarity - inter group dissimilarity at which the observations are first joined together in the same cluster

Clustering CSL465/603 - Machine Learning 58

Page 59: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Agglomerative Clustering (1)• Single Linkage – distance between two most similar

points in 𝐺 and 𝐻𝐷n� 𝐺, 𝐻 = min

"∈�,4∈�𝐷(𝑖, 𝑗)

• Also referred to as nearest neighbor linkage• Results in extended clusters through chaining• May violate the compactness property (large diameter)

Clustering CSL465/603 - Machine Learning 59

Page 60: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Agglomerative Clustering (2)• Complete Linkage – distance between two most

dissimilar points in 𝐺 and 𝐻𝐷Y� 𝐺, 𝐻 = max

"∈�,4∈�𝐷(𝑖, 𝑗)

• Furthest neighbor technique• Forces spherical clusters with consistent diameter• May violate the closeness property

Clustering CSL465/603 - Machine Learning 60

Page 61: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Agglomerative Clustering (3)• Average Linkage (Group Average) – average

dissimilarity between the groups𝐷�� 𝐺,𝐻 =

1𝑁�𝑁�

66 𝑑 𝑖, 𝑗�

4∈�

"∈�• Less affected by outliers

Clustering CSL465/603 - Machine Learning 61

Page 62: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Agglomerative Clustering (4)524 14. Unsupervised Learning

Average Linkage Complete Linkage Single Linkage

FIGURE 14.13. Dendrograms from agglomerative hierarchical clustering of hu-man tumor microarray data.

observations within them are relatively close together (small dissimilarities)as compared with observations in different clusters. To the extent this isnot the case, results will differ.

Single linkage (14.41) only requires that a single dissimilarity dii′ , i ∈ Gand i′ ∈ H, be small for two groups G and H to be considered closetogether, irrespective of the other observation dissimilarities between thegroups. It will therefore have a tendency to combine, at relatively lowthresholds, observations linked by a series of close intermediate observa-tions. This phenomenon, referred to as chaining, is often considered a de-fect of the method. The clusters produced by single linkage can violate the“compactness” property that all observations within each cluster tend tobe similar to one another, based on the supplied observation dissimilari-ties {dii′}. If we define the diameter DG of a group of observations as thelargest dissimilarity among its members

DG = maxi∈Gi′∈G

dii′ , (14.44)

then single linkage can produce clusters with very large diameters.Complete linkage (14.42) represents the opposite extreme. Two groups

G and H are considered close only if all of the observations in their unionare relatively similar. It will tend to produce compact clusters with smalldiameters (14.44). However, it can produce clusters that violate the “close-ness” property. That is, observations assigned to a cluster can be much

Clustering CSL465/603 - Machine Learning 62

Page 63: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Density-Based Clustering (1) (Extra Topic)• DBSCAN – Density Based Spatial Clustering of

Applications with Noise• Proposed by Ester, Kriegel, Sander and Xu (KDD 96)• KDD – 2014 Test of Time Award Winner

• Basic Idea – Clusters are dense regions in the data space, separated by regions of lower object density• Discovers clusters of arbitrary shape in spatial

databases with noise

Clustering CSL465/603 - Machine Learning 63

Page 64: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Density-Based Clustering (2)• Why Density-Based Clustering?

Clustering CSL465/603 - Machine Learning 64

Results of a k-medoid algorithm for k=4

Page 65: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Density-Based Clustering (3)• Principle• For any point in a cluster, the local point density around

that point has to exceed some threshold• The set of point from one cluster is spatially connected

• DBSCAN defines two parameters• 𝜖 - radius for the neighborhood of point 𝑝:

𝑁� 𝑝 = 𝑞 ∈ 𝑋|𝑑 𝑝, 𝑞 ≤ 𝜖• 𝑀𝑖𝑛𝑃𝑡𝑠– minimum number of points in the given

neighborhood 𝑁� 𝑝

Clustering CSL465/603 - Machine Learning 65

Page 66: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

𝜖 - Neighborhood• 𝜖 - Neighborhood – objects within a radius of 𝜖 from

an object𝑁� 𝑝 = 𝑞 ∈ 𝑋|𝑑 𝑝, 𝑞 ≤ 𝜖

• High Density 𝜖 - Neighborhood of an object contains at least 𝑀𝑖𝑛𝑃𝑡𝑠of objects

Clustering CSL465/603 - Machine Learning 66

q pεε

Page 67: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Core, Border and Outlier Points• 𝜖 = 1• 𝑀𝑖𝑛𝑃𝑡𝑠 = 5

• Given 𝜖 and 𝑀𝑖𝑛𝑃𝑡𝑠categorize objects into three exclusive groups• Core point – if it has more

than a specified number of points 𝑀𝑖𝑛𝑃𝑡𝑠 within 𝜖-neighborhood (interior points of a cluster)

• Border point – has fewer than 𝑀𝑖𝑛𝑃𝑡𝑠 within 𝜖-neighborhood, but is in the neighborhood of a core point

• Noise/Outlier – any point that is neither a core nor a border point

Clustering CSL465/603 - Machine Learning 67

Core

Border Outlier

Page 68: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Density – Reachability (1)• Directly density-reachable• An object 𝑞 is directly density-reachable from object 𝑝 if 𝑝

is a core object and 𝑞 is in 𝑝’s 𝜖-neighborhood.• Density reachability is asymmetric

Clustering CSL465/603 - Machine Learning 68

q pεε

𝑀𝑖𝑛𝑃𝑡𝑠 =4

Page 69: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Density – Reachability (2)• Density-Reachable (directly and indirectly): • A point 𝑝 is directly density-reachable from 𝑝2;• 𝑝2 is directly density-reachable from 𝑝1;• 𝑝1is directly density-reachable from 𝑞;• 𝑝 ← 𝑝2 ← 𝑝1 ← 𝑞 form a chain.• 𝑝 is indirectly density reachable from 𝑞

Clustering CSL465/603 - Machine Learning 69

p

q

p2p1

Page 70: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Density - Connectivity• Density-reachable is not symmetric• Not good enough to describe clusters

• Density-Connected• A pair of points 𝑝 and 𝑞 are density-connected, if they are

commonly density-reachable from a point 𝑜.• This is symmetric

Clustering CSL465/603 - Machine Learning 70

p q

o

Page 71: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Cluster in DBSCAN• Given a dataset 𝑋, parameter 𝜖 and threshold 𝑀𝑖𝑛𝑃𝑡𝑠• A cluster 𝐶 is a subset of objects satisfying the

criteria• Connected - ∀𝑝, 𝑞 ∈ 𝐶, 𝑝 and 𝑞 are density-connected• Maximal - ∀𝑝, 𝑞 ∈ 𝑋, if 𝑝 ∈ 𝐶 and 𝑞 is density-reachable

from 𝑝, then 𝑞 ∈ 𝐶

Clustering CSL465/603 - Machine Learning 71

Page 72: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

DBSCAN - Algorithm• Input – Dataset 𝑋, Parameters - 𝜖,𝑀𝑖𝑛𝑃𝑡𝑠• For each object 𝑝 ∈ 𝑋• If 𝑝 is a core object and not processed then

• 𝐶 = retrieve all objects density reachable from 𝑝• Mark all objects in 𝐶 as processed• Report 𝐶 as a cluster

• Else mark 𝑝 as outlier

• If 𝑝 is a border point, no points are density-reachable from 𝑝 and the DBSCAN algorithm visits the next point in 𝑋

Clustering CSL465/603 - Machine Learning 72

Page 73: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

DBSCAN Algorithm – Illustration (1)• 𝜖 = 2, 𝑀𝑖𝑛𝑃𝑡𝑠 = 3

Clustering CSL465/603 - Machine Learning 73

For each object 𝑝 ∈ 𝑋If 𝑝 is a core object and not processed then

𝐶 = retrieve all objects density reachable from 𝑝Mark all objects in 𝐶 as processedReport 𝐶 as a cluster

Else mark 𝑝 as outlier

Page 74: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

DBSCAN Algorithm – Illustration (2)• 𝜖 = 2, 𝑀𝑖𝑛𝑃𝑡𝑠 = 3

Clustering CSL465/603 - Machine Learning 74

For each object 𝑝 ∈ 𝑋If 𝑝 is a core object and not processed then

𝐶 = retrieve all objects density reachable from 𝑝Mark all objects in 𝐶 as processedReport 𝐶 as a cluster

Else mark 𝑝 as outlier

Page 75: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

DBSCAN Algorithm – Illustration (3)• 𝜖 = 2, 𝑀𝑖𝑛𝑃𝑡𝑠 = 3

Clustering CSL465/603 - Machine Learning 75

For each object 𝑝 ∈ 𝑋If 𝑝 is a core object and not processed then

𝐶 = retrieve all objects density reachable from 𝑝Mark all objects in 𝐶 as processedReport 𝐶 as a cluster

Else mark 𝑝 as outlier

Page 76: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

DBSCAN – Example (1)• Where it works• Original Points

Clustering CSL465/603 - Machine Learning 76

Page 77: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

DBSCAN – Example (2)• Where it does not work• Varying densities• Original points

Clustering CSL465/603 - Machine Learning 77

Page 78: Clustering - Department of Computer Science & Engineeringcse.iitrpr.ac.in/ckn/courses/f2016/csl603/w14.pdf · • The loss function generally decreases with R Clustering CSL465/603

Summary• Unsupervised Learning• K-means clustering• Expectation Maximization for discovering the clusters

• K-medoids clustering• Gaussian Mixture Models• Expectation Maximization for estimating the parameters

of the Gaussian mixtures• Hierarchical Clustering• Agglomerative Clustering

• Density Based Clustering• DBSCAN

Clustering CSL465/603 - Machine Learning 78