clustering - department of computer science &...

ClusteringCSL465/603 - Fall 2016Narayanan C Krishnan

[email protected]

Supervised vs Unsupervised Learning• Supervised learning – Given 𝑥", 𝑦" "%&

' , learn a function 𝑓: 𝑋 → 𝑌• Categorical output – classification• Continuous output – regression

• Unsupervised learning - Given 𝑥" "%&' , can we

infer the structure of the data?• Learning without a teacher

Clustering CSL465/603 - Machine Learning 2

Why Unsupervised Learning?• Unlabeled data is cheap• Labeled data is expensive – cumbersome to collect

• Exploratory data analysis• Preprocessing step for supervised learning

algorithms• Analysis of data in high dimensional spaces


Cluster Analysis• Discover groups such that samples within a group

are more similar to each other than samples across groups


Applications of Clustering (1)• Unsupervised image segmentation


Applications of Clustering (2)• Image Compression


Applications of Clustering (3)• Social network clustering


Applications of Clustering (4)• Recommendation Systems


Components of Clustering• A dissimilarity (similarity) function• Measures the distance/dissimilarity between examples

• A loss function• Evaluates the clusters

• An algorithm that optimizes this loss function


Proximity Matrices• Data is directly represented in terms of proximity

between pairs of objects• Subjectively judged dissimilarities are seldom

distance in the strict sense (not necessarily follow the properties of a distance measure)• Replace the proximity matrix 𝐷 by

𝐷 + 𝐷0 /2


Dissimilarity Based on Attributes (1)• Data point x" has 𝐷 features• Attributes are real-valued• Euclidean distance between the data points

𝐷 x", x4 = 6 𝑥"7 − 𝑥47 ^2:

7%&

�

• Resulting clusters are invariant to rotation and translation, but not to scaling

• If features have different scales - standardize the data


Dissimilarity Based on Attributes (2)• Data point x" has 𝐷 features• Attributes are real-valued• Any ℒ= norm

𝐷 x", x4 = 6 𝑥"7 − 𝑥47>

:

7%&

?

• Cosine distance between the data points𝐷 x", x4 =

∑ 𝑥"7𝑥47:7%&

∑ 𝑥"7A:7%&

� ∑ 𝑥47A:7%&

�


Dissimilarity Based on Attributes (3)• Data point x" has 𝐷 features• Attributes are ordinal• Grades – A, B, C, D• Answers to survey question - strongly agree, agree,

neutral, disagree• Replace the ordinal values by quantitative

representations𝑚 − 1/2

𝑀 ,𝑚 = 1,… ,𝑀


Dissimilarity Based on Attributes (4)• Data point x" has 𝐷 features• Attributes are categorical• Values of an attribute are unordered• Define explicit difference between the values

𝑑&& ⋯ 𝑑&H⋮ ⋱ ⋮

𝑑H& ⋯ 𝑑HH• Often

• For identical values - 𝑑K,KL = 0, if𝑚 = 𝑚P

• For different values- 𝑑K,KL= 1, if𝑚 ≠ 𝑚P


Loss Function for Clustering (1)• Assign each observation to a cluster without regard

to the probability model describing the data• Let 𝐾 - be the number of clusters and 𝑘 - indexes into the

number of clusters• Each observation is assigned to one and only one cluster• View the assignment as a function 𝐶 𝑖 = 𝑘

• Loss function

𝑊 𝐶 =126 6 6 𝑑(x", x"L)

�

Y "L %Z

�

Y " %Z

[

Z%&• Characterized the extent to which observations assigned

to the same cluster tend to be close to one another• Within cluster distance/scatter


Loss Function for Clustering (2)• Consider the function

𝑇 =1266 𝑑""L

'

"L%&

'

"%&• Total point scatter

• This can be divided as

𝑇 =126 6 6 𝑑""L

�

Y "L %Z

+ 6 𝑑""L�

Y "L ]Z

�

Y " %Z

[

Z%&

𝑇 = 𝑊 𝐶 + 𝐵(𝐶)


Loss Function for Clustering (3)• The function 𝐵 𝐶

𝐵 𝐶 =126 6 6 𝑑""L

�

Y "L ]Z

�

Y " %Z

[

Z%&• Between cluster distance/scatter

• Thus minimizing 𝑊 𝐶 is equivalent to maximizing 𝐵 𝐶


Combinatorial Clustering• Minimize 𝑊 over all possible assignments of 𝑁 data

points to 𝐾 clusters• Unfortunately feasible only for very small data sets• The number of distinct assignments is

𝑆 𝑁,𝐾 =1𝐾!6 −1 [bZ 𝐾

𝑘 𝑘'[

Z%&• 𝑆(10, 4) = 34,105• 𝑆 19, 4 = 10&g

• Not a practical clustering algorithm


K- Means Clustering (1)• Most popular iterative descent clustering method• Suppose all variables/features are real-valued and

we use squared Euclidean distance as the dissimilarity measure

𝑑 xh, x"L = x" − x"L A

• The within cluster scatter can be written as

𝑊 𝐶 =126 6 6 x" − x"L A

�

Y "L %Z

�

Y " %Z

[

Z%&

= 6𝑁Z 6 x" − xiZ A�

Y " %Z

[

Z%&Clustering CSL465/603 - Machine Learning 19

K-Means Clustering (2)• Find

𝐶∗ = minY6𝑁Z 6 x" − xiZ A

�

Y " %Z

[

Z%&

• Note that for a set 𝑆�̅�n = argminr6 x" − m A

�

"∈n• So find

𝐶∗ = minY, rt tuv

w 6𝑁Z 6 x" − mZ

A�

Y " %Z

[

Z%&


K-Means Clustering (3)• Find the optimal solution using Expectation

Maximization• Iterative procedure consisting of two steps

• Expectation step (E Step) – Fix the mean vectors mZ Z%&

[ and find the optimal 𝐶∗

• Maximization step (M step) – Fix the cluster assignments 𝐶 and find the optimal mean vectors mZ Z%&

[

• Each step of this procedure reduces the loss function value


K-Means Clustering Illustration (1)




• Blue point - Expectation step• Red point – Maximization step

How to Choose K?• Similar to choosing 𝐾 in kNN• The loss function generally decreases with 𝐾


Limitations of K-Means Clustering• Hard assignments are susceptible to noise/outliers• Assumes spherical (convex) clusters with uniform

prior on the clusters• Clusters can change arbitrarily for different 𝐾 and

initializations


K-Medoids• K-Means is suitable only when using Euclidean

distance• Susceptible to outliers• Challenge when the centroid of a cluster is not a valid

data point• Generalizing K-Means to arbitrary distance

measures• Replace the mean calcluation by median calculation

• Ensures the centroid to be a medoid – always a valid data point• Increases computation as we have to now find the

medoid


Soft K-Means as Gaussian Mixture Models (1)• Probabilistic Clusters• Each cluster is associated with a Gaussian

Distribution - 𝒩(𝜇Z, ΣZ)• Each cluster also has a prior probability - 𝜋Z• Then the likelihood of a data point drawn from the 𝐾

clusters will be

𝑃 𝑥 = 6𝜋Z𝑃 𝑥 𝜇Z, ΣZ

[

Z%&• Where ∑ 𝜋Z = 1[

Z%&


Soft K-Means as Gaussian Mixture Models (2)• Given 𝑁 iid data points, the likelihood function 𝑃 𝑥&, … , 𝑥' is 𝑃 𝑥&, … , 𝑥' =


Soft K-Means as Gaussian Mixture Models (3)• Given 𝑁 iid data points, the likelihood function 𝑃 𝑥&, … , 𝑥' is 𝑃 𝑥&, … , 𝑥' = ∏ 𝑃(𝑥")'

"%& = ∏ ∑ 𝜋Z𝑃 𝑥" 𝜇Z, ΣZ[Z%&

'"%&

• Let us take the negative log likelihood


Soft K-Means as Gaussian Mixture Models (4)• Given 𝑁 iid data points, the likelihood function 𝑃 𝑥&, … , 𝑥' is 𝑃 𝑥&, … , 𝑥' = ∏ 𝑃(𝑥")'

"%& = ∏ ∑ 𝜋Z𝑃 𝑥" 𝜇Z, ΣZ[Z%&

'"%&

• Let us take the log likelihood

6log 6𝜋Z𝑃 𝑥" 𝜇Z, ΣZ

[

Z%&

'

"%&


Soft K-Means as Gaussian Mixture Models (5)• Problem with maximum likelihood• Sum over the components appears inside the log, thus

coupling all parameters• Can lead to singularities



[

Z%&

'

"%&

Soft K-Means as Gaussian Mixture Models (6)• Latent Variables• Each data point 𝑥" is associated with a latent

variable - 𝑧" = 𝑧"&, … , 𝑧"[• Where 𝑧"Z ∈ 0, 1 , ∑ 𝑧"Z = 1[

Z%& and 𝑃 𝑧"Z = 1 = 𝜋Z• Given the complete data 𝑋, 𝑍, we look at maximizing

𝑃 𝑋, 𝑍 𝜋Z, 𝜇Z, ΣZ


Soft K-Means as Gaussian Mixture Models (7)• Latent Variables• Each data point 𝑥" is associated with a latent

variable - 𝑧" = 𝑧"&, … , 𝑧"[• Where 𝑧"Z ∈ 0, 1 , ∑ 𝑧"Z = 1[

Z%& and 𝑃 𝑧"Z = 1 = 𝜋Z• Let the probability 𝑃 𝑧"Z = 1|𝑥" be denoted as 𝛾 𝑧"Z• From Bayes theorem

𝛾 𝑧"Z = 𝑃 𝑧"Z = 1|𝑥" =𝑃 𝑧Z = 1 𝑃 𝑥"|𝑧"Z = 1

𝑃 𝑥"• The marginal distribution 𝑃 𝑥" = ∑ 𝑃 𝑥", 𝑧"�

��t =∑ 𝑃 𝑧"Z = 1 𝑃 𝑥"|𝑧"Z = 1[Z%&


Soft K-Means as Gaussian Mixture Models (8)• Now,

𝑃 𝑧Z = 1 = 𝜋Z𝑃 𝑥"|𝑧"Z = 1 = 𝑁 𝑥"|𝜇Z, ΣZ

• Therefore𝛾 𝑧"Z = 𝑃 𝑧"Z = 1|𝑥" =


Estimating the mean 𝜇Z (1)• Begin with the log-likelihood function


[

Z%&

'

"%&• Taking the derivative wrt to 𝜇Z and equating it to 0


Estimating the mean 𝜇Z (2)

𝜇Z =1𝑁Z

6𝛾 𝑧"Z 𝑥"

'

"%&• Where 𝑁Z = ∑ 𝛾 𝑧"Z'

"%&• Effective number of points assigned to cluster 𝑘

• So the mean of 𝑘��Gaussian component is the weighted mean of all the points in the dataset• Where the weight of the 𝑖��data point is the posterior

probability that component k was responsible for generating 𝑥"


Estimating the Covariance ΣZ• Begin with the log-likelihood function


[

Z%&

'

"%&• Taking the derivative wrt to ΣZ and equating it to 0

ΣZ =1𝑁Z

6𝛾 𝑧"Z 𝑥" − 𝜇Z 0 𝑥" − 𝜇Z

'

"%&• Similar to the result for a single Gaussian for the

dataset, but each data point is weighted by the corresponding posterior probability.


Estimating the mixing coefficients 𝜋Z• Begin with the log-likelihood function


[

Z%&

'

"%&• Maximize the log-likelihood, w.r.t 𝜋Z

• Subject to the condition that ∑ 𝜋Z[Z%& = 1

• Use Lagrange multiplier 𝜆 and maximize


[

Z%&

'

"%&

+ 𝜆 6 𝜋Z[

Z%&− 1

• Solving this will result in 𝜋Z =

𝑁Z𝑁


Soft K-Means as Gaussian Mixture Models (8)• In Summary• 𝜋Z =

't'

• 𝜇Z =&'t∑ 𝛾 𝑧"Z 𝑥"'"%&

• ΣZ =&'t∑ 𝛾 𝑧"Z 𝑥" − 𝜇Z 0 𝑥" − 𝜇Z'"%&

• But then what if 𝑧"Z is unkown?• Use EM algorithm!


EM for GMM• First choose initial values for 𝜋Z, 𝜇Z, ΣZ• Alternate between Expectation and Maximization

Steps• Expectation Step (E) – Given the parameters of the

compute the posterior probabilities 𝛾(𝑧"Z)• Maximization step (M) – Given the posterior

probabilities, update 𝜋Z, 𝜇Z, ΣZ


EM for GMM Illustration (1)


Practical Issues with EM for GMM• Takes many more iterations than k-Means• Each iteration requires more computation• Run k-Means first, and then EM for GMM• Covariance can be initialized to the covariance of the

clusters obtained from k-Means• EM is not guaranteed to find the global maximum of

the log likelihood function• Check for convergence• Log likelihood does not change significantly between two

iterations


Hierarchical Clustering (1)• Organize clusters in a hierarchical fashion• Produces a rooted binary tree (dendrogram)


Hierarchical Clustering (2)• Bottom-up (agglomerative): recursively merge two

groups with the smallest between cluster similarity• Top-down (divisive): recursively split the least

coherent cluster• Users can choose a cut through the hierarchy to

represent the most natural division of clusters


Hierarchical Clustering (3)• Bottom-up (agglomerative): recursively merge two

groups with the smallest between cluster similarity• Top-down (divisive): recursively split the least coherent

cluster• Share a monotonicity property

• Dissimilarity between merged clusters is monotone increase with the level of the merger

• Cophenetic correlation coefficient• Correlation between the 𝑁(𝑁 − 1)/2pairwise observation

dissimilarities and the cophenetic dissmilarities derived from the dendrogram

• Cophenetic dissimilarity - inter group dissimilarity at which the observations are first joined together in the same cluster


Agglomerative Clustering (1)• Single Linkage – distance between two most similar

points in 𝐺 and 𝐻𝐷n� 𝐺, 𝐻 = min

"∈�,4∈�𝐷(𝑖, 𝑗)

• Also referred to as nearest neighbor linkage• Results in extended clusters through chaining• May violate the compactness property (large diameter)


Agglomerative Clustering (2)• Complete Linkage – distance between two most

dissimilar points in 𝐺 and 𝐻𝐷Y� 𝐺, 𝐻 = max

"∈�,4∈�𝐷(𝑖, 𝑗)

• Furthest neighbor technique• Forces spherical clusters with consistent diameter• May violate the closeness property


Agglomerative Clustering (3)• Average Linkage (Group Average) – average

dissimilarity between the groups𝐷�� 𝐺,𝐻 =

1𝑁�𝑁�

66 𝑑 𝑖, 𝑗�

4∈�

�

"∈�• Less affected by outliers


Agglomerative Clustering (4)524 14. Unsupervised Learning

Average Linkage Complete Linkage Single Linkage

FIGURE 14.13. Dendrograms from agglomerative hierarchical clustering of hu-man tumor microarray data.

observations within them are relatively close together (small dissimilarities)as compared with observations in different clusters. To the extent this isnot the case, results will differ.

Single linkage (14.41) only requires that a single dissimilarity dii′ , i ∈ Gand i′ ∈ H, be small for two groups G and H to be considered closetogether, irrespective of the other observation dissimilarities between thegroups. It will therefore have a tendency to combine, at relatively lowthresholds, observations linked by a series of close intermediate observa-tions. This phenomenon, referred to as chaining, is often considered a de-fect of the method. The clusters produced by single linkage can violate the“compactness” property that all observations within each cluster tend tobe similar to one another, based on the supplied observation dissimilari-ties {dii′}. If we define the diameter DG of a group of observations as thelargest dissimilarity among its members

DG = maxi∈Gi′∈G

dii′ , (14.44)

then single linkage can produce clusters with very large diameters.Complete linkage (14.42) represents the opposite extreme. Two groups

G and H are considered close only if all of the observations in their unionare relatively similar. It will tend to produce compact clusters with smalldiameters (14.44). However, it can produce clusters that violate the “close-ness” property. That is, observations assigned to a cluster can be much


Density-Based Clustering (1) (Extra Topic)• DBSCAN – Density Based Spatial Clustering of

Applications with Noise• Proposed by Ester, Kriegel, Sander and Xu (KDD 96)• KDD – 2014 Test of Time Award Winner

• Basic Idea – Clusters are dense regions in the data space, separated by regions of lower object density• Discovers clusters of arbitrary shape in spatial

databases with noise


Density-Based Clustering (2)• Why Density-Based Clustering?


Results of a k-medoid algorithm for k=4

Density-Based Clustering (3)• Principle• For any point in a cluster, the local point density around

that point has to exceed some threshold• The set of point from one cluster is spatially connected

• DBSCAN defines two parameters• 𝜖 - radius for the neighborhood of point 𝑝:

𝑁� 𝑝 = 𝑞 ∈ 𝑋|𝑑 𝑝, 𝑞 ≤ 𝜖• 𝑀𝑖𝑛𝑃𝑡𝑠– minimum number of points in the given

neighborhood 𝑁� 𝑝


𝜖 - Neighborhood• 𝜖 - Neighborhood – objects within a radius of 𝜖 from

an object𝑁� 𝑝 = 𝑞 ∈ 𝑋|𝑑 𝑝, 𝑞 ≤ 𝜖

• High Density 𝜖 - Neighborhood of an object contains at least 𝑀𝑖𝑛𝑃𝑡𝑠of objects


q pεε

Core, Border and Outlier Points• 𝜖 = 1• 𝑀𝑖𝑛𝑃𝑡𝑠 = 5

• Given 𝜖 and 𝑀𝑖𝑛𝑃𝑡𝑠categorize objects into three exclusive groups• Core point – if it has more

than a specified number of points 𝑀𝑖𝑛𝑃𝑡𝑠 within 𝜖-neighborhood (interior points of a cluster)

• Border point – has fewer than 𝑀𝑖𝑛𝑃𝑡𝑠 within 𝜖-neighborhood, but is in the neighborhood of a core point

• Noise/Outlier – any point that is neither a core nor a border point


Core

Border Outlier

Density – Reachability (1)• Directly density-reachable• An object 𝑞 is directly density-reachable from object 𝑝 if 𝑝

is a core object and 𝑞 is in 𝑝’s 𝜖-neighborhood.• Density reachability is asymmetric


q pεε

𝑀𝑖𝑛𝑃𝑡𝑠 =4

Density – Reachability (2)• Density-Reachable (directly and indirectly): • A point 𝑝 is directly density-reachable from 𝑝2;• 𝑝2 is directly density-reachable from 𝑝1;• 𝑝1is directly density-reachable from 𝑞;• 𝑝 ← 𝑝2 ← 𝑝1 ← 𝑞 form a chain.• 𝑝 is indirectly density reachable from 𝑞


p

q

p2p1

Density - Connectivity• Density-reachable is not symmetric• Not good enough to describe clusters

• Density-Connected• A pair of points 𝑝 and 𝑞 are density-connected, if they are

commonly density-reachable from a point 𝑜.• This is symmetric


p q

o

Cluster in DBSCAN• Given a dataset 𝑋, parameter 𝜖 and threshold 𝑀𝑖𝑛𝑃𝑡𝑠• A cluster 𝐶 is a subset of objects satisfying the

criteria• Connected - ∀𝑝, 𝑞 ∈ 𝐶, 𝑝 and 𝑞 are density-connected• Maximal - ∀𝑝, 𝑞 ∈ 𝑋, if 𝑝 ∈ 𝐶 and 𝑞 is density-reachable

from 𝑝, then 𝑞 ∈ 𝐶


DBSCAN - Algorithm• Input – Dataset 𝑋, Parameters - 𝜖,𝑀𝑖𝑛𝑃𝑡𝑠• For each object 𝑝 ∈ 𝑋• If 𝑝 is a core object and not processed then

• 𝐶 = retrieve all objects density reachable from 𝑝• Mark all objects in 𝐶 as processed• Report 𝐶 as a cluster

• Else mark 𝑝 as outlier

• If 𝑝 is a border point, no points are density-reachable from 𝑝 and the DBSCAN algorithm visits the next point in 𝑋


DBSCAN Algorithm – Illustration (1)• 𝜖 = 2, 𝑀𝑖𝑛𝑃𝑡𝑠 = 3


For each object 𝑝 ∈ 𝑋If 𝑝 is a core object and not processed then

𝐶 = retrieve all objects density reachable from 𝑝Mark all objects in 𝐶 as processedReport 𝐶 as a cluster

Else mark 𝑝 as outlier

DBSCAN – Example (1)• Where it works• Original Points


DBSCAN – Example (2)• Where it does not work• Varying densities• Original points


Summary• Unsupervised Learning• K-means clustering• Expectation Maximization for discovering the clusters

• K-medoids clustering• Gaussian Mixture Models• Expectation Maximization for estimating the parameters

of the Gaussian mixtures• Hierarchical Clustering• Agglomerative Clustering

• Density Based Clustering• DBSCAN


clustering - department of computer science &...

Documents