machine learning - sharif university of...

Machine LearningClustering

Hamid Beigy

Sharif University of Technology

Fall 1394

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1394 1 / 25

Table of contents

1 Introduction

2 Data matrix and dissimilarity matrix

3 Proximity Measures

4 Clustering methodsPartitioning methodsHierarchical methodsModel-based clustering


Table of contents

1 Introduction





Introduction

Clustering is the process of grouping a set of data objects into multiple groups or clustersso that objects within a cluster have high similarity, but are very dissimilar to objects inother clusters.

Dissimilarities and similarities are assessed based on the attribute values describing theobjects and often involve distance measures.

Clustering as a data mining tool has its roots in many application areas such as biology,security, business intelligence, and Web search.


Requirements for cluster analysis

Clustering is a challenging research field and the following are its typical requirements.

ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeRequirements for domain knowledge to determine input parametersAbility to deal with noisy dataIncremental clustering and insensitivity to input orderCapability of clustering high-dimensionality dataConstraint-based clusteringInterpretability and usability


Comparing clustering methods

The clustering methods can be compared using the following aspects:

The partitioning criteria : In some methods, all the objects are partitioned so that nohierarchy exists among the clusters.Separation of clusters : In some methods, data partitioned into mutually exclusive clusterswhile in some other methods, the clusters may not be exclusive, that is, a data object maybelong to more than one cluster.Similarity measure : Some methods determine the similarity between two objects by thedistance between them; while in other methods, the similarity may be defined by connectivitybased on density or contiguity.Clustering space : Many clustering methods search for clusters within the entire data space.These methods are useful for low-dimensionality data sets. With high- dimensional data,however, there can be many irrelevant attributes, which can make similarity measurementsunreliable. Consequently, clusters found in the full space are often meaningless. Its oftenbetter to instead search for clusters within different subspaces of the same data set.


Table of contents

1 Introduction





Data matrix and dissimilarity matrix

Suppose that we have n objects described by p attributes. The objects arex1 = (x11, x12, . . . , x1p), x2 = (x21, x22, . . . , x2p), and so on, where xij is the value forobject xi of the j th attribute. For brevity, we hereafter refer to object xi as object i .

The objects may be tuples in a relational database, and are also referred to as datasamples or feature vectors.Main memory-based clustering and nearest-neighbor algorithms typically operate on eitherof the following two data structures:

Data matrix This structure stores the n objects in the form of a table or n × p matrix.

x11 . . . x1f . . . x1p...

......

......

xi1 . . . xif . . . xip...

......

......

xn1 . . . xnf . . . xnp

Dissimilarity matrix : This structure stores a collection of proximities that are available forall pairs of objects. It is often represented by an n × n matrix or table:

0 d(1, 2) d(1, 3) . . . d(1, n)d(2, 1) 0 d(2, 3) . . . d(2, n)

......

.... . .

...d(n, 1) d(n, 2) d(n, 3) . . . 0


Table of contents

1 Introduction





Proximity Measures

Proximity measures for nominal attributes : Let the number of states of a nominalattribute be M. The dissimilarity between two objects i and j can be computed based onthe ratio of mismatches:

d(i , j) =p −m

p

where m is the number of matches and p is the total number of attributes describing theobjects.Proximity measures for binary attributes : Binary attributes are either symmetric orasymmetric.

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 70 #32

70 Chapter 2 Getting to Know Your Data

Alternatively, similarity can be computed as

sim(i, j) = 1 � d(i, j) = m

p. (2.12)

Proximity between objects described by nominal attributes can be computed usingan alternative encoding scheme. Nominal attributes can be encoded using asymmetricbinary attributes by creating a new binary attribute for each of the M states. For anobject with a given state value, the binary attribute representing that state is set to 1,while the remaining binary attributes are set to 0. For example, to encode the nominalattribute map color, a binary attribute can be created for each of the five colors previ-ously listed. For an object having the color yellow, the yellow attribute is set to 1, whilethe remaining four attributes are set to 0. Proximity measures for this form of encodingcan be calculated using the methods discussed in the next subsection.

2.4.3 Proximity Measures for Binary AttributesLet’s look at dissimilarity and similarity measures for objects described by eithersymmetric or asymmetric binary attributes.

Recall that a binary attribute has only one of two states: 0 and 1, where 0 means thatthe attribute is absent, and 1 means that it is present (Section 2.1.3). Given the attributesmoker describing a patient, for instance, 1 indicates that the patient smokes, while 0indicates that the patient does not. Treating binary attributes as if they are numeric canbe misleading. Therefore, methods specific to binary data are necessary for computingdissimilarity.

“So, how can we compute the dissimilarity between two binary attributes?” Oneapproach involves computing a dissimilarity matrix from the given binary data. If allbinary attributes are thought of as having the same weight, we have the 2 ⇥ 2 contin-gency table of Table 2.3, where q is the number of attributes that equal 1 for both objectsi and j, r is the number of attributes that equal 1 for object i but equal 0 for object j, s isthe number of attributes that equal 0 for object i but equal 1 for object j, and t is thenumber of attributes that equal 0 for both objects i and j. The total number of attributesis p, where p = q + r + s + t .

Recall that for symmetric binary attributes, each state is equally valuable. Dis-similarity that is based on symmetric binary attributes is called symmetric binarydissimilarity. If objects i and j are described by symmetric binary attributes, then the

Table 2.3 Contingency Table for Binary Attributes

Object j

1 0 sum

1 q r q + r

Object i 0 s t s + t

sum q + s r + t p

For symmetric binary attributes, similarity is calculated as

d(i , j) =r + s

q + r + s + t

For asymmetric binary attributes when the number of negative matches, t, is unimportantand the number of positive matches, q, is important , similarity is calculated as

d(i , j) =r + s

q + r + s

Coefficient 1− d(i , j) is called the Jaccard coefficient.Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1394 7 / 25

Proximity Measures (cont.)

Dissimilarity of numeric attributes :

The most popular distance measure is Euclidean distance

d(i , j) =√

(xi1 − xj2)2 + (xi2 − xj1)2 + . . .+ (xip − xjp)2

Another well-known measure is Manhattan distance

d(i , j) = |xi1 − xj2|+ |xi2 − xj1|+ . . .+ |xip − xjp|

Minkowski distance is generalization of Euclidean and Manhattan distances

d(i , j) = h

√|xi1 − xj2|h + |xi2 − xj1|h + . . .+ |xip − xjp|h

Dissimilarity of ordinal attributes : We first replace each xif by its corresponding rankrif ∈ {1, . . . ,Mf } and then normalize it using

zif =rif − 1

Mf − 1

Then dissimilarity can be computed using distance measures for numeric attributes usingzif .


Proximity Measures (cont.)

Dissimilarity for attributes of mixed types : A more preferable approach is to process allattribute types together, performing a single analysis.

d(i , j) =

∑pf=1 δ

(f )ij d

(f )ij∑p

f=1 δ(f )ij

where the indicator δ(f )ij = 0 if either

xif or xjf is missingxif = xjf = 0 and attribute f is asymmetric binary

and otherwise δ(f )ij = 1.

The distance d(f )ij is computed based on the type of attribute f .


Table of contents

1 Introduction





Clustering methods

There are many clustering algorithms in the literature. It is difficult to provide a crispcategorization of clustering methods because these categories may overlap so that amethod may have features from several categories. In general, the major fundamentalclustering methods can be classified into the following categories.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 450 #8

450 Chapter 10 Cluster Analysis: Basic Concepts and Methods

Grid-based methods: Grid-based methods quantize the object space into a finitenumber of cells that form a grid structure. All the clustering operations are per-formed on the grid structure (i.e., on the quantized space). The main advantage ofthis approach is its fast processing time, which is typically independent of the num-ber of data objects and dependent only on the number of cells in each dimension inthe quantized space.

Using grids is often an efficient approach to many spatial data mining problems,including clustering. Therefore, grid-based methods can be integrated with otherclustering methods such as density-based methods and hierarchical methods. Grid-based clustering is studied in Section 10.5.

These methods are briefly summarized in Figure 10.1. Some clustering algorithmsintegrate the ideas of several clustering methods, so that it is sometimes difficult to clas-sify a given algorithm as uniquely belonging to only one clustering method category.Furthermore, some applications may have clustering criteria that require the integrationof several clustering techniques.

In the following sections, we examine each clustering method in detail. Advancedclustering methods and related issues are discussed in Chapter 11. In general, thenotation used is as follows. Let D be a data set of n objects to be clustered. An object isdescribed by d variables, where each variable is also called an attribute or a dimension,

Method General Characteristics

Partitioningmethods

– Find mutually exclusive clusters of spherical shape– Distance-based– May use mean or medoid (etc.) to represent cluster center– Effective for small- to medium-size data sets

Hierarchicalmethods

– Clustering is a hierarchical decomposition (i.e., multiple levels)– Cannot correct erroneous merges or splits– May incorporate other techniques like microclustering or

consider object “linkages”

Density-basedmethods

– Can find arbitrarily shaped clusters– Clusters are dense regions of objects in space that are

separated by low-density regions– Cluster density: Each point must have a minimum number of

points within its “neighborhood”– May filter out outliers

Grid-basedmethods

– Use a multiresolution grid data structure– Fast processing time (typically independent of the number of

data objects, yet dependent on grid size)

Figure 10.1 Overview of clustering methods discussed in this chapter. Note that some algorithms maycombine various methods.


Partitioning methods

The simplest and most fundamental version of cluster analysis is partitioning, whichorganizes the objects of a set into several exclusive groups or clusters.Formally, given a data set, D, of n objects, and k, the number of clusters to form, apartitioning algorithm organizes the objects into k partitions (k ≤ n), where eachpartition represents a cluster.The clusters are formed to optimize an objective partitioning criterion, such as adissimilarity function based on distance, so that the objects within a cluster are similar toone another and dissimilar to objects in other clusters in terms of the data set attributes.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 453 #11

10.2 Partitioning Methods 453

(a) Initial clustering (b) Iterate (c) Final clustering

+

+

+

+

+

++

+

+

Figure 10.3 Clustering of a set of objects using the k-means method; for (b) update cluster centers andreassign objects accordingly (the mean of each cluster is marked by a +).

Example 10.1 Clustering by k-means partitioning. Consider a set of objects located in 2-D space,as depicted in Figure 10.3(a). Let k = 3, that is, the user would like the objects to bepartitioned into three clusters.

According to the algorithm in Figure 10.2, we arbitrarily choose three objects asthe three initial cluster centers, where cluster centers are marked by a +. Each objectis assigned to a cluster based on the cluster center to which it is the nearest. Such adistribution forms silhouettes encircled by dotted curves, as shown in Figure 10.3(a).

Next, the cluster centers are updated. That is, the mean value of each cluster is recal-culated based on the current objects in the cluster. Using the new cluster centers, theobjects are redistributed to the clusters based on which cluster center is the nearest.Such a redistribution forms new silhouettes encircled by dashed curves, as shown inFigure 10.3(b).

This process iterates, leading to Figure 10.3(c). The process of iteratively reassigningobjects to clusters to improve the partitioning is referred to as iterative relocation. Even-tually, no reassignment of the objects in any cluster occurs and so the process terminates.The resulting clusters are returned by the clustering process.

The k-means method is not guaranteed to converge to the global optimum and oftenterminates at a local optimum. The results may depend on the initial random selectionof cluster centers. (You will be asked to give an example to show this as an exercise.)To obtain good results in practice, it is common to run the k-means algorithm multipletimes with different initial cluster centers.

The time complexity of the k-means algorithm is O(nkt), where n is the total numberof objects, k is the number of clusters, and t is the number of iterations. Normally, k ⌧ nand t ⌧ n. Therefore, the method is relatively scalable and efficient in processing largedata sets.

There are several variants of the k-means method. These can differ in the selectionof the initial k-means, the calculation of dissimilarity, and the strategies for calculatingcluster means.


k-Means clustering algorithm

Suppose a data set, D, contains n objects in Euclidean space. Partitioning methodsdistribute the objects in D into k clusters, C1, . . . ,Ck , that is, Ci ⊂ D and Ci ∩ Cj = φfor (1 ≤ i , j ≤ k).

An objective function is used to assess the partitioning quality so that objects within acluster are similar to one another but dissimilar to objects in other clusters.

This is, the objective function aims for high intracluster similarity and low interclustersimilarity.

A centroid-based partitioning technique uses the centroid of a cluster, Ci , to representthat cluster.

The difference between an object p ∈ Ci and µi , the representative of the cluster, ismeasured by ||p − µi ||.The quality of cluster Ci can be measured by the within-cluster variation, which is thesum of squared error between all objects in Ci and the centroid ci , defined as

E =n∑

i=1

∑

p∈Ci

||p − µi ||2


k-Means clustering algorithm (cont.)336 Representative-based Clustering

2 3 4 10 11 12 20 25 30

(a) Initial datasetµ1 = 2

2 3

µ2 = 4

4 10 11 12 20 25 30

(b) Iteration: t = 1µ1 = 2.5

2 3 4

µ2 = 16

10 11 12 20 25 30

(c) Iteration: t = 2µ1 = 3

2 3 4 10

µ2 = 18

11 12 20 25 30

(d) Iteration: t = 3µ1 = 4.75

2 3 4 10 11 12

µ2 = 19.60

20 25 30

(e) Iteration: t = 4µ1 = 7

2 3 4 10 11 12

µ2 = 25

20 25 30

(f) Iteration: t = 5 (converged)

Figure 13.1. K-means in one dimension.

Example 13.2 (K-means in Two Dimensions). In Figure 13.2 we illustrate theK-means algorithm on the Iris dataset, using the first two principal components asthe two dimensions. Iris has n = 150 points, and we want to find k = 3 clusters,corresponding to the three types of Irises. A random initialization of the clustermeans yields

µ1 = (−0.98,−1.24)T µ2 = (−2.96,1.16)T µ3 = (−1.69,−0.80)T

as shown in Figure 13.2a. With these initial clusters, K-means takes eight iterationsto converge. Figure 13.2b shows the clusters and their means after one iteration:

µ1 = (1.56,−0.08)T µ2 = (−2.86,0.53)T µ3 = (−1.50,−0.05)T

Finally, Figure 13.2c shows the clusters on convergence. The final means are asfollows:

µ1 = (2.64,0.19)T µ2 = (−2.35,0.27)T µ3 = (−0.66,−0.33)T


k-Means clustering algorithm (cont.)

The k-means method is not guaranteed to converge to the global optimum and oftenterminates at a local optimum.

The results may depend on the initial random selection of cluster centers.

o obtain good results in practice, it is common to run the k-means algorithm multipletimes with different initial cluster centers.

The time complexity of the k-means algorithm is O(nkt), where n is the total number ofobjects, k is the number of clusters, and t is the number of iterations.

Normally, k � n and t � n. Therefore, the method is relatively scalable and efficient inprocessing large data sets.

There are several variants of the k-means method. These can differ in the selection of theinitial k-means, the calculation of dissimilarity, and the strategies for calculating clustermeans.

The k-modes method is a variant of k-means, which extends the k-means paradigm to clusternominal data by replacing the means of clusters with modes.The partitioning around medoid (PAM) is a realization of k-medoids method.


Hierarchical methods

A hierarchical clustering method works by grouping data objects into a hierarchy or treeof clusters.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 460 #18


aab

b

c

d

ede

cde

abcde

Step 0 Step 1 Step 2 Step 3 Step 4


Divisive(DIANA)

Agglomerative(AGNES)

Figure 10.6 Agglomerative and divisive hierarchical clustering on data objects {a,b,c,d,e}.

Levell=0

l=1l=2

l=3

l=4

a b c d e1.0

0.8

0.6

0.4

0.2

0.0

Sim

ilari

ty s

cale

Figure 10.7 Dendrogram representation for hierarchical clustering of data objects {a,b,c,d,e}.

different clusters. This is a single-linkage approach in that each cluster is representedby all the objects in the cluster, and the similarity between two clusters is measuredby the similarity of the closest pair of data points belonging to different clusters. Thecluster-merging process repeats until all the objects are eventually merged to form onecluster.

DIANA, the divisive method, proceeds in the contrasting way. All the objects are usedto form one initial cluster. The cluster is split according to some principle such as themaximum Euclidean distance between the closest neighboring objects in the cluster. Thecluster-splitting process repeats until, eventually, each new cluster contains only a singleobject.

A tree structure called a dendrogram is commonly used to represent the process ofhierarchical clustering. It shows how objects are grouped together (in an agglomerativemethod) or partitioned (in a divisive method) step-by-step. Figure 10.7 shows a den-drogram for the five objects presented in Figure 10.6, where l = 0 shows the five objectsas singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the

Hierarchical clustering methods

Agglomerative hierarchical clusteringDivisive hierarchical clustering


Distance measures in hierarchical methods

Whether using an agglomerative method or a divisive method, a core need is to measurethe distance between two clusters, where each cluster is generally a set of objects.

Four widely used measures for distance between clusters are as follows, where |p − q| isthe distance between two objects or points, p and q; µi is the mean for cluster, Ci ; and niis the number of objects in Ci . They are also known as linkage measures.

Minimum distancedmin(Ci ,Cj) = min

p∈Ci ,q∈Cj

{|p − q|}

Maximum distance

dmax(Ci ,Cj) = maxp∈Ci ,q∈Cj

{|p − q|}

Mean distance

dmean(Ci ,Cj) = |µi − µj |Average distance

dmin(Ci ,Cj) =1

ninj

∑

p∈Ci ,q∈Cj

|p − q|


Hierarchical methods

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 460 #18


aab

b

c

d

ede

cde

abcde



Divisive(DIANA)

Agglomerative(AGNES)

Figure 10.6 Agglomerative and divisive hierarchical clustering on data objects {a,b,c,d,e}.

Levell=0

l=1l=2

l=3

l=4

a b c d e1.0

0.8

0.6

0.4

0.2

0.0

Sim

ilari

ty s

cale

Figure 10.7 Dendrogram representation for hierarchical clustering of data objects {a,b,c,d,e}.

different clusters. This is a single-linkage approach in that each cluster is representedby all the objects in the cluster, and the similarity between two clusters is measuredby the similarity of the closest pair of data points belonging to different clusters. Thecluster-merging process repeats until all the objects are eventually merged to form onecluster.

DIANA, the divisive method, proceeds in the contrasting way. All the objects are usedto form one initial cluster. The cluster is split according to some principle such as themaximum Euclidean distance between the closest neighboring objects in the cluster. Thecluster-splitting process repeats until, eventually, each new cluster contains only a singleobject.

A tree structure called a dendrogram is commonly used to represent the process ofhierarchical clustering. It shows how objects are grouped together (in an agglomerativemethod) or partitioned (in a divisive method) step-by-step. Figure 10.7 shows a den-drogram for the five objects presented in Figure 10.6, where l = 0 shows the five objectsas singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the


Model-based clustering

k-means is closely related to a probabilistic model known as the Gaussian mixture model.

p(x) =K∑

k=1

πkN (x |µk ,Σk)

πk , µk ,Σk are parameters. πk are called mixing proportions and each Gaussian is called amixture component.The model is simply a weighted sum of Gaussians. But it is much more powerful than asingle Gaussian, because it can model multi-modal distributions.

Gaussian mixture models example

I A mixture of three Gaussians.

Roland Memisevic Machine Learning 21

Gaussian mixture models example

A Gaussian fit to some data. Gaussian mixture fit to same data.


Gaussian mixture models

p(x) =X

k

⇡kN (x|µk,⌃k)

I Note that for p(x) to be a probability distribution, werequire that

Pk ⇡k = 1 and that ⇡k > 0 8k

I Thus, we may interpret the ⇡k as probabilities themselves!

I This motivates introducing latent variables z andre-writing the model, equivalently, in terms of twodistributions p(z) and p(z|x) as follows:

p(x) =X

z

p(z)p(x|z)


Gaussian mixture models

I Here

p(z) =KY

k=1

⇡zkk

is a discrete distribution (that is, z is a one-hot encodinglike in K-means.)

I Andp(x|zk = 1) = N (x|µk,⌃k)

is a conditional Gaussian distribution.

I Why rewrite the mixture model like this?


Note that for p(x) to be a probability distribution, we require that∑

k πk = 1 and thatfor all k we have πk > 0. Thus, we may interpret the πk as probabilities themselves.Set of parameters θ = {{πk}, {µk}, {Σk}}


Model-based clustering (cont.)

Let use a K-dimensional binary random variable z in which a particular element zk equalsto 1 and other elements are 0.

The values of zk therefore satisfy zk ∈ {0, 1} and∑

k zk = 1

We define the joint distribution p(x , z) in terms of a marginal distribution p(z) and aconditional distribution p(x |z).

The marginal distribution over z is specified in terms of πk , such that

p(zk = 1) = πk

We can write this distribution in the form

p(zk = 1) =K∏

k=1

πzkk

The conditional distribution of x given a particular value for z is a Gaussian

p(x |zk = 1) = N (x |µk ,Σk)

This can also be written in the form

p(x |zk = 1) =K∏

k=1

N (x |µk ,Σk)zk


Gaussian mixture model (example)112 2. PROBABILITY DISTRIBUTIONS

0.5 0.3

0.2

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1

Figure 2.23 Illustration of a mixture of 3 Gaussians in a two-dimensional space. (a) Contours of constantdensity for each of the mixture components, in which the 3 components are denoted red, blue and green, andthe values of the mixing coefficients are shown below each component. (b) Contours of the marginal probabilitydensity p(x) of the mixture distribution. (c) A surface plot of the distribution p(x).

We therefore see that the mixing coefficients satisfy the requirements to be probabil-ities.

From the sum and product rules, the marginal density is given by

p(x) =

K!

k=1

p(k)p(x|k) (2.191)

which is equivalent to (2.188) in which we can view πk = p(k) as the prior prob-ability of picking the kth component, and the density N (x|µk,Σk) = p(x|k) asthe probability of x conditioned on k. As we shall see in later chapters, an impor-tant role is played by the posterior probabilities p(k|x), which are also known asresponsibilities. From Bayes’ theorem these are given by

γk(x) ≡ p(k|x)

=p(k)p(x|k)"

l p(l)p(x|l)

=πkN (x|µk,Σk)"

l πlN (x|µl,Σl). (2.192)

We shall discuss the probabilistic interpretation of the mixture distribution in greaterdetail in Chapter 9.

The form of the Gaussian mixture distribution is governed by the parameters π,µ and Σ, where we have used the notation π ≡ {π1, . . . , πK}, µ ≡ {µ1, . . . ,µK}and Σ ≡ {Σ1, . . .ΣK}. One way to set the values of these parameters is to usemaximum likelihood. From (2.188) the log of the likelihood function is given by

ln p(X|π, µ,Σ) =

N!

n=1

ln

#K!

k=1

πkN (xn|µk,Σk)

$(2.193)

9.2. Mixtures of Gaussians 433

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1 (c)

0 0.5 1

0

0.5

1

Figure 9.5 Example of 500 points drawn from the mixture of 3 Gaussians shown in Figure 2.23. (a) Samplesfrom the joint distribution p(z)p(x|z) in which the three states of z, corresponding to the three components of themixture, are depicted in red, green, and blue, and (b) the corresponding samples from the marginal distributionp(x), which is obtained by simply ignoring the values of z and just plotting the x values. The data set in (a) issaid to be complete, whereas that in (b) is incomplete. (c) The same samples in which the colours represent thevalue of the responsibilities γ(znk) associated with data point xn, obtained by plotting the corresponding pointusing proportions of red, blue, and green ink given by γ(znk) for k = 1, 2, 3, respectively

matrix X in which the nth row is given by xTn . Similarly, the corresponding latent

variables will be denoted by an N × K matrix Z with rows zTn . If we assume that

the data points are drawn independently from the distribution, then we can expressthe Gaussian mixture model for this i.i.d. data set using the graphical representationshown in Figure 9.6. From (9.7) the log of the likelihood function is given by

ln p(X|π, µ,Σ) =

N!

n=1

ln

"K!

k=1

πkN (xn|µk,Σk)

#. (9.14)

Before discussing how to maximize this function, it is worth emphasizing thatthere is a significant problem associated with the maximum likelihood frameworkapplied to Gaussian mixture models, due to the presence of singularities. For sim-plicity, consider a Gaussian mixture whose components have covariance matricesgiven by Σk = σ2

kI, where I is the unit matrix, although the conclusions will holdfor general covariance matrices. Suppose that one of the components of the mixturemodel, let us say the jth component, has its mean µj exactly equal to one of the data

Figure 9.6 Graphical representation of a Gaussian mixture modelfor a set of N i.i.d. data points {xn}, with correspondinglatent points {zn}, where n = 1, . . . , N .

xn

zn

N

µ Σ

π



Let X = {x1, . . . , xN} be drawn i.i.d. from mixture of Gaussian. The log-likelihood of theobservations equals to

ln p(x |µ, π,Σ) =N∑

n=1

ln

[K∑

k=1

πkN (xn|µk ,Σk)

]

Setting the derivatives of ln p(x |µ, π,Σ) with respect to µk and setting it equal to zero,we obtain

0 = −N∑

n=1

πkN (xn|µk ,Σk)∑K

j=1 πjN (xn|µj ,Σj)︸︷︷︸γ(znk )

Σk(xn − µk)

Multiplying by Σ−1k and then simplifying, we obtain

µk =1

Nk

N∑

n=1

γ(znk)xn

Nk =N∑

n=1

γ(znk)



Setting the derivatives of ln p(x |µ, π,Σ) with respect to Σk and setting it equal to zero,we obtain

Σk =1

Nk

N∑

n=1

γ(znk)(xn − µk)(xn − µk)T

We maximize ln p(x |µ, π,Σ) with respect to πk with constraint∑K

k=1 πk = 1. This canbe achieved using a Lagrange multiplier and maximizing the following quantity

ln p(x |µ, π,Σ) + λ

(K∑

k=1

πk − 1

).

which givesN∑

n=1

πkN (xn|µk ,Σk)∑K

j=1 πjN (xn|µj ,Σj)+ λ

If we now multiply both sides by πk and sum over k making use of the constraint∑Kk=1 πk = 1, we find λ = N. Using this to eliminate λ and rearranging we obtain

πk =Nk

N


EM for Gassian mixture models

1 Initialize µk , Σk , and πk , and evaluate the initial value of the log likelihood.

2 E step Evaluate γ(znk) using the current parameter values

γ(znk) =πkN (xn|µk ,Σk)

∑Kj=1 πjN (xn|µj ,Σj)

3 M step Re-estimate the parameters using the current value of γ(znk)

µk =1

Nk

N∑

n=1

γ(znk)xn

Σk =1

Nk

N∑

n=1

γ(znk)(xn − µk)(xn − µk)T

πk =Nk

N

where Nk =∑N

n=1 γ(znk).

4 Evaluate the log likelihood ln p(x |µ, π,Σ) =∑N

n=1 ln[∑K

k=1 πkN (xn|µk ,Σk)]

and check

for convergence of either the parameters or the log likelihood. If the convergence criterionis not satisfied return to step 2.

Please read section 9.2 of Bishop.Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1394 24 / 25

Model-based clustering (example) 9.2. Mixtures of Gaussians 437

(a)−2 0 2

−2

0

2

(b)−2 0 2

−2

0

2

(c)

L = 1

−2 0 2

−2

0

2

(d)

L = 2

−2 0 2

−2

0

2

(e)

L = 5

−2 0 2

−2

0

2

(f)

L = 20

−2 0 2

−2

0

2

Figure 9.8 Illustration of the EM algorithm using the Old Faithful set as used for the illustration of the K-meansalgorithm in Figure 9.1. See the text for details.

and the M step, for reasons that will become apparent shortly. In the expectationstep, or E step, we use the current values for the parameters to evaluate the posteriorprobabilities, or responsibilities, given by (9.13). We then use these probabilities inthe maximization step, or M step, to re-estimate the means, covariances, and mix-ing coefficients using the results (9.17), (9.19), and (9.22). Note that in so doingwe first evaluate the new means using (9.17) and then use these new values to findthe covariances using (9.19), in keeping with the corresponding result for a singleGaussian distribution. We shall show that each update to the parameters resultingfrom an E step followed by an M step is guaranteed to increase the log likelihoodfunction. In practice, the algorithm is deemed to have converged when the changeSection 9.4in the log likelihood function, or alternatively in the parameters, falls below somethreshold. We illustrate the EM algorithm for a mixture of two Gaussians applied tothe rescaled Old Faithful data set in Figure 9.8. Here a mixture of two Gaussiansis used, with centres initialized using the same values as for the K-means algorithmin Figure 9.1, and with precision matrices initialized to be proportional to the unitmatrix. Plot (a) shows the data points in green, together with the initial configura-tion of the mixture model in which the one standard-deviation contours for the two


machine learning - sharif university of...

Documents