an introduction to the expectation-maximization (em) algorithm

1

An Introduction to the Expectation-Maximization An Introduction to the Expectation-Maximization (EM) Algorithm(EM) Algorithm

An Introduction to the Expectation-Maximization An Introduction to the Expectation-Maximization (EM) Algorithm(EM) Algorithm

2

OverviewOverviewOverviewOverview

• The problem of missing data

• A mixture of Gaussian

• The EM Algorithm

– The Q-function, the E-step and the M-step

• Some examples

• Summary

3

Missing DataMissing DataMissing DataMissing Data• Data collection and feature generation are not perfect. Objects in a

pattern recognition application may have missing features

– In the classification problem of sea bass versus salmon, the lightness feature can still be estimated if part of the fish is occluded. This is not so for the width feature

– Features are extracted by multiple sensors. One or multiple sensors may be malfunctioning when the object is presented

– A respondent fails to answer all the questions in a questionnaire

– Some participants quit during the middle of a study, where measurements are taken at different time points

• An example of missing data: “dermatology” from UCI

Missingvalues

4

Missing DataMissing DataMissing DataMissing Data

• Different types of missing data

– Missing completely at random (MCAR)

– Missing at random (MAR): the fact that a feature is missing is random, after conditioned on another feature

Example: people who are depressed might be less inclined to report their income, and thus reported income will be related to depression. However, if, within depressed patients the probability of reported income was unrelated to income level, then the data would be considered MAR

• If not MAR nor MCAR, the missing data phenomenon needs to be modeled explicitly

– Example: if a user fails to provide his/her rating on a movie, it is more likely that he/she dislikes the movie because he/she has not watched it

5

Strategy of Coping With Missing DataStrategy of Coping With Missing DataStrategy of Coping With Missing DataStrategy of Coping With Missing Data

• Discarding cases

– Ignore patterns with missing features

– Ignore features with missing values

– Easy to implement, but wasteful of data. May bias the parameter estimate if not MCAR

• Maximum likelihood (the EM algorithm)

– Under some assumption of how the data are generated, estimate the parameters that maximize the likelihood of the observed data (non-missing features)

• Multiple Imputation

– For each object with missing features, generate (impute) multiple instantiations of the missing features based on the values of the non-missing features

6

Mixture Of GaussianMixture Of GaussianMixture Of GaussianMixture Of Gaussian

• In DHS chapters 2 and 3, the class conditional densities are assumed to be parametric, typically Gaussian

– What if this assumption is violated?

• One solution is to use non-parametric density estimate (Ch 4)

• An intermediate approach: a mixture of Gaussian

0 2 4 6 8 100

0.1

0.2

-10 -5 0 5 100

0.02

0.04

0.06

0.08

0.1

-10 -5 0 5 100

0.05

0.1

0.15

0.2

0.25

-10 -5 0 5 100

0.05

0.1

0.15

0.2

0.25

7

Gaussian MixtureGaussian MixtureGaussian MixtureGaussian Mixture

• Let x denote the feature vector of an object

• A Gaussian mixture consists of k different Gaussian distributions, k being specified by the user

– Each of the Gaussian is known as “component distribution” or simply “component”

– The j-th Gaussian has mean j and covariance matrix Cj

– In practice, k can be unknown

• Two different ways to think about a Gaussian mixture

– The pdf is a sum of multiple Gaussian pdfs

– Two stage-data generation process:

First a component is randomly selected. The probability that the j-th component is selected is j

The component selected is referred to as “component label”

The data x is then generated according to

8

Gaussian MixtureGaussian MixtureGaussian MixtureGaussian Mixture

• Let f(x, j, Cj) denote the Gaussian pdf

• Let z be the random variable corresponding to the identity of the

component selected (component label)

– The event “z=j” means that the j-th component is selected at the first stage of the data generation process

– p(z=j) = j

• The pdf of a Gaussian mixture is a weighted sum of different Gaussian pdf

9

Gaussian Mixture and Missing DataGaussian Mixture and Missing DataGaussian Mixture and Missing DataGaussian Mixture and Missing Data

• Suppose we want to fit a Gaussian mixture of k components to the training data set {x1, …, xn}

• The parameters are j, j and Cj

• The log-likelihood is given by

• The parameters can be found by maximizing

– Unfortunately, the MLE does not have a closed form solution

• Let zi be the component label of the data xi

• A simple algorithm to find the MLE views all the zi as missing data. After all, the zi are not observed (unknown)!

• This falls under the category of “missing completely at random”, because all the zi are missing. So, the missing phenomenon is not related to the value of xi

10

The EM AlgorithmThe EM AlgorithmThe EM AlgorithmThe EM Algorithm

• The EM algorithm is an iterative algorithm that finds the parameters which maximize the log-likelihood when there are missing data

• Complete data: the combination of the missing data and the observed data

– For the case of Gaussian mixture, the complete data consist of the pairs (xi, zi), i=1, ..., n

• The EM algorithm is best used in situation where the log-likelihood of the observed data is difficult to optimize, whereas the log-likelihood of the complete data is easy to optimize

– In the case of a Gaussian mixture, the log-likelihood of complete data (xi and zi) is maximized easily by taking a weighted sum and a weighted scatter matrix of the data

• Note that the log-likelihood of the observed data is obtained by marginalizing (summing out) the hidden data

11


• Let X and Z denote the vector of all observed data and hidden data, respectively

• Let t be the parameter estimate at the t-th iteration

• Define the Q-function (a function of )

• Intuitively, the Q-function is the “expected value of the complete data log-likelihood”

– Fill in all the possible values for the missing data Z. This gives rise to the complete data, and we compute its log-likelihood

– Not all the possible values for the missing features are equally “good”. The “goodness” of a particular way of filling in (Z=z) is determined by how likely the r.v. Z takes the value of z

– This probability is determined by the observed data X and the current parameter t

12


• An improved parameter estimate at iteration (t+1) is obtained by maximizing the Q function

• Maximizing with respect to is often easy because maximizing the complete data log-likelihood . is assumed to be easy

• The EM algorithm takes its name because it alternates between the E-step (expectation) and the M-step (maximization)

– E-step: compute Q(; t)

– M-step: Find that maximizes Q(; t)

• Note the difference between and t: one is variable, while the other is given (fixed)

13


• From a computation viewpoint, the E-step computes the posterior probability p(Z|X,t) based on the current parameter estimate

• The M-step updates the parameter estimate to get t+1 based on update equations derived analytically by maximizing Q(; t)

• The EM Algorithm requires an initial guess 0 for the parameter

• Each iteration of E-step and M-step is guaranteed to increase . , the log-likelihood of the observed data, until a local maximum of is reached. (See DHS Problem 3.44)

– That explains why the EM algorithm can find the MLE

– Note that

14

The EM Algorithm Applied to Gaussian MixtureThe EM Algorithm Applied to Gaussian MixtureThe EM Algorithm Applied to Gaussian MixtureThe EM Algorithm Applied to Gaussian Mixture

• The missing data are the component labels (z1, …, zn)

• The parameter vector is = (1, …, k, 1, …, k, C1, …, Ck)

• The E-step computes the posterior probability of the missing data

• Let rij denote p(zi=j|xi, )

15

The EM Algorithm Applied to Gaussian MixtureThe EM Algorithm Applied to Gaussian MixtureThe EM Algorithm Applied to Gaussian MixtureThe EM Algorithm Applied to Gaussian Mixture

Complete data log-likelihood

“Weight” of the missing data

16

The M-stepThe M-stepThe M-stepThe M-step

• To maximize Q(; t) with respect to l, set the gradient to zero

• This is the update equation for the new l at (t+1)-th iteration

– ril is computed based only on the parameter at the t-th iteration

• Similarly, we have

• If all ril are binary (component labels are known), these formulae reduce to the standard formulae for mean, covariance matrix and the class prior

17

Complete EM Algorithm for Gaussian MixtureComplete EM Algorithm for Gaussian MixtureComplete EM Algorithm for Gaussian MixtureComplete EM Algorithm for Gaussian Mixture

1. Start with an initial guess of the parameter, 0 = (1, …, k, 1, …, k, C1, …, Ck)

2. Set t:=0

3. While the log-likelihood of observed data, , is still increasing, do

1. Perform the E-step by computing rij = P(zi=j|xi, t)

2. Perform the M-step by re-estimating the parameter using the following update equation

3. Form t+1 based on for all l

4. Set t:=t+1

18

An ExampleAn ExampleAn ExampleAn Example

• 8 Data points in 2D. Fit a Gaussian mixture of two components

• For simplicity, 1 and 2 are fixed to be 0.5

• 17 iterations for convergence

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.69,0.31) (0.62,0.38)

(0.65,0.35) (0.57,0.43) (0.43,0.57) (0.35,0.65)

(0.38,0.62) (0.30,0.70)

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

Initial parameter estimate, 0 The values of rij1 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.70,0.30) (0.62,0.38)

(0.66,0.34) (0.57,0.43) (0.43,0.57) (0.34,0.66)

(0.38,0.62) (0.30,0.70)

2 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.71,0.29) (0.63,0.37)

(0.66,0.34) (0.57,0.43) (0.43,0.57) (0.34,0.66)

(0.37,0.63) (0.29,0.71)

3 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.72,0.28) (0.64,0.36)

(0.67,0.33) (0.58,0.42) (0.42,0.58) (0.33,0.67)

(0.36,0.64) (0.28,0.72)

4 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.73,0.27) (0.64,0.36)

(0.68,0.32) (0.58,0.42) (0.42,0.58) (0.32,0.68)

(0.36,0.64) (0.27,0.73)

5 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.75,0.25) (0.65,0.35)

(0.69,0.31) (0.59,0.41) (0.41,0.59) (0.31,0.69)

(0.35,0.65) (0.25,0.75)

6 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.76,0.24) (0.67,0.33)

(0.70,0.30) (0.59,0.41) (0.41,0.59) (0.30,0.70)

(0.33,0.67) (0.24,0.76)

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

7 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.78,0.22) (0.68,0.32)

(0.71,0.29) (0.60,0.40) (0.40,0.60) (0.29,0.71)

(0.32,0.68) (0.22,0.78)

8 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.80,0.20) (0.70,0.30)

(0.73,0.27) (0.61,0.39) (0.39,0.61) (0.27,0.73)

(0.30,0.70) (0.20,0.80)

9 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.82,0.18) (0.72,0.28)

(0.75,0.25) (0.63,0.37) (0.37,0.63) (0.25,0.75)

(0.28,0.72) (0.18,0.82)

10 The values of rij

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.86,0.14) (0.75,0.25)

(0.78,0.22) (0.64,0.36) (0.36,0.64) (0.22,0.78)

(0.25,0.75) (0.14,0.86)


-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.89,0.11) (0.79,0.21)

(0.82,0.18) (0.67,0.33) (0.33,0.67) (0.18,0.82)

(0.21,0.79) (0.11,0.89)


-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.94,0.06) (0.85,0.15)

(0.87,0.13) (0.71,0.29) (0.29,0.71) (0.13,0.87)

(0.15,0.85) (0.06,0.94)


-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(0.98,0.02) (0.92,0.08)

(0.93,0.07) (0.78,0.22) (0.22,0.78) (0.07,0.93)

(0.08,0.92) (0.02,0.98)


-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(1.00,0.00) (0.99,0.01)

(0.99,0.01) (0.90,0.10) (0.10,0.90) (0.01,0.99)

(0.01,0.99) (0.00,1.00)


-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(1.00,0.00) (1.00,0.00)

(1.00,0.00) (0.99,0.01) (0.01,0.99) (0.00,1.00)

(0.00,1.00) (0.00,1.00)


-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(1.00,0.00) (1.00,0.00)

(1.00,0.00) (1.00,0.00) (0.00,1.00) (0.00,1.00)

(0.00,1.00) (0.00,1.00)


-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6-1

0

1

2

3

4

5

6

(1.00,0.00) (1.00,0.00)

(1.00,0.00) (1.00,0.00) (0.00,1.00) (0.00,1.00)

(0.00,1.00) (0.00,1.00)

19

Some notes on the EM Algorithm and Gaussian Some notes on the EM Algorithm and Gaussian MixtureMixture

Some notes on the EM Algorithm and Gaussian Some notes on the EM Algorithm and Gaussian MixtureMixture

• Since EM is iterative, it may end up in a local maxima (instead of the global maxima) of the log-likelihood of the observed data

– A good initialization is needed to find a good local maxima

• The EM algorithm may not be the most efficient algorithm for maximizing the log-likelihood; however, the EM algorithm is often fairly simple to implement

• If the missing data are continuous, integration should be used instead of the summation to form Q(; t)

• The number of components k in a Gaussian mixture is either specified by the user, or advanced techniques can be used to estimate it based on the available data

• A mixture of Gaussians can be viewed as a “middle-ground”

– Flexibility: non-parametric > mixture of Gaussians > a single Gaussian

– Memory for storing the parameters: non-parametric > mixture of Gaussians > a single Gaussian

20

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-1

-0.5

0

0.5

1

1.5

A Classification ExampleA Classification ExampleA Classification ExampleA Classification Example

-1 -0.5 0 0.5 1 1.5 2 2.5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-1 -0.5 0 0.5 1 1.5 2-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

31 training errors

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

• Different types of class conditional densities are considered to classify the two-ring data set

– 1. Gaussian with common covariance matrix; 2. Gaussian with different covariance matrix; 3. A mixture of Gaussian

25 training errors No training error; number of Gaussians is 4 per each class

21

SummarySummarySummarySummary

• The problem of missing data is regularly encountered in real-world applications

• Instead of discarding the cases with missing data, the EM algorithm can be used to maximize the marginalized log-likelihood (the log-likelihood of data observed)

• A mixture of Gaussians provides more flexibility than a single Gaussian for density estimation

• By regarding the component labels as missing data, the EM algorithm can be used for parameter estimation in a mixture of Gaussians

• The EM algorithm is an iterative algorithm that consists of the E-step (computation of the Q function) and the M-step (maximization of the Q function)

22

ReferencesReferencesReferencesReferences

• http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html

• An EM implementation in Matlab: http://www.cse.msu.edu/~lawhiu/software/em.m

• http://www.lshtm.ac.uk/msu/missingdata/index.html

• DHS chapter 3.9

• Some recent derivations of the EM algorithm in new scenarios

• M. Figueiredo, "Bayesian image segmentation using wavelet-based priors", IEEE Computer Society Conference on Computer Vision and Pattern Recognition - CVPR'2005.

• B. Krishnapuram, A. Hartemink, L. Carin, and M. Figueiredo, "A Bayesian approach to joint feature selection and classifier design", TPAMI, vol. 26, no. 9, pp. 1105-1111, 2004.

• M. Figueiredo and A.K.Jain, "Unsupervised learning of finite mixture models", TPAMI, vol. 24, no. 3, pp. 381-396, March 2002. http://www.lx.it.pt/~mtf/mixturecode.zip for the software

an introduction to the expectation-maximization (em) algorithm

Documents