the expectation-maximization algorithmflaxman/ht17_lecture6.pdfexponential families (i) the em...

29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford

Upload: others

Post on 08-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

The Expectation-Maximization Algorithm

Mihaela van der Schaar

Department of Engineering ScienceUniversity of Oxford

Page 2: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

2/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models

Latent Variables and Marginal Likelihoods

Many probabilistic models have hidden variables that are notobservable in the dataset D: these models are known as latentvariable models.

Examples: Hidden Markov Models & Mixture Models.How would MLE be carried out for such models?

Each data point is drawn from a joint distribution Pθ(X ,Z ).For a realization ((X1,Z1), . . ., (Xn,Zn)), we only observe thevariables in the dataset D = (X1, . . .,Xn).Complete-data likelihood:

Pθ((X1,Z1), . . ., (Xn,Zn)) =n∏

i=1

Pθ(Xi ,Zi ).

Marginal likelihood:

Pθ(X1, . . .,Xn) =n∏

i=1

∑z

Pθ(Xi , Zi = z).

Page 3: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models

The Hardness of Maximizing Marginal Likelihoods (I)

The MLE is obtained by maximizing the marginal likelihood:

θ̂∗n = argmaxθ∈Θ

n∑i=1

log

(∑z

Pθ(Xi , Zi = z)

).

Solving this optimization problem is often a hard task!Non-convex.Many local maxima.No analytic solution.

−5 0 5 10−30

−25

−20

−15

−10

−5

0

θ

log(Pθ(X

))

Complete-data likelihood

−5 0 5 10−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

θ

log(Pθ(X

))Marginal likelihood

Page 4: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models

The Hardness of Maximizing Marginal Likelihoods (II)

The MLE for θ is obtained by maximizing the marginal loglikelihood function:

θ̂∗n = argmaxθ∈Θ

n∑i=1

log

(∑z

Pθ(Xi , Zi = z)

).

Solving this optimization problem is often a hard task!

The methods used in the previous lecture would not work.

Need a simpler approximate procedure!

The Expectation-Maximization is an iterative algorithm thatcomputes an approximate solution for the MLE optimizationproblem.

Page 5: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

5/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models

Exponential Families (I)

The EM algorithm is well-suited for exponential familydistributions.

Exponential Family

A single-parameter exponential family is a set of probabilitydistributions that can be expressed in the form

Pθ(X ) = h(X ) · exp (η(θ) · T (X )− A(θ)) ,

where h(X ),A(θ) and T (X ) are known functions. An alternative,equivalent form often given as

Pθ(X ) = h(X ) · g(θ) · exp (η(θ) · T (X )) .

The variable θ is called the parameter of the family.

Page 6: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models

Exponential Families (II)

Exponential family distributions:

Pθ(X ) = h(X ) · exp (η(θ) · T (X )− A(θ)) .

T (X ) is a sufficient statistic of the distributionThe sufficient statistic is a function of the data that fullysummarizes the data X within the density function Pθ(X ).This means that for any data sets D1 and D2, the densityfunction is the same if T (D1) = T (D2). This is true even ifD1 and D2 are quite different.The sufficient statistic of a set of independent identicallydistributed data observations is simply the sum of individualsufficient statistics, i.e. T (D) =

∑ni=1 T (Xi ).

Page 7: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

7/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models

Exponential Families (III)

Exponential family distributions:

Pθ(X ) = h(X ) · exp (η(θ) · T (X )− A(θ)) .

η(θ) is called the natural parameterThe set of values of η(θ) for which the function Pθ(X ) is finiteis called the natural parameter space.

A(θ) is called the log-partition functionThe mean, variance and other moments of the sufficientstatistic T (X ) can be derived by differentiating A(θ).

Page 8: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models

Exponential Families (IV)

Exponential Family Example: Normal Distribution

Pθ(X ) =1√2πσ

· exp(−(X − µ)2

2σ2

)=

1√2π

· exp(−X 2 − 2X µ+ µ2

2σ2− log(σ)

)

=exp

(⟨[ µσ2 ,

−12σ2

]T,[X ,X 2

]T⟩−(

µ2

2σ2 + log(σ)))

√2π

.

η(θ) =

σ2,−1

2σ2

]T, h(X ) = (2π)−

12

T (X ) =[X ,X 2

]T, A(θ) =

(µ2

2σ2+ log(σ)

).

Page 9: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

9/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models

Exponential Families (V)

Properties of Exponential Families

Exponential families have sufficient statistics that cansummarize arbitrary amounts of independent identicallydistributed data using a fixed number of values.

Exponential families have conjugate priors (an importantproperty in Bayesian statistics).

The posterior predictive distribution of an exponential-familyrandom variable with a conjugate prior can always be writtenin closed form.

Page 10: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models

Exponential Families (VI)

The Canonical Form of Exponential Families

If η(θ) = θ, then the exponential family is said to be incanonical form.

The canonical form is non-unique, since η(θ) can be multipliedby any nonzero constant, provided that T (X ) is multiplied bythat constant’s reciprocal, or a constant c can be added toη(θ) and h(X ) multiplied by exp(−c · T (x)) to offset it.

Page 11: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm

Expectation-Maximization (I)

Two unknowns:The latent variables Z = (Z1, . . .,Zn).The parameter θ.

Complications arise because we don’t know the latentvariables (Z1, . . .,Zn)→maximizing Pθ((X1,Z1), . . ., (Xn,Zn)) is often a simplertask!Recall that maximizing the complete-data likelihood is oftensimpler than maximizing the marginalized likelihood!

−5 0 5 10−30

−25

−20

−15

−10

−5

0

θ

log(Pθ(X

))

Complete-data likelihood

−5 0 5 10−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

θ

log(Pθ(X

))Marginal likelihood

Page 12: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm

Expectation-Maximization (II)

The EM Algorithm

1 Start with an initial guess θ̂(o) for θ.For every iteration t, do the following:

2 E-Step: Q(θ, θ̂(t)) =∑

z log (Pθ(Z = z,D)) · P(Z| D, θ̂(t))

3 M-Step: θ̂(t+1) = argmaxθ∈ΘQ(θ, θ̂(t))

4 Go to step 2 if stopping criterion is not met.

Page 13: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

13/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm

Expectation-Maximization (III)

Two unknowns:The latent variables Z = (Z1, . . .,Zn).The parameter θ.

“Expected” Likelihood:∑

z log (Pθ(Z = z,D)) · P(Z| D, θ).

Here the logarithm acts directly on the complete-datalikelihood, so the corresponding M-step will be tractable.

−5 0 5 10−30

−25

−20

−15

−10

−5

0

θ

log(Pθ(X

))

Complete-data likelihood

−5 0 5 10−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

θ

log(Pθ(X

))

Marginal likelihood

Page 14: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm

Expectation-Maximization (III)

Two unknowns:The latent variables Z = (Z1, . . .,Zn).The parameter θ.

“Expected” Likelihood:∑

z log (Pθ(Z = z,D)) · P(Z| D, θ).

Here the logarithm acts directly on the complete-datalikelihood, so the corresponding M-step will be tractable.But we still have two terms (log (Pθ(Z = z,D)) & P(Z| D, θ))that depend on the two unknowns Z and θ.

The EM algorithm:E-step: Fix the posterior Z| D, θ by conditioning on thecurrent guess for θ, i.e. Z| D, θ(t).M-step: Update the guess for θ by solving a tractableoptimization problem.

The EM algorithm breaks down the intractable MLEoptimization problem into simpler, tractable iterative steps.

Page 15: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

15/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm

EM for Exponential Family (I)

The critical points of the marginal likelihood function:∂log(Pθ(D))

∂θ = 1Pθ(D)

∑z ∂Pθ(D,Z=z)

∂θ = 0.

∂log (Pθ(D,Z))

∂θ=

∂θlog

exp (⟨η(θ),T (D,Z)⟩ − A(θ)) · h(D,Z)︸ ︷︷ ︸Canonical form of exponential family

.

For η(θ) = θ, we have that

∂Pθ(D,Z)

∂θ=

(T (D,Z)− ∂

∂θA(θ)

)Pθ(D,Z).

Page 16: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

16/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm

EM for Exponential Family (II)

For exponential families: Eθ [T (D,Z)] = ∂∂θA(θ).

∂Pθ(D,Z)

∂θ= (T (D,Z)− Eθ [T (D,Z)]) Pθ(D,Z).

Since 1Pθ(D)

∑z∂Pθ(D,Z=z)

∂θ = 0, we have that

1

Pθ(D)

∑z

(T (D,Z = z)− Eθ [T (D,Z)]) Pθ(D,Z = z) = 0

∑z

T (D,Z = z)Pθ(D,Z = z)

Pθ(D)− Eθ [T (D,Z)] = 0

Eθ [T (D,Z)|D]− Eθ [T (D,Z)] = 0

Page 17: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

17/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm

EM for Exponential Family (III)

For the critical values of θ, the following condition is satisfied:

Eθ [T (D,Z)|D] = Eθ [T (D,Z)] .

How is this related to the EM objective Q(θ, θ̂(t))?

Q(θ, θ̂(t)) =∑z

log (Pθ(Z = z,D)) · Pθ̂(t)(Z| D)

= θEθ̂(t) [T (D,Z)|D]− A(θ) + Constant

= θEθ̂(t) [T (D,Z)|D]− Eθ [T (D,Z)] + Constant.

∂Q(θ,θ̂(t))∂θ = 0 → Eθ̂(t) [T (D,Z)|D] = Eθ [T (D,Z)] .

Since it is difficult to solve the above equation analytically,the EM algorithm solves for θ via successive approximations,i.e. solve the following for θ̂(t+1):

Eθ̂(t) [T (D,Z)|D] = Eθ̂(t+1) [T (D,Z)] .

Page 18: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

18/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

Multivariate Gaussian Mixture Models

Example: Multivariate Gaussian Mixtures

Parameters for a mixture of K Gaussians: mixtureproportions {πk}Kk=1, mean vectors and covariance matrices{(µk ,Σk)}Kk=1.

X1

X2

K = 3

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

Figure: Contour plot for the density of a mixture of 3 bivariate Gaussian distributions.

Page 19: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

19/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

Multivariate Gaussian Mixture Models

The Generative Process

Zi = z ∼ Categorical(π1, . . . , πK ), and Xi ∼ N (µz ,Σz).

−5 −4 −3 −2 −1 0 1 2 3 4 5−6

−5

−4

−3

−2

−1

0

1

2

3

X1

X2

Figure: A sample from a mixture model: every data point is colored according to its component membership.

Page 20: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

20/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

Multivariate Gaussian Mixture Models

The Dataset

Need to learn the parameters (πk , µk ,Σk)Kk=1 from the data

points D = (X1, . . . ,Xn) that are not “colored by thecomponent memberships”, i.e. we do not observe the latentvariables Z = (Z1, . . . ,Zn).

−5 −4 −3 −2 −1 0 1 2 3 4 5−6

−5

−4

−3

−2

−1

0

1

2

3

X1

X2

(a) (D, Z): the data points and their componentmemberships.

−5 −4 −3 −2 −1 0 1 2 3 4 5−6

−5

−4

−3

−2

−1

0

1

2

3

X1

X2

(b) D: the dataset with the observed data points(component memberships are latent).

Page 21: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

21/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models

MLE for the Gaussian Mixture Models

The complete-data likelihood function is given by

Pθ(D,Z) =n∏

i=1

πzi · N (Xi |µzi ,Σzi ).

The marginal likelihood function is

Pθ(D) =n∏

i=1

K∑k=1

πk · N (Xi |µk ,Σk).

The MLE can be obtained by maximizing the marginal loglikelihood function:

θ̂∗n = argmaxθ∈Θ

n∑i=1

log

(K∑

k=1

πk · N (Xi |µk ,Σk)

).

Exercise: Is the objective function above concave?

Page 22: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

22/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models

Implementing EM for the Gaussian Mixture Model (I)

The expected complete-data log likelihood function is

Ez [Pθ(D,Z)] =n∑

i=1

K∑k=1

γ(k,Xi |θ) (log(πk) + log (N (Xi |µk ,Σk)))

γ(k ,Xi |θ) = Pθ(Zi = k|Xi ).

γ(k,Xi |θ) is called the responsibility of component k towardsdata point Xi

γ(k,Xi |θ) =πk · N (Xi |µk ,Σk)∑Kj=1 πj · N (Xi |µj ,Σj)

Try to work out the derivation above yourself!

Page 23: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

23/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models

Implementing EM for the Gaussian Mixture Model (II)

(E-step) Approximate expected complete-data likelihood byfixing the responsibilities γ(k,Xi |θ) using the parameterestimates obtained from the previous iteration.

Q(θ, θ̂(t)) =n∑

i=1

K∑k=1

γ(k,Xi |θ̂(t)) (log(πk) + log (N (Xi |µk ,Σk)))

γ(k,Xi |θ̂(t)) =π̂(t)k · N (Xi |µ̂(t)

k , Σ̂(t)k )∑K

j=1 π̂(t)j · N (Xi |µ̂

(t)j , Σ̂

(t)j )

.

(M-step) Solve a tractable optimization problem

(π̂(t+1), µ̂(t+1), Σ̂(t+1)) =

argmax(π,µ,Σ)

n∑i=1

K∑k=1

γ(k ,Xi |θ̂(t)) (log(πk) + log (N (Xi |µk ,Σk)))

Page 24: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models

Implementing EM for the Gaussian Mixture Model (III)

The (M-step) yields the following parameter updatingequations

π̂(t+1)k =

1

n

n∑i=1

γ(k,Xi |θ̂(t))

µ̂(t+1)k =

1

n

n∑i=1

Xi · γ(k,Xi |θ̂(t))

Σ̂(t+1)k =

n∑i=1

γ(k ,Xi |θ̂(t))∑nj=1 γ(k ,Xj |θ̂(t))

(Xi − µ̂(t+1)k )(Xi − µ̂

(t+1)k )T

Try to work out the updating equations by yourself!

Page 25: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

25/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models

EM in Practice

Consider a Gaussian mixture model with K = 3, and the followingparameters:

π1 = 0.6, π2 = 0.05, and π3 = 0.35.

µ1 = [−1.4, 1.8]T , µ2 = [−1.4, −2.8]T , µ3 = [−1.9, 0.55]T .

Σ1 =

[0.8 −0.8−0.8 4

],Σ2 =

[1.2 2.32.3 5.2

],Σ3 =

[0.4 −0.01

−0.01 0.35

]Try writing a MATLAB code that generates a randomdataset of 5000 data points drawn from the modelspecified above, and implement the EM algorithm tolearn the model parameters from this dataset.

Page 26: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

26/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models

EM in Practice

The complete-data log likelihood increases after every EMiteration!

This means that every new iteration finds a ”better” estimate!

0 10 20 30 40 50 60 70 80−3.4

−3.38

−3.36

−3.34

−3.32

−3.3

−3.28

−3.26

−3.24

−3.22

−3.2

EM iteration

Log-likelihood

Page 27: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

27/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models

EM in Practice

Compare the true density function with the estimated one.

X1

X2

Contour Plot for the True Density Function

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

X1

X2

Contour Plot for the Estimated Density Function

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

Page 28: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

28/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM Performance Guarantees

What Does EM Guarantee?

The EM algorithm does not guarantee that θ̂(t) will convergeto θ̂∗n.

EM guarantees the following:θ̂(t) always converges (to a local optimum).Every iteration improves the marginal likelihood Pθ̂(t)(D).

Does the Initial Value Matter?

1 The initial value θ(o) affects the speed of convergence and the valueof θ(∞)! Smart initialization methods are often needed.

2 The K -means algorithm is often used to initialize the parameters ina Gaussian mixture model before applying the EM algorithm.

Page 29: The Expectation-Maximization Algorithmflaxman/HT17_lecture6.pdfExponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

29/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM Performance Guarantees

References

1 Robert W Keener, ”Statistical theory: notes for a course intheoretical statistics,” 2006.

2 Robert W Keener, ”Theoretical Statistics: Topics for a CoreCourse,” 2010.

3 Christopher Bishop, ”Pattern Recognition and MachineLearning,” 2007.