the expectation-maximization algorithmflaxman/ht17_lecture6.pdfexponential families (i) the em...

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1/29

EM & Latent Variable Models Gaussian Mixture Models EM Theory

The Expectation-Maximization Algorithm

Mihaela van der Schaar

Department of Engineering ScienceUniversity of Oxford

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

2/29


MLE for Latent Variable Models

Latent Variables and Marginal Likelihoods

Many probabilistic models have hidden variables that are notobservable in the dataset D: these models are known as latentvariable models.

Examples: Hidden Markov Models & Mixture Models.How would MLE be carried out for such models?

Each data point is drawn from a joint distribution Pθ(X ,Z ).For a realization ((X1,Z1), . . ., (Xn,Zn)), we only observe thevariables in the dataset D = (X1, . . .,Xn).Complete-data likelihood:

Pθ((X1,Z1), . . ., (Xn,Zn)) =n∏

i=1

Pθ(Xi ,Zi ).

Marginal likelihood:

Pθ(X1, . . .,Xn) =n∏

i=1

∑z

Pθ(Xi , Zi = z).

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3/29



The Hardness of Maximizing Marginal Likelihoods (I)

The MLE is obtained by maximizing the marginal likelihood:

θ̂∗n = argmaxθ∈Θ

n∑i=1

log

(∑z

Pθ(Xi , Zi = z)

).

Solving this optimization problem is often a hard task!Non-convex.Many local maxima.No analytic solution.

−5 0 5 10−30

−25

−20

−15

−10

−5

0

θ

log(Pθ(X

))

Complete-data likelihood

−5 0 5 10−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

θ

log(Pθ(X

))Marginal likelihood

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4/29



The Hardness of Maximizing Marginal Likelihoods (II)

The MLE for θ is obtained by maximizing the marginal loglikelihood function:


n∑i=1

log

(∑z

Pθ(Xi , Zi = z)

).

Solving this optimization problem is often a hard task!

The methods used in the previous lecture would not work.

Need a simpler approximate procedure!

The Expectation-Maximization is an iterative algorithm thatcomputes an approximate solution for the MLE optimizationproblem.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

5/29



Exponential Families (I)

The EM algorithm is well-suited for exponential familydistributions.

Exponential Family

A single-parameter exponential family is a set of probabilitydistributions that can be expressed in the form

Pθ(X ) = h(X ) · exp (η(θ) · T (X )− A(θ)) ,

where h(X ),A(θ) and T (X ) are known functions. An alternative,equivalent form often given as

Pθ(X ) = h(X ) · g(θ) · exp (η(θ) · T (X )) .

The variable θ is called the parameter of the family.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6/29



Exponential Families (II)

Exponential family distributions:

Pθ(X ) = h(X ) · exp (η(θ) · T (X )− A(θ)) .

T (X ) is a sufficient statistic of the distributionThe sufficient statistic is a function of the data that fullysummarizes the data X within the density function Pθ(X ).This means that for any data sets D1 and D2, the densityfunction is the same if T (D1) = T (D2). This is true even ifD1 and D2 are quite different.The sufficient statistic of a set of independent identicallydistributed data observations is simply the sum of individualsufficient statistics, i.e. T (D) =

∑ni=1 T (Xi ).

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

7/29



Exponential Families (III)

Exponential family distributions:

Pθ(X ) = h(X ) · exp (η(θ) · T (X )− A(θ)) .

η(θ) is called the natural parameterThe set of values of η(θ) for which the function Pθ(X ) is finiteis called the natural parameter space.

A(θ) is called the log-partition functionThe mean, variance and other moments of the sufficientstatistic T (X ) can be derived by differentiating A(θ).

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8/29



Exponential Families (IV)

Exponential Family Example: Normal Distribution

Pθ(X ) =1√2πσ

· exp(−(X − µ)2

2σ2

)=

1√2π

· exp(−X 2 − 2X µ+ µ2

2σ2− log(σ)

)

=exp

(⟨[ µσ2 ,

−12σ2

]T,[X ,X 2

]T⟩−(

µ2

2σ2 + log(σ)))

√2π

.

η(θ) =

[µ

σ2,−1

2σ2

]T, h(X ) = (2π)−

12

T (X ) =[X ,X 2

]T, A(θ) =

(µ2

2σ2+ log(σ)

).

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

9/29



Exponential Families (V)

Properties of Exponential Families

Exponential families have sufficient statistics that cansummarize arbitrary amounts of independent identicallydistributed data using a fixed number of values.

Exponential families have conjugate priors (an importantproperty in Bayesian statistics).

The posterior predictive distribution of an exponential-familyrandom variable with a conjugate prior can always be writtenin closed form.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10/29



Exponential Families (VI)

The Canonical Form of Exponential Families

If η(θ) = θ, then the exponential family is said to be incanonical form.

The canonical form is non-unique, since η(θ) can be multipliedby any nonzero constant, provided that T (X ) is multiplied bythat constant’s reciprocal, or a constant c can be added toη(θ) and h(X ) multiplied by exp(−c · T (x)) to offset it.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11/29


EM: The Algorithm

Expectation-Maximization (I)

Two unknowns:The latent variables Z = (Z1, . . .,Zn).The parameter θ.

Complications arise because we don’t know the latentvariables (Z1, . . .,Zn)→maximizing Pθ((X1,Z1), . . ., (Xn,Zn)) is often a simplertask!Recall that maximizing the complete-data likelihood is oftensimpler than maximizing the marginalized likelihood!

−5 0 5 10−30

−25

−20

−15

−10

−5

0

θ

log(Pθ(X

))


−5 0 5 10−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

θ

log(Pθ(X

))Marginal likelihood

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12/29


EM: The Algorithm

Expectation-Maximization (II)

The EM Algorithm

1 Start with an initial guess θ̂(o) for θ.For every iteration t, do the following:

2 E-Step: Q(θ, θ̂(t)) =∑

z log (Pθ(Z = z,D)) · P(Z| D, θ̂(t))

3 M-Step: θ̂(t+1) = argmaxθ∈ΘQ(θ, θ̂(t))

4 Go to step 2 if stopping criterion is not met.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

13/29


EM: The Algorithm

Expectation-Maximization (III)


“Expected” Likelihood:∑

z log (Pθ(Z = z,D)) · P(Z| D, θ).

Here the logarithm acts directly on the complete-datalikelihood, so the corresponding M-step will be tractable.

−5 0 5 10−30

−25

−20

−15

−10

−5

0

θ

log(Pθ(X

))


−5 0 5 10−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

θ

log(Pθ(X

))

Marginal likelihood

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14/29


EM: The Algorithm

Expectation-Maximization (III)


“Expected” Likelihood:∑

z log (Pθ(Z = z,D)) · P(Z| D, θ).

Here the logarithm acts directly on the complete-datalikelihood, so the corresponding M-step will be tractable.But we still have two terms (log (Pθ(Z = z,D)) & P(Z| D, θ))that depend on the two unknowns Z and θ.

The EM algorithm:E-step: Fix the posterior Z| D, θ by conditioning on thecurrent guess for θ, i.e. Z| D, θ(t).M-step: Update the guess for θ by solving a tractableoptimization problem.

The EM algorithm breaks down the intractable MLEoptimization problem into simpler, tractable iterative steps.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

15/29


EM: The Algorithm

EM for Exponential Family (I)

The critical points of the marginal likelihood function:∂log(Pθ(D))

∂θ = 1Pθ(D)

∑z ∂Pθ(D,Z=z)

∂θ = 0.

∂log (Pθ(D,Z))

∂θ=

∂

∂θlog

exp (⟨η(θ),T (D,Z)⟩ − A(θ)) · h(D,Z)︸︷︷︸Canonical form of exponential family

.

For η(θ) = θ, we have that

∂Pθ(D,Z)

∂θ=

(T (D,Z)− ∂

∂θA(θ)

)Pθ(D,Z).

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

16/29


EM: The Algorithm

EM for Exponential Family (II)

For exponential families: Eθ [T (D,Z)] = ∂∂θA(θ).

∂Pθ(D,Z)

∂θ= (T (D,Z)− Eθ [T (D,Z)]) Pθ(D,Z).

Since 1Pθ(D)

∑z∂Pθ(D,Z=z)

∂θ = 0, we have that

1

Pθ(D)

∑z

(T (D,Z = z)− Eθ [T (D,Z)]) Pθ(D,Z = z) = 0

∑z

T (D,Z = z)Pθ(D,Z = z)

Pθ(D)− Eθ [T (D,Z)] = 0

Eθ [T (D,Z)|D]− Eθ [T (D,Z)] = 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

17/29


EM: The Algorithm

EM for Exponential Family (III)

For the critical values of θ, the following condition is satisfied:

Eθ [T (D,Z)|D] = Eθ [T (D,Z)] .

How is this related to the EM objective Q(θ, θ̂(t))?

Q(θ, θ̂(t)) =∑z

log (Pθ(Z = z,D)) · Pθ̂(t)(Z| D)

= θEθ̂(t) [T (D,Z)|D]− A(θ) + Constant

= θEθ̂(t) [T (D,Z)|D]− Eθ [T (D,Z)] + Constant.

∂Q(θ,θ̂(t))∂θ = 0 → Eθ̂(t) [T (D,Z)|D] = Eθ [T (D,Z)] .

Since it is difficult to solve the above equation analytically,the EM algorithm solves for θ via successive approximations,i.e. solve the following for θ̂(t+1):

Eθ̂(t) [T (D,Z)|D] = Eθ̂(t+1) [T (D,Z)] .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

18/29


Multivariate Gaussian Mixture Models

Example: Multivariate Gaussian Mixtures

Parameters for a mixture of K Gaussians: mixtureproportions {πk}Kk=1, mean vectors and covariance matrices{(µk ,Σk)}Kk=1.

X1

X2

K = 3

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

Figure: Contour plot for the density of a mixture of 3 bivariate Gaussian distributions.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

19/29



The Generative Process

Zi = z ∼ Categorical(π1, . . . , πK ), and Xi ∼ N (µz ,Σz).

−5 −4 −3 −2 −1 0 1 2 3 4 5−6

−5

−4

−3

−2

−1

0

1

2

3

X1

X2

Figure: A sample from a mixture model: every data point is colored according to its component membership.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

20/29



The Dataset

Need to learn the parameters (πk , µk ,Σk)Kk=1 from the data

points D = (X1, . . . ,Xn) that are not “colored by thecomponent memberships”, i.e. we do not observe the latentvariables Z = (Z1, . . . ,Zn).

−5 −4 −3 −2 −1 0 1 2 3 4 5−6

−5

−4

−3

−2

−1

0

1

2

3

X1

X2

(a) (D, Z): the data points and their componentmemberships.

−5 −4 −3 −2 −1 0 1 2 3 4 5−6

−5

−4

−3

−2

−1

0

1

2

3

X1

X2

(b) D: the dataset with the observed data points(component memberships are latent).

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

21/29


EM for Gaussian Mixture Models

MLE for the Gaussian Mixture Models

The complete-data likelihood function is given by

Pθ(D,Z) =n∏

i=1

πzi · N (Xi |µzi ,Σzi ).

The marginal likelihood function is

Pθ(D) =n∏

i=1

K∑k=1

πk · N (Xi |µk ,Σk).

The MLE can be obtained by maximizing the marginal loglikelihood function:


n∑i=1

log

(K∑

k=1

πk · N (Xi |µk ,Σk)

).

Exercise: Is the objective function above concave?

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

22/29



Implementing EM for the Gaussian Mixture Model (I)

The expected complete-data log likelihood function is

Ez [Pθ(D,Z)] =n∑

i=1

K∑k=1

γ(k,Xi |θ) (log(πk) + log (N (Xi |µk ,Σk)))

γ(k ,Xi |θ) = Pθ(Zi = k|Xi ).

γ(k,Xi |θ) is called the responsibility of component k towardsdata point Xi

γ(k,Xi |θ) =πk · N (Xi |µk ,Σk)∑Kj=1 πj · N (Xi |µj ,Σj)

Try to work out the derivation above yourself!

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

23/29



Implementing EM for the Gaussian Mixture Model (II)

(E-step) Approximate expected complete-data likelihood byfixing the responsibilities γ(k,Xi |θ) using the parameterestimates obtained from the previous iteration.

Q(θ, θ̂(t)) =n∑

i=1

K∑k=1

γ(k,Xi |θ̂(t)) (log(πk) + log (N (Xi |µk ,Σk)))

γ(k,Xi |θ̂(t)) =π̂(t)k · N (Xi |µ̂(t)

k , Σ̂(t)k )∑K

j=1 π̂(t)j · N (Xi |µ̂

(t)j , Σ̂

(t)j )

.

(M-step) Solve a tractable optimization problem

(π̂(t+1), µ̂(t+1), Σ̂(t+1)) =

argmax(π,µ,Σ)

n∑i=1

K∑k=1

γ(k ,Xi |θ̂(t)) (log(πk) + log (N (Xi |µk ,Σk)))

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24/29



Implementing EM for the Gaussian Mixture Model (III)

The (M-step) yields the following parameter updatingequations

π̂(t+1)k =

1

n

n∑i=1

γ(k,Xi |θ̂(t))

µ̂(t+1)k =

1

n

n∑i=1

Xi · γ(k,Xi |θ̂(t))

Σ̂(t+1)k =

n∑i=1

γ(k ,Xi |θ̂(t))∑nj=1 γ(k ,Xj |θ̂(t))

(Xi − µ̂(t+1)k )(Xi − µ̂

(t+1)k )T

Try to work out the updating equations by yourself!

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

25/29



EM in Practice

Consider a Gaussian mixture model with K = 3, and the followingparameters:

π1 = 0.6, π2 = 0.05, and π3 = 0.35.

µ1 = [−1.4, 1.8]T , µ2 = [−1.4, −2.8]T , µ3 = [−1.9, 0.55]T .

Σ1 =

[0.8 −0.8−0.8 4

],Σ2 =

[1.2 2.32.3 5.2

],Σ3 =

[0.4 −0.01

−0.01 0.35

]Try writing a MATLAB code that generates a randomdataset of 5000 data points drawn from the modelspecified above, and implement the EM algorithm tolearn the model parameters from this dataset.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

26/29



EM in Practice

The complete-data log likelihood increases after every EMiteration!

This means that every new iteration finds a ”better” estimate!

0 10 20 30 40 50 60 70 80−3.4

−3.38

−3.36

−3.34

−3.32

−3.3

−3.28

−3.26

−3.24

−3.22

−3.2

EM iteration

Log-likelihood

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

27/29



EM in Practice

Compare the true density function with the estimated one.

X1

X2

Contour Plot for the True Density Function

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

X1

X2

Contour Plot for the Estimated Density Function

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

28/29


EM Performance Guarantees

What Does EM Guarantee?

The EM algorithm does not guarantee that θ̂(t) will convergeto θ̂∗n.

EM guarantees the following:θ̂(t) always converges (to a local optimum).Every iteration improves the marginal likelihood Pθ̂(t)(D).

Does the Initial Value Matter?

1 The initial value θ(o) affects the speed of convergence and the valueof θ(∞)! Smart initialization methods are often needed.

2 The K -means algorithm is often used to initialize the parameters ina Gaussian mixture model before applying the EM algorithm.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

29/29


EM Performance Guarantees

References

1 Robert W Keener, ”Statistical theory: notes for a course intheoretical statistics,” 2006.

2 Robert W Keener, ”Theoretical Statistics: Topics for a CoreCourse,” 2010.

3 Christopher Bishop, ”Pattern Recognition and MachineLearning,” 2007.

the expectation-maximization algorithmflaxman/ht17_lecture6.pdfexponential families (i) the em...

Documents