lecture 17 - computer science - western universityoveksler/courses/cs434a_541a/lecture17.… ·...

45
CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 17

Upload: others

Post on 14-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

CS434a/541a: Pattern RecognitionProf. Olga Veksler

Lecture 17

Page 2: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Today

� Parametric Unsupervised Learning� Expectation Maximization (EM)

� one of the most useful statistical methods� oldest version in 1958 (Hartley)� seminal paper in 1977 (Dempster et al.)� can also be used when some samples are missing

features

Page 3: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Unsupervised Learning

� In unsupervised learning, where we are only given samples x1,…, xn without class labels

� Last 2 lectures: nonparametric approach (clustering)

� Today, parametric approach � assume parametric distribution of data � estimate parameters of this distribution� much “harder” than the supervised learning case

Page 4: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Parametric Unsupervised Learning� Assume the data was generated by a model with

known shape but unknown parameters

� Advantages of having a model� Gives a meaningful way to cluster data

� adjust the parameters of the model to maximize the probability that the model produced the observed data

� Can sensibly measure if a clustering is good� compute the likelihood of data induced by clustering

� Can compare 2 clustering algorithms� which one gives the higher likelihood of the observed data?

P(x |θθθθ)

Page 5: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Parametric Supervised Learning

� Let us recall supervised parametric learning� have m classes� have samples x1,…, xn each of class 1, 2,…, m� suppose Di holds samples from class i� probability distribution for class i is pi(x|θθθθi)

p1(x|θθθθ1) p2(x|θθθθ2)

Page 6: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Parametric Supervised Learning� Use the ML method to estimate parameters θθθθi

� find θθθθi which maximizes the likelihood function F(θθθθi)

)(F)|x(p)|D(p iiDx

iii

θθθθθθθθθθθθ ======== ∏∏∏∏∈∈∈∈

� or, equivalently, find θθθθi which maximizes the log likelihood l(θθθθi)

(((( )))) (((( ))))����∈∈∈∈

========iDx

iiii xpDpl θθθθθθθθθθθθ |ln)|(ln

[[[[ ]]]])|(lnmaxargˆ111

1

θθθθθθθθθθθθ

Dp==== [[[[ ]]]])|(lnmaxargˆ222

2

θθθθθθθθθθθθ

Dp====

Page 7: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Parametric Supervised Learning� now the distributions are fully specified� can classify unknown sample using MAP classifier

)ˆ|( 11 θθθθxp )ˆ|( 22 θθθθxp

p 1(x

| c1

)P(c 1

)>p 2

(x|c 2

)P(c 2

) p2 (x

| c2 )P(c

2 )>p1 (x

|c1 )P(c

1 )

Page 8: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Parametric Unsupervised Learning

� In unsupervised learning, no one tells us the true classes for samples. We still know� have m classes� have samples x1,…, xn each of unknown class� probability distribution for class i is pi(x|θθθθi)

� Can we determine the classes and parameters simultaneously?

Page 9: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Example: MRI Brain Segmentation

� In MRI brain image, different brain tissues have different intensities

� Know that brain has 6 major types of tissues� Each type of tissue can be modeled by a Gaussian N(µµµµi,σσσσi

2) reasonably well, parameters µµµµi,σσσσi

2 are unknown� Segmenting (classifying) the brain image into different

tissue classes is very useful� don’t know which image pixel corresponds to which tissue (class)� don’t know parameters for each N(µµµµi,σσσσi

2)

Picture from M. Leventon

segmentation

Page 10: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Mixture Density Model� Model data with mixture density

(((( )))) (((( ))))j

m

jjj cPcxpxp ����

====

====1

,|)|( θθθθθθθθ

{{{{ }}}}mθθθθθθθθθθθθ ,...,1====� where

� To generate a sample from distribution p(x|θθθθ)� first select class j with probability P(cj)� then generate x according to probability law p(x|cj ,θθθθj )

P(c 1)

P(c

2)

P(c3)

p(x|c1 ,θθθθ1 )p(x|c2 ,θθθθ2 )

p(x|c3 ,θθθθ3 )

component densities

mixing parameters

(((( )))) (((( )))) (((( )))) 1...21 ====++++++++++++ mcPcPcP�

Page 11: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Example: Gaussian Mixture Density

� Mixture of 3 Gaussians

(((( )))) (((( )))) (((( ))))xp5.0xp3.0xp2.0)x(p 321 ++++++++====

������������

������������

����

��������

����

�������� ≅≅≅≅ 10

01,00N)x(p1

[[[[ ]]]] ������������

������������

����

�������� ≅≅≅≅ 40

04,6,6N)x(p2

[[[[ ]]]] ������������

������������

����

�������� −−−−≅≅≅≅ 60

06,7,7N)x(p3

p1(x)

p2(x)

p3(x)

Page 12: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Mixture Density(((( )))) (((( ))))j

m

jjj cPcxpxp ����

====

====1

,|)|( θθθθθθθθ

� P(c1),…, P(cm) can be known or unknown

� Suppose we know how to estimate θθθθ1,…, θθθθm and P(c1),…, P(cm)

� Can “break apart” mixture p(x|θθθθ ) for classification

� To classify sample x, use MAP estimation, that is choose class i which maximizes

),|( ii xcP θθθθ (((( ))))iii cP),c|x(p θθθθ∝∝∝∝probability of component i

to generate xprobability of component i

Page 13: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

ML Estimation for Mixture Density

(((( )))) (((( ))))j

m

jjj cPcxpxp ����

====

====1

,|),|( θθθθρρρρθθθθ

� Can use Maximum Likelihood estimation for a mixture density; need to estimate� θθθθ1,…, θθθθm

� ρρρρ1 = P(c1),…, ρρρρm = P(cm), and ρ ρ ρ ρ = {ρρρρ1 ,…, ρρρρm }

� As in the supervised case, form the logarithm likelihood function

(((( )))) ),|(ln, ρρρρθθθθρρρρθθθθ Dpl ====

(((( )))) i

m

jjjcxp ρρρρθθθθ����

====

====1

,|

(((( ))))���� ����==== ====

����

��������

====

n

ki

m

jjjcxp

1 1

,|ln ρρρρθθθθ(((( ))))����====

====n

kkxp

1

,|ln ρρρρθθθθ

Page 14: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

ML Estimation for Mixture Density

� need to maximize l(θθθθ,ρρρρ) with respect to θθθθ and ρρρρ� As you may have guessed, l(θθθθ, ρρρρ) is not the easiest

function to maximize� If we take partial derivatives with respect to θθθθ, ρρρρ and

set them to 0, typically we have a “coupled” nonlinear system of equation

� usually closed form solution cannot be found

� We could use the gradient ascent method� in general, it is not the greatest method to use, should

only be used as last resort

� There is a better algorithm, called EM

(((( )))) (((( ))))���� ����==== ====

����

��������

====

n

ki

m

jjjcxpl

1 1

,|ln, ρρρρθθθθρρρρθθθθ

Page 15: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Mixture Density

(((( )))) j

m

jjjcxpxp ρρρρθθθθρρρρθθθθ ����

====

====1

,|),|(

� Before EM, let’s look at the mixture density again

(((( )))) iiiiii cxpcPcxp ρρρρθθθθθθθθ ),|(),|( ====

� Suppose we know how to estimate θθθθ1,…, θθθθm and ρρρρ1,…,ρρρρm� Estimating the class of x is easy with MAP, maximize

� Suppose we know the class of samples x1,…, xn� This is just the supervised learning case, so estimating

θθθθ1,…, θθθθm and ρρρρ1,…,ρρρρm is easy

� This is an example of chicken-and-egg problem� ME algorithm approaches this problem by adding

“hidden” variables

[[[[ ]]]])|(lnmaxargˆiii Dp

i

θθθθθθθθθθθθ

====nDi

i||ˆ ====ρρρρ

Page 16: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Expectation Maximization Algorithm

� EM is an algorithm for ML parameter estimation when the data has missing values. It is used when1. data is incomplete (has missing values)

� some features are missing for some samples due to data corruption, partial survey responces, etc.

� This scenario is very useful, covered in section 3.92. Suppose data X is complete, but p(X|θθθθ) is hard to

optimize. Suppose further that introducing certain hidden variables U whose values are missing, and suppose it is easier to optimize the “complete” likelihood function p(X,U|θθθθ). Then EM is useful. � This scenario is useful for the mixture density

estimation, and is subject of our lecture today� Notice that after we introduce artificial (hidden)

variables U with missing values, case 2 is completely equivalent to case 1

Page 17: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM: Hidden Variables for Mixture Density(((( )))) j

m

jjjcxpxp ρρρρθθθθθθθθ ����

====

====1

,|)|(

� For simplicity, assume component densities are

(((( )))) (((( ))))��������

����

����

��������

����

���� −−−−−−−−==== 2

2

2exp

21

,|σσσσµµµµ

ππππσσσσθθθθ j

jj

xcxp

� assume for now that the variance is known� need to estimate θθθθ = {µµµµ1,…, µµµµm}

� If we knew which sample came from which component (that is the class label), the ML parameter estimation is easy

� Thus to get an easier problem, introduce hidden variables which indicate which component each sample belongs to

Page 18: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM: Hidden Variables for Mixture Density

(((( ))))

������������==== 01k

izif sample i was generated by component kotherwise

� zi(k) are indicator random variables, they indicate

which Gaussian component generated sample xi

� For , define hidden variables zi(k)mkni ≤≤≤≤≤≤≤≤≤≤≤≤≤≤≤≤ 1,1

(((( )))) (((( )))){{{{ }}}}miiii zzxx ,...,, 1→→→→

(((( )))) (((( ))))2,~,| σσσσµµµµθθθθ kii Nzxp

� Let zi = {zi(1),…, zi

(m)}, indicator r.v. corresponding to sample xi

� Conditioned on zi , distribution of xi is Gaussian

� where k is s.t. zi(k) = 1

Page 19: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM: Joint Likelihood

(((( ))))θθθθθθθθ |,...,,,...,)|,( 11 nn zzxxpZXp ====

� Let zi = {zi(1),…, zi

(m)}, and Z = {z1,…, zn}

(((( ))))∏∏∏∏====

====n

iii zxp

1

|, θθθθ� The complete likelihood is

(((( )))) (((( ))))∏∏∏∏====

====n

iiii zpzxp

1

,| θθθθ

gaussian part of ρρρρc

� The problem, is, of course, is that the values of Z are missing, since we made it up (that is Z is hidden)

� If we actually observed Z, the log likelihoodln[p(X,Z|θθθθ)] would be trivial to maximize with respect to θθθθ and ρρρρi

Page 20: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Derivation

� Instead of maximizing ln[p(X,Z|θθθθ)] the idea behind EM is to maximize some function of ln[p(X,Z|θθθθ)], usually its expected value

� the expectation is with respect to the missing data Z

(((( ))))[[[[ ]]]]θθθθ|,ln ZXpEZ

� that is with respect to density p(Z |X,θθθθ)

� however θθθθ is our ultimate goal, we don’t know θ θ θ θ !

� If θθθθ makes ln[p(X,Z|θθθθ)] large, then θθθθ tends to make E[ln p(X,Z|θθθθ)] large

Page 21: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Algorithm

� EM solution is to iterate

1. start with initial parameters θθθθ (0)

E. compute the expectation of log likelihood with respect to current estimate θθθθ (t) and X. Let’s call it Q(θθθθ |θθθθ (t))

iterate the following 2 step until convergence

(((( ))))(((( )))) (((( )))) (((( ))))[[[[ ]]]]tZ

t XZXpEQ θθθθθθθθθθθθθθθθ ,||,ln| ====

M. maximize Q(θθθθ |θθθθ (t))

(((( )))) (((( ))))(((( ))))tt Q θθθθθθθθθθθθθθθθ

|maxarg1 ====++++

Page 22: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Algorithm: Picture

θθθθ

(((( ))))θθθθ|ln Xp

optimal value for θθθθwe’d like to find it but optimizing p(X |θθθθ) is

very difficult

Page 23: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Algorithm: Picture

θθθθ

(((( ))))θθθθ|,ln ZXp

Z

This curve corresponds to the

correct Z, we should optimize for but Z is

not observed

for mixture estimation,there are mn curves, each

curve corresponds to a particular assignment of

samples to classes

unobserved Zcorresponding

to observed data X

Page 24: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Algorithm: Picture

θθθθ

(((( ))))θθθθ|,ln ZXp

Z

(((( )))) (((( ))))[[[[ ]]]]tZ XZXpE θθθθθθθθ ,||,ln

ZE (θθθθ)

Page 25: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Algorithm

� It can be proven that EM algorithm converges to the local maximum of the log-likelihood

(((( ))))θθθθ|ln Xp

� Why is it better than gradient ascent?� Convergence of EM is usually significantly faster, in the

beginning, very large steps are made (that is likelihood function increases rapidly), as opposed to gradient ascent which usually takes tiny steps

� gradient descent is not guaranteed to converge� recall all the difficulties of choosing the appropriate

learning rate

Page 26: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM for Mixture of Gaussians: E step� Let’s come back to our example (((( )))) j

m

jjjcxpxp ρρρρθθθθθθθθ ����

====

====1

,|)|(

(((( )))) (((( ))))��������

����

����

��������

����

���� −−−−−−−−==== 2

2

2exp

21

,|σσσσµµµµ

ππππσσσσθθθθ j

jj

xcxp

� need to estimate θθθθ = {µµµµ1,…, µµµµm} and ρρρρ1,…, ρρρρm

(((( ))))

������������==== 01k

izif sample i was generated by component kotherwise

� for , define zi(k)mkni ≤≤≤≤≤≤≤≤≤≤≤≤≤≤≤≤ 1,1

� We need log-likelihood of observed X and hidden Z(((( )))) (((( ))))∏∏∏∏

========

n

iii zxpZXp

1

|,ln|,ln θθθθθθθθ (((( )))) (((( ))))����====

====n

iiii zPzxp

1

,|ln θθθθ

� as before, zi = {zi(1),…, zi

(m)}, and Z = {z1,…, zn}

Page 27: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM for Mixture of Gaussians: E step

� We need log-likelihood of observed X and hidden Z(((( )))) (((( ))))∏∏∏∏

========

n

iii zxpZXp

1

|,ln|,ln θθθθθθθθ (((( )))) (((( ))))����====

====n

iiii zPzxp

1

,|ln θθθθ

� First let’s rewrite (((( )))) (((( ))))iii zPzxp θθθθ,|

(((( )))) (((( )))) ====iii zPzxp θθθθ,|

(((( ))))(((( )))) (((( ))))(((( ))))[[[[ ]]]] (((( ))))

∏∏∏∏====

============m

k

zkki

ki

iizPzxp

1

1,1| θθθθ

(((( ))))(((( )))) (((( ))))(((( )))) (((( ))))

(((( ))))(((( )))) (((( ))))(((( )))) (((( ))))��������

��������

����

============

============

1zif1zP,1z|xp

1zif1zP,1z|xp

mi

mi

mii

1i

1i

1ii

θθθθ

θθθθ��

(((( )))) (((( ))))(((( ))))(((( ))))

∏∏∏∏====

����

��������

====��������

����

����������������

���� −−−−−−−−====m

k

z

ki

ki

ki

zPx

12

2

12

exp21

σσσσµµµµ

ππππσσσσ

Page 28: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM for Mixture of Gaussians: E step� log-likelihood of observed X and hidden Z is

(((( )))) (((( )))) (((( ))))����====

====n

iiii zPzxpZXp

1

,|ln|,ln θθθθθθθθ

(((( )))) (((( ))))(((( ))))(((( ))))

���� ∏∏∏∏==== ====

����

��������

====��������

����

����������������

���� −−−−−−−−====n

i

m

k

z

ki

ki

ki

zPx

1 12

2

12

exp21

lnσσσσ

µµµµππππσσσσ

(((( )))) (((( ))))(((( ))))(((( ))))kiz

n

i

ki

kim

k

zPx

��������==== ====

����

��������

====��������

����

����������������

���� −−−−−−−−====1

2

2

1

12

exp21

lnσσσσ

µµµµππππσσσσ

(((( )))) (((( )))) (((( ))))(((( ))))��������==== ====

����

��������

====++++

−−−−−−−−====n

i

m

k

ki

kiki zP

xz

1 12

2

1ln22

1ln

σσσσµµµµ

ππππσσσσ

(((( )))) (((( ))))��������

==== ====����

��������

++++

−−−−−−−−====n

i

m

kk

kiki

xz

1 12

2

ln22

1ln ρρρρ

σσσσµµµµ

ππππσσσσ

(((( )))) (((( )))) kki cPkclassfromxsampleP ρρρρ========

Page 29: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM for Mixture of Gaussians: E step� log-likelihood of observed X and hidden Z is

(((( )))) (((( )))) (((( ))))��������

==== ====����

��������

++++

−−−−−−−−====n

i

m

kk

kiki

xzZXp

1 12

2

ln22

1ln|,ln ρρρρ

σσσσµµµµ

ππππσσσσθθθθ

� For the E step, we must compute

(((( )))) (((( ))))[[[[ ]]]]tZ XZXpE θθθθθθθθ ,||,ln====

(((( )))) (((( )))) (((( ))))������������

����������������

��������

��������

++++

−−−−−−−−==== ��������==== ====

n

i

m

k

tk

kikiZ

xzE

1 12

2

ln22

1ln ρρρρ

σσσσµµµµ

ππππσσσσ

(((( ))))[[[[ ]]]] (((( ))))��������

==== ====��������

����

����

��������

����

����++++

−−−−−−−−====n

i

m

kk

kikiZ

xzE

1 12

2

ln22

1ln ρρρρ

σσσσµµµµ

ππππσσσσ

[[[[ ]]]] bxEabxaEi

iXii

iiX ++++====����

��������

++++ ��������

(((( ))))(((( )))) (((( )))) (((( )))) (((( )))) (((( ))))(((( ))))tm

ttm

tt QQ ρρρρρρρρµµµµµµµµθθθθθθθθθθθθ ,...,,,...,|| 11====

Page 30: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM for Mixture of Gaussians: E step

� need to compute EZ[zi(k)] in the above expression

(((( ))))(((( )))) (((( ))))[[[[ ]]]] (((( ))))��������

==== ====��������

����

����

��������

����

����++++

−−−−−−−−====n

i

m

kk

kikiZ

t xzEQ

1 12

2

ln22

1ln| ρρρρ

σσσσµµµµ

ππππσσσσθθθθθθθθ

(((( ))))[[[[ ]]]] (((( )))) (((( ))))(((( )))) (((( )))) (((( ))))(((( ))))itk

iitk

ik

iZ xzPxzPzE ,|1*1,|0*0 θθθθθθθθ ====++++========

(((( )))) (((( ))))(((( ))))itk

i xzP ,|1 θθθθ========(((( )))) (((( ))))(((( )))) (((( )))) (((( ))))(((( ))))

(((( ))))(((( ))))ti

tki

ki

ti

xpzPzxp

θθθθθθθθθθθθ

||11,| ============

(((( )))) (((( ))))(((( ))))(((( )))) (((( ))))(((( ))))����

====��������

������������

���� −−−−−−−−

��������

������������

���� −−−−−−−−==== m

j

tji

tj

tki

tk

x

x

1

2

2

2

2

21

exp

21

exp

µµµµσσσσ

ρρρρ

µµµµσσσσ

ρρρρ

� we are finally done with the E step� for implementation, just need to compute EZ[zi

(k)] ’s don’t need to compute Q

(((( )))) (((( ))))(((( ))))(((( )))) (((( ))))(((( )))) (((( )))) (((( ))))(((( ))))����

====

========

��������

������������

���� −−−−−−−−==== m

j

tji

ji

ti

tki

tk

zPzxP

x

1

2

|11,|

21

exp

θθθθθθθθ

µµµµρρρρ

Page 31: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM for Mixture of Gaussians: M step

� Need to maximize Q with respect to all parameters

(((( ))))(((( )))) (((( ))))[[[[ ]]]] (((( ))))��������

==== ====������������

����������������

����++++

−−−−−−−−====n

i

m

kk

kikiZ

t xzEQ

1 12

2

ln22

1ln| ρρρρ

σσσσµµµµ

ππππσσσσθθθθθθθθ

(((( ))))(((( )))) ====∂∂∂∂

∂∂∂∂ t

k

Q θθθθθθθθµµµµ

| (((( ))))[[[[ ]]]](((( ))))����

====

−−−−n

i

kikiZ

xzE

12σσσσµµµµ

0====

(((( )))) (((( ))))[[[[ ]]]]����====

++++ ========����n

ii

kiZ

tkk xzE

nnew

1

1 1µµµµµµµµ

� First differentiate with respect to µµµµk

the mean for class k is weighted average of all samples, and this weight is proportional to the current estimate of

probability that the sample belongs to class k

Page 32: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM for Mixture of Gaussians: M step(((( ))))(((( )))) (((( ))))[[[[ ]]]] (((( ))))

��������==== ====

������������

����������������

����++++

−−−−−−−−====n

i

m

kk

kikiZ

t xzEQ

1 12

2

ln22

1ln| ρρρρ

σσσσµµµµ

ππππσσσσθθθθθθθθ

� For ρρρρk we have to use Lagrange multipliers to preserve constraint 1

1

====����====

m

jjρρρρ

� Thus we need to differentiate (((( )))) (((( ))))(((( )))) ������������

����������������

����−−−−−−−−==== ����

====

1|,1

m

jj

tQF ρρρρλλλλθθθθθθθθρρρρλλλλ

(((( )))) (((( ))))[[[[ ]]]] 01

,1

====−−−−====∂∂∂∂

∂∂∂∂����

====λλλλ

ρρρρρρρρλλλλ

ρρρρ

n

i

kiZ

kk

zEF (((( ))))[[[[ ]]]] 01

====−−−−��������====

k

n

i

kiZ zE λρλρλρλρ

� Summing up over all components: (((( ))))[[[[ ]]]] ������������======== ====

====m

kk

m

k

n

i

kiZ zE

11 1

λρλρλρλρ

� Since and we get(((( ))))[[[[ ]]]] nzEm

k

n

i

kiZ ====��������

==== ====1 1

11

====����====

m

kkρρρρ n====λλλλ

(((( )))) (((( ))))[[[[ ]]]]����====

++++ ====n

i

kiZ

tk zE

n 1

1 1ρρρρ

Page 33: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM AlgorithmThe algorithm on this slide applies ONLY to univariate gaussian

case with known variances

1. Randomly initialize µµµµ1,…, µµµµm , ρρρρ1,…, ρρρρm (with constraint ΣρΣρΣρΣρi = 1)

E. for all i, k, compute

iterate until no change in µµµµ1,…, µµµµm , ρρρρ1,…, ρρρρm

M. for all k, do parameter update

(((( ))))[[[[ ]]]](((( ))))

(((( ))))����====

��������

������������

���� −−−−−−−−

��������

������������

���� −−−−−−−−==== m

jjij

kikk

iZ

x

xzE

1

22

22

21exp

21exp

µµµµσσσσ

ρρρρ

µµµµσσσσ

ρρρρ

(((( ))))[[[[ ]]]]����====

====n

ii

kiZk xzE

n 1

1µµµµ (((( ))))[[[[ ]]]]����====

====n

i

kiZk zE

n 1

1ρρρρ

Page 34: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Algorithm� For the more general case of multivariate

Gaussians with unknown means and variances

(((( ))))[[[[ ]]]] (((( ))))(((( ))))����

====

==== m

1jjjj

kkkkiZ

,|xp

,|xpzE

ΣΣΣΣµµµµρρρρ

ΣΣΣΣµµµµρρρρ� E step:

(((( )))) (((( )))) (((( ))))����

�������� −−−−−−−−−−−−==== −−−−

−−−− k1

kt

k2/11k

2/dkk xx21exp

)2(1,|xp µµµµΣΣΣΣµµµµ

ΣΣΣΣππππΣΣΣΣµµµµwhere

(((( ))))[[[[ ]]]](((( ))))[[[[ ]]]]����

����

====

======== n

i

kiZ

n

ii

kiZ

k

zE

xzE

1

1µµµµ

(((( ))))[[[[ ]]]](((( ))))(((( ))))

(((( ))))[[[[ ]]]]����

����

====

====

−−−−−−−−====ΣΣΣΣ n

i

kiZ

n

i

Tkiki

kiZ

k

zE

xxzE

1

1

µµµµµµµµ

(((( ))))[[[[ ]]]]����====

====n

i

kiZk zE

n 1

1ρρρρ

� M step:

Page 35: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Algorithm and K-means� k-means can be derived from EM algorithm

(((( ))))[[[[ ]]]](((( ))))

(((( ))))����====

��������

������������

���� −−−−−−−−

��������

������������

���� −−−−−−−−==== m

jjij

kikk

iZ

x

xzE

1

22

22

21exp

21exp

µµµµσσσσ

ρρρρ

µµµµσσσσ

ρρρρ

� Setting mixing parameters equal for all classes,

(((( ))))

(((( ))))����====

��������

������������

���� −−−−−−−−

��������

������������

���� −−−−−−−−==== m

jji

ki

x

x

1

22

22

21exp

21exp

µµµµσσσσ

µµµµσσσσ

� If we let , then ∞∞∞∞→→→→σσσσ(((( ))))[[[[ ]]]]

������������ −−−−>>>>−−−−∀∀∀∀====

otherwisexxjif

zE jikikiZ 0

||||||||,1 µµµµµµµµ

� so at the E step, for each current mean, we find all points closest to it and form new clusters

(((( ))))[[[[ ]]]]����====

====n

ii

kiZk xzE

n 1

1µµµµ� at the M step, we compute the new means inside

current clusters

Page 36: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Gaussian Mixture Example

Page 37: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

After first iteration

EM Gaussian Mixture Example

Page 38: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

After second iteration

EM Gaussian Mixture Example

Page 39: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

After third iteration

EM Gaussian Mixture Example

Page 40: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

After 20th iteration

EM Gaussian Mixture Example

Page 41: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Example� Example from R. Gutierrez-Osuna � Training set of 900 examples forming an annulus� Mixture model with m = 30 Gaussian components of

unknown mean and variance is used� Training:

� Initialization:� means to 30 random examples� covaraince matrices initialized to be diagonal, with

large variances on the diagonal (compared to the training data variance)

� During EM training, components with small mixing coefficients were trimmed� This is a trick to get in a more compact model, with

fewer than 30 Gaussian components

Page 42: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

EM Example

from R. Gutierrez-Osuna

Page 43: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Figure from “Color and Texture Based Image Segmentation Using EM and Its Application to Content Based Image Retrieval”,S.J. Belongie et al., ICCV 1998

EM Texture Segmentation Example

Page 44: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Three frames from the MPEG “flower garden” sequence

Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE Transactions on Image Processing, 1994, c 1994, IEEE

EM Motion Segmentation Example

Page 45: Lecture 17 - Computer Science - Western Universityoveksler/Courses/CS434a_541a/Lecture17.… · Lecture 17. Today Parametric Unsupervised Learning Expectation Maximization (EM) one

Summary

� Advantages� If the assumed data distribution is correct, the

algorithm works well

� Disadvantages� If assumed data distribution is wrong, results

can be quite bad. � In particular, bad results if use incorrect number of

classes (i.e. the number of mixture components)