sargur srihari [email protected]/cse574/chap10/10.3... · 2015. 12. 6. · machine...

Machine Learning Srihari

1

Variational Mixture of Gaussians

Sargur Srihari [email protected]


Objective

•  Apply variational inference machinery to Gaussian Mixture Models

•  Demonstrates how Bayesian treatment elegantly resolves difficulties with maximum likelihood issues

•  Many more complex distributions can be solved using straightforward extensions of this analysis

2


Graphical Model for GMM

•  Graphical model corresponding to likelihood function of standard GMM:

•  For each observation xn we have a corresponding latent latent variable zn –  A 1-of-K binary vector with elements znk for

k=1,..K

•  Denote observed data by X={x1,..,xN} •  Latent variables by Z={z1,..,zN}

3

Plate Notation: Equivalent networks

3

Directed acyclic graph Representing mixture


Likelihood Function for GMM

4

x

Therefore Likelihood function is

�

p(X |π,µ,Σ) =k=1

K

∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩

⎫ ⎬ ⎭ n=1

N

∏

Therefore log-likelihood function is

�

ln p(X |π ,µ,Σ) = lnk=1

K

∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩

⎫ ⎬ ⎭ n=1

N

∑

Find parameters , π, µ and Σ that maximize log-likelihood A more difficult problem than for a single Gaussian

Mixture density function is Since z has values {zk} with probabilities πk

Product is over the Ni.i.d. samples


GMM m.l.e. expressions

•  Obtained using derivatives of log-likelihood

•  Not closed form solutions for the parameters

– Since the responsibilities depend on those parameters in a complex way 5

1

1 ( )xN

k nk nnk

zN

µ γ=

= ∑

�

Σk = 1Nk

γ(znk )(xn − µk )(xn − µk )T

n=1

N

∑

kkNN

π =

�

Nk = γ(znk )n=1

N

∑

Parameters (means)

Parameters(covariance matrices)

Parameters (Mixing Coefficients)

All three are in terms of responsibliities

γ (znk )


6

EM For GMM

•  E step – use current value of parameters

to evaluate posterior probabilities p(Z/X), i.e., responsibilities •  M step

– use these posterior probabilities to to re-estimate p(X,Z): means, covariances and mixing coefficients wrt p(Z/X)

µk ,Σk ,π kγ (znk )


Graphical model for Bayesian GMM

•  To specify model we need these conditional probabilities: 1.  p(Z|π): conditional distribution of Z given mixing coeffts 2.  p(X|Z, µ, Λ): 3.  p(π): distribution of mixing coefficients 4.  p(µ,Λ): prior governing mean and precision of each

component 7

Mixing coefficients

precisions

means

GMM Bayesian GMM


Conditional Distribution Expressions

1.  Conditional distribution of Z={z1,.,zN} given mix coefficients π Since components are mutually exclusive

2.  Conditional distribution of observed data X={x1,..,xN} given latent variables and component parameters p(X|Z, µ, Λ) –  Since components are Gaussian

–  where µ ={µk} and Λ={Λk}

•  use of precision matrix simplifies further analysis

8

€

p(Z |π ) = π kznk

k=1

K

∏n=1

N

∏

p(X |Z,µ,Λ) = N(x

n| µ

k∏∏ ,Λk−1)znk

p(z) = π kzk

k=1

K

∏

p(x | z) = N x | µ

k,Σ

k( )zk

k=1

K

∏


Parameter Priors: Mixing Coefficients

3. Distribution of mixing coefficients p(π) •  Conjugate priors simplify analysis •  Dirichlet distribution over π

–  We have chosen the same parameter α0 for each of the components

–  C(α0) is the normalization constant for the Dirichlet distribution

9

€

p(π ) = Dir(π |α0) = C(α0) π kα0−1

k=1

K

∏


Parameter Priors: Mean, Precision

4. Distribution of Mean and Precision of Gaussian components

–  Gaussian-Wishart prior is

–  Which represents the conjugate prior when both mean and precision are unknown

•  Resulting model has: – Link between Λ and µ – Due to distribution (4) above

€

p(µ,Λ) = p(µ |Λ)p(Λ)

= N µkm0(β0Λk )−1( )

k =1

K

∏ W (Λk |W0,ν 0)

p(µ,Λ)


Bayesian Network for Bayesian GMM •  Joint of all random variables:

–  All the factors were given earlier –  Only X={x1,..,xN} are observed

•  This BN provides a nice distinction between latent variables and parameters – Variables such as zn that appear inside the plate

are latent variables •  No of such variables grows with data set

– Variables outside the plate are parameters •  Fixed in no. and outside of data set

– From viewpoint of PGMs no fundamental difference 11

Means

Precisions

Mixing Coeffts

€

p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)


The variational approach •  Recall GMM •  The EM approach:

1.  Evaluation of posterior distribution p(Z|X) 2.  Evaluation of expectation of p(X,Z) wrt to p(Z|X)

•  Our goal is to specify the variational distribution q(Z,π,µ,Λ) which will specify p(Z,π,µ,Λ|X) – Recall

12

€

ln p(X) = L(q) + KL(q || p)where

L(q) = q(Z)ln p(X,Z)q(Z)

" # $

% & '

∫ dZ

and

KL{q || p} = − q(Z)ln p(Z | X)q(Z)

" # $

% & '

∫ dZ

�

p(x) = p(z)p(x | z) = π kN x |µk,Σk( )k=1

K

∑z

∑Here p(z) has parameter π with distribution p(π)


Variational Distribution •  In variational inference we can specify q by

using a factorized distribution – For Bayesian GMM the latent variables and

parameters are Z, π, µ and Λ. •  So we consider the variational distribution

q(Z,π,µ,Λ)=q(Z)q(π,µ,Λ) – Remarkably, this is the only assumption needed for

a tractable solution to a Bayesian Mixture Model •  Functional forms of both q(Z) and q(π,µ,Λ) are

determined automatically by optimizing the variational distribution 13

€

q(Z) = qii=1

M

∏ (Zi)

Subscripts for q’s omitted


Sequential update equations •  Using general result of factorized distributions

– When L(q) is defined as –  the q that makes the functional L(q) largest is

•  For Bayesian GMM log of optimized factor is

•  Since we have

– Note: Expectations are are just weighted sums

14

L(q) = q(Z )ln p(X,Z )q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ = qi ln p(X,Z )− lnqi

i∑⎧

⎨⎩

⎫⎬⎭i

∏∫ dZ

lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const

lnq *(Z)= Eπ,µ,Λln p X,Z,π,µ,Λ( )⎡⎣⎢

⎤⎦⎥+const

€

p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)

lnq *(Z)= Eπln p(Z | π)⎡⎣⎢

⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)

⎡⎣⎢

⎤⎦⎥ +const


Simplification of q*(Z) •  Expression for factor q*(Z)

•  Absorbing terms not depending on Z into constant

•  where D is dimensionality of data variable x

•  Taking exponentials on both sides •  Normalized distribution is

where 15

lnq *(Z)= Eπln p(Z | π)⎡⎣⎢

⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)

⎡⎣⎢

⎤⎦⎥ +const

lnq *(Z)= znk

k=1

K

∑n=1

N

∑ lnρnk+const

where lnρnk= E lnπ

k⎡⎣⎢

⎤⎦⎥ +

12E ln |λ

k|⎡

⎣⎢⎤⎦⎥ −D2

ln(2π)−12EµkΔk

(xn−µ

k)TΛ

k(xk−µ

k)⎡

⎣⎢⎤⎦⎥

q *(Z)= rnk

znk

k=1

K

∏n=1

N

∏

q*(Z )∝ ρnkznk

k=1

K

∏n=1

N

∏

rnk=ρnk

ρnj

j=1

K

∑rnk are positive since ρnk are exponentials of real nos. and will sum to one as required


Factor q*(Z) has same form as prior •  Normalized distribution is

•  We have found form of q* to maximize the functional L(q) –  It has same form as prior

•  Distribution q*(Z) is discrete and has the standard result E[znk]=rnk, – which play the role of responsibilities

•  Since equations for q*(Z) depend on moments of other variables – They are coupled and solved iteratively

16

q *(Z)= rnk

znk

k=1

K

∏n=1

N

∏

€

p(Z |π ) = π kznk

k=1

K

∏n=1

N

∏


Variational EM •  Variational E-step: determine responsibilities rnk •  Variational M-step:

1.  determine statistics of data set

and

2. find optimal solution for the factor q(π,µ,Λ)

17

Nk

= rnk

n=1

N

∑

xk

=1

Nk

rnk

n=1

N

∑ xn

Sk

=1

Nk

rnk

n=1

N

∑ xn− x

n( ) xn− x

n( )T

Responsibility of kth component Mean of kth component Covariance matrix of kth component


Factorization of q(π,µ,Λ) •  Using general result of factorized distributions

– We can write

•  which decomposes into terms involving π and only µ,Λ

– The terms involving µ and Λ comprise sum of terms involving µk and Λk leading to factorization

18

lnq *(π,µ,Λ)= ln p(π)+ ln p µk,Λk( )

k=1

K

∑ +EZln p(Z | π)⎡⎣⎢

⎤⎦⎥ + E z

nk⎡⎣⎢⎤⎦⎥ lnN xk | µk,Λk

−1( )k=1

K

∑n=1

N

∑ +const

q(π,µ,Λ)= q(π) q(µk,Λk)

k=1

K

∏

lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const


Factor q(π) is a Dirichlet •  Given the factorization •  Consider each factor in turn: q(π) and q(µk,Λk) •  (2a) Identifying terms depending on π, q(π) has

the solution

•  Taking exponential on both sides we get q*(π) as a Dirichlet

lnq *(π)= (α0−1) lnz

kk=1

K

∑ + rnk

n=1

N

∑k=1

K

∑ lnπk+const

q *(π)= Dir(π |α)

where α has the components αk=α0+Nk

q(π,µ,Λ)= q(π) q(µk,Λk)

k=1

K

∏

∑∏==

− =ΓΓ

Γ=K

kk

K

kk

k

kDir

10

1

1

1

0 where)()...(

)()|( ααµαα

ααµ α K=3 αk=0.1

Dirichlet:


Factor q*(µk,Λk) is a Gaussian-Wishart (2b) Variational posterior for q*(µk,Λk)

– Does not further factorize into marginals –  It is a Gaussian-Wishart distribution

– W is the Wishart distribution •  It has the form

W(Λ|W,ν)=B|Λ|(ν-D-1)/2exp[-½Tr(W-1Λ)] where ν is the no. of degrees of freedom, W is a D x D scale matrix and Tr is the trace. B(W,ν) is a normalization constant •  It is the conjugate prior for a Gaussian with known mean and

unknown precision matrix Λ

20

q *(µk,Λ

k) = N µ

km

k(β

kΛ

k)−1( )W (Λ

k|W

0,ν

0)


Parameters of q*(µk,Λk) •  Gaussian-Wishart

–  where we have defined

•  These update equations are analogous to M-

step of EM for m.l. solution of GMM –  Involve evaluation of same sums as EM over the

data set 21

q *(µk,Λ

k) = N µ

km

k(β

kΛ

k)−1( )W (Λ

k|W

0,ν

0)

βk= β

0+N

k

mk=1βk

β0m0+N

kxk( )

Wk−1 =W

0−1 +N

kSk+β0Nk

β0+N

k

xk−m

0( ) xk −m0( )T

υk= υ

0+N

k+1


Expression for Responsibilities •  For the M step we need expectations

– Which are obtained by normalizing ρnk •  Since where

–  The three expectations wrt variational distribution of parameters are easily evaluated to give

– ψ is the digamma function with •  Digamma appears in the definition of Dirichlet

22

lnρnk = E lnπ k[ ]+ 12 E ln |λk |[ ]− D2 ln(2π )−12EµkΔk (xn − µk )

TΛk (xk − µk )⎡⎣ ⎤⎦

E[znk]=rnk

rnk=ρnk

ρnj

j=1

K

∑

lnπ k ≡ E lnπ k[ ] =ψ (α k )−ν

lnΛk ≡12E ln |λk |[ ] = ψ νk +1− i

2⎛⎝⎜

⎞⎠⎟i=1

D

∑ + D ln2 + lnWk

EµkΛk(xn − µk )

TΛk (xk − µk )⎡⎣ ⎤⎦ = Dβk−1 +νk (xn − µk )

TΛk (xk − µk )

α̂ = α kk∑ψ (a) = d

dalnΓ(a)

ν is the no. of degrees of freedom of Wishart


Evaluation of Responsibilities •  Substituting the three expectations into ln ρnk

– This is similar to responsibilities for mle for EM

•  which can be written in the form

•  where we have used precision Λk instead of covariance Σk to highlight similarity

23

r

rnk

∝ !πk!Λ1/2 exp −

D2βk

−υk

2xn−m

k( )Wk(xn −mk)

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

�

γ(zk ) ≡ p(zk =1 | x) = p(zk =1)p(x | zk =1)

p(z j =1)p(x | z j =1)j=1

K

∑

= π kN(x |µk,Σk )

π jN(x |µk,Σ j )j=1

K

∑

rnk ∝ π k Λk1/2 exp − 1

2xn − µk( )Λk (xn − µk )

⎧⎨⎩

⎫⎬⎭


Summary of Optimization •  Optimization of variational posterior distribution

involves cycling between two stages – Analogous to E and M steps on m.l. EM

•  Variational E-step: – Use current distribution over model parameters to

evaluate moments and hence evaluate E[znk]=rnk

•  Variational M step –  keep responsibilities fixed; use them to recompute

variational distribution over the parameters using and 24

q *(π)= Dir(π |α) q *(µ

k,Λ

k) = N µ

km

k(β

kΛ

k)−1( )W (Λ

k|W

0,ν

0)


Variational Bayesian GMM

25

K=6 components After convergence there are only two components Density of red ink inside each ellipse shows Mean value of Mixing coefficients

Old Faithful data set


Similarity of Variational Bayes and EM

•  Close similarity between variational solution for the Bayesin mixture of Gaussians and the EM algorithm for maximum likelihood

•  In the limit as N à∞, the Bayesian treatment converges to the maximum likelihood EM

•  Variational algorithm is more expensive but problem of singularity is eliminated

26


Variational Lower Bound

•  We can straight-forwardly evaluate the lower bound L(q) for this model

•  Recall

•  The lower bound is used to monitor re-estimation to test for convergence

27

ln p(X) = L(q)+ KL(q || p)where

L(q) = q(Z )ln p(X,Z )q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ

and

KL{q || p} = − q(Z )ln p(Z | X)q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ


Predictive Density •  In using a Bayesian GMM we will be

interested in the predictive density for a new value of the observed variable

•  Assuming corresponding latent variable we can show that

– The mixture of Student’s T becomes a GMM as Nà∞ 28

x̂

p(x̂ | X) = 1α̂

α kSt x̂|mk,Lk,νk +1-D( )k=1

N

∑where the kth component has mean mk and precision

Lk =νk +1−D( )βk

1+ βk( ) Wk


Determining no. of components

•  Plot of variational lower bound L versus no. of components K

•  Distinct peak at K=2 •  For each K model is trained from 100 starts

– Results shown as +

29

sargur srihari [email protected]/cse574/chap10/10.3... · 2015. 12. 6. · machine...

Documents