sargur srihari [email protected]/cse574/chap10/10.3... · 2015. 12. 6. · machine...

29
Machine Learning Srihari 1 Variational Mixture of Gaussians Sargur Srihari [email protected]

Upload: others

Post on 18-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

1

Variational Mixture of Gaussians

Sargur Srihari [email protected]

Page 2: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Objective

•  Apply variational inference machinery to Gaussian Mixture Models

•  Demonstrates how Bayesian treatment elegantly resolves difficulties with maximum likelihood issues

•  Many more complex distributions can be solved using straightforward extensions of this analysis

2

Page 3: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Graphical Model for GMM

•  Graphical model corresponding to likelihood function of standard GMM:

•  For each observation xn we have a corresponding latent latent variable zn –  A 1-of-K binary vector with elements znk for

k=1,..K

•  Denote observed data by X={x1,..,xN} •  Latent variables by Z={z1,..,zN}

3

Plate Notation: Equivalent networks

3

Directed acyclic graph Representing mixture

Page 4: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Likelihood Function for GMM

4

x

Therefore Likelihood function is

p(X |π,µ,Σ) =k=1

K

∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩

⎫ ⎬ ⎭ n=1

N

Therefore log-likelihood function is

ln p(X |π ,µ,Σ) = lnk=1

K

∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩

⎫ ⎬ ⎭ n=1

N

Find parameters , π, µ and Σ that maximize log-likelihood A more difficult problem than for a single Gaussian

Mixture density function is Since z has values {zk} with probabilities πk

Product is over the Ni.i.d. samples

Page 5: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

GMM m.l.e. expressions

•  Obtained using derivatives of log-likelihood

•  Not closed form solutions for the parameters

– Since the responsibilities depend on those parameters in a complex way 5

1

1 ( )xN

k nk nnk

zN

µ γ=

= ∑

Σk = 1Nk

γ(znk )(xn − µk )(xn − µk )T

n=1

N

kkNN

π =

Nk = γ(znk )n=1

N

Parameters (means)

Parameters(covariance matrices)

Parameters (Mixing Coefficients)

All three are in terms of responsibliities

γ (znk )

Page 6: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

6

EM For GMM

•  E step – use current value of parameters

to evaluate posterior probabilities p(Z/X), i.e., responsibilities •  M step

– use these posterior probabilities to to re-estimate p(X,Z): means, covariances and mixing coefficients wrt p(Z/X)

µk ,Σk ,π kγ (znk )

Page 7: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Graphical model for Bayesian GMM

•  To specify model we need these conditional probabilities: 1.  p(Z|π): conditional distribution of Z given mixing coeffts 2.  p(X|Z, µ, Λ): 3.  p(π): distribution of mixing coefficients 4.  p(µ,Λ): prior governing mean and precision of each

component 7

Mixing coefficients

precisions

means

GMM Bayesian GMM

Page 8: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Conditional Distribution Expressions

1.  Conditional distribution of Z={z1,.,zN} given mix coefficients π Since components are mutually exclusive

2.  Conditional distribution of observed data X={x1,..,xN} given latent variables and component parameters p(X|Z, µ, Λ) –  Since components are Gaussian

–  where µ ={µk} and Λ={Λk}

•  use of precision matrix simplifies further analysis

8

p(Z |π ) = π kznk

k=1

K

∏n=1

N

p(X |Z,µ,Λ) = N(x

n| µ

k∏∏ ,Λk−1)znk

p(z) = π kzk

k=1

K

p(x | z) = N x | µ

k,Σ

k( )zk

k=1

K

Page 9: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Parameter Priors: Mixing Coefficients

3. Distribution of mixing coefficients p(π) •  Conjugate priors simplify analysis •  Dirichlet distribution over π

–  We have chosen the same parameter α0 for each of the components

–  C(α0) is the normalization constant for the Dirichlet distribution

9

p(π ) = Dir(π |α0) = C(α0) π kα0−1

k=1

K

Page 10: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Parameter Priors: Mean, Precision

4. Distribution of Mean and Precision of Gaussian components

–  Gaussian-Wishart prior is

–  Which represents the conjugate prior when both mean and precision are unknown

•  Resulting model has: – Link between Λ and µ – Due to distribution (4) above

p(µ,Λ) = p(µ |Λ)p(Λ)

= N µkm0(β0Λk )−1( )

k =1

K

∏ W (Λk |W0,ν 0)

p(µ,Λ)

Page 11: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Bayesian Network for Bayesian GMM •  Joint of all random variables:

–  All the factors were given earlier –  Only X={x1,..,xN} are observed

•  This BN provides a nice distinction between latent variables and parameters – Variables such as zn that appear inside the plate

are latent variables •  No of such variables grows with data set

– Variables outside the plate are parameters •  Fixed in no. and outside of data set

– From viewpoint of PGMs no fundamental difference 11

Means

Precisions

Mixing Coeffts

p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)

Page 12: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

The variational approach •  Recall GMM •  The EM approach:

1.  Evaluation of posterior distribution p(Z|X) 2.  Evaluation of expectation of p(X,Z) wrt to p(Z|X)

•  Our goal is to specify the variational distribution q(Z,π,µ,Λ) which will specify p(Z,π,µ,Λ|X) – Recall

12

ln p(X) = L(q) + KL(q || p)where

L(q) = q(Z)ln p(X,Z)q(Z)

" # $

% & '

∫ dZ

and

KL{q || p} = − q(Z)ln p(Z | X)q(Z)

" # $

% & '

∫ dZ

p(x) = p(z)p(x | z) = π kN x |µk,Σk( )k=1

K

∑z

∑Here p(z) has parameter π with distribution p(π)

Page 13: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Variational Distribution •  In variational inference we can specify q by

using a factorized distribution – For Bayesian GMM the latent variables and

parameters are Z, π, µ and Λ. •  So we consider the variational distribution

q(Z,π,µ,Λ)=q(Z)q(π,µ,Λ) – Remarkably, this is the only assumption needed for

a tractable solution to a Bayesian Mixture Model •  Functional forms of both q(Z) and q(π,µ,Λ) are

determined automatically by optimizing the variational distribution 13

q(Z) = qii=1

M

∏ (Zi)

Subscripts for q’s omitted

Page 14: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Sequential update equations •  Using general result of factorized distributions

– When L(q) is defined as –  the q that makes the functional L(q) largest is

•  For Bayesian GMM log of optimized factor is

•  Since we have

– Note: Expectations are are just weighted sums

14

L(q) = q(Z )ln p(X,Z )q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ = qi ln p(X,Z )− lnqi

i∑⎧

⎨⎩

⎫⎬⎭i

∏∫ dZ

lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const

lnq *(Z)= Eπ,µ,Λln p X,Z,π,µ,Λ( )⎡⎣⎢

⎤⎦⎥+const

p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)

lnq *(Z)= Eπln p(Z | π)⎡⎣⎢

⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)

⎡⎣⎢

⎤⎦⎥ +const

Page 15: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Simplification of q*(Z) •  Expression for factor q*(Z)

•  Absorbing terms not depending on Z into constant

•  where D is dimensionality of data variable x

•  Taking exponentials on both sides •  Normalized distribution is

where 15

lnq *(Z)= Eπln p(Z | π)⎡⎣⎢

⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)

⎡⎣⎢

⎤⎦⎥ +const

lnq *(Z)= znk

k=1

K

∑n=1

N

∑ lnρnk+const

where lnρnk= E lnπ

k⎡⎣⎢

⎤⎦⎥ +

12E ln |λ

k|⎡

⎣⎢⎤⎦⎥ −D2

ln(2π)−12EµkΔk

(xn−µ

k)TΛ

k(xk−µ

k)⎡

⎣⎢⎤⎦⎥

q *(Z)= rnk

znk

k=1

K

∏n=1

N

q*(Z )∝ ρnkznk

k=1

K

∏n=1

N

rnk=ρnk

ρnj

j=1

K

∑rnk are positive since ρnk are exponentials of real nos. and will sum to one as required

Page 16: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Factor q*(Z) has same form as prior •  Normalized distribution is

•  We have found form of q* to maximize the functional L(q) –  It has same form as prior

•  Distribution q*(Z) is discrete and has the standard result E[znk]=rnk, – which play the role of responsibilities

•  Since equations for q*(Z) depend on moments of other variables – They are coupled and solved iteratively

16

q *(Z)= rnk

znk

k=1

K

∏n=1

N

p(Z |π ) = π kznk

k=1

K

∏n=1

N

Page 17: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Variational EM •  Variational E-step: determine responsibilities rnk •  Variational M-step:

1.  determine statistics of data set

and

2. find optimal solution for the factor q(π,µ,Λ)

17

Nk

= rnk

n=1

N

xk

=1

Nk

rnk

n=1

N

∑ xn

Sk

=1

Nk

rnk

n=1

N

∑ xn− x

n( ) xn− x

n( )T

Responsibility of kth component Mean of kth component Covariance matrix of kth component

Page 18: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Factorization of q(π,µ,Λ) •  Using general result of factorized distributions

– We can write

•  which decomposes into terms involving π and only µ,Λ

– The terms involving µ and Λ comprise sum of terms involving µk and Λk leading to factorization

18

lnq *(π,µ,Λ)= ln p(π)+ ln p µk,Λk( )

k=1

K

∑ +EZln p(Z | π)⎡⎣⎢

⎤⎦⎥ + E z

nk⎡⎣⎢⎤⎦⎥ lnN xk | µk,Λk

−1( )k=1

K

∑n=1

N

∑ +const

q(π,µ,Λ)= q(π) q(µk,Λk)

k=1

K

lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const

Page 19: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Factor q(π) is a Dirichlet •  Given the factorization •  Consider each factor in turn: q(π) and q(µk,Λk) •  (2a) Identifying terms depending on π, q(π) has

the solution

•  Taking exponential on both sides we get q*(π) as a Dirichlet

lnq *(π)= (α0−1) lnz

kk=1

K

∑ + rnk

n=1

N

∑k=1

K

∑ lnπk+const

q *(π)= Dir(π |α)

where α has the components αk=α0+Nk

q(π,µ,Λ)= q(π) q(µk,Λk)

k=1

K

∑∏==

− =ΓΓ

Γ=K

kk

K

kk

k

kDir

10

1

1

1

0 where)()...(

)()|( ααµαα

ααµ α K=3 αk=0.1

Dirichlet:

Page 20: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Factor q*(µk,Λk) is a Gaussian-Wishart (2b) Variational posterior for q*(µk,Λk)

– Does not further factorize into marginals –  It is a Gaussian-Wishart distribution

– W is the Wishart distribution •  It has the form

W(Λ|W,ν)=B|Λ|(ν-D-1)/2exp[-½Tr(W-1Λ)] where ν is the no. of degrees of freedom, W is a D x D scale matrix and Tr is the trace. B(W,ν) is a normalization constant •  It is the conjugate prior for a Gaussian with known mean and

unknown precision matrix Λ

20

q *(µk,Λ

k) = N µ

km

k(β

k)−1( )W (Λ

k|W

0,ν

0)

Page 21: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Parameters of q*(µk,Λk) •  Gaussian-Wishart

–  where we have defined

•  These update equations are analogous to M-

step of EM for m.l. solution of GMM –  Involve evaluation of same sums as EM over the

data set 21

q *(µk,Λ

k) = N µ

km

k(β

k)−1( )W (Λ

k|W

0,ν

0)

βk= β

0+N

k

mk=1βk

β0m0+N

kxk( )

Wk−1 =W

0−1 +N

kSk+β0Nk

β0+N

k

xk−m

0( ) xk −m0( )T

υk= υ

0+N

k+1

Page 22: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Expression for Responsibilities •  For the M step we need expectations

– Which are obtained by normalizing ρnk •  Since where

–  The three expectations wrt variational distribution of parameters are easily evaluated to give

– ψ is the digamma function with •  Digamma appears in the definition of Dirichlet

22

lnρnk = E lnπ k[ ]+ 12 E ln |λk |[ ]− D2 ln(2π )−12EµkΔk (xn − µk )

TΛk (xk − µk )⎡⎣ ⎤⎦

E[znk]=rnk

rnk=ρnk

ρnj

j=1

K

lnπ k ≡ E lnπ k[ ] =ψ (α k )−ν

lnΛk ≡12E ln |λk |[ ] = ψ νk +1− i

2⎛⎝⎜

⎞⎠⎟i=1

D

∑ + D ln2 + lnWk

EµkΛk(xn − µk )

TΛk (xk − µk )⎡⎣ ⎤⎦ = Dβk−1 +νk (xn − µk )

TΛk (xk − µk )

α̂ = α kk∑ψ (a) = d

dalnΓ(a)

ν is the no. of degrees of freedom of Wishart

Page 23: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Evaluation of Responsibilities •  Substituting the three expectations into ln ρnk

– This is similar to responsibilities for mle for EM

•  which can be written in the form

•  where we have used precision Λk instead of covariance Σk to highlight similarity

23

r

rnk

∝ !πk!Λ1/2 exp −

D2βk

−υk

2xn−m

k( )Wk(xn −mk)

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

γ(zk ) ≡ p(zk =1 | x) = p(zk =1)p(x | zk =1)

p(z j =1)p(x | z j =1)j=1

K

= π kN(x |µk,Σk )

π jN(x |µk,Σ j )j=1

K

rnk ∝ π k Λk1/2 exp − 1

2xn − µk( )Λk (xn − µk )

⎧⎨⎩

⎫⎬⎭

Page 24: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Summary of Optimization •  Optimization of variational posterior distribution

involves cycling between two stages – Analogous to E and M steps on m.l. EM

•  Variational E-step: – Use current distribution over model parameters to

evaluate moments and hence evaluate E[znk]=rnk

•  Variational M step –  keep responsibilities fixed; use them to recompute

variational distribution over the parameters using and 24

q *(π)= Dir(π |α) q *(µ

k,Λ

k) = N µ

km

k(β

k)−1( )W (Λ

k|W

0,ν

0)

Page 25: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Variational Bayesian GMM

25

K=6 components After convergence there are only two components Density of red ink inside each ellipse shows Mean value of Mixing coefficients

Old Faithful data set

Page 26: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Similarity of Variational Bayes and EM

•  Close similarity between variational solution for the Bayesin mixture of Gaussians and the EM algorithm for maximum likelihood

•  In the limit as N à∞, the Bayesian treatment converges to the maximum likelihood EM

•  Variational algorithm is more expensive but problem of singularity is eliminated

26

Page 27: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Variational Lower Bound

•  We can straight-forwardly evaluate the lower bound L(q) for this model

•  Recall

•  The lower bound is used to monitor re-estimation to test for convergence

27

ln p(X) = L(q)+ KL(q || p)where

L(q) = q(Z )ln p(X,Z )q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ

and

KL{q || p} = − q(Z )ln p(Z | X)q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ

Page 28: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Predictive Density •  In using a Bayesian GMM we will be

interested in the predictive density for a new value of the observed variable

•  Assuming corresponding latent variable we can show that

– The mixture of Student’s T becomes a GMM as Nà∞ 28

p(x̂ | X) = 1α̂

α kSt x̂|mk,Lk,νk +1-D( )k=1

N

∑where the kth component has mean mk and precision

Lk =νk +1−D( )βk

1+ βk( ) Wk

Page 29: Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine Learning Srihari Variational Distribution • In variational inference we can specify

Machine Learning Srihari

Determining no. of components

•  Plot of variational lower bound L versus no. of components K

•  Distinct peak at K=2 •  For each K model is trained from 100 starts

– Results shown as +

29