sargur srihari [email protected]/cse574/chap10/10.3... · 2015. 12. 6. · machine...
TRANSCRIPT
Machine Learning Srihari
Objective
• Apply variational inference machinery to Gaussian Mixture Models
• Demonstrates how Bayesian treatment elegantly resolves difficulties with maximum likelihood issues
• Many more complex distributions can be solved using straightforward extensions of this analysis
2
Machine Learning Srihari
Graphical Model for GMM
• Graphical model corresponding to likelihood function of standard GMM:
• For each observation xn we have a corresponding latent latent variable zn – A 1-of-K binary vector with elements znk for
k=1,..K
• Denote observed data by X={x1,..,xN} • Latent variables by Z={z1,..,zN}
3
Plate Notation: Equivalent networks
3
Directed acyclic graph Representing mixture
Machine Learning Srihari
Likelihood Function for GMM
4
x
Therefore Likelihood function is
�
p(X |π,µ,Σ) =k=1
K
∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩
⎫ ⎬ ⎭ n=1
N
∏
Therefore log-likelihood function is
�
ln p(X |π ,µ,Σ) = lnk=1
K
∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩
⎫ ⎬ ⎭ n=1
N
∑
Find parameters , π, µ and Σ that maximize log-likelihood A more difficult problem than for a single Gaussian
Mixture density function is Since z has values {zk} with probabilities πk
Product is over the Ni.i.d. samples
Machine Learning Srihari
GMM m.l.e. expressions
• Obtained using derivatives of log-likelihood
• Not closed form solutions for the parameters
– Since the responsibilities depend on those parameters in a complex way 5
1
1 ( )xN
k nk nnk
zN
µ γ=
= ∑
�
Σk = 1Nk
γ(znk )(xn − µk )(xn − µk )T
n=1
N
∑
kkNN
π =
�
Nk = γ(znk )n=1
N
∑
Parameters (means)
Parameters(covariance matrices)
Parameters (Mixing Coefficients)
All three are in terms of responsibliities
γ (znk )
Machine Learning Srihari
6
EM For GMM
• E step – use current value of parameters
to evaluate posterior probabilities p(Z/X), i.e., responsibilities • M step
– use these posterior probabilities to to re-estimate p(X,Z): means, covariances and mixing coefficients wrt p(Z/X)
µk ,Σk ,π kγ (znk )
Machine Learning Srihari
Graphical model for Bayesian GMM
• To specify model we need these conditional probabilities: 1. p(Z|π): conditional distribution of Z given mixing coeffts 2. p(X|Z, µ, Λ): 3. p(π): distribution of mixing coefficients 4. p(µ,Λ): prior governing mean and precision of each
component 7
Mixing coefficients
precisions
means
GMM Bayesian GMM
Machine Learning Srihari
Conditional Distribution Expressions
1. Conditional distribution of Z={z1,.,zN} given mix coefficients π Since components are mutually exclusive
2. Conditional distribution of observed data X={x1,..,xN} given latent variables and component parameters p(X|Z, µ, Λ) – Since components are Gaussian
– where µ ={µk} and Λ={Λk}
• use of precision matrix simplifies further analysis
8
€
p(Z |π ) = π kznk
k=1
K
∏n=1
N
∏
p(X |Z,µ,Λ) = N(x
n| µ
k∏∏ ,Λk−1)znk
p(z) = π kzk
k=1
K
∏
p(x | z) = N x | µ
k,Σ
k( )zk
k=1
K
∏
Machine Learning Srihari
Parameter Priors: Mixing Coefficients
3. Distribution of mixing coefficients p(π) • Conjugate priors simplify analysis • Dirichlet distribution over π
– We have chosen the same parameter α0 for each of the components
– C(α0) is the normalization constant for the Dirichlet distribution
9
€
p(π ) = Dir(π |α0) = C(α0) π kα0−1
k=1
K
∏
Machine Learning Srihari
Parameter Priors: Mean, Precision
4. Distribution of Mean and Precision of Gaussian components
– Gaussian-Wishart prior is
– Which represents the conjugate prior when both mean and precision are unknown
• Resulting model has: – Link between Λ and µ – Due to distribution (4) above
€
p(µ,Λ) = p(µ |Λ)p(Λ)
= N µkm0(β0Λk )−1( )
k =1
K
∏ W (Λk |W0,ν 0)
p(µ,Λ)
Machine Learning Srihari
Bayesian Network for Bayesian GMM • Joint of all random variables:
– All the factors were given earlier – Only X={x1,..,xN} are observed
• This BN provides a nice distinction between latent variables and parameters – Variables such as zn that appear inside the plate
are latent variables • No of such variables grows with data set
– Variables outside the plate are parameters • Fixed in no. and outside of data set
– From viewpoint of PGMs no fundamental difference 11
Means
Precisions
Mixing Coeffts
€
p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)
Machine Learning Srihari
The variational approach • Recall GMM • The EM approach:
1. Evaluation of posterior distribution p(Z|X) 2. Evaluation of expectation of p(X,Z) wrt to p(Z|X)
• Our goal is to specify the variational distribution q(Z,π,µ,Λ) which will specify p(Z,π,µ,Λ|X) – Recall
12
€
ln p(X) = L(q) + KL(q || p)where
L(q) = q(Z)ln p(X,Z)q(Z)
" # $
% & '
∫ dZ
and
KL{q || p} = − q(Z)ln p(Z | X)q(Z)
" # $
% & '
∫ dZ
�
p(x) = p(z)p(x | z) = π kN x |µk,Σk( )k=1
K
∑z
∑Here p(z) has parameter π with distribution p(π)
Machine Learning Srihari
Variational Distribution • In variational inference we can specify q by
using a factorized distribution – For Bayesian GMM the latent variables and
parameters are Z, π, µ and Λ. • So we consider the variational distribution
q(Z,π,µ,Λ)=q(Z)q(π,µ,Λ) – Remarkably, this is the only assumption needed for
a tractable solution to a Bayesian Mixture Model • Functional forms of both q(Z) and q(π,µ,Λ) are
determined automatically by optimizing the variational distribution 13
€
q(Z) = qii=1
M
∏ (Zi)
Subscripts for q’s omitted
Machine Learning Srihari
Sequential update equations • Using general result of factorized distributions
– When L(q) is defined as – the q that makes the functional L(q) largest is
• For Bayesian GMM log of optimized factor is
• Since we have
– Note: Expectations are are just weighted sums
14
L(q) = q(Z )ln p(X,Z )q(Z )
⎧⎨⎩
⎫⎬⎭∫ dZ = qi ln p(X,Z )− lnqi
i∑⎧
⎨⎩
⎫⎬⎭i
∏∫ dZ
lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const
lnq *(Z)= Eπ,µ,Λln p X,Z,π,µ,Λ( )⎡⎣⎢
⎤⎦⎥+const
€
p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)
lnq *(Z)= Eπln p(Z | π)⎡⎣⎢
⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)
⎡⎣⎢
⎤⎦⎥ +const
Machine Learning Srihari
Simplification of q*(Z) • Expression for factor q*(Z)
• Absorbing terms not depending on Z into constant
• where D is dimensionality of data variable x
• Taking exponentials on both sides • Normalized distribution is
where 15
lnq *(Z)= Eπln p(Z | π)⎡⎣⎢
⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)
⎡⎣⎢
⎤⎦⎥ +const
lnq *(Z)= znk
k=1
K
∑n=1
N
∑ lnρnk+const
where lnρnk= E lnπ
k⎡⎣⎢
⎤⎦⎥ +
12E ln |λ
k|⎡
⎣⎢⎤⎦⎥ −D2
ln(2π)−12EµkΔk
(xn−µ
k)TΛ
k(xk−µ
k)⎡
⎣⎢⎤⎦⎥
q *(Z)= rnk
znk
k=1
K
∏n=1
N
∏
q*(Z )∝ ρnkznk
k=1
K
∏n=1
N
∏
rnk=ρnk
ρnj
j=1
K
∑rnk are positive since ρnk are exponentials of real nos. and will sum to one as required
Machine Learning Srihari
Factor q*(Z) has same form as prior • Normalized distribution is
• We have found form of q* to maximize the functional L(q) – It has same form as prior
• Distribution q*(Z) is discrete and has the standard result E[znk]=rnk, – which play the role of responsibilities
• Since equations for q*(Z) depend on moments of other variables – They are coupled and solved iteratively
16
q *(Z)= rnk
znk
k=1
K
∏n=1
N
∏
€
p(Z |π ) = π kznk
k=1
K
∏n=1
N
∏
Machine Learning Srihari
Variational EM • Variational E-step: determine responsibilities rnk • Variational M-step:
1. determine statistics of data set
and
2. find optimal solution for the factor q(π,µ,Λ)
17
Nk
= rnk
n=1
N
∑
xk
=1
Nk
rnk
n=1
N
∑ xn
Sk
=1
Nk
rnk
n=1
N
∑ xn− x
n( ) xn− x
n( )T
Responsibility of kth component Mean of kth component Covariance matrix of kth component
Machine Learning Srihari
Factorization of q(π,µ,Λ) • Using general result of factorized distributions
– We can write
• which decomposes into terms involving π and only µ,Λ
– The terms involving µ and Λ comprise sum of terms involving µk and Λk leading to factorization
18
lnq *(π,µ,Λ)= ln p(π)+ ln p µk,Λk( )
k=1
K
∑ +EZln p(Z | π)⎡⎣⎢
⎤⎦⎥ + E z
nk⎡⎣⎢⎤⎦⎥ lnN xk | µk,Λk
−1( )k=1
K
∑n=1
N
∑ +const
q(π,µ,Λ)= q(π) q(µk,Λk)
k=1
K
∏
lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const
Machine Learning Srihari
Factor q(π) is a Dirichlet • Given the factorization • Consider each factor in turn: q(π) and q(µk,Λk) • (2a) Identifying terms depending on π, q(π) has
the solution
• Taking exponential on both sides we get q*(π) as a Dirichlet
lnq *(π)= (α0−1) lnz
kk=1
K
∑ + rnk
n=1
N
∑k=1
K
∑ lnπk+const
q *(π)= Dir(π |α)
where α has the components αk=α0+Nk
q(π,µ,Λ)= q(π) q(µk,Λk)
k=1
K
∏
∑∏==
− =ΓΓ
Γ=K
kk
K
kk
k
kDir
10
1
1
1
0 where)()...(
)()|( ααµαα
ααµ α K=3 αk=0.1
Dirichlet:
Machine Learning Srihari
Factor q*(µk,Λk) is a Gaussian-Wishart (2b) Variational posterior for q*(µk,Λk)
– Does not further factorize into marginals – It is a Gaussian-Wishart distribution
– W is the Wishart distribution • It has the form
W(Λ|W,ν)=B|Λ|(ν-D-1)/2exp[-½Tr(W-1Λ)] where ν is the no. of degrees of freedom, W is a D x D scale matrix and Tr is the trace. B(W,ν) is a normalization constant • It is the conjugate prior for a Gaussian with known mean and
unknown precision matrix Λ
20
q *(µk,Λ
k) = N µ
km
k(β
kΛ
k)−1( )W (Λ
k|W
0,ν
0)
Machine Learning Srihari
Parameters of q*(µk,Λk) • Gaussian-Wishart
– where we have defined
• These update equations are analogous to M-
step of EM for m.l. solution of GMM – Involve evaluation of same sums as EM over the
data set 21
q *(µk,Λ
k) = N µ
km
k(β
kΛ
k)−1( )W (Λ
k|W
0,ν
0)
βk= β
0+N
k
mk=1βk
β0m0+N
kxk( )
Wk−1 =W
0−1 +N
kSk+β0Nk
β0+N
k
xk−m
0( ) xk −m0( )T
υk= υ
0+N
k+1
Machine Learning Srihari
Expression for Responsibilities • For the M step we need expectations
– Which are obtained by normalizing ρnk • Since where
– The three expectations wrt variational distribution of parameters are easily evaluated to give
– ψ is the digamma function with • Digamma appears in the definition of Dirichlet
22
lnρnk = E lnπ k[ ]+ 12 E ln |λk |[ ]− D2 ln(2π )−12EµkΔk (xn − µk )
TΛk (xk − µk )⎡⎣ ⎤⎦
E[znk]=rnk
rnk=ρnk
ρnj
j=1
K
∑
lnπ k ≡ E lnπ k[ ] =ψ (α k )−ν
lnΛk ≡12E ln |λk |[ ] = ψ νk +1− i
2⎛⎝⎜
⎞⎠⎟i=1
D
∑ + D ln2 + lnWk
EµkΛk(xn − µk )
TΛk (xk − µk )⎡⎣ ⎤⎦ = Dβk−1 +νk (xn − µk )
TΛk (xk − µk )
α̂ = α kk∑ψ (a) = d
dalnΓ(a)
ν is the no. of degrees of freedom of Wishart
Machine Learning Srihari
Evaluation of Responsibilities • Substituting the three expectations into ln ρnk
– This is similar to responsibilities for mle for EM
• which can be written in the form
• where we have used precision Λk instead of covariance Σk to highlight similarity
23
r
rnk
∝ !πk!Λ1/2 exp −
D2βk
−υk
2xn−m
k( )Wk(xn −mk)
⎧⎨⎪⎪
⎩⎪⎪
⎫⎬⎪⎪
⎭⎪⎪
�
γ(zk ) ≡ p(zk =1 | x) = p(zk =1)p(x | zk =1)
p(z j =1)p(x | z j =1)j=1
K
∑
= π kN(x |µk,Σk )
π jN(x |µk,Σ j )j=1
K
∑
rnk ∝ π k Λk1/2 exp − 1
2xn − µk( )Λk (xn − µk )
⎧⎨⎩
⎫⎬⎭
Machine Learning Srihari
Summary of Optimization • Optimization of variational posterior distribution
involves cycling between two stages – Analogous to E and M steps on m.l. EM
• Variational E-step: – Use current distribution over model parameters to
evaluate moments and hence evaluate E[znk]=rnk
• Variational M step – keep responsibilities fixed; use them to recompute
variational distribution over the parameters using and 24
q *(π)= Dir(π |α) q *(µ
k,Λ
k) = N µ
km
k(β
kΛ
k)−1( )W (Λ
k|W
0,ν
0)
Machine Learning Srihari
Variational Bayesian GMM
25
K=6 components After convergence there are only two components Density of red ink inside each ellipse shows Mean value of Mixing coefficients
Old Faithful data set
Machine Learning Srihari
Similarity of Variational Bayes and EM
• Close similarity between variational solution for the Bayesin mixture of Gaussians and the EM algorithm for maximum likelihood
• In the limit as N à∞, the Bayesian treatment converges to the maximum likelihood EM
• Variational algorithm is more expensive but problem of singularity is eliminated
26
Machine Learning Srihari
Variational Lower Bound
• We can straight-forwardly evaluate the lower bound L(q) for this model
• Recall
• The lower bound is used to monitor re-estimation to test for convergence
27
ln p(X) = L(q)+ KL(q || p)where
L(q) = q(Z )ln p(X,Z )q(Z )
⎧⎨⎩
⎫⎬⎭∫ dZ
and
KL{q || p} = − q(Z )ln p(Z | X)q(Z )
⎧⎨⎩
⎫⎬⎭∫ dZ
Machine Learning Srihari
Predictive Density • In using a Bayesian GMM we will be
interested in the predictive density for a new value of the observed variable
• Assuming corresponding latent variable we can show that
– The mixture of Student’s T becomes a GMM as Nà∞ 28
x̂
p(x̂ | X) = 1α̂
α kSt x̂|mk,Lk,νk +1-D( )k=1
N
∑where the kth component has mean mk and precision
Lk =νk +1−D( )βk
1+ βk( ) Wk
Machine Learning Srihari
Determining no. of components
• Plot of variational lower bound L versus no. of components K
• Distinct peak at K=2 • For each K model is trained from 100 starts
– Results shown as +
29