lecture 12: gaussian mixture models
TRANSCRIPT
![Page 1: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/1.jpg)
Lecture 12: Gaussian Mixture Models
Shuai Li
John Hopcroft Center, Shanghai Jiao Tong University
https://shuaili8.github.io
https://shuaili8.github.io/Teaching/VE445/index.html
1
![Page 2: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/2.jpg)
Outline
• Generative models
• GMM
• EM
2
![Page 3: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/3.jpg)
Generative Models (review)
3
![Page 4: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/4.jpg)
Discriminative / Generative Models
• Discriminative models• Modeling the dependence of unobserved variables on observed ones
• also called conditional models
• Deterministic:
• Probabilistic:
• Generative models• Modeling the joint probabilistic distribution of data
• Given some hidden parameters or variables
• Then do the conditional inference
4
![Page 5: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/5.jpg)
Discriminative Models
• Discriminative models• Modeling the dependence of unobserved variables on observed ones• also called conditional models• Deterministic:
• Linear regression
• Probabilistic:• Logistic regression
• Directly model the dependence for label prediction• Easy to define dependence on specific features and models• Practically yielding higher prediction performance• E.g. linear regression, logistic regression, k nearest neighbor, SVMs, (multi-
layer) perceptrons, decision trees, random forest5
![Page 6: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/6.jpg)
Generative Models
• Generative models• Modeling the joint probabilistic distribution of data• Given some hidden parameters or variables
• Then do the conditional inference
• Recover the data distribution [essence of data science]
• Benefit from hidden variables modeling
• E.g. Naive Bayes, Hidden Markov Model, Mixture Gaussian, Markov Random Fields, Latent Dirichlet Allocation
6
![Page 7: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/7.jpg)
Discriminative Models vs Generative Models
• In General• A Discriminative model models the decision boundary between the classes
• A Generative Model explicitly models the actual distribution of each class
• Example: Our training set is a bag of fruits. Only apples and orangesare labeled. Imagine a post-it note stuck to the fruit• A generative model will model various attributes of fruits such as color,
weight, shape, etc
• A discriminative model might model color alone, should that suffice to distinguish apples from oranges
7
![Page 8: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/8.jpg)
8
![Page 9: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/9.jpg)
Gaussian Mixture Models
9
![Page 10: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/10.jpg)
Gaussian Mixture Models
• Is a clustering algorithms
• Difference with K-means• K-means outputs the label of a sample
• GMM outputs the probability that a sample belongs to a certain class
• GMM can also be used to generate new samples!
10
![Page 11: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/11.jpg)
K-means vs GMM
11
![Page 12: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/12.jpg)
Gaussian distribution
• Very common in probability theory and important in statistics
• often used in the natural and social sciences to represent real-valued random variables whose distributions are not known
• is useful because of the central limit theorem• averages of samples independently drawn from the same distribution converge in
distribution to the normal with the true mean and variance, that is, they become normally distributed when the number of observations is sufficiently large
• Physical quantities that are expected to be the sum of many independent processes often have distributions that are nearly normal
• The probability density of the Gaussian distribution is
𝑓 𝑥|𝜇, 𝜎2 =1
2𝜋𝜎2𝑒−𝑥−𝜇 2
2𝜎2
12
![Page 13: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/13.jpg)
High-dimensional Gaussian distribution
• The probability density of Gaussian distribution on 𝑥 = 𝑥1, … , 𝑥𝑑⊤
is
𝒩 𝑥|𝜇, ∑ =
exp −12𝑥 − 𝜇 ⊤∑−1 𝑥 − 𝜇
2𝜋 𝑑 ∑• where 𝜇 is the mean vector
• ∑ is the symmetric covariance matrix (positive semi-definite)
• E.g. the Gaussian distribution with
13
𝑓 𝑥|𝜇, 𝜎2 =1
2𝜋𝜎2𝑒−𝑥−𝜇 2
2𝜎2
![Page 14: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/14.jpg)
Mixture of Gaussian
• The probability given in a mixture of K Gaussians is:
𝑝 𝑥 =
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥|𝜇𝑗 , Σ𝑗)
where 𝑤𝑗 is the prior probability of the j-th Gaussian
• Example
14
![Page 15: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/15.jpg)
Examples
15
Observation
![Page 16: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/16.jpg)
Data generation
• Let the parameter set 𝜃 = {𝑤𝑗 , 𝜇𝑗 , Σ𝑗: 𝑗}, then the probability density of mixture Gaussian can be written as
𝑝 𝑥 𝜃 =
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥|𝜇𝑗 , Σ𝑗)
• Equivalent to generate data points in two steps• Select which component 𝑗 the data point belongs to according to the
categorical (or multinoulli) distribution of 𝑤1, … , 𝑤𝐾
• Generate the data point according to the probability of 𝑗-th component
16
![Page 17: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/17.jpg)
Learning task
• Given a dataset 𝑋 = 𝑥(1), 𝑥(2),∙∙∙, 𝑥(𝑁) to train the GMM model
• Find the best 𝜃 that maximizes the probability ℙ 𝑋 𝜃
• Maximal likelihood estimator (MLE)
17
![Page 18: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/18.jpg)
Introduce latent variable
• For data points 𝑥(𝑖), 𝑖 = 1, … , 𝑁, let’s write the probability as
ℙ 𝑥(𝑖) 𝜃 =
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥(𝑖)|𝜇𝑗 , Σ𝑗)
where ∑𝑗=1𝐾 𝑤𝑗 = 1
• Introduce latent variable• 𝑧(𝑖) is the Gaussian cluster ID indicates which Gaussian 𝑥(𝑖) comes from
• ℙ 𝑧(𝑖) = 𝑗 = 𝑤𝑗 “Prior”
• ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜃 = 𝒩(𝑥(𝑖)|𝜇𝑗 , Σ𝑗)
• ℙ 𝑥(𝑖) 𝜃 = ∑𝑗=1𝐾 ℙ 𝑧(𝑖) = 𝑗 ∙ ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜃
18
![Page 19: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/19.jpg)
Maximal likelihood• Let 𝑙 𝜃 = ∑𝑖=1
𝑁 log ℙ(𝑥 𝑖 ; 𝜃) = ∑𝑖=1𝑁 log∑𝑗ℙ(𝑥
𝑖 , 𝑧 𝑖 = 𝑗; 𝜃) be the log-likelihood
• We want to solve
argmax 𝑙 𝜃 = argmax
𝑖=1
𝑁
log
𝑗=1
𝐾
ℙ 𝑧(𝑖) = 𝑗 ∙ ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜃
= argmax
𝑖=1
𝑁
log
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥(𝑖)|𝜇𝑗 , Σ𝑗)
• No closed solution by solving𝜕ℙ 𝑋|𝜃
𝜕𝑤= 0,
𝜕ℙ 𝑋|𝜃
𝜕𝜇= 0,
𝜕ℙ 𝑋|𝜃
𝜕Σ= 0
19
![Page 20: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/20.jpg)
Likelihood maximization• If we know 𝑧(𝑖) for all 𝑖, the problem becomes
argmax 𝑙 𝜃 = argmax
𝑖=1
𝑁
log ℙ 𝑥(𝑖) 𝜃
= argmax
𝑖=1
𝑁
log ℙ 𝑥(𝑖), 𝑧(𝑖) 𝜃
= argmax
𝑖=1
𝑁
logℙ 𝑥(𝑖) 𝜃, 𝑧(𝑖) + logℙ 𝑧(𝑖) 𝜃
= argmax
𝑖=1
𝑁
log𝒩(𝑥(𝑖)|𝜇𝑧(𝑖) , Σ𝑧(𝑖)) + log𝑤𝑧(𝑖)
20
The solution is
• 𝑤𝑗 =1
𝑁∑𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗
• 𝜇𝑗 =∑𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗 𝑥(𝑖)
∑𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗
• Σ𝑗 =∑𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗 𝑥(𝑖)−𝜇𝑗 𝑥(𝑖)−𝜇𝑗
⊤
∑𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗
Average over each cluster
![Page 21: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/21.jpg)
Likelihood maximization (cont.)
• Given the parameter 𝜃 = {𝑤𝑗 , 𝜇𝑗 , Σ𝑗: 𝑗}, the posterior distribution of each latent variable 𝑧(𝑖) can be inferred
• ℙ 𝑧(𝑖) = 𝑗 𝑥(𝑖); 𝜃 =ℙ 𝑥(𝑖), 𝑧(𝑖) = 𝑗 𝜃
ℙ 𝑥(𝑖) 𝜃
• =ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜇𝑗 , Σ𝑗 ℙ 𝑧(𝑖) = 𝑗 𝑤
∑𝑗′=1𝐾 ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗′; 𝜇𝑗′ , Σ𝑗′ ℙ 𝑧(𝑖) = 𝑗′ 𝑤
• Or 𝑤𝑗(𝑖)
= ℙ 𝑧(𝑖) = 𝑗 𝑥(𝑖); 𝜃 ∝ ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜇𝑗 , Σ𝑗 ℙ 𝑧(𝑖) = 𝑗 𝑤
21
![Page 22: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/22.jpg)
Likelihood maximization (cont.)
• For every possible values of 𝑧(𝑖)’s
• Take the expectation on probability of 𝑧(𝑖)
22
Which is equivalent to
• 𝑤𝑗 =1
𝑁∑𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗
• ∑𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗 𝜇𝑗 = ∑𝑖=1
𝑁 𝟏 𝑧(𝑖) = 𝑗 𝑥(𝑖)
• ∑𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗 Σ𝑗 = ∑𝑖=1
𝑁 𝟏 𝑧(𝑖) = 𝑗 𝑥(𝑖) − 𝜇𝑗 𝑥(𝑖) − 𝜇𝑗⊤
Take expectation on two sides
• 𝑤𝑗 =1
𝑁∑𝑖=1𝑁 ℙ(𝑧(𝑖) = 𝑗)
• ∑𝑖=1𝑁 ℙ(𝑧 𝑖 = 𝑗) 𝜇𝑗 = ∑𝑖=1
𝑁 ℙ(𝑧 𝑖 = 𝑗)𝑥(𝑖)
• ∑𝑖=1𝑁 ℙ(𝑧 𝑖 = 𝑗) Σ𝑗 = ∑𝑖=1
𝑁 ℙ(𝑧 𝑖 = 𝑗) 𝑥(𝑖) − 𝜇𝑗 𝑥(𝑖) − 𝜇𝑗⊤
The solution is
• 𝑤𝑗 =1
𝑁∑𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗
• 𝜇𝑗 =∑𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗 𝑥(𝑖)
∑𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗
• Σ𝑗 =∑𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗 𝑥(𝑖)−𝜇𝑗 𝑥(𝑖)−𝜇𝑗
⊤
∑𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗
𝑤𝑗(𝑖)
= ℙ 𝑧(𝑖) = 𝑗 𝑥(𝑖); 𝜃
The solution is
• 𝑤𝑗 =1
𝑁∑𝑖=1𝑁 𝑤𝑗
(𝑖)
• 𝜇𝑗 =∑𝑖=1𝑁 𝑤𝑗
(𝑖)𝑥(𝑖)
∑𝑖=1𝑁 𝑤
𝑗(𝑖)
• Σ𝑗 =∑𝑖=1𝑁 𝑤𝑗
(𝑖)𝑥(𝑖)−𝜇𝑗 𝑥(𝑖)−𝜇𝑗
⊤
∑𝑖=1𝑁 𝑤
𝑗(𝑖)
Take expectation
![Page 23: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/23.jpg)
Likelihood maximization (cont.)
• If we know the distribution of 𝑧(𝑖)’s, then it is equivalent to maximize
𝑙 ℙ𝑧(𝑖) , 𝜃 =
𝑖=1
𝑁
𝑗=1
𝐾
ℙ 𝑧(𝑖) = 𝑗 logℙ 𝑥(𝑖), 𝑧 𝑖 = 𝑗|𝜃
ℙ 𝑧(𝑖) = 𝑗
• Compared with previous if we know the values of 𝑧(𝑖)
argmax
𝑖=1
𝑁
logℙ 𝑥(𝑖), 𝑧(𝑖) 𝜃
• Note that it is hard to solve if we directly maximize
𝑙 𝜃 =
𝑖=1
𝑁
logℙ 𝑥(𝑖) 𝜃 =
𝑖=1
𝑁
log
𝑗=1
𝐾
ℙ 𝑧(𝑖) = 𝑗|𝜃 ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜃
23
is a lower bound of
![Page 24: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/24.jpg)
Expectation maximization methods
• E-step: • Infer the posterior distribution of the latent variables given the model
parameters
• M-step: • Tune parameters to maximize the data likelihood given the latent variable
distribution
• EM methods• Iteratively execute E-step and M-step until convergence
24
![Page 25: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/25.jpg)
EM for GMM
• Repeat until convergence:{(E-step) For each 𝑖, 𝑗, set
𝑤𝑗(𝑖)
= ℙ 𝑧(𝑖) = 𝑗 𝑥(𝑖), 𝑤, 𝜇, Σ(M-step) Update the parameters
𝑤𝑗 =1
𝑁
𝑖=1
𝑁
𝑤𝑗(𝑖), 𝜇𝑗 =
∑𝑖=1𝑁 𝑤𝑗
(𝑖)𝑥(𝑖)
∑𝑖=1𝑁 𝑤𝑗
(𝑖)
Σ𝑗 =∑𝑖=1𝑁 𝑤𝑗
(𝑖)𝑥(𝑖) − 𝜇𝑗 𝑥(𝑖) − 𝜇𝑗
⊤
∑𝑖=1𝑁 𝑤𝑗
(𝑖)
}25
![Page 26: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/26.jpg)
Example
• Hidden variable: for each point, which Gaussian generates it?
26
𝑝 𝑥 𝜃 =
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥|𝜇𝑗 , Σ𝑗)
![Page 27: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/27.jpg)
Example (cont.)
• E-step: for each point, estimate the probability that each Gaussian component generated it
27
![Page 28: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/28.jpg)
Example (cont.)
• M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data
28
![Page 29: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/29.jpg)
Example 2
29
![Page 30: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/30.jpg)
Remaining issues
• Initialization:• GMM is a kind of EM algorithm which is very sensitive to initial conditions
• Number of Gaussians:• Use some information-theoretic criteria to obtain the optima K
• Minimal description length (MDL)
30
![Page 31: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/31.jpg)
Minimal description length (MDL)
• All statistical learning is about finding regularities in data, and the best hypothesis to describe the regularities in data is also the one that is able to compress the data most
• In an MDL framework, a good model is one that allows an efficient (i.e. short) encoding of the data, but whose own description is also efficient
• For the GMM
• 𝑋 is the vector of data elements
• 𝑝(𝑥) is the probability by GMM
• 𝑃 is the number of free parameters needed to describe the GMM• 𝑃 = 𝐾 𝑑 + 𝑑(𝑑 + 1)/2 + (𝐾 − 1) 31
(negative) log-likelihoodThe cost of coding data
The cost of coding GMM itself
![Page 32: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/32.jpg)
Example
32
𝐾 = 2 𝐾 = 16
400 new data points generated from the 16-GMM
![Page 33: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/33.jpg)
Remaining issues (cont.)
• Simplification of the covariance matrices
33
![Page 34: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/34.jpg)
Application in computer vision
• Image segmentation
34
![Page 35: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/35.jpg)
Application in computer vision (cont.)
• Object tracking: Knowing the moving object distribution in the first frame, we can localize the objet in the next frames by tracking its distribution
35
![Page 36: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/36.jpg)
Expectation and Maximization
36
![Page 37: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/37.jpg)
Background: Latent variable models
• Some of the variables in the model are not observed
• Examples: mixture model, Hidden Markov Models, LDA, etc
37
![Page 38: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/38.jpg)
Background: Marginal likelihood
• Joint model ℙ 𝑥, 𝑧|𝜃 , 𝜃 is the model parameter
• With 𝑧 unobserved, we marginalize out 𝑧 and use the marginal log-likelihood for learning
ℙ 𝑥|𝜃 =
𝑧
ℙ 𝑥, 𝑧|𝜃
• Example: mixture model
ℙ 𝑥|𝜃 =
𝑘
ℙ 𝑥|𝑧 = 𝑘, 𝜃𝑘 ℙ 𝑧 = 𝑘|𝜃𝑘 =
𝑘
𝜋𝑘ℙ 𝑥|𝑧 = 𝑘, 𝜃𝑘
where 𝜋𝑘 is the mixing proportions
38
![Page 39: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/39.jpg)
Examples
• Mixture of Bernoulli
• Mixture of Gaussians
• Hidden Markov Model
39
![Page 40: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/40.jpg)
Learning the hidden variable 𝑍
• If all 𝑧 observed, the likelihood factorizes, and learning is relatively easy
• If 𝑧 not observed, we have to handle a sum inside log• Idea 1: ignore this problem and simply take derivative and follow the gradient
• Idea 2: use the current 𝜃 to estimate 𝑧, fill them in and do fully-observed learning
• Is there a better way?
40
![Page 41: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/41.jpg)
Example
• Let events be “grades in a class”
• Assume we want to estimate 𝜇 from data. In a class, suppose there are• a students get A, b students get B, c students get C and d students get D.
• What’s the maximum likelihood estimate of 𝜇 given 𝑎, 𝑏, 𝑐, 𝑑 ?
41
![Page 42: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/42.jpg)
Example (cont.)
42
• for MLE 𝜇, set
• Solve
• So if class got max like
![Page 43: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/43.jpg)
Example (cont.)
• What if the observed information is
• What is the maximal likelihood estimator of 𝜇 now?
43
![Page 44: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/44.jpg)
Example (cont.)
• What is the MLE of 𝜇 now?
• We can answer this question circularly:
44
Expectation
If we know the value of 𝜇we could compute theexpected value of 𝑎 and 𝑏
Maximization
If we know the expected values of 𝑎 and 𝑏 we could compute the maximum likelihood value of 𝜇
Since the ratio 𝑎: 𝑏 should be the same as the ratio ½ : 𝜇
![Page 45: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/45.jpg)
Example – Algorithm
• We begin with a guess for 𝜇, and then iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of 𝜇 and 𝑎, 𝑏
• Define • μ (t) the estimate of μ on the t-th iteration
• b (t) the estimate of b on t-th iteration
45
![Page 46: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/46.jpg)
Example – Algorithm (cont.)
• Iteration will converge
• But only assured to converge to local optimum
46
![Page 47: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/47.jpg)
General EM methods
• Repeat until convergence{(E-step) For each data point 𝑖, set
𝑞𝑖 𝑗 = ℙ 𝑧(𝑖) = 𝑗|𝑥 𝑖 ; 𝜃(M-step) Update the parameters
argmax𝜃 𝑙 𝑞𝑖 , 𝜃 =
𝑖=1
𝑁
𝑗
𝑞𝑖 𝑗 logℙ(𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃)
𝑞𝑖 𝑗
}
47
![Page 48: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/48.jpg)
Convergence of EM
• Denote 𝜃(𝑡) and 𝜃(𝑡+1) as the parameters of two successive iterations of EM, we want to prove that
𝑙 𝜃(𝑡) ≤ 𝑙 𝜃 𝑡+1
where note that
𝑙 𝜃 =
𝑖=1
𝑁
log ℙ(𝑥 𝑖 ; 𝜃) =
𝑖=1
𝑁
log
𝑗
ℙ(𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃)
• This shows EM always monotonically improves the log-likelihood, thus ensures EM will at least converge to a local optimum
48
![Page 49: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/49.jpg)
Proof• Start from 𝜃(𝑡), we choose the posterior of latent variable
𝑞𝑖(𝑡)
𝑗 = 𝑝 𝑧(𝑖) = 𝑗|𝑥 𝑖 ; 𝜃(𝑡)
• By Jensen’s inequality on the log function
• 𝑙 𝑞, 𝜃(𝑡) = ∑𝑖=1𝑁 ∑𝑗 𝑞𝑖 𝑗 log
𝑝(𝑥 𝑖 ,𝑧 𝑖 =𝑗;𝜃(𝑡))
𝑞𝑖 𝑗
≤
𝑖=1
𝑗
log
𝑗
𝑞𝑖 𝑗 ∙𝑝 𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃 𝑡
𝑞𝑖 𝑗
=
𝑖=1
𝑗
log
𝑗
𝑝 𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃(𝑡) = 𝑙 𝜃(𝑡)
• When 𝑞𝑖(𝑡)
𝑗 = 𝑝 𝑧(𝑖) = 𝑗|𝑥 𝑖 ; 𝜃(𝑡) , the above inequality is equality49
𝑙 𝜃(𝑡) = 𝑙 𝑞𝑖(𝑡), 𝜃(𝑡) ≥ 𝑙 𝑞, 𝜃(𝑡)
![Page 50: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/50.jpg)
Proof (cont.)
• Then the parameter 𝜃(𝑡+1) is obtained by maximizing
𝑙 𝑞𝑖(𝑡), 𝜃 =
𝑖=1
𝑁
𝑗
𝑞𝑖(𝑡)
log𝑝(𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃)
𝑞𝑖(𝑡)
• Thus
𝑙 𝜃(𝑡+1) ≥ 𝑙 𝑞𝑖(𝑡), 𝜃(𝑡+1) ≥ 𝑙 𝑞𝑖
(𝑡), 𝜃(𝑡) = 𝑙 𝜃(𝑡)
• Since the likelihood is at most 1, EM will converge (to a local optimum)
50
![Page 51: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/51.jpg)
Example
• Follow previous example, suppose ℎ = 20, 𝑐 = 10, 𝑑 = 10, 𝜇(0) = 0
• Then
• Convergence is generally linear: error decreases by a constant factor each time step
51
![Page 52: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/52.jpg)
Advantages
• EM is often applied to NP-hard problems
• Widely used
• Often effective
52
![Page 53: Lecture 12: Gaussian Mixture Models](https://reader033.vdocuments.net/reader033/viewer/2022043020/626bc0d386cd586ef32f8ee8/html5/thumbnails/53.jpg)
K-means and EM
• After initialize the position of the centers, K-means interactively execute the following two operations:• Assign the label of the points based on their distances to the centers
• Specific posterior distribution of latent variable
• E-step
• Update the positions of the centers based on the labeling results• Given the labels, optimize the model parameters
• M-step
53