markov chain monte carlo(mcmc)

28
Markov chain Monte Carlo(MCMC) Jin Young Choi 1

Upload: others

Post on 23-Jan-2022

15 views

Category:

Documents


1 download

TRANSCRIPT

Markov chain Monte Carlo(MCMC)

Jin Young Choi

1

Outline

Monte Carlo : Sample from a distribution to estimate the distribution

Markov Chain Monte Carlo (MCMC)

โ€’ Applied to Clustering, Unsupervised Learning, Bayesian Inference

Importance Sampling

Metropolis-Hastings Algorithm

Gibbs Sampling

Markov Blanket in Sampling for Bayesian Network

Example: Estimation of Gaussian Mixture Model

2

๐‘ ๐‘ฅ ๐ท =๐‘ง,๐œƒ๐‘ ๐‘ฅ, ๐‘ง, ๐œƒ ๐ท = ? , ๐‘ ๐‘ง|๐‘ฅ, ๐œƒ =? , ๐‘ ๐œƒ|๐‘ฅ, ๐‘ง =?

Markov chain Monte Carlo(MCMC)

Monte Carlo : Sample from a distribution

- to estimate the distribution for GMM estimation, Clustering

(Labeling, Unsupervised Learning)

- to compute max, mean

Markov Chain Monte Carlo : sampling using โ€œlocalโ€ information

- Generic โ€œproblem solving techniqueโ€

- decision/inference/optimization/learning problem

- generic, but not necessarily very efficient

3

Monte Carlo Integration

General problem: evaluating

๐”ผ๐‘ƒ โ„Ž ๐‘‹ = โˆซ โ„Ž ๐‘ฅ ๐‘ƒ ๐‘ฅ ๐‘‘๐‘ฅcan be difficult. (โˆซ โ„Ž ๐‘ฅ ๐‘ƒ ๐‘ฅ ๐‘‘๐‘ฅ < โˆž)

If we can draw samples ๐‘ฅ(๐‘ )~๐‘ƒ ๐‘ฅ , then we can estimate

๐”ผ๐‘ƒ โ„Ž ๐‘‹ โ‰ˆ เดคโ„Ž๐‘ =1

๐‘

๐‘ =1

๐‘

โ„Ž ๐‘ฅ ๐‘  .

Monte Carlo integration is great if you can sample from the target distribution

โ€ข But what if you canโ€™t sample from the target?

โ€ข Importance sampling: Use of a simple distribution

4

Importance Sampling

Idea of importance sampling:

Draw the sample from a proposal distribution ๐‘„(โ‹…) and re-weight the integral using importance weights so that the correct distribution is targeted

๐”ผ๐‘ƒ โ„Ž ๐‘‹ = โˆซโ„Ž ๐‘ฅ ๐‘ƒ ๐‘ฅ

๐‘„ ๐‘ฅ๐‘„ ๐‘ฅ ๐‘‘๐‘ฅ = ๐”ผ๐‘„

โ„Ž ๐‘‹ ๐‘ƒ ๐‘‹

๐‘„ ๐‘‹.

Hence, given an iid sample ๐‘ฅ ๐‘  from ๐‘„, our estimator becomes

๐ธ๐‘„โ„Ž ๐‘‹ ๐‘ƒ ๐‘‹

๐‘„ ๐‘‹=1

๐‘

๐‘ =1

๐‘โ„Ž ๐‘ฅ ๐‘  ๐‘ƒ ๐‘ฅ ๐‘ 

๐‘„ ๐‘ฅ ๐‘ 

5

Limitations of Monte Carlo

Direct (unconditional) sampling

โ€ข Hard to get rare events in high-dimensional spaces Gibbs sampling

Importance sampling

โ€ข Do not work well if the proposal ๐‘„ ๐‘ฅ is very different from target ๐‘ƒ ๐‘ฅ

โ€ข Yet constructing a ๐‘„ ๐‘ฅ similar to ๐‘ƒ ๐‘ฅ can be difficult Markov Chain

Intuition: instead of a fixed proposal ๐‘„ ๐‘ฅ , what if we could use an adaptiveproposal?

โ€ข ๐‘‹๐‘ก+1 depends only on ๐‘‹๐‘ก, not on ๐‘‹0, ๐‘‹1, โ€ฆ , ๐‘‹๐‘กโˆ’1โ€ข Markov Chain

6

Markov Chains: Notation & Terminology

Countable (finite) state space ฮฉ (e.g. N)

Sequence of random variables ๐‘‹๐‘ก on ฮฉ for ๐‘ก = 0,1,2, โ€ฆ

Definition : ๐‘‹๐‘ก is a Markov Chain if

๐‘ƒ ๐‘‹๐‘ก+1 = ๐‘ฆ | ๐‘‹๐‘ก = ๐‘ฅ๐‘ก, โ€ฆ , ๐‘‹0 = ๐‘ฅ0 = ๐‘ƒ ๐‘‹๐‘ก+1 = ๐‘ฆ | ๐‘‹๐‘ก = ๐‘ฅ๐‘ก

Notation : ๐‘ƒ ๐‘‹๐‘ก+1 = ๐‘– | ๐‘‹๐‘ก = ๐‘— = ๐‘๐‘—๐‘–

- Random Works

Example.

๐‘๐ด๐ด = ๐‘ƒ ๐‘‹๐‘ก+1 = ๐ด | ๐‘‹๐‘ก = ๐ด = 0.6๐‘๐ด๐ธ = ๐‘ƒ ๐‘‹๐‘ก+1 = ๐ธ | ๐‘‹๐‘ก = ๐ด = 0.4๐‘๐ธ๐ด = ๐‘ƒ ๐‘‹๐‘ก+1 = ๐ด | ๐‘‹๐‘ก = ๐ธ = 0.7๐‘๐ธ๐ธ = ๐‘ƒ ๐‘‹๐‘ก+1 = ๐ธ | ๐‘‹๐‘ก = ๐ธ = 0.3

7

Markov Chains: Notation & Terminology

Let ๐‘ท = ๐‘๐‘–๐‘— - transition probability matrix

- dimension ฮฉ ร— ฮฉ

Let ๐œ‹๐‘ก ๐‘— = ๐‘ƒ ๐‘‹๐‘ก = ๐‘—

- ๐œ‹0 : initial probability distribution

Then ๐œ‹๐‘ก ๐‘— = ฯƒ๐‘– ๐œ‹๐‘กโˆ’1 ๐‘– ๐‘๐‘–๐‘— = ๐œ‹๐‘กโˆ’1๐‘ท ๐‘— = ๐œ‹0๐‘ท๐‘ก ๐‘—

๐œ‹๐‘ก = ๐œ‹๐‘กโˆ’1๐‘ท = ๐œ‹๐‘กโˆ’2๐‘ท2 =โˆ™โˆ™โˆ™= ๐œ‹0๐‘ท

๐‘ก

8

Markov Chains: Fundamental Properties

Theorem:

- If the limit lim๐‘กโ†’โˆž

๐‘ƒ๐‘ก = ๐‘ƒ exists and ฮฉ is finite, then

๐œ‹๐‘ƒ ๐‘— = ๐œ‹ ๐‘— and ฯƒ๐‘— ๐œ‹ ๐‘— = 1

and such ๐œ‹ is an unique solution to ๐œ‹๐‘ท = ๐œ‹ (๐œ‹ is called a stationary distribution)

- No matter where we start, after some time, we will be in any state ๐‘— with probability ~ ๐œ‹ ๐‘—

9

Markov Chain Monte Carlo

MCMC algorithm feature adaptive proposals

- Instead of ๐‘„ ๐‘ฅโ€ฒ , they use ๐‘„(๐‘ฅโ€ฒ|๐‘ฅ) where ๐‘ฅโ€ฒ is the new state being sampled, and ๐‘ฅ is the previous sample

- As ๐‘ฅ changes, ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ can also change (as a function of ๐‘ฅโ€ฒ)

- The acceptance probability is set to ๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

- No matter where we start, after some time, we will be in any state ๐‘— with probability ~ ๐œ‹ ๐‘—

10โ†’ ๐‘12โ† ๐‘21

๐‘22๐‘11 โ†’ ๐‘12โ† ๐‘21

๐‘22๐‘11

importance

๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ = ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ for Gaussian Why?

Metropolis-Hastings

Draws a sample ๐‘ฅโ€ฒ from ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ , where ๐‘ฅ is the previous sample

The new sample ๐‘ฅโ€ฒ is accepted or rejected with some probability ๐ด ๐‘ฅโ€ฒ|๐‘ฅ

โ€ข This acceptance probability is ๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

โ€ข ๐ด ๐‘ฅโ€ฒ|๐‘ฅ is like a ratio of importance sampling weights

โ€ข๐‘ƒ ๐‘ฅโ€ฒ

๐‘„ ๐‘ฅโ€ฒ ๐‘ฅis the importance weight for ๐‘ฅโ€ฒ,

๐‘ƒ ๐‘ฅ

๐‘„ ๐‘ฅ|๐‘ฅโ€ฒis the importance weight for ๐‘ฅ

โ€ข We divide the importance weight for ๐‘ฅโ€ฒ by that of ๐‘ฅ

โ€ข Notice that we only need to compute ๐‘ƒ ๐‘ฅโ€ฒ /๐‘ƒ ๐‘ฅ rather than ๐‘ƒ ๐‘ฅโ€ฒ or ๐‘ƒ ๐‘ฅ separately

โ€ข ๐ด ๐‘ฅโ€ฒ ๐‘ฅ ensures that, after sufficiently many draws, our samples will come from the true distribution ๐‘ƒ(๐‘ฅ)

11

๐”ผ๐‘ƒ โ„Ž ๐‘‹ = โˆซโ„Ž ๐‘ฅ ๐‘ƒ ๐‘ฅ

๐‘„ ๐‘ฅ๐‘„ ๐‘ฅ ๐‘‘๐‘ฅ = ๐”ผ๐‘„

โ„Ž ๐‘‹ ๐‘ƒ ๐‘‹

๐‘„ ๐‘‹

๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ = ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ for Gaussian Why?

The MH Algorithm

Initialize starting state ๐‘ฅ(0),

Burn-in: while samples have โ€œnot convergedโ€

โ€ข ๐‘ฅ = ๐‘ฅ(๐‘ก)

โ€ข ๐‘ก = ๐‘ก + 1

โ€ข Sample ๐‘ฅโˆ—~๐‘„(๐‘ฅโˆ—|๐‘ฅ) // draw from proposal

โ€ข Sample ๐‘ข~Uniform 0,1 // draw acceptance threshold

โ€ข If ๐‘ข < ๐ด ๐‘ฅโˆ— ๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโˆ— ๐‘„(๐‘ฅ|๐‘ฅโˆ—)

๐‘ƒ ๐‘ฅ ๐‘„ ๐‘ฅโˆ—|๐‘ฅ, ๐‘ฅ(๐‘ก)= ๐‘ฅโˆ— // transition

โ€ข Else ๐‘ฅ(๐‘ก) = ๐‘ฅ // stay in current state

โ€ข Repeat until converging

12

The MH Algorithm

Example:

โ€ข Let ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ be a Gaussian centered on ๐‘ฅ

โ€ข Weโ€™re trying to sample from a bimodal distribution ๐‘ƒ ๐‘ฅ

๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

13

The MH Algorithm

Example:

โ€ข Let ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ be a Gaussian centered on ๐‘ฅ

โ€ข Weโ€™re trying to sample from a bimodal distribution ๐‘ƒ ๐‘ฅ

14

๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

The MH Algorithm

Example:

โ€ข Let ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ be a Gaussian centered on ๐‘ฅ

โ€ข Weโ€™re trying to sample from a bimodal distribution ๐‘ƒ ๐‘ฅ

15

๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

The MH Algorithm

Example:

โ€ข Let ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ be a Gaussian centered on ๐‘ฅ

โ€ข Weโ€™re trying to sample from a bimodal distribution ๐‘ƒ ๐‘ฅ

16

๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

The MH Algorithm

Example:

โ€ข Let ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ be a Gaussian centered on ๐‘ฅ

โ€ข Weโ€™re trying to sample from a bimodal distribution ๐‘ƒ ๐‘ฅ

17

๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

The MH Algorithm

Example:

โ€ข Let ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ be a Gaussian centered on ๐‘ฅ

โ€ข Weโ€™re trying to sample from a bimodal distribution ๐‘ƒ ๐‘ฅ

18

๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

The MH Algorithm

Example:

โ€ข Let ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ be a Gaussian centered on ๐‘ฅ

โ€ข Weโ€™re trying to sample from a bimodal distribution ๐‘ƒ ๐‘ฅ

19

๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

The MH Algorithm

Example:

โ€ข Let ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ be a Gaussian centered on ๐‘ฅ

โ€ข Weโ€™re trying to sample from a bimodal distribution ๐‘ƒ ๐‘ฅ

20

๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

Gibbs Sampling

Gibbs Sampling is an MCMC algorithm that samples each random variable of a graphical model, one at a time

โ€ข GS is a special case of the MH algorithm

Consider a factored state space

โ€ข ๐‘ฅ โˆˆ ฮฉ is a vector ๐‘ฅ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘šโ€ข Notation: ๐‘ฅโˆ’๐‘– = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘–โˆ’1, ๐‘ฅ๐‘–+1, โ€ฆ , ๐‘ฅ๐‘š

21

Gibbs Sampling

The GS algorithm:

1. Suppose the graphical model contains variables ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘›2. Initialize starting values for ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘›3. Do until convergence:

1. Pick a component ๐‘– โˆˆ 1, โ€ฆ , ๐‘›

2. Sample value of ๐‘ง~๐‘ƒ ๐‘ฅ๐‘–|๐‘ฅโˆ’๐‘– , and update ๐‘ฅ๐‘– โ† ๐‘ง

When we update ๐‘ฅ๐‘–, we immediately use its new value for sampling other variables ๐‘ฅ๐‘—

๐‘ƒ ๐‘ฅ๐‘–|๐‘ฅโˆ’๐‘– achieves the acceptance probability in MH algorithm.

22

๐ด ๐‘ฅโ€ฒ|๐‘ฅ = min 1,๐‘ƒ ๐‘ฅโ€ฒ /๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

๐‘ƒ ๐‘ฅ /๐‘„ ๐‘ฅ|๐‘ฅโ€ฒ

Markov Blankets

The conditional ๐‘ƒ ๐‘ฅ๐‘– ๐‘ฅโˆ’๐‘– can be obtained using Markov Blanket

โ€ข Let ๐‘€๐ต(๐‘ฅ๐‘–) be the Markov Blanket of ๐‘ฅ๐‘–, then

๐‘ƒ ๐‘ฅ๐‘– | ๐‘ฅโˆ’๐‘– = ๐‘ƒ ๐‘ฅ๐‘–|MB ๐‘ฅ๐‘–

For a Bayesian Network, the Markov Blanket of ๐‘ฅ๐‘– is the set containing its parents, children, and co-parents

23

Gibbs Sampling: An Example

Consider the GMM

โ€ข The data ๐‘ฅ (position) are extracted from two Gaussian distribution

โ€ข We do NOT know the class ๐‘ฆ of each data, and information of the Gaussian distribution

โ€ข Initialize the class of each data at ๐‘ก = 0 to randomly

24

Gaussian with mean 3,โˆ’3 , variance 3

Gaussian with mean 1,2 , variance 2

Gibbs Sampling: An Example

Sampling ๐‘ƒ ๐‘ฆ๐‘– ๐‘ฅโˆ’๐‘– , ๐‘ฆโˆ’๐‘–) at ๐‘ก = 1, we compute:

๐‘ƒ ๐‘ฆ๐‘– = 0 |๐‘ฅโˆ’๐‘– , ๐‘ฆโˆ’๐‘– โˆ ๐’ฉ ๐‘ฅ๐‘–|๐œ‡๐‘ฅโˆ’๐‘–,0, ๐œŽ๐‘ฅโˆ’๐‘–,0๐‘ƒ ๐‘ฆ๐‘– = 1 | ๐‘ฅโˆ’๐‘– , ๐‘ฆโˆ’๐‘– โˆ ๐’ฉ ๐‘ฅ๐‘–|๐œ‡๐‘ฅโˆ’๐‘–,1, ๐œŽ๐‘ฅโˆ’๐‘–,1

where๐œ‡๐‘ฅโˆ’๐‘–,๐พ = ๐‘€๐ธ๐ด๐‘ ๐‘‹๐‘–๐พ , ๐œŽ๐‘ฅโˆ’๐‘–,๐พ = ๐‘‰๐ด๐‘… ๐‘‹๐‘–๐พ๐‘‹๐‘–๐พ = ๐‘ฅ๐‘— | ๐‘ฅ๐‘— โˆˆ ๐‘ฅโˆ’๐‘– , ๐‘ฆ๐‘— = ๐พ

And update ๐‘ฆ๐‘– with ๐‘ƒ ๐‘ฆ๐‘– |๐‘ฅโˆ’๐‘– , ๐‘ฆโˆ’๐‘– and repeat for all data

25

Iteration of ๐‘– at the same ๐‘ก

0 1

Gibbs Sampling: An Example

Now ๐‘ก = 2, and we repeat the procedure to sample new class of each data

And similarly for ๐‘ก = 3, 4, โ€ฆ

26

โ‹…โ‹…โ‹…

Gibbs Sampling: An Example

Data ๐‘–โ€™s class can be chosen with tendency of ๐‘ฆ๐‘–โ€ข The classes of the data can be oscillated after the sufficient sequences

โ€ข We can assume the class of datum as more frequently selected class

In the simulation, the final class is correct with the probability of 94.9% at ๐‘ก =100

27

Interim Summary

Markov Chain Monte Carlo methods use adaptive proposals ๐‘„ ๐‘ฅโ€ฒ ๐‘ฅ to sample from the true distribution ๐‘ƒ ๐‘ฅ

Metropolis-Hastings allows you to specify any proposal ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ

โ€ข But choosing a good ๐‘„ ๐‘ฅโ€ฒ|๐‘ฅ requires care

Gibbs sampling sets the proposal ๐‘„ ๐‘ฅ๐‘–โ€ฒ|๐‘ฅโˆ’1 to the conditional distribution ๐‘ƒ ๐‘ฅ๐‘–

โ€ฒ|๐‘ฅโˆ’1โ€ข Acceptance rate always 1.

โ€ข But remember that high acceptance usually entails slow exploration

โ€ข In fact, there are better MCMC algorithms for certain models

28