lecture 14: bayesian inference and monte carlo methods ... · lecture14:bayesianinferenceand...
TRANSCRIPT
![Page 1: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/1.jpg)
lecture 14: bayesian inference andmonte carlo methodsSTAT 545: Intro. to Computational Statistics
Vinayak RaoPurdue University
November 1, 2018
![Page 2: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/2.jpg)
Bayesian inference
Given a set of observations X, MLE maximizes the likelihood:
θMLE = argmax p(X|θ)
What if we believe θ is close to 0, is sparse, or is smooth?
Encode this with a ‘prior’ probability p(θ).
• Represents a priori beliefs about θ
Given observations, we can calculate the ‘posterior’:
p(θ|X) = p(X|θ)p(θ)P(X)
We can calculate the maximum a posteriori (MAP) solution:
θMAP = argmax p(θ|X)
Point estimate discards information about uncertainty in θ
1/29
![Page 3: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/3.jpg)
Bayesian inference
Given a set of observations X, MLE maximizes the likelihood:
θMLE = argmax p(X|θ)
What if we believe θ is close to 0, is sparse, or is smooth?
Encode this with a ‘prior’ probability p(θ).
• Represents a priori beliefs about θ
Given observations, we can calculate the ‘posterior’:
p(θ|X) = p(X|θ)p(θ)P(X)
We can calculate the maximum a posteriori (MAP) solution:
θMAP = argmax p(θ|X)
Point estimate discards information about uncertainty in θ
1/29
![Page 4: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/4.jpg)
Bayesian inference
Given a set of observations X, MLE maximizes the likelihood:
θMLE = argmax p(X|θ)
What if we believe θ is close to 0, is sparse, or is smooth?
Encode this with a ‘prior’ probability p(θ).
• Represents a priori beliefs about θ
Given observations, we can calculate the ‘posterior’:
p(θ|X) = p(X|θ)p(θ)P(X)
We can calculate the maximum a posteriori (MAP) solution:
θMAP = argmax p(θ|X)
Point estimate discards information about uncertainty in θ
1/29
![Page 5: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/5.jpg)
Bayesian inference
Given a set of observations X, MLE maximizes the likelihood:
θMLE = argmax p(X|θ)
What if we believe θ is close to 0, is sparse, or is smooth?
Encode this with a ‘prior’ probability p(θ).
• Represents a priori beliefs about θ
Given observations, we can calculate the ‘posterior’:
p(θ|X) = p(X|θ)p(θ)P(X)
We can calculate the maximum a posteriori (MAP) solution:
θMAP = argmax p(θ|X)
Point estimate discards information about uncertainty in θ
1/29
![Page 6: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/6.jpg)
Bayesian inference
Given a set of observations X, MLE maximizes the likelihood:
θMLE = argmax p(X|θ)
What if we believe θ is close to 0, is sparse, or is smooth?
Encode this with a ‘prior’ probability p(θ).
• Represents a priori beliefs about θ
Given observations, we can calculate the ‘posterior’:
p(θ|X) = p(X|θ)p(θ)P(X)
We can calculate the maximum a posteriori (MAP) solution:
θMAP = argmax p(θ|X)
Point estimate discards information about uncertainty in θ
1/29
![Page 7: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/7.jpg)
Bayesian inference
Bayesian inference works with the entire distribution p(θ|X).
• Represents a posteriori beliefs about θ
Allows us to maintain and propagate uncertainty.
E.g. consider the likelihood p(X|θ) = N(X|θ, 1)
• What is a good prior over θ?• What is a convenient prior over θ?
2/29
![Page 8: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/8.jpg)
Bayesian inference
Bayesian inference works with the entire distribution p(θ|X).
• Represents a posteriori beliefs about θ
Allows us to maintain and propagate uncertainty.
E.g. consider the likelihood p(X|θ) = N(X|θ, 1)
• What is a good prior over θ?
• What is a convenient prior over θ?
2/29
![Page 9: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/9.jpg)
Bayesian inference
Bayesian inference works with the entire distribution p(θ|X).
• Represents a posteriori beliefs about θ
Allows us to maintain and propagate uncertainty.
E.g. consider the likelihood p(X|θ) = N(X|θ, 1)
• What is a good prior over θ?• What is a convenient prior over θ?
2/29
![Page 10: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/10.jpg)
Bayesian inference
The posterior distribution p(θ|X) ∝ p(X|θ)p(θ) summarizes allnew information about θ provided by the data
In practice, these distributions are unwieldy.
Need approximations.
An exception: ‘Conjugate priors’ for exponential familydistributions.
3/29
![Page 11: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/11.jpg)
Bayesian inference
The posterior distribution p(θ|X) ∝ p(X|θ)p(θ) summarizes allnew information about θ provided by the data
In practice, these distributions are unwieldy.
Need approximations.
An exception: ‘Conjugate priors’ for exponential familydistributions.
3/29
![Page 12: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/12.jpg)
Conjugate exponential family priors
Let observations come from an exponential-family:
p(x|θ) = 1Z(θ)h(x) exp(θ
⊤ϕ(x))
= h(x) exp(θ⊤ϕ(x)− ζ(θ)) with ζ(θ) = log(Z(θ))
Place a prior over θ:
p(θ|a,b) ∝ η(θ) exp(θ⊤a− ζ(θ)b)
Given a set of observations X = {x1, . . . , xN}
p(θ|X) ∝( N∏i=1
h(xi) exp(θ⊤ϕ(xi)− ζ(θ))
)η(θ) exp(θ⊤a− ζ(θ)b)
∝ η(θ) exp(θ⊤
(a+
N∑i=1
ϕ(xi))
− ζ(θ)(b+ N))
4/29
![Page 13: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/13.jpg)
Conjugate exponential family priors
Let observations come from an exponential-family:
p(x|θ) = 1Z(θ)h(x) exp(θ
⊤ϕ(x))
= h(x) exp(θ⊤ϕ(x)− ζ(θ)) with ζ(θ) = log(Z(θ))
Place a prior over θ:
p(θ|a,b) ∝ η(θ) exp(θ⊤a− ζ(θ)b)
Given a set of observations X = {x1, . . . , xN}
p(θ|X) ∝( N∏i=1
h(xi) exp(θ⊤ϕ(xi)− ζ(θ))
)η(θ) exp(θ⊤a− ζ(θ)b)
∝ η(θ) exp(θ⊤
(a+
N∑i=1
ϕ(xi))
− ζ(θ)(b+ N))
4/29
![Page 14: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/14.jpg)
Conjugate exponential family priors
Let observations come from an exponential-family:
p(x|θ) = 1Z(θ)h(x) exp(θ
⊤ϕ(x))
= h(x) exp(θ⊤ϕ(x)− ζ(θ)) with ζ(θ) = log(Z(θ))
Place a prior over θ:
p(θ|a,b) ∝ η(θ) exp(θ⊤a− ζ(θ)b)
Given a set of observations X = {x1, . . . , xN}
p(θ|X) ∝( N∏i=1
h(xi) exp(θ⊤ϕ(xi)− ζ(θ))
)η(θ) exp(θ⊤a− ζ(θ)b)
∝ η(θ) exp(θ⊤
(a+
N∑i=1
ϕ(xi))
− ζ(θ)(b+ N))
4/29
![Page 15: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/15.jpg)
Conjugate exponential family priors
Let observations come from an exponential-family:
p(x|θ) = 1Z(θ)h(x) exp(θ
⊤ϕ(x))
= h(x) exp(θ⊤ϕ(x)− ζ(θ)) with ζ(θ) = log(Z(θ))
Place a prior over θ:
p(θ|a,b) ∝ η(θ) exp(θ⊤a− ζ(θ)b)
Given a set of observations X = {x1, . . . , xN}
p(θ|X) ∝( N∏i=1
h(xi) exp(θ⊤ϕ(xi)− ζ(θ))
)η(θ) exp(θ⊤a− ζ(θ)b)
∝ η(θ) exp(θ⊤
(a+
N∑i=1
ϕ(xi))
− ζ(θ)(b+ N))
4/29
![Page 16: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/16.jpg)
Conjugate priors (contd.)
Prior over θ: exp. fam. distribution with parameters (a,b).
Posterior: same family with parameters (a+∑N
i=1 ϕ(xi),b+ N).
Rare instance where analytical expressions for posterior exists.
In most cases a simple prior quickly leads to a complicatedposterior, requiring Monte Carlo methods.
Note the conjugate prior is an entire family of distributions.
• Actual distribution is chosen by setting the parameters (a,b)(a has the same dimension as ϕ, b is a scalar)
• These might be set by e.g. talking to a domain expert.
5/29
![Page 17: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/17.jpg)
Conjugate priors (contd.)
Prior over θ: exp. fam. distribution with parameters (a,b).
Posterior: same family with parameters (a+∑N
i=1 ϕ(xi),b+ N).
Rare instance where analytical expressions for posterior exists.
In most cases a simple prior quickly leads to a complicatedposterior, requiring Monte Carlo methods.
Note the conjugate prior is an entire family of distributions.
• Actual distribution is chosen by setting the parameters (a,b)(a has the same dimension as ϕ, b is a scalar)
• These might be set by e.g. talking to a domain expert.
5/29
![Page 18: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/18.jpg)
Conjugate priors: Beta-Bernoulli example
Let xi ∈ {0, 1} indicate if a new drug works for subject i.
The unknown probability of success is π: x ∼ Bern(π).
p(x|π) = π1(x=1)(1− π)1(x=0)
= exp (1(x = 1) log(π) + (1− 1(x = 1)) log(1− π))
= (1− π) exp(1(x = 1) log π
1− π
)=
11+ exp(θ) exp (ϕ(x)θ)
This is an exponential family distrib., withθ = log π
1−π , ϕ(x) = 1(x = 1),h(x) = 1, Z(θ) = (1+ exp(θ)).Defining ζ(θ) = log Z(θ) as in the previous slide,
p(x|θ) = exp (ϕ(x)θ − ζ(θ))
6/29
![Page 19: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/19.jpg)
Conjugate priors: Beta-Bernoulli example
Let xi ∈ {0, 1} indicate if a new drug works for subject i.
The unknown probability of success is π: x ∼ Bern(π).
p(x|π) = π1(x=1)(1− π)1(x=0)
= exp (1(x = 1) log(π) + (1− 1(x = 1)) log(1− π))
= (1− π) exp(1(x = 1) log π
1− π
)=
11+ exp(θ) exp (ϕ(x)θ)
This is an exponential family distrib., withθ = log π
1−π , ϕ(x) = 1(x = 1),h(x) = 1, Z(θ) = (1+ exp(θ)).Defining ζ(θ) = log Z(θ) as in the previous slide,
p(x|θ) = exp (ϕ(x)θ − ζ(θ))
6/29
![Page 20: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/20.jpg)
Conjugate priors: Beta-Bernoulli example
Let xi ∈ {0, 1} indicate if a new drug works for subject i.
The unknown probability of success is π: x ∼ Bern(π).
p(x|π) = π1(x=1)(1− π)1(x=0)
= exp (1(x = 1) log(π) + (1− 1(x = 1)) log(1− π))
= (1− π) exp(1(x = 1) log π
1− π
)=
11+ exp(θ) exp (ϕ(x)θ)
This is an exponential family distrib., withθ = log π
1−π , ϕ(x) = 1(x = 1),h(x) = 1, Z(θ) = (1+ exp(θ)).
Defining ζ(θ) = log Z(θ) as in the previous slide,
p(x|θ) = exp (ϕ(x)θ − ζ(θ))
6/29
![Page 21: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/21.jpg)
Conjugate priors: Beta-Bernoulli example
Let xi ∈ {0, 1} indicate if a new drug works for subject i.
The unknown probability of success is π: x ∼ Bern(π).
p(x|π) = π1(x=1)(1− π)1(x=0)
= exp (1(x = 1) log(π) + (1− 1(x = 1)) log(1− π))
= (1− π) exp(1(x = 1) log π
1− π
)=
11+ exp(θ) exp (ϕ(x)θ)
This is an exponential family distrib., withθ = log π
1−π , ϕ(x) = 1(x = 1),h(x) = 1, Z(θ) = (1+ exp(θ)).Defining ζ(θ) = log Z(θ) as in the previous slide,
p(x|θ) = exp (ϕ(x)θ − ζ(θ))
6/29
![Page 22: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/22.jpg)
Conjugate priors: Beta-Bernoulli example
When θ = log π1−π is unknown, a Bayesian places a prior on it.
As before, define an exp. fam. prior with parameters a:
p(θ|a) ∝ exp(a1θ + a2ζ(θ))
Then given data X = (x1, . . . , xN),
p(θ|a, X) ∝ p(θ, X|a)
∝ exp((
a1 +N∑i=1
1(xi = 1))θ + (a2 − N)ζ(θ)
)
Thus, the posterior is in the same family as the prior, but withupdated parameters
(a1 +
∑Ni=1 1(xi = 1),a2 − N
).
7/29
![Page 23: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/23.jpg)
Conjugate priors: Beta-Bernoulli example
When θ = log π1−π is unknown, a Bayesian places a prior on it.
As before, define an exp. fam. prior with parameters a:
p(θ|a) ∝ exp(a1θ + a2ζ(θ))
Then given data X = (x1, . . . , xN),
p(θ|a, X) ∝ p(θ, X|a)
∝ exp((
a1 +N∑i=1
1(xi = 1))θ + (a2 − N)ζ(θ)
)
Thus, the posterior is in the same family as the prior, but withupdated parameters
(a1 +
∑Ni=1 1(xi = 1),a2 − N
).
7/29
![Page 24: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/24.jpg)
Conjugate priors: Beta-Bernoulli example
When θ = log π1−π is unknown, a Bayesian places a prior on it.
As before, define an exp. fam. prior with parameters a:
p(θ|a) ∝ exp(a1θ + a2ζ(θ))
Then given data X = (x1, . . . , xN),
p(θ|a, X) ∝ p(θ, X|a)
∝ exp((
a1 +N∑i=1
1(xi = 1))θ + (a2 − N)ζ(θ)
)
Thus, the posterior is in the same family as the prior, but withupdated parameters
(a1 +
∑Ni=1 1(xi = 1),a2 − N
).
7/29
![Page 25: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/25.jpg)
Conjugate priors: Beta-Bernoulli example
Looking at the prior more carefully, we see:
p(θ|a) ∝ exp(a1θ + a2ζ(θ))
∝ exp(a1 log
π
1− π+ a2 log(1− π)
)∝ πa1(1− π)(a2−a1)
= πb1−1(1− π)(b2−1)
This is just the Beta(b1,b2) distribution, and you can check thatthe posterior is Beta
(b1 +
∑Ni=1 1(xi = 1),b2 +
∑Ni=1 1(xi = 0)
).
b1 and b2 are sometimes called pseudo-observations, andcapture our prior beliefs: before seeing any x’s our prior is as ifwe saw b1 successes and b2 failures. After seeing data, wefactor actual observations into the pseudo-observations.
8/29
![Page 26: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/26.jpg)
Conjugate priors: Beta-Bernoulli example
Looking at the prior more carefully, we see:
p(θ|a) ∝ exp(a1θ + a2ζ(θ))
∝ exp(a1 log
π
1− π+ a2 log(1− π)
)∝ πa1(1− π)(a2−a1)
= πb1−1(1− π)(b2−1)
This is just the Beta(b1,b2) distribution, and you can check thatthe posterior is Beta
(b1 +
∑Ni=1 1(xi = 1),b2 +
∑Ni=1 1(xi = 0)
).
b1 and b2 are sometimes called pseudo-observations, andcapture our prior beliefs: before seeing any x’s our prior is as ifwe saw b1 successes and b2 failures. After seeing data, wefactor actual observations into the pseudo-observations. 8/29
![Page 27: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/27.jpg)
monte carlo methods
![Page 28: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/28.jpg)
Monte Carlo integration
What about the situation when the posterior p(θ|X) is nolonger simple/available in closed form?
What information about p(θ|X) do we really need?
Typically, expectations of different functions g:
Eθ|X[g] =∫
dθg(θ)p(θ|X)
What is g for to calculate 1) mean, 2) variance, 3) p(θ > 10|X)?
10/29
![Page 29: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/29.jpg)
Monte Carlo integration
What about the situation when the posterior p(θ|X) is nolonger simple/available in closed form?
What information about p(θ|X) do we really need?
Typically, expectations of different functions g:
Eθ|X[g] =∫
dθg(θ)p(θ|X)
What is g for to calculate 1) mean, 2) variance, 3) p(θ > 10|X)?
10/29
![Page 30: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/30.jpg)
Monte Carlo integration
Let us forget the posterior distribution p(θ|X), and considersome general probability distribution p(x). We want
µ := Ep[g] =∫Xg(x)p(x)dx
Sampling approximation: rather than visit all points in X ,calculate a summation over a finite set.
µ ≈ 1N
N∑i=1
g(xi) := µ
Monte Carlo approximation:
• Obtain points by sampling from p(x): xi ∼ p• Approximate integration with summation
µ ≈ 1N
N∑i=1
g(xi)
11/29
![Page 31: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/31.jpg)
Monte Carlo integration
Let us forget the posterior distribution p(θ|X), and considersome general probability distribution p(x). We want
µ := Ep[g] =∫Xg(x)p(x)dx
Sampling approximation: rather than visit all points in X ,calculate a summation over a finite set.
µ ≈ 1N
N∑i=1
g(xi) := µ
Monte Carlo approximation:
• Obtain points by sampling from p(x): xi ∼ p• Approximate integration with summation
µ ≈ 1N
N∑i=1
g(xi)
11/29
![Page 32: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/32.jpg)
Monte Carlo integration
Let us forget the posterior distribution p(θ|X), and considersome general probability distribution p(x). We want
µ := Ep[g] =∫Xg(x)p(x)dx
Sampling approximation: rather than visit all points in X ,calculate a summation over a finite set.
µ ≈ 1N
N∑i=1
g(xi) := µ
Monte Carlo approximation:
• Obtain points by sampling from p(x): xi ∼ p• Approximate integration with summation
µ ≈ 1N
N∑i=1
g(xi)11/29
![Page 33: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/33.jpg)
Monte Carlo integration
µ =1N
N∑i=1
g(xi)
If xi ∼ p,
Ep[µ] =1N
N∑i=1
Ep[g] = µ Unbiased estimate
Varp[µ] =1NVarp[g], Error = StdDev ∝ N−1/2
1N
N∑i=1
f→ Ep(g) = µ as N→ ∞ Consistent estimate (LLN)
12/29
![Page 34: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/34.jpg)
Monte Carlo integration
µ =1N
N∑i=1
g(xi)
If xi ∼ p,
Ep[µ] =1N
N∑i=1
Ep[g] = µ Unbiased estimate
Varp[µ] =1NVarp[g], Error = StdDev ∝ N−1/2
1N
N∑i=1
f→ Ep(g) = µ as N→ ∞ Consistent estimate (LLN)
12/29
![Page 35: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/35.jpg)
Monte Carlo Sampling
Is this a good idea?
• In low-dims, worth considering numerical methods likequadrature. In high-dims, these quickly become infeasible.
Simpson’s rule in d-dimensions, with N grid points:
error ∝ N−4/d
Monte Carlo integration:
error ∝ N−1/2
Independent of dimensionality!• If unbiasedness is important to you.• Very simple.• Very modular: easily incorporated into more complexmodels (Gibbs sampling)
13/29
![Page 36: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/36.jpg)
Monte Carlo Sampling
Is this a good idea?
• In low-dims, worth considering numerical methods likequadrature. In high-dims, these quickly become infeasible.Simpson’s rule in d-dimensions, with N grid points:
error ∝ N−4/d
Monte Carlo integration:
error ∝ N−1/2
Independent of dimensionality!• If unbiasedness is important to you.• Very simple.• Very modular: easily incorporated into more complexmodels (Gibbs sampling)
13/29
![Page 37: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/37.jpg)
Monte Carlo Sampling
Is this a good idea?
• In low-dims, worth considering numerical methods likequadrature. In high-dims, these quickly become infeasible.Simpson’s rule in d-dimensions, with N grid points:
error ∝ N−4/d
Monte Carlo integration:
error ∝ N−1/2
Independent of dimensionality!
• If unbiasedness is important to you.• Very simple.• Very modular: easily incorporated into more complexmodels (Gibbs sampling)
13/29
![Page 38: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/38.jpg)
Monte Carlo Sampling
Is this a good idea?
• In low-dims, worth considering numerical methods likequadrature. In high-dims, these quickly become infeasible.Simpson’s rule in d-dimensions, with N grid points:
error ∝ N−4/d
Monte Carlo integration:
error ∝ N−1/2
Independent of dimensionality!• If unbiasedness is important to you.• Very simple.• Very modular: easily incorporated into more complexmodels (Gibbs sampling)
13/29
![Page 39: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/39.jpg)
Monte Carlo Sampling (contd.)
An aside: Monte Carlo should be your method of last resort!
Don’t hesitate using numerical integration
• Numerical integration can be much faster and more accurate
Contrast> integrate(function(x) x ∗ exp(−x), lower = 0, upper = Inf)
with
> mean(rexp(1000))
14/29
![Page 40: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/40.jpg)
Generating random variables
• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?
No!
• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging
> set.seed(1)• Careful with batch/parallel processing.
15/29
![Page 41: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/41.jpg)
Generating random variables
• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?
No!
• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging
> set.seed(1)• Careful with batch/parallel processing.
15/29
![Page 42: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/42.jpg)
Generating random variables
• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?
No!
• Instead: pseudorandom numbers.
• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging
> set.seed(1)• Careful with batch/parallel processing.
15/29
![Page 43: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/43.jpg)
Generating random variables
• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?
No!
• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.
• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging
> set.seed(1)• Careful with batch/parallel processing.
15/29
![Page 44: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/44.jpg)
Generating random variables
• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?
No!
• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html
• Upside: Can use seeds for reproducibility or debugging> set.seed(1)
• Careful with batch/parallel processing.
15/29
![Page 45: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/45.jpg)
Generating random variables
• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?
No!
• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging
> set.seed(1)
• Careful with batch/parallel processing.
15/29
![Page 46: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/46.jpg)
Generating random variables
• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?
No!
• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging
> set.seed(1)• Careful with batch/parallel processing.
15/29
![Page 47: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/47.jpg)
Generating random variables
R has a bunch of random number generators.
rnorm, rgamma, rbinom, rexp, rpoiss etc.
What if we want samples from some other distribution?
16/29
![Page 48: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/48.jpg)
Generating random variables
Inverse transform sampling
Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du
Let:X ∼ p(·)U = F(X)
Then U is Unif(0, 1)
Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)
E.g. − log(U) is Exponential(1).Usually hard to compute F−1.
17/29
![Page 49: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/49.jpg)
Generating random variables
Inverse transform sampling
Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du
Let:X ∼ p(·)U = F(X)
Then U is Unif(0, 1)
Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)
E.g. − log(U) is Exponential(1).Usually hard to compute F−1.
17/29
![Page 50: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/50.jpg)
Generating random variables
Inverse transform sampling
Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du
Let:X ∼ p(·)U = F(X)
Then U is Unif(0, 1)
Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)
E.g. − log(U) is Exponential(1).Usually hard to compute F−1.
17/29
![Page 51: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/51.jpg)
Generating random variables
Inverse transform sampling
Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du
Let:X ∼ p(·)U = F(X)
Then U is Unif(0, 1)
Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)
E.g. − log(U) is Exponential(1).
Usually hard to compute F−1.
17/29
![Page 52: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/52.jpg)
Generating random variables
Inverse transform sampling
Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du
Let:X ∼ p(·)U = F(X)
Then U is Unif(0, 1)
Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)
E.g. − log(U) is Exponential(1).Usually hard to compute F−1.
17/29
![Page 53: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/53.jpg)
Rejection sampling
Let p(x) = f(x)Z .
Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.
If we sample points uniformly below the curve Mf(x):
18/29
![Page 54: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/54.jpg)
Rejection sampling
Let p(x) = f(x)Z .
Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.
If we sample points uniformly below the curve Mf(x):Probability of a sample in [x0, x0 +∆x] = Mf(x0)∆X∫
X Mf(x0)dx= p(x0)∆x.
18/29
![Page 55: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/55.jpg)
Rejection sampling
Let p(x) = f(x)Z .
Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.
If we sample points uniformly below the curve Mf(x):Probability of a sample in [x0, x0 +∆x] = Mf(x0)∆X∫
X Mf(x0)dx= p(x0)∆x.
How to do this (without sampling from p)?
18/29
![Page 56: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/56.jpg)
Rejection sampling
Let p(x) = f(x)Z .
Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.
If Mf(x) ≤ Nq(x) ∀x for constant N and distribution q(·)Sample points uniformly under Nq(x).(sample x0 ∼ q(·), and assign it a uniform height in [0,Nq(x0)]
18/29
![Page 57: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/57.jpg)
Rejection sampling
Let p(x) = f(x)Z .
Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.
If Mf(x) ≤ Nq(x) ∀x for constant N and distribution q(·)Sample points uniformly under Nq(x).Keep only points under Mf(x).
18/29
![Page 58: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/58.jpg)
Rejection sampling
Let p(x) = f(x)Z .
Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.
Equivalent algorithm: (convince yourself)• Propose x∗ ∼ q(·)• Accept with probability Mf(x∗)/Nq(x∗)
18/29
![Page 59: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/59.jpg)
Rejection sampling
Let p(x) = f(x)Z .
Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.
We need a bound on f(x).A loose bound leads to lots of rejections.Probability of acceptance = MZ
N .
18/29
![Page 60: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/60.jpg)
Intractable normalization constants
A probability density takes the form p(x) = f(x)Z
• Z =∫X f(x)dx is the normalization contant
• Ensures probability integrates to 1
Often Z is difficult to calculate (intractable integral over f(x))
Consequently, evaluating p(x) is hard
However, rejection sampling doesn’t need Z or p(x)
19/29
![Page 61: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/61.jpg)
Intractable normalization constants
A probability density takes the form p(x) = f(x)Z
• Z =∫X f(x)dx is the normalization contant
• Ensures probability integrates to 1
Often Z is difficult to calculate (intractable integral over f(x))
Consequently, evaluating p(x) is hard
However, rejection sampling doesn’t need Z or p(x)
19/29
![Page 62: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/62.jpg)
Intractable normalization constants
A probability density takes the form p(x) = f(x)Z
• Z =∫X f(x)dx is the normalization contant
• Ensures probability integrates to 1
Often Z is difficult to calculate (intractable integral over f(x))
Consequently, evaluating p(x) is hard
However, rejection sampling doesn’t need Z or p(x)
19/29
![Page 63: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/63.jpg)
Rejection sampling (contd.)
Example 1:
p(x) ∝ exp(−x2/2)| sin(x)|
Example 2 (truncated normal):
p(x) ∝ exp(−x2/2)1{x>c}
What is M for each case? What can we say about efficiency?
20/29
![Page 64: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/64.jpg)
Importance Sampling
Rather that accept/reject, assign weights to samples.
Observe:
Ep[g] =∫g(x)p(x)dx =
∫g(x)p(x)q(x)q(x)dx = Eq
[g(x)p(x)q(x)
]Use Monte Carlo approximation to the latter expectation:
• Draw proposal x from q(·) and calculate weight w(x) = p(x)q(x) .∫
g(x)p(x)dx ≈ 1N
N∑s=1
w(xs)g(xs)
Since w(x) = p(x)/q(x) = f(x)Zq(x) :
• We don’t need a bounding envelope.• We need normalizn constant Z (but see later).
21/29
![Page 65: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/65.jpg)
Importance Sampling
Rather that accept/reject, assign weights to samples. Observe:
Ep[g] =∫g(x)p(x)dx =
∫g(x)p(x)q(x)q(x)dx = Eq
[g(x)p(x)q(x)
]
Use Monte Carlo approximation to the latter expectation:
• Draw proposal x from q(·) and calculate weight w(x) = p(x)q(x) .∫
g(x)p(x)dx ≈ 1N
N∑s=1
w(xs)g(xs)
Since w(x) = p(x)/q(x) = f(x)Zq(x) :
• We don’t need a bounding envelope.• We need normalizn constant Z (but see later).
21/29
![Page 66: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/66.jpg)
Importance Sampling
Rather that accept/reject, assign weights to samples. Observe:
Ep[g] =∫g(x)p(x)dx =
∫g(x)p(x)q(x)q(x)dx = Eq
[g(x)p(x)q(x)
]Use Monte Carlo approximation to the latter expectation:
• Draw proposal x from q(·) and calculate weight w(x) = p(x)q(x) .∫
g(x)p(x)dx ≈ 1N
N∑s=1
w(xs)g(xs)
Since w(x) = p(x)/q(x) = f(x)Zq(x) :
• We don’t need a bounding envelope.• We need normalizn constant Z (but see later).
21/29
![Page 67: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/67.jpg)
Importance Sampling
Rather that accept/reject, assign weights to samples. Observe:
Ep[g] =∫g(x)p(x)dx =
∫g(x)p(x)q(x)q(x)dx = Eq
[g(x)p(x)q(x)
]Use Monte Carlo approximation to the latter expectation:
• Draw proposal x from q(·) and calculate weight w(x) = p(x)q(x) .∫
g(x)p(x)dx ≈ 1N
N∑s=1
w(xs)g(xs)
Since w(x) = p(x)/q(x) = f(x)Zq(x) :
• We don’t need a bounding envelope.• We need normalizn constant Z (but see later).
21/29
![Page 68: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/68.jpg)
Importance sampling vs simple Monte Carlo
Simple Monte Carlo/MCMC (left) uses sampling approximation
Importance sampling (right) weights the samples
22/29
![Page 69: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/69.jpg)
Importance Sampling (contd)
When does this make sense?Sometimes it’s easier to simulate from q(x) than p(x).
Sometimes it’s better to simulate from q(x) than p(x)!
To reduce variance. E.g. rare event simulation.
Let x ∼ (0, 1)
• What is P(X > 5)?
23/29
![Page 70: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/70.jpg)
Importance Sampling (contd)
When does this make sense?Sometimes it’s easier to simulate from q(x) than p(x).
Sometimes it’s better to simulate from q(x) than p(x)!
To reduce variance. E.g. rare event simulation.
Let x ∼ (0, 1)
• What is P(X > 5)?
23/29
![Page 71: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/71.jpg)
Importance Sampling (contd)
When does this make sense?Sometimes it’s easier to simulate from q(x) than p(x).
Sometimes it’s better to simulate from q(x) than p(x)!
To reduce variance. E.g. rare event simulation.
Let x ∼ (0, 1)
• What is P(X > 5)?
23/29
![Page 72: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/72.jpg)
Importance sampling:
Let X = (x1, . . . , x100) be a hundred dice.What is p(
∑xi ≥ 550)?
Rejection sampling (from p(x)) leads to high rejection.
A better choice might be to bias the dice.E.g. q(xi = v) ∝ v (for v ∈ {1, . . . 6})
24/29
![Page 73: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/73.jpg)
Importance sampling:
Let X = (x1, . . . , x100) be a hundred dice.What is p(
∑xi ≥ 550)?
Rejection sampling (from p(x)) leads to high rejection.
A better choice might be to bias the dice.E.g. q(xi = v) ∝ v (for v ∈ {1, . . . 6})
24/29
![Page 74: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/74.jpg)
Importance sampling:
Let X = (x1, . . . , x100) be a hundred dice.What is p(
∑xi ≥ 550)?
Rejection sampling (from p(x)) leads to high rejection.
A better choice might be to bias the dice.E.g. q(xi = v) ∝ v (for v ∈ {1, . . . 6})
24/29
![Page 75: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/75.jpg)
Importance sampling:
Define SX =∑xi
p(S ≥ 550) =∑
y∈ all configs of 100 diceδ(∑
y ≥ 550)p(y)
=∑
y∈ all configs of 100 dice
p(y)q(y)δ(
∑y ≥ 550)q(y)
For a proposal X∗ ∼ q,
w(X∗) = p(X∗q(X∗) =
(1/6)100∏i q(x∗i )
Use approximation p(S ≥ 550) ≈∑N
j=1 w(Xj)δ(∑xji ≥ 550)
25/29
![Page 76: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/76.jpg)
Importance sampling (contd)
What is the variance of the estimate?
Var[µimp] = E[µ2imp]− µ2
= E
( 1N
N∑i=1
wig(xi))2− µ2
= E
[(p(x)g(x)q(x)
)2]− µ2
=
∫Xq(x)
(p(x)g(x)q(x)
)2dx− µ2
≥(∫
Xq(x)p(x)g(x)q(x) dx
)2− µ2
= 0 (!)
26/29
![Page 77: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/77.jpg)
Importance sampling (contd)
What is the variance of the estimate?
Var[µimp] = E[µ2imp]− µ2
= E
( 1N
N∑i=1
wig(xi))2− µ2
= E
[(p(x)g(x)q(x)
)2]− µ2
=
∫Xq(x)
(p(x)g(x)q(x)
)2dx− µ2
≥(∫
Xq(x)p(x)g(x)q(x) dx
)2− µ2
= 0 (!)
26/29
![Page 78: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/78.jpg)
Importance sampling (contd)
What is the variance of the estimate?
Var[µimp] = E[µ2imp]− µ2
= E
( 1N
N∑i=1
wig(xi))2− µ2
= E
[(p(x)g(x)q(x)
)2]− µ2
=
∫Xq(x)
(p(x)g(x)q(x)
)2dx− µ2
≥(∫
Xq(x)p(x)g(x)q(x) dx
)2− µ2
= 0 (!)
26/29
![Page 79: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/79.jpg)
Importance sampling (contd)
What is the variance of the estimate?
Var[µimp] = E[µ2imp]− µ2
= E
( 1N
N∑i=1
wig(xi))2− µ2
= E
[(p(x)g(x)q(x)
)2]− µ2
=
∫Xq(x)
(p(x)g(x)q(x)
)2dx− µ2
≥(∫
Xq(x)p(x)g(x)q(x) dx
)2− µ2
= 0 (!)
26/29
![Page 80: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/80.jpg)
Importance sampling (contd)
What is the variance of the estimate?
Var[µimp] = E[µ2imp]− µ2
= E
( 1N
N∑i=1
wig(xi))2− µ2
= E
[(p(x)g(x)q(x)
)2]− µ2
=
∫Xq(x)
(p(x)g(x)q(x)
)2dx− µ2
≥(∫
Xq(x)p(x)g(x)q(x) dx
)2− µ2
= 0 (!)
26/29
![Page 81: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/81.jpg)
Importance sampling (contd)
What is the variance of the estimate?
Var[µimp] = E[µ2imp]− µ2
= E
( 1N
N∑i=1
wig(xi))2− µ2
= E
[(p(x)g(x)q(x)
)2]− µ2
=
∫Xq(x)
(p(x)g(x)q(x)
)2dx− µ2
≥(∫
Xq(x)p(x)g(x)q(x) dx
)2− µ2
= 0 (!)
26/29
![Page 82: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/82.jpg)
Importance sampling (contd)
We achieve this lower bound when q(x) ∝ p(x)g(x).A slightly useless result, because
q(x) = p(x)g(x)∫X p(x)g(x)dx
requires solving the integral we care about.
27/29
![Page 83: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/83.jpg)
Importance sampling (contd)
We want a small variance in the weights w(xi).Easy to check at Eq[w(x)] = 1.
Varq[w(x)] = Eq[w(x)2]− Eq[w(x)]2
=
∫X
(p(x)q(x)
)2q(x)dx− 1 =
∫X
p(x)2q(x) dx− 1
Can be unbounded. E.g. p = N (0, 2) and q = N (0, 1).
Apopular diagnosis statistic: effective sample size (ESS).
ESS =
(∑Ni=1 w(xi)
)2∑N
i=1 w(xi)2
Small ESS→ Large variability in w’s→ bad estimate.Large ESS promises you nothing!
28/29
![Page 84: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/84.jpg)
Importance sampling (contd)
A popular diagnosis statistic: effective sample size (ESS).
ESS =
(∑Ni=1 w(xi)
)2∑N
i=1 w(xi)2
Small ESS→ Large variability in w’s→ bad estimate.Large ESS promises you nothing!
28/29
![Page 85: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/85.jpg)
Importance sampling when Z is unknown
Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.
How can we estimate Z =∫f(x)dx?
Z =
∫f(x)dx =
∫ f(x)q(x)q(x) dx
Reuse samples from the proposal distribution q(x):
Z =1N
N∑i=1
f(xi)q(xi)
=1N
N∑i=1
w(xi)
Can use to approximate importance sampling weights w(xi):
w(xi) =p(xi)q(xi)
=f(xi)Zq(xi)
≈ 1Zw(xi)
Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.
29/29
![Page 86: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/86.jpg)
Importance sampling when Z is unknown
Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.
How can we estimate Z =∫f(x)dx?
Z =
∫f(x)dx =
∫ f(x)q(x)q(x) dx
Reuse samples from the proposal distribution q(x):
Z =1N
N∑i=1
f(xi)q(xi)
=1N
N∑i=1
w(xi)
Can use to approximate importance sampling weights w(xi):
w(xi) =p(xi)q(xi)
=f(xi)Zq(xi)
≈ 1Zw(xi)
Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.
29/29
![Page 87: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/87.jpg)
Importance sampling when Z is unknown
Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.
How can we estimate Z =∫f(x)dx?
Z =
∫f(x)dx =
∫ f(x)q(x)q(x) dx
Reuse samples from the proposal distribution q(x):
Z =1N
N∑i=1
f(xi)q(xi)
=1N
N∑i=1
w(xi)
Can use to approximate importance sampling weights w(xi):
w(xi) =p(xi)q(xi)
=f(xi)Zq(xi)
≈ 1Zw(xi)
Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.
29/29
![Page 88: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/88.jpg)
Importance sampling when Z is unknown
Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.
How can we estimate Z =∫f(x)dx?
Z =
∫f(x)dx =
∫ f(x)q(x)q(x) dx
Reuse samples from the proposal distribution q(x):
Z =1N
N∑i=1
f(xi)q(xi)
=1N
N∑i=1
w(xi)
Can use to approximate importance sampling weights w(xi):
w(xi) =p(xi)q(xi)
=f(xi)Zq(xi)
≈ 1Zw(xi)
Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.
29/29
![Page 89: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/89.jpg)
Importance sampling when Z is unknown
Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.
How can we estimate Z =∫f(x)dx?
Z =
∫f(x)dx =
∫ f(x)q(x)q(x) dx
Reuse samples from the proposal distribution q(x):
Z =1N
N∑i=1
f(xi)q(xi)
=1N
N∑i=1
w(xi)
Can use to approximate importance sampling weights w(xi):
w(xi) =p(xi)q(xi)
=f(xi)Zq(xi)
≈ 1Zw(xi)
Use w(x) instead of w(x) in the Monte Carlo approximation.
Is biased for finite N, but consistent as N→ ∞.
29/29
![Page 90: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity](https://reader033.vdocuments.net/reader033/viewer/2022053001/5f055bed7e708231d412928b/html5/thumbnails/90.jpg)
Importance sampling when Z is unknown
Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.
How can we estimate Z =∫f(x)dx?
Z =
∫f(x)dx =
∫ f(x)q(x)q(x) dx
Reuse samples from the proposal distribution q(x):
Z =1N
N∑i=1
f(xi)q(xi)
=1N
N∑i=1
w(xi)
Can use to approximate importance sampling weights w(xi):
w(xi) =p(xi)q(xi)
=f(xi)Zq(xi)
≈ 1Zw(xi)
Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.
29/29