introduction to monte carlo methodsmonte carlo simulation 6 all of the calculations reviewed earlier...

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)

Introduction to Monte Carlo Methods

Prof. Nicholas Zabaras

Center for Informatics and Computational Science

https://cics.nd.edu/

University of Notre Dame

Notre Dame, Indiana, USA

Email: [email protected]

URL: https://www.zabaras.com/

October 4, 2018

1

https://cics.nd.edu/mailto:[email protected]://www.zabaras.com/


Contents

2

Review of the Bayesian inference framework, Introducing the Monte

Carlo simulation, reviewing the central Limit Theorem and Law of Large

Numbers, indicator functions, error approximations

Examples showing convergence of the MC simulator, Generalization of

the MC Estimator in High Dimensions

Sample Representation of the Monte Carlo Estimator

Deterministic vc MC integration, Using MC for Computing integrals,

expectations and Bayes factors

Following closely:

C. Robert, G. Casella, Monte Carlo Statistical Methods (Ch.. 1, 2, 3.1, & 3.2) (google books, slides, video)

J. S. Liu, MC Strategies in Scientific Computing (Chapters 1 & 2)

J-M Marin and C. P. Robert, Bayesian Core (Chapter 2)

Statistical Computing & Monte Carlo Methods, A. Doucet (course notes, 2007)

http://books.google.com/books?id=HfhGAxn5GugC&dq=Monte+Carlo+statistical+methods,+google+books&printsec=frontcover&source=bn&hl=en&ei=JbxiTP6KDIK88gaK7bnsCQ&sa=X&oi=book_result&ct=result&resnum=4&ved=0CCQQ6AEwAwhttp://books.google.com/books?id=HfhGAxn5GugC&printsec=frontcover&source=gbs_v2_summary_r&cad=0http://www.ceremade.dauphine.fr/~xian/Madrid.pdfhttp://videolectures.net/christian_robert/http://books.google.com/books?id=R8E-yHaKCGUC&printsec=frontcover&dq=monte+carlo+strategies+in+scientific+computing&hl=en&ei=afBiTOWwNoT7lwfi1-nMCw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CDUQ6AEwAAhttp://www.springerlink.com/content/x38261/http://people.cs.ubc.ca/~arnaud/stat535/slides7.pdf


The Bayesian Model

3

Let us revisit our Bayesian model

In most problems of interest there are no analytical closed forms for

the posterior (using conjugate priors is one exception).

Also note that the denominator in the Bayes’ equation above implies

the calculation of the following high-dimensional integral:

( ) ( | )f x d

( ) ( | )( | )

( ) ( | )

f xx

f x d


The Bayesian Model

4

Consider the posterior model

Typical point estimates based on the posterior include the following:

We are often interested in the mode of the posterior

If 𝜃 = (𝜃1, 𝜃2) and 𝜃2 is a nuisance parameter, marginal distributions are also often needed:

1 1 2 2( | ) ( , | )x x d

( ) ( | )( | )

( ) ( | )

f xx

f x d

22

| ( | )

| ( | ) |

x x d

Var x x d x

( | )x

argmax ( | )MAP x


The Bayesian Model

5

Consider the posterior model

The predictive distribution of 𝑌, 𝑌~𝑔(𝑦|𝑥), and its mean , are:

Similarly, for model selection, we have seen that the posterior takes the

form:

1

( ) ( ) ( | , )( , | )

( ) ( ) ( | , )

k

k k k kk

k k k k k

k

k f x kk x

k f x k d

( ) ( | )( | )

( ) ( | )

f xx

f x d

( | ) ( | ) ( | )

| ( | ) ( | )

g y x f y x d

Y x yf y x d dy

|Y x


Monte Carlo Simulation

6

All of the calculations reviewed earlier require computing high-

dimensional integrals.

Monte Carlo simulation provides the means for effective calculation of

these integrals and for resolving many more issues.

Some historical (early) references on Monte Carlo Methods:

N. Metropolis and S. Ulam, The Monte Carlo Method, J Amer. Statist. Ass., Vol. 44, pp. 335-341

(1949)

N. Metropolis, The Beginning of the Monte Carlo Method, Los Alamos Science Special Issue

(1987)

http://www.amstat.org/misc/TheMonteCarloMethod.pdfhttps://www.dropbox.com/s/5iwwnfdj0nak2s9/TheMonteCarloMethod.pdf?dl=0http://library.lanl.gov/cgi-bin/getfile?00326866.pdfhttps://www.dropbox.com/s/qc5qv64krrlg57f/TheBeginningOfTheMCMethod.pdf?dl=0


Introducing Monte Carlo Simulation

7

Consider a 2 × 2 square as shown in the Figure below.

Let an inscribed circle D of radius 1 in the square S.

2S

S

D



8

Let us consider dropping darts uniformly on the square S. This means that the probability of the dart falling in a subdomain 𝐴 of S is proportional to the area of 𝐴.

Let 𝐷 = (𝑥, 𝑦) define a random variable on that represents the location of the drop of the dart. We have:

Assume 𝑁 independent drops of the dart on the square S, i.e.

( ) Adxdy

P D Adxdy

S

S

1 2, ,..., ND D D



9

A common sense estimation of the above probability would be

where 𝑁 is the total number of dart falls.

Can we give a statistical justification

of this result?

( ) Adxdy

P D Adxdy

S

𝑃(𝐷 ∈ 𝐴) ≃𝑇𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑑𝑎𝑟𝑡 𝑓𝑒𝑙𝑙 𝑖𝑛 𝐴

𝑁


Indicator Functions

10

Let us introduce the indicator function for the event 𝐴, i.e. let

Let us compute the probability for

The above results come immediately noticing that

1 ,( , )A

if drop point of dart d = (x, y) Ax y

0, otherwise

.D A

( , ) ( , )1

( ) ( , )4 4

A A

A

x y dxdy x y dxdy

P D A x y dxdydxdy

S S

S

S

\ \

( , ) ( , ) ( , ) 1 0 1A A AA A A A A

x y dxdy x y dxdy x y dxdy dxdy dxdy dxdy S S S


Indicator Functions

11

The probability density associated to 𝑃 is ¼ - this is the density of the uniform distribution on 𝑆 denoted as

Let us introduce the random variable where

𝑋, 𝑌 are the random variables representing the Cartesian coordinates of a uniformly distributed point 𝑑 = (𝑥, 𝑦) on S, where a dart falls.

With this notation

( , ) ( , )1

( ) ( , )4 4

A A

A

x y dxdy x y dxdy

P D A x y dxdydxdy

S S

S

S

1( ) ( , ) ( )

4 SAP d A x y dxdy V U

S

SU

( ) : ( ) : ( , ),A AV D D X Y

~ ,SD U


Strong Law of Large Numbers

12

Let 𝑋𝑖 for 𝑖 = 1, 2, . . . , 𝑁 be independent and identically distritbutedrandom variables (i.i.d.) with mean 𝔼(𝑋𝑖) = 𝜇 and variance 𝑉(𝑋𝑖) =𝜎2 < ∞, the strong LLN states for the sample mean the following:NX

lim NX almost surelyN

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 13

Let 𝑋𝑖 for 𝑖 = 1, 2, . . . , 𝑁 be independent and identically distritbutedrandom variables (i.i.d.) with mean 𝔼(𝑋𝑖) = 𝜇 and variance 𝑉(𝑋𝑖) =𝜎2 < ∞, the weak LLN states that the sample mean is a random variable that converges to the true mean as , i.e.

More formally:

Note that

NXN

1

N

i

iN

X

X as NN

Pr 0 0NNlim X

2

2

1 1 1 1

2 2

[ ] var[ ]

[ ] var[ ]

N N N N

i i

i i i iN N

X X

X and Xn n N N N

Weak Law of Large Numbers


The Central Limit Theorem

14

Let 𝑋𝑖 for 𝑖 = 1, 2, . . . , 𝑁 be independent and identically distributed (i.i.d) random variables each with expectation μ and variances Then:

Under some weak conditions (such as finite variance), the sample

mean has a limiting normal distribution:

2.

1

1 1

1 1 1[ ] ( )

N

i N Ni

N i i

i i

X

X X NN N N N

22 21

2 21 1

1 1 1[ ] ( )

N

i N Ni

N i

i i

X

Var X Var Var X NN N N N N

1

N

i

iN

X

XN

lim𝑁→∞

ത𝑋𝑁 − 𝜇

𝜎2

𝑁

∼ 𝒩(0,1)𝑜𝑟 𝑎𝑠 𝑁 → ∞ ത𝑋𝑁 ∼ 𝒩(𝜇,

𝜎2

𝑁)


Back to Indicator Functions

15

Let 𝑋𝑖 for 𝑖 = 1, 2, . . . , 𝑁 be an indicator function for an event 𝐹, i.e. let

Then as a result of the law of large numbers,

Note that herein we used that

i

1 if theoutcomeof theexperiment i is F,X

0, otherwise

1

Pr( )N

iN

i

XX F as N

N

1 Pr( ) 0 Pr( ) Pr( )iX F F F


Let us introduce the random variables

associated to the drops and consider the sum

Assuming (𝑁∞), then the law of large numbers (since )yields

where we already proved that

When 𝑁∞, this justifies our intuitive result introduced earlier.

Returning to Our Dart Drop Experiment

16

: ( ) : ( ), 1, 2., ,i i A iV V D D i N

lim ( ) ( )SNN

S V almost surely

U

1

. . . : ( ), 1,2.,,

i i

N

i

iN

Empirical averageof i i d V V D i N

Vnumber of drops that fell in A

SN N

, 1,2,...,iD i N

( )S

V U

Pr ( )S

D A V U


Since is an unbiased estimator of

To characterize the precision of the estimator, we can use the following:

This means that

Thus the mean square error between decreases as

Mean Square Error

17

1( ) ,

4 4P d dxdy

D

D

121

1 1( ) ( ) ( ) ( ) ( ' . . .)

N

N N i i

i

Var E Var S Var V Var V recall theV s are i i dN N

NS .4

4N N N

Errorterm

S is a random variable : S E

2 2

1

1( ) ( ) Pr( ) ( )N N N NVar S S S S D Var V

N

D

Pr( )NS and DD1

N

For our problem 2

2

1( ) ( ) ( ) 0.1685.4 4

Var V P D P D

D D


Applying the central limit theorem (since ) :

The probability of the error being larger than is given as:

where

One can similarly prove that for any integer 𝑁 ≥ 1 and 𝜀 > 0,

where and (see here).

Properties of the Estimator

18

( )Var V

( ),

4

N

d

Var VS

NN

var( )2

V

N

var( )2

var( )Pr 2 2 1 2 1 2 0.0456

4 var( )N

V

V NS

N V

N

.x the CDF of 0,1 N

2

2var( )

2

var( )Pr | | 2 1

4 var( ) var( )2

N

V

N

VS erfc Ce

NV V

N N

1

12 2

xx erf

2

:xe

for large x erfc xx

http://en.wikipedia.org/wiki/Standard_normal_tablehttp://mathworld.wolfram.com/Erf.html


Coefficient of Variation

19

The coefficient of variation for our Monte Carlo estimator is given as:

For different COVs, the number of samples needed is given as:

With all of the above error estimates, we conclude that the

approximation error varies as .

(1 ) 4(1 )var 0.52334 4 4. .

( )4

N

N

N

SCoeff of Var COV S

S N NN

10% 28

5% 110

1% 2739

COV N

COV N

COV N

1O

N


An alternative non-asymptotic error estimatea using a Bernstein type

inequality is:

Using this result, we can show that:

Alternatively, for any , using the Eq. above we can write:

These expressions for the approximation error show that it is inversely

proportional to

Properties of the Estimator

20

2

log(2 / )(0,1],

4 2

NP S for N

1N

log(40)Pr 0.05

4 2NS

N

.NaFrom Statistical Computing & Monte Carlo Methods, A. Doucet.

221 0, 24

N

NN and P S e

http://en.wikipedia.org/wiki/Bernstein_inequalities_(probability_theory)#cite_ref-3http://people.cs.ubc.ca/~arnaud/stat535/slides7.pdf


Convergence of as a function of 𝑁 (one realization)

Convergence of the Simulator

21

4NS

See here for a

C++

implementation

(Meanerror_plt)

https://www.dropbox.com/s/8xnqtvao8ua4a8u/Dart_experiment_simulator.rar?dl=0


Convergence of as a function of 𝑁 (one hundred realizations)


22

4NS

See here for a

C++

implementation

(Meanerror.plt)



Square root empirical mean square error across 100

realizations as a function of 𝑁 and (dotted)


23

4NS

( )Var V

N

See here for a

C++

implementation

(Var_error.plt)



Convergence of the Estimator

24

Monte Carlo estimator of the area of a unit circle: (a) Circles: MC mean averaged over the S=100

runs; (b) Theoretical error bars using the estimator discussed earlier; (c) Average of the MC

numerical error bars (from the S=100 MC runs for each N). MatLab implementation is given here.

1

1

1( ) ( )

1( ) ( )

S

i

S

i

Mean Area Estimator at each N circles MC Area Estimator of circle run iS

Sampled Area Error Bars colored area Numerical Area Error Bars run iS

24 4

( ) /

Nour S MC estimatorr

MC Estimator of the area A of circle run i Area of square Number of counts of samples falling in circle N

2

2

1

1 1( ) 4 ( )

1

N

A j N

Area of Square jIndicator function

Numerical Area Error Bars run i r D SN N

2

2( ) 44 4

N

Area of Square

exact std of our S MC estimator

Theoretical Area Error Bars solid lines r N

https://www.dropbox.com/s/6vl1rfq457k78xi/AreaCircle.rar?dl=0


Consider now the case where We assume a

hypercube S nθ and the inscribed hyperball D nθ in

We can build the same estimator as in 2𝐷. Our indicator function now becomes .

All of our earlier derivations are still applicable. The rate of

convergence of the estimator in the mean square sense is

independent of n and equal to

This is not the case using a deterministic method on a grid of regularly

spaced points where the convergence rate is typically of the form

1/Nr/n where 𝑟 is related to the smoothness of the contours of D nθ.

Monte Carlo methods are thus attractive when n is large.

Generalization of the Dart Drop Experiment

25

, 1.n

n .

n DD

1 .N


Assume you are interested in computing the volume of the

hypersphere of radius R = 1 in n−dimensions:

We want to do so using samples from the hypercube of

volume

Using 𝑁 samples, the variance of our MC estimator is


26

2

0

12

n

nvol as nn

S

1

.2

n n n

n

n n

p p pVar Xas n

N N N

volwith p

S

1,1n

2 .

n

𝑤ℎ𝑒𝑟𝑒: 𝑋 ∼ ℬℯ𝓇𝓃ℴ𝓊ℓℓ𝒾 𝑝𝑛𝜃

http://en.wikipedia.org/wiki/N-spherehttp://en.wikipedia.org/wiki/Bernoulli_distribution


The coefficient of variation is then:

To obtain a reasonable relative error, we would need

For this choice:

For

Clearly a plain Monte Carlo approach is not as helpful in 20 and higher dimensions! You are trying in a huge volume of the hypercube to hit

the infinitesimal hypersphere!

Monte Carlo and the Curse of Dimensionality

27

1

n

n n

pVar X

N NCOV as n

X p Np

1100 .nN p

0.1COV

1 20 2010 1

9

0

11 10!20, 100 100 2 100 2 4.06 10nn N p


To obtain a fixed precision in the variance, a number of samples that

is exponential in the dimension is needed!

So using a plain Monte Carlo method still suffers from the curse of

dimensionality

MC not appropriate for rare events modeling.


28

1, ( ) , 1.

: .

n

n

Consider : X and Var X C with

Var X CThen For N

N


Assume i.i.d. samples

Now consider any set and assume that we are interested in

We naturally choose the following estimator:

which by the law of large numbers is a consistent estimator of

since

Generalization

29

1N

A

( ) ~ ( 1,2,..., )i i N

( ) ( ), ~ .A P A for

( )A

( )

1

1lim ( ) ( ( )) ( )

Ni

A AN

i

AN

𝜋(𝐴) ≃𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝐴

𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠


Now we generalize the idea to tackle the generic problem

of estimating

where and 𝜋 is a probability distribution on

We assume that and that cannot be

calculated analytically.

Generalization

30

( ( )) ( ) ( )f f d

: fn

f .xn

( ( ) )f ( ( ))f


To evaluate we consider the unbiased estimator

From the law of large numbers will converge and

A good measure of the approximation is the variance of

From the central limit (if ), we have:

Generalization

31

( )

1

1( ) ( )

Ni

N

i

S f fN

( )

1

( )1( ) ( )

Ni

N

i

Var fVar S f Var f

N N

( )

1

1lim ( ) ( ( )) ( . .)

Ni

Ni

f f a sN

( ( )),f

( )NS f

( ),NS f

var ( )f

( ) ( ) 0, ( )

( )( ) ( ) ,

N

Nd

N

Nd

N S f f Var f or

Var fS f f

N

N

N


Generalization

32

The rate of convergence is independent of the dimension of

Integration in complex domains is now not a problem.

The method is easy to implement and rather general. You will need

to be able to evaluate

to be able to produce samples distributed according to

( )f

Some historical references on Monte Carlo Methods:

N. Metropolis and S. Ulam, The Monte Carlo Method, J Amer. Statist. Ass., Vol. 44, pp. 335-341

(1949)

N. Metropolis, The Beginning of the Monte Carlo Method, Los Alamos Science Special Issue

(1987)

http://www.amstat.org/misc/TheMonteCarloMethod.pdfhttps://www.dropbox.com/s/5iwwnfdj0nak2s9/TheMonteCarloMethod.pdf?dl=0http://library.lanl.gov/cgi-bin/getfile?00326866.pdfhttps://www.dropbox.com/s/qc5qv64krrlg57f/TheBeginningOfTheMCMethod.pdf?dl=0


Sample Representation of the MC Estimator

33

Let us introduce the Dirac-Delta function for defined for

any as follows:

Note that this implies in particular that for

For we can introduce the following mixture of Delta-

Dirac functions:

which is the empirical measure, and consider for any

0( )

0 0( ) ( ) ( )f d f

( ) ~ , 1,..., ,i i N

0

: fn

f

A

0 0 0( ) ( ) ( ) ( )A A

A

d d

A

ො𝜋𝑁(𝜃):=1

𝑁

𝑖=1

𝑁

൯𝛿𝜃 𝑖 (𝜃

ො𝜋𝑁(𝐴) ≜ න

𝐴

ො𝜋𝑁 𝜃 𝑑𝜃 =ා

𝑖=1

𝑁

න

𝐴

1

𝑁𝛿𝜃 𝑖 (𝜃) 𝑑𝜃 =

1

𝑁

𝑖=1

𝑁

൯𝕀𝐴(𝜃𝑖 = 𝑆𝑁(𝐴) (# 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛𝐴)


Sample Representation of

34

The concentration of points in a given region of the space represents

This is contrast with parametric statistics where one starts with

samples and then introduces a distribution with an algebraic

representation of the underlying population.

Note that each sample has a weight of 1/𝑁, but that it is also possible to consider weighted sample representations of .

( )i


Sample Representation of

35

Sample representation of a Gaussian distribution is shown below.

See here for a MatLab Implementation

Software/GaussianSampleInterpretation.mhttps://www.dropbox.com/s/tak5ny5evo3jqeh/GaussianSampleInterpretation.m?dl=0


Deterministic Vs. Monte Carlo Integration

36


Sample Representation of the MC Estimator

37

Now consider the problem of estimating We simply replace

with its sample representation )ො𝜋𝑁(𝜃 and obtain:

This is precisely , the Monte Carlo estimator suggested earlier.

Clearly based on , we can easily estimate for any 𝑓.

For example, the variance of 𝑓 is approximated as:

( ).f

( )NS f

( )f)ො𝜋𝑁(𝜃

𝔼𝜋(𝑓) ≃ න

𝛩

)𝑓(𝜃 ා

𝑖=1

𝑁

න

𝐴

1

𝑁𝛿𝜃 𝑖 (𝜃) 𝑑𝜃 =ා

𝑖=1

𝑁

න

𝛩

1

𝑁𝑓(𝜃)𝛿𝜃 𝑖 (𝜃) 𝑑𝜃 =

1

𝑁

𝑖=1

𝑁

𝑓 𝜃 𝑖

𝑉𝑎𝑟𝜋(𝑓) = 𝔼𝜋(𝑓2) − 𝔼𝜋

2(𝑓) ≃1

𝑁

𝑖=1

𝑁

𝑓2 𝜃 𝑖 −1

𝑁

𝑖=1

𝑁

𝑓 𝜃 𝑖

2


From the Algebraic to the Sample Representation

38

Similarly, if we have:

The marginal distribution is given as:

If we want to estimate and is only known up to a

normalizing constant, then a reasonable estimate is the following:

arg max ( )

( )

( )( )arg max ( )

i

i

ො𝜋𝑁(𝜃1, 𝜃2) =1

𝑁

𝑖=1

𝑁

𝛿𝜃1𝑖𝜃2𝑖 𝜃1 , 𝜃2

ො𝜋𝑁(𝜃1) =1

𝑁

𝑖=1

𝑁

𝛿𝜃1𝑖 𝜃1


Sampling from an Arbitrary Distribution

39

We can now see that if we can sample easily from an arbitrary

distribution, then we could easily compute any quantities of interest.

But how do we sample from an arbitrary distribution?

We will discuss about MCMC and other methods in follow up

lectures


Monte Carlo Integration

40

We can use the same estimation for performing integration. Consider

computing the following:

We can write this integration as where the

mean is wrt the uniform distribution in 𝐷:

To compute the integral 𝐼, we thus need to draw samples from this distribution. Our estimator will then be as follows:

The estimator converges as .

( )D

I f x dx

1( ) ( )DD

I f x dx f x

1( )

0,

if x Dx

otherwise

1

1( )

n

i

i

I f xn

1( )O

N



41

We can extend this to high (d) dimensions:

The estimator converges as regardless of the dimensionality 𝑑.

For a Riemann integration of this, the rate of convergence is better

where a grid of equally spaced points is used (e.g. for 𝐷 = [0,1], Δ𝑥 =1/𝑁).

However, in let us say ten dimensions, 𝐷 = [0,1]10, you will need O (𝑁10) grid points to achieve the same rate of convergence as in 1D!

Generally, the rate of convergence of the Riemann approximation of

the above integral is where 𝑁 is the total number of grid points and r depends on the smoothness of the domain 𝐷.

( )D

I f d x x

1( )

N

1( )N

1( )N

J. S. Liu, MC Strategies in Scientific Computing (Chapter 2, Introduction)

/(1 )r dN

http://books.google.com/books?id=R8E-yHaKCGUC&printsec=frontcover&dq=monte+carlo+strategies+in+scientific+computing&hl=en&ei=afBiTOWwNoT7lwfi1-nMCw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CDUQ6AEwAA


( )x( )x

Riemann integration

( )x

MC integration

Monte Carlo for Integration

42

In computing with Riemann integration, one e.g.

constructs a grid with Δ𝑥 = 1/𝑁, computes and with approximation error evaluates:

In 𝑑 −dimensions, we need 𝑁𝑑 evaluations to keep an error

Clearly the MC estimator is slower in convergence but it

converges independently of the dimension 𝑑. It requires no grid of points. But you need to be able to sample from 𝑝(𝑥).

[0,1]

( )D

I f x dx

if(x )= f(iΔx)

1 1

1( ) ( )

N N

N i i

i i

I f x x f xN

1x N

if(x )

1( )

N

1N


Monte Carlo Integration Consider calculating for any function 𝑓(𝑥), the following

integral

We can write this integral as:

Here

The Monte Carlo estimator of the integral is now:

1

0

( ) ( )f f x dx

[0,1]~X U

43

ො𝜇𝑁(𝑓) =1

𝑁

𝑖=1

𝑁

)𝑓(𝑋𝑖 , 𝑋𝑖 ~𝑖.𝑖.𝑑

𝑈 0,1

𝜇(𝑓) = න

−∞

+∞

𝑓(𝑥)𝕀 0,1 (𝑥)𝑑𝑥 = 𝔼 )𝑓(𝑋


0 5 10 15 20 25 30 35 40 45 500.2

0.4

0.6

0.8

f(x)=x

Empirical Average

Theoretical Mean

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4f(x)=x2

Empirical Average

Theoretical Mean

0 5 10 15 20 25 30 35 40 45 50-0.5

0

0.5

1f(x)=cos( x)

Empirical Average

Theoretical Mean


MatLab Implementation

( )

0.5

f x x

2( )

0.333

f x x

( ) cos

0

f x x

44

https://www.dropbox.com/s/kp98e697plbkkia/MonteCarloEx1.m?dl=0


Monte Carlo Integration: Variance From the earlier discussed properties of the MC

estimator:

where:

When the variance is unknown, we

can use its MC estimator instead: 2 ( ) ( )f Var f X

22

[0,1]( ) ( ) ( ) ( )f Var f X f x x dx

45

ො𝜎𝑁2(𝑓) =

1

𝑁 − 1

𝑖=1

𝑁

)𝑓(𝑋𝑖) − ො𝜇𝑁(𝑓2, ො𝜇𝑁(𝑓) =

1

𝑁

𝑗=1

𝑁

൯𝑓(𝑋𝑗 , 𝑋𝑖 ~𝑖.𝑖.𝑑

𝒰 0,1

𝔼 ො𝜇𝑁 = 𝜇

𝑉𝑎𝑟 ො𝜇𝑁 =1

𝑁𝑉𝑎𝑟 )𝑓(𝑋 =

1

𝑁𝜎2(𝑓), 𝑋~𝒰 0,1


0 5 10 15 20 25 30 35 40 45 500

0.05

0.1f(x)=x

Empirical Variance

Theoretical Variance

0 5 10 15 20 25 30 35 40 45 500

0.05

0.1f(x)=x2

Empirical Variance


0 5 10 15 20 25 30 35 40 45 500.1

0.2

0.3

0.4

0.5f(x)=cos( x)

Empirical Variance


Monte Carlo Integration: Empirical Variance


( )

( ) 1/12

f x x

Var f X

2( )

( ) 4 / 45

f x x

Var f X

( ) cos

( ) 1/ 2

f x x

Var f X

46

ො𝜎𝑁2(𝑓) =

1

𝑁 − 1

𝑖=1

𝑁

)𝑓(𝑋𝑖) − Ƹ𝜇𝑁(𝑓2, 𝑋𝑖 ~

𝑖.𝑖.𝑑𝑈 0,1

https://www.dropbox.com/s/kp98e697plbkkia/MonteCarloEx1.m?dl=0


Optimal Number of MC Samples We can compute the optimal number of MC samples for a

desired accuracy of the estimator:

From the asymptotic properties we

can estimate:

where:

Approximating the variance , the number of needed 𝑁should satisfy:

2

2

2( )

( )

XNX N f

f

1 11 , [0,1]2

X inverse cdf of

N

2 ( )f

47

Pr | )ො𝜇𝑁(𝑓) − 𝜇(𝑓 | ≤ 𝜀 = 1 − 𝛼 ⇒ Pr| )ො𝜇𝑁(𝑓) − 𝜇(𝑓 |

ൗ𝜎

𝑁

≤𝜀 𝑁

𝜎= 1 − 𝛼

𝑁 ො𝜇𝑁 − 𝜇 →𝑁→∞

𝐷𝒩 )0, 𝜎2(𝑓

ො𝜎𝑁2(𝑓) ≤

𝑁𝜀2

𝑋𝛼2


Optimal Number of MC Samples The condition can be satisfied iteratively

1. Start with n1 MC samples from

2. If then stop; otherwise

3. Evaluate and generate 𝑘1 samples

where here indicates the integer

part of 𝑥.

11,..., ~ [0,1]nX X U

1 11,..., ~ [0,1]n n kX X U x

48

ො𝜎𝑁2(𝑓) ≤

𝑁𝜀2

𝑋𝛼2

ො𝜎𝑛12 (𝑓) ≤

𝑛1𝜀2

𝑋𝛼2

𝑘1 =൯𝑋𝛼

2 ො𝜎𝑛12 (𝑓

𝜀2− 𝑛1


Computing Expectations

49

Expectations with respect to any distribution also involve high-

dimensional integration:

If we can draw samples from 𝜋(𝒙), then we can use the estimator

( ) ( )D

I f d x x x

( )x( )x

Riemann integration

( )x

MC integration

መ𝐼 =1

𝑁

𝑖=1

𝑁

(𝑓(𝑥𝑖), 𝑤𝑖𝑡ℎ 𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓𝑉𝑎𝑟𝑖𝑎𝑡. መ𝐼) =𝑉𝑎𝑟 )𝑓(𝑥

𝑁𝔼 )𝜋(𝑥 )𝑓(𝑥


For the CMBdata, consider 2 subsamples and

We consider that both come from distributions

and

We want to decide if both means are the same, i.e. test

We assume that the variance (error) 2 is the same for both models

and use a prior

The Bayes factor can then be computed as follows:

Note that the Bayes factor does not depend on the normalizing

constant of and thus the use of an improper prior

is fine.

50

1,..., nx x 1,..., .ny y

0 : .x yH

2 21

10B

2 2 2

10 2 2 2

, , | ,

, |

x y n x y x y

n

d d dB

d d

D

D

2 2 21

Bayes Factor Approximation

J-M Marin and C. P. Robert, Bayesian Core, Chapter 2

𝑦1, . . . , 𝑦𝑛 ∼ 𝒩 𝜇𝑦 , 𝜎2 .

𝑥1, . . . , 𝑥𝑛 ∼ 𝒩 𝜇𝑥 , 𝜎2

https://www.dropbox.com/s/7w9peyrlpwcdv4m/CMBdata?dl=0http://www.springerlink.com/content/x38261/


We introduce a new parametrization by writing:

with the prior

This allows the same prior to be used in both 𝐻0 and the alternative 𝐻1: 𝜇𝑥 ≠ 𝜇𝑦.

We choose the improper prior and

The Bayes’ factor can now be re-written as:

51

,x y

,

1

2

2 2 2

10 2 2 2

2 2/2 /2 12 2 2 2 2 2 2 22

2 2/2 /2 12 2 2 2 2 2 2 2

, , | ,

, |

1exp 2 exp 2

2

exp 2 exp 2

x y n x y x y

n

n n

x y

n n

x y

d d dB

d d

n x s n y s e d d d

n x s n y s d d

D

D


𝜉 ∼ 𝒩(0,1).



52

To simplify the Bayes factor, let us define:

We can then write:

Performing the integrals in 𝜎2:

Lets now integrate in 𝜇. Start with the denominator. Note that:

22 2 2

2

2 2 2

2

12 22 2

101

2 22

1

2

nx y Sn

nx y Sn

e e d d d

B

e d d

2 2

222 1 1

n n

i iyx i i

x x y yss

Sn n n n

2

2 22 2

102 2

2

1

2

n

n

x y S e d d

B

x y S d

22

2 2

22 2

x yx yx y



53

So the denominator takes the form:

With the definitions and we can simplify as:

2

2 22 2

102 2

2

1

2

n

n

x y S e d d

B

x y S d

22

22 2

2 22 4 2

n

nn

x yx y Sx y S d d

2

2

2 4 2

2 1

x y S

n

10

( 1)/22

2 22 2 2

2

2 1

2 22 1 2

1

2

1 1

2 22 2

2 2

nn n

n n n

Common Factorin the demomin.and numer.of

B

x y

x y S d d

1/22

2

1

2 2

n

x y S

2 1,n

Here we use the normalizing constant

of the t-distribution

http://en.wikipedia.org/wiki/Student's_t-distribution


Thus for the normal case with

54

and 0 : 0H

under prior 2 2, 1 and

2

1/22

2

01 1/22

2 /210

21

2 2 2

n

n

x y S

BB

x y S e d

For CMBdata, we simulate and approximate

with (for one simulation that we run)01B

when . Thus 𝐻0 is much more likely with the data available.

20.0888, 0.1078, 0.00875, 100x y S n


𝑥1, . . . , 𝑥𝑛 ∼ 𝒩 𝜇 + 𝜉, 𝜎2

𝑦1, . . . , 𝑦𝑛 ∼ 𝒩 𝜇 − 𝜉, 𝜎2

)𝜉 ∼ 𝒩(0,1

)𝜉1, . . . , 𝜉1000 ∼ 𝒩(0,1

𝐵01𝜋 =

ҧ𝑥 − ത𝑦 2 + 2𝑆2 −𝑛+ Τ1 2

Τ𝑖=1

10002𝜉𝑖 + ҧ𝑥 − ത𝑦

2 + 2𝑆2 −𝑛+ Τ1 2 1000= 43.3309

https://www.dropbox.com/s/7w9peyrlpwcdv4m/CMBdata?dl=0


For an estimator, with , the variance can

be computed (in terms of the empirical mean) as

and for 𝑁 large,

We can thus find the variability

in the Bayes factor estimation.

We construct a convergence

test and of confidence bounds on

the approximation of

20 40 60 80 100 120 1400

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

55

2

1

1 1( )

1

N

N j N

j

v h x hN N

Histogram of 1000 realizations

of the approximation of

based on 1000 simulations

each.

Precision Evaluation in Bayes Factor( ) ( )I h d x x x

MatLab Implementationൗℎ𝑁 − 𝔼𝑓 )ℎ(𝑋 )𝑣𝑁 ∼ 𝒩(0,1

𝐵01𝜋

𝐵01𝜋

)𝑥𝑖 ∼ 𝜋(𝒙

https://www.dropbox.com/s/noh9vipbkl9pfrk/PrecisionEvaluation.m?dl=0


0 100 200 300 400 500 600 700 800 900 1000

9.6

9.8

10

10.2

10.4

10.6

For estimating a normal mean, a robust prior is a Cauchy prior

Under squared error loss, the posterior mean is given as:

Form of suggests simulating i.i.d. variables 𝜃1,· · · , 𝜃𝑚 ∼ 𝒩(𝑥, 1)and calculate

LLN implies

56

~ ( ,1), ~ (0,1)x N C

2

2

/2

2

/2

2

1( )1

1

x

x

e d

x

e d


Example (Cauchy-Normal)

𝛿𝑚𝜋 (𝑥) =

𝑖=1

𝑚𝜃𝑖

1 + 𝜃𝑖2

𝑖=1

𝑚

11 + 𝜃𝑖

2

𝛿𝑚𝜋 (𝑥) → 𝛿𝜋(𝑥) 𝑎𝑠 𝑚 → ∞
https://en.wikipedia.org/wiki/Cauchy_distributionhttps://www.dropbox.com/s/xxfqlloltlqchek/CauchyPrior.m?dl=0

introduction to monte carlo methodsmonte carlo simulation 6 all of the calculations reviewed earlier...

Documents