# tutorial on quasi-monte carlo · pdf filemonte carlo simulation; ... different worst-case...

of 68 /68
Tutorial on quasi-Monte Carlo methods Josef Dick School of Mathematics and Statistics, UNSW, Sydney, Australia

Post on 03-Feb-2018

214 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

Tutorial onquasi-Monte Carlo methods

Josef Dick

School of Mathematics and Statistics, UNSW, Sydney, [email protected]

Comparison: MCMC, MC, QMC

Roughly speaking:

• Markov chain Monte Carlo and quasi-Monte Carlo are fordifferent types of problems;

• If you have a problem where Monte Carlo does not work, thenchances are quasi-Monte Carlo will not work as well;

• If Monte Carlo works, but you want a faster method⇒ try(randomized) quasi-Monte Carlo (some tweaking might benecessary).

• Quasi-Monte Carlo is an "experimental design" approach toMonte Carlo simulation;

In this talk we shall discuss how quasi-Monte Carlo can be faster thanMonte Carlo under certain assumptions.

Outline

DISCREPANCY, REPRODUCING KERNEL HILBERTSPACES AND WORST-CASE ERROR

QUASI-MONTE CARLO POINT SETS

RANDOMIZATIONS

WEIGHTED FUNCTION SPACES ANDTRACTABILITY

The task is to approximate an integral

Is(f ) =

∫[0,1]s

f (z) dz

for some integrand f by some quadrature rule

QN,s(f ) =1N

N−1∑n=0

f (xn)

at some sample points x0, . . . ,xN−1 ∈ [0,1]s.

∫[0,1]s

f (x) dx ≈ 1N

N−1∑n=0

f (xn)

In other words:

Area under curve = Volume of cube × average function value.

• If x0, . . . ,xN−1 ∈ [0,1]s are chosen randomly⇒ Monte Carlo

• If x0, . . . ,xN−1 ∈ [0,1]s chosen deterministically⇒ Quasi-MonteCarlo

Smooth integrands

• Integrand f : [0,1]→ R; say continuously differentiable;• We want to study the integration error:∫ 1

0f (x) dx − 1

N

N−1∑n=0

f (xn).

• Representation:

f (x) = f (1)−∫ 1

xf ′(t) dt = f (1)−

∫ 1

01[0,t](x)f ′(t) dt ,

where

1[0,t](x) =

{1 if x ∈ [0, t ],0 otherwise.

Integration error

• Substitute:∫ 1

0f (x)dx =

∫ 1

0

(f (1)−

∫ 1

01[0,t](x)f ′(t) dt

)dx

= f (1)−∫ 1

0

∫ 1

01[0,t](x)f ′(t) dx dt ,

1N

N−1∑n=0

f (xn) =1N

N−1∑n=0

(f (1)−

∫ 1

01[0,t](xn)f ′(t) dt

)

= f (1)−∫ 1

0

1N

N−1∑n=0

1[0,t](xn)f ′(t) dt .

• Integration error:∫ 1

0f (x) dx − 1

N

N−1∑n=0

f (xn) =

∫ 1

0

(1N

N−1∑n=0

1[0,t](xn)−∫ 1

01[0,t](x) dx

)f ′(t) dt .

Local discrepancy

Integration error:∫ 1

0f (x) dx − 1

N

N−1∑n=0

f (xn) =

∫ 1

0

(1N

N−1∑n=0

1[0,t](xn)− t

)f ′(t) dt .

Let P = {x0, . . . , xN−1} ⊂ [0,1]. Define the local discrepancy by

∆P(t) =1N

N−1∑n=0

1[0,t](xn)− t , t ∈ [0,1].

Then ∫ 1

0f (x) dx − 1

N

N−1∑n=0

f (xn) =

∫ 1

0∆P(t)f ′(t) dt .

Koksma-Hlawka inequality

∣∣∣∣∣∫ 1

0f (x) dx − 1

N

N−1∑n=0

f (xn)

∣∣∣∣∣ =

∣∣∣∣∣∫ 1

0∆P(t)f ′(t) dt

∣∣∣∣∣≤

(∫ 1

0|∆P(t)|p dt

)1/p(∫ 1

0|f ′(t)|q dt

)1/q

= ‖∆P‖Lp‖f ′‖Lq ,1p

+1q

= 1,

where ‖g‖Lp =(∫|g|p

)1/p.

Interpretation of discrepancy

Let P = {x0, . . . , xN−1} ⊂ [0,1]. Recall the definition of the localdiscrepancy:

∆P(t) =1N

N−1∑n=0

1[0,t](xn)− t , t ∈ [0,1].

Local discrepancy measures difference between uniform distributionand emperical distribution of quadrature points P = {x0, . . . , xN−1}.

This is the Kolmogorov-Smirnov test for the difference between theempirical distribution of {x0, . . . , xN−1} and the uniform distribution.

Function space

Representation

f (x) = f (1)−∫ 1

xf ′(t) dt .

Define inner product:

〈f ,g〉 = f (1)g(1) +

∫ 1

0f ′(t)g′(t) dt .

and norm

‖f‖ =

√|f (1)|2 +

∫ 1

0|f ′(t)|2 dt .

Function space:

H = {f : [0,1]→ R : f absolutely continuous and ‖f‖ <∞}.

Worst-case error

The worst-case error is defined by

e(H,P) = supf∈H,‖f‖≤1

∣∣∣∣∣∫ 1

0f (x) dx − 1

N

N−1∑n=0

f (xn)

∣∣∣∣∣ .

Reproducing kernel Hilbert space

Recall:

f (y) = f (1) · 1 +

∫ 1

0f ′(x)(−1[0,x ](y)) dx

and

〈f ,g〉 = f (1)g(1) +

∫ 1

0f ′(x)g′(x) dx .

Goal: Find set of functions gy ∈ H for each y ∈ [0,1] such that

〈f ,gy 〉 = f (y) for all f ∈ H.

Conclusions:• gy (1) = 1 for all y ∈ [0,1];•

g′y (x) = −1[0,x ](y) =

{−1 if y ≤ x ,

0 otherwise;

• Make g continuous such that g ∈ H;

Reproducing kernel Hilbert space

It follows that

gy (x) = 1 + min{1− x ,1− y}.

The function

K (x , y) := gy (x) = 1 + min{1− x ,1− y}, x , y ∈ [0,1]

is called reproducing kernel.

The space H is called a reproducing kernel Hilbert space (withreproducing kernel K ).

Numerical integration in reproducing kernel Hilbertspaces

Function representation:∫ 1

0f (z) dz =

∫ 1

0〈f ,K (·, z)〉 dz =

⟨f ,∫ 1

0K (·, z) dz

⟩1N

N−1∑n=0

f (xn) =1N

N−1∑n=0

〈f ,K (·, xn)〉 =

⟨f ,

1N

N−1∑n=0

K (·, xn)

Integration error∫ 1

0f (z) dz − 1

N

N−1∑n=0

f (xn) =

⟨f ,∫ 1

0K (·, z) dz − 1

N

N−1∑n=0

K (·, xn)

⟩= 〈f ,h〉,

where

h(x) =

∫ 1

0K (x , z) dz − 1

N

N−1∑n=0

K (x , xn).

Worst-case error in reproducing kernel Hilbert spaces

Thus

e(H,P) = supf∈H,‖f‖≤1

∣∣∣∣∣∫ 1

0f (z) dz − 1

N

N−1∑n=0

f (xn)

∣∣∣∣∣= sup

f∈H,‖f‖=1|〈f ,h〉|

= supf∈H,f 6=0

∣∣∣∣⟨ f‖f‖

,h⟩∣∣∣∣

=〈h,h〉‖h‖

= ‖h‖,

since the supremum is attained when choosing f = h‖h‖ ∈ H.

Worst-case error in reproducing kernel Hilbert spaces

e2(H,P) =

∫ 1

0

∫ 1

0K (x , y) dx dy − 2

N

N−1∑n=0

∫ 1

0K (x , xn) dx

+1

N2

N−1∑n,m=0

K (xn, xm)

Numerical integration in higher dimensions

• Tensor product space Hs = H× · · · × H.• Reproducing kernel

K (x ,y) =s∏

i=1

[1 + min{1− xi ,1− yi}],

where x = (x1, . . . , xs),y = (y1, . . . , ys) ∈ [0,1]s.• Functions f ∈ Hs have partial mixed derivatives up to order 1 in

each variable∂|u|f∂xu

∈ L2([0,1]s),

where u ⊆ {1, . . . , s}, xu = (xi )i∈u and |u| denotes the cardinalityof u and where ∂|u|f

∂xu(xu,1) = 0 for all u ⊂ {1, . . . , s}.

Worst-case error

Again

e2(H,P) =

∫[0,1]s

∫[0,1]s

K (x ,y) dx dy − 2N

N−1∑n=0

∫[0,1]s

K (x ,xn) dx

+1

N2

N−1∑n,m=0

K (xn,xm)

ande2(H,P) =

∫[0,1]s

∆P(x)∂sf∂x

(x) dx ,

where

∆P(x) =1N

N−1∑n=0

1[0,x](xn)−s∏

i=1

xi .

Discrepancy in higher dimensions

Point set P = {x0, . . . ,xN−1} ⊂ [0,1]s, t = (t1, . . . , ts) ∈ [0,1]s.

• Local discrepancy:

∆P(t) =1N

N−1∑n=0

1[0,t](xn)−s∏

i=1

ti ,

where [0, t ] =∏s

i=1[0, ti ].

Koksma-Hlawka inequalityLet f : [0,1]s → R with

‖f‖ =

(∫[0,1]s

∣∣∣∣ ∂s

∂xf (x)

∣∣∣∣q dx

)1/q

,

and where ∂|u|f∂xu

(xu,1) = 0 for all u ⊂ {1, . . . , s}.Then ∣∣∣∣∣

∫[0,1]s

f (x) dx − 1N

N−1∑n=0

f (xn)

∣∣∣∣∣ ≤ ‖f‖‖∆P‖Lp ,

where 1p + 1

q = 1.

⇒ Construct points P = {x0, . . . ,xN−1} ⊂ [0,1]s with smalldiscrepancy

Lp(P) = ‖∆P‖Lp .

It is often useful to consider different reproducing kernels yieldingdifferent worst-case errors and discrepancies. We will see furtherexamples later.

Constructions of low-discrepancy sequences

• Lattice rules (Bilyk, Brauchart, Cools, D., Hellekalek, Hickernell,Hlawka, Joe, Keller, Korobov, Kritzer, Kuo, Larcher, L’Ecuyer,Lemieux, Leobacher, Niederreiter, Nuyens, Pillichshammer,Sinescu, Sloan, Temlyakov, Wang, Wozniakowski, ...)

• Digital nets and sequences (Baldeaux, Bierbrauer, Brauchart,Chen, D., Edel, Faure, Hellekalek, Hofer, Keller, Kritzer, Kuo,Larcher, Leobacher, Niederreiter, Owen, Özbudak,Pillichshammer, Pirsic, Schmid, Skriganov, Sobol’, Wang, Xing,Yue, ...)

• Hammersley-Halton sequences (Atanassov, De Clerck, Faure,Halton, Hammersley, Kritzer, Larcher, Lemieux, Pillichshammer,Pirsic, White, ...)

• Kronecker sequences (Beck, Hellekalek, Larcher, Niederreiter,Schoissengeier, ...)

Lattice rules

Let N ∈ N, let

g = (g1, . . . ,gs) ∈ {1, . . . ,N − 1}s.

xn ={ng

N

}, for n = 0, . . . ,N − 1,

where {z} = z − bzc for z ∈ R+0 .

Fibonacci lattice rules

Lattice rule with N = 55 points and generating vector g = (1,34).

Fibonacci lattice rules

Lattice rule with N = 89 points and generating vector g = (1,55).

In comparison: Random point set

Random set of 64 points generated by Matlab using MersenneTwister.

Lattice rules

How to find generating vector g?

Reproducing kernel:

K (x ,y) =s∏

i=1

(1 + 2πBα ({xi − yi})) ,

where {xi − yi} = (xi − yi )− bxi − yic is the fractional part of xi − yi .

Reproducing kernel Hilbert space of Fourier series:

f (x) =∑h∈Zs

f (h)e2πih·x .

Worst-case error:

e2α(g) = −1 +

1N

N−1∑n=0

s∏i=1

(1 + 2πBα

({ngi

N

})),

where Bα is the Bernoulli polynomial of order α. For instanceB2(x) = x2 − x + 1/6.

Component-by-component construction (Korobov,Sloan-Reztsov,Sloan-Kuo-Joe, Nuyens-Cools)

• Set g1 = 1.• For d = 2, . . . , s assume that we have found g2, . . . ,gd−1. Then

find gd ∈ {1, . . . ,N − 1} which minimizes eα(g1, . . . ,gd−1,g) as afunction of g, i.e.

gd = argming∈{1,...,N−1}e

2α(g1, . . . ,gd−1,g).

Using fast Fourier transform (Nuyens, Cools), a good generatingvector g ∈ {1, . . . ,N − 1}s can be found in O(sN log N) operations.

Hammersley-Halton sequence

Radical inverse function in base b:

• Let n ∈ N0 have base b expansion

n = n0 + n1b + · · ·+ na−1ba−1.

• Setϕb(n) =

n0

b+

n1

b2 + · · ·+ na−1

ba ∈ [0,1].

Hammersley-Halton sequence:• Let P = {p1,p2, . . . , } be the set of prime numbers in increasing

order, i.e. p1 = 2,p2 = 3,p3 = 5,p4 = 7, . . ..• Define quadrature points x0,x1, . . . by

xn = (ϕp1 (n), ϕp2 (n), . . . , ϕps (n)) for n = 0,1,2, . . . .

Hammersley-Halton sequence

Hammerlsey-Halton point set with 64 points.

Digital nets

• Choose prime number b and finite field Zb = {0,1, . . . ,b − 1} oforder b.

• Choose C1, . . . ,Cs ∈ Zm×mb .

• Let n = n0 + n1b + · · ·+ nm−1bm−1 and set~n = (n0, . . . ,nm−1)> ∈ Zm

b .• Let

~yn,i = Ci~n for 1 ≤ i ≤ s,0 ≤ n < bm.

• For ~yn,i = (yn,i,1, . . . , yn,i,m)> ∈ Zmb let

xn,i =yn,i,1

b+ · · ·+ yn,i,m

bm .

• Set xn = (xn,1, . . . , xn,s) for 0 ≤ n < bm.

(t ,m, s)-net property

Let m, s ≥ 1 and b ≥ 2 be integers. A point set P = {x0, . . . ,xbm−1}is called a (t ,m, s)-net in base b, if for all integers d1, . . . ,ds ≥ 0 with

d1 + · · ·+ ds = m − t

the number of points in the elementary intervals

s∏i=1

[ai

bdi,

ai + 1bdi

)where 0 ≤ ai < bdi ,

is bt .

If P = {x0, . . . ,xbm−1} ⊂ [0,1]s is a (t ,m, s)-net, then localdiscrepancy function

∆P

( a1

bd1, . . . ,

as

bds

)= 0

for all 0 ≤ ai < bdi ,di ≥ 0,d1 + · · ·+ ds = m − t .

Sobol point set

The first 64 points of a Sobol sequence which form a (0,6,2)-net inbase 2.

Niederreiter-Xing point set

The first 64 points of a Niederreiter-Xing sequence.

Niederreiter-Xing point set

The first 1024 points of a Niederreiter-Xing sequence.

Kronecker sequence

Let α1, . . . , αs ∈ R be linearly independent over Q.Let

xn = ({nα1}, . . . , {nαs}) for n = 0,1,2, . . . .

For instance, one can choose αi =√

pi where pi is the i th primenumber.

Kronecker sequence

The first 64 points of a Kronecker sequence.

Discrepancy bounds

• It is known that for all the constructions above one has

Lp(P) ≤ Cs(log N)c(s)

N, for all N ≥ 1,1 ≤ p ≤ ∞,

where Cs > 0 and c(s) ≈ s.• Lower bound: For all point sets P consisting of N points we have

Lp(P) ≥ C′s(log N)(s−1)/2

Nfor all N ≥ 1,1 < p ≤ ∞.

In comparison: random point set

E(L2(P)) = O

(√log log N

N

).

For 1 < p <∞, the exact order of convergence is known to be oforder

(log N)(s−1)/2

N.

Explicit constructions of such points were achieved by Chen &Skriganov for p = 2 and Skriganov for 1 < p <∞.

Great open problem: What is the correct order of convergence of

minP⊂[0,1]s|P|=N

L∞(P)?

Randomized quasi-Monte Carlo

Introduce random element to deterministic point set.

Has several advantages:• yields an unbiased estimator;• there is a statistical error estimate;• better rate of convergence of the random-case error compared to

worst-case error;• performs similar to Monte Carlo for L2 functions but has better

rate of convergence for smooth functions;

Randomly shifted lattice rules

Lattice rules can be randomized using a random shift:• Lattice point set:

{ng/N} , n = 0,1, . . . ,N − 1.

• Shifted lattice rule: choose σ ∈ [0,1]s uniformly distributed; thenthe shifted lattice point set is given by

{ng/N + σ} , n = 0,1, . . . ,N − 1.

Lattice point set

Shifted lattice point set

Owen’s scrambling

• Let Zb = {0,1, . . . ,b − 1} and let

x =x1

b+

x2

b2 + · · · , x1, x2, . . . ∈ Zb.

• Randomly choose permutations σ, σx1 , σx1,x2 , . . . : Zb → Zb.• Let

y1 = σ(x1)

y2 = σx1 (x2)

y3 = σx1,x2 (x3)

. . . . . .

• Sety =

y1

b+

y2

b2 + · · · .

Scrambled Sobol point set

The first 1024 points of a Sobol sequence (left) and a scrambledSobol sequence (right).

Scrambled net variance

Let

e(R) =

∫[0,1]s

f (x) dx − 1N

bm−1∑n=0

f (yn).

Then•

E(e(R)) = 0

Var(e(R)) = O(

(log N)s

N2α+1

)for integrands with smoothness 0 ≤ α ≤ 1.

Dependence on the dimension

Although the convergence rate is better for quasi-Monte Carlo, thereis a stronger dependence on the dimension:

• Monte Carlo: error = O(N−1/2);

• Quasi-Monte Carlo: error = O(

(log N)s−1

N

);

Notice that g(N) := N−1(log N)s−1 is an increasing function for

N ≤ es−1.

So if s is large, say s ≈ 30, then N−1(log N)29 increases for N ≤ 1012.

Intractable

For integration error in H with reproducing kernel

K (x ,y) =s∏

i=1

min{1− xi ,1− yi}

it is known that

e2(H,P) ≥(

1− N(

89

)s)e2(H, ∅).

⇒ Error can only decrease if

N > constant ·(

98

)s

.

Weighted function spaces (Sloan and Wozniakowski)

Study weighted function spaces: Introduce γ1, γ2, . . . , > 0 and define

K (x ,y) =s∏

i=1

(1 + γi min{1− xi ,1− yi}).

Then if∞∑i=1

γi <∞

we havee(Hs,γ ,P) ≤ CN−δ,

where C > 0 is independent of the dimension s and δ < 1.

Tractability

• Minimal error over all methods using N points:

e∗N(H) = infP:|P|=N

e(H,P);

• Inverse of the error:

N∗(s, ε) = min{N ∈ N : e∗N(H) ≤ εe∗0(H)}

for ε > 0;• Strong tractability:

N∗(s, ε) ≤ Cε−β

for some constant C > 0;

Sloan and Wozniakowski: For the function space considered above,we get strong tractability if

∞∑i=1

γi <∞.

Tractability of star-discrepancy

Recall the local discrepancy function

∆P(t) =1N

N−1∑n=0

1[0,t)(xn)− t1 · · · ts, where P = {x0, . . . ,xN−1}.

The star-discrepancy is given by

D∗N(P) = supt∈[0,1]s

|∆P(t)| .

Result by Heinrich, Novak, Wozniakowski, Wasilkowski:

Minimal star discrepancy D∗(s,N) = infP

D∗N(P) ≤ C√

sN

for all s,N ∈ N.

This implies thatN∗(s, ε) ≤ Csε−2.

Extensions, open problems and future research

• Quasi-Monte Carlo methods which achieve convergence rates oforder

N−α(log N)αs, α > 1

for sufficiently smooth integrands;• Construction of point sets whose star-discrepancy is tractable;• Completely uniformly distributed point sets to speed up

convergence in Markov chain Monte Carlo algorithms;• Infinite dimensional integration: Integrands have infinitely many

variables;• Connections to codes, orthogonal arrays and experimental

designs;• Quasi-Monte Carlo for function approximation;• Choosing the weights in applications;• Quasi-Monte Carlo on the sphere;

Book

Thank You!Special thanks to

• Dirk Nuyens for creating many of the pictures in this talk;• Friedrich Pillichshammer for letting me use a picture of his

daughters;• Fred Hickernell, Peter Kritzer, Art Owen, and Friedrich

Pillichshammer for helpful comments on the slides;• All colleagues who worked on quasi-Monte Carlo methods over

the years.