monte carlo sampling techniques for solving stochastic and ...€¦ · monte carlo sampling...

Monte Carlo sampling techniques forsolving stochastic and large scale

deterministic optimization problems.

Alexander Shapiro

Georgia Institute of TechnologySchool of Industrial and Systems Engineering

Computing with Uncertainty: Mathematical Modeling, Numerical Approximationand Large Scale Optimization of Complex Systems with Uncertainty

Stochastic programming problem:

Minx∈X

{f (x) := E[F (x , ξ)]

}, (1)

where ξ ∈ Ξ ⊂ Rd is a random vector. The feasible set X canbe finite (combinatorial stochastic programs), or X ⊂ Rn.

Even a crude discretization of the distribution of the randomvector ξ typically results in an exponential growth of the numberof scenarios with increase of the number d of random variables.For example, if components of random vector ξ areindependently distributed and distribution of each component isdiscretized by r points, then the total number of scenarios is rd .That is, although the input data grows linearly with increase ofthe dimension d , the number of scenarios grows exponentially.

Monte Carlo sampling approachGenerate a random sample ξ1, ..., ξN of N realizations of therandom vector ξ by using Monte Carlo sampling techniques.Then the expected value function f (x) := E[F (x , ξ)] can beapproximated by the sample average function

fN(x) :=1N

N∑j=1

F (x , ξj).

Consequently the true (expected value) problem isapproximated by the so-called Sample Average Approximation(SAA) problem:

Minx∈X

fN(x). (2)

Large scale deterministic problems

Consider (deterministic) problem of the form

Minx∈X

∑i∈I

Fi(x), (3)

where Fi : Rn → R, i ∈ I, with I being a finite index set of verylarge cardinality |I|. By assigning equal probability 1/|I| toeach element of the set I, we can view problem (3), normalizedby factor 1/|I|, as a stochastic programming problem. Bygenerating a random sample IN , of cardinality N << |I|, fromI, we can construct the corresponding SAA problem

Minx∈X

{f (x) :=

∑i∈IN

Fi(x)}. (4)

Example of Support Vector Machines

Minw ,b

∑i∈I

[1− yi(wTΦ(xi) + b)

]+

+ µ‖w‖2. (5)

Here (xi , yi)i∈I is data set with xi ∈ Rn and yi ∈ {−1,1} beingthe classifier, Φ(·) is a feature mapping and µ > 0 is a constant.Equivalently

Minw ,b,η≥0

‖w‖2 + 1µ

∑i∈I

ηi

s.t. yi(wTΦ(xi) + b) ≥ 1− ηi , i ∈ I.(6)

A (naive) justification of the SAA method is that for a givenx ∈ X , by the Law of Large Numbers (LLN), fN(x) converges tof (x) w.p.1 as N tends to infinity. It is possible to show that,under mild regularity conditions, this convergence is uniform onany compact subset of X (uniform LLN). It follows that theoptimal value vN and an optimal solution xN of the SAA problem(2) converge w.p.1 to their counterparts of the true problem.Central Limit Theorem type results. Notoriously slowconvergence of order Op(N−1/2). By the CLT, for a given x ∈ X ,

N1/2[fN(x)− f (x)

]D→ N(0, σ2(x)),

where σ2(x) := Var [F (x , ξ)] and “D→ ” denotes convergence in

distribution.

The idea of common random numbers generation.

Suppose that we want to compare values of the objectivefunction at two points x1, x2 ∈ X . In that case we are interestedin the difference f (x1)− f (x2) rather than in the individualvalues f (x1) and f (x2). If we use sample average estimatesfN(x1) and fN(x2) based on independent samples, both of sizeN, then fN(x1) and fN(x2) are uncorrelated and

Var[fN(x1)− fN(x2)

]= Var

[fN(x1)

]+ Var

[fN(x2)

]. (7)

On the other hand, if we use the same sample for theestimators fN(x1) and fN(x2), then

Var[fN(x1)− fN(x2)

]= Var

[fN(x1)

]+ Var

[fN(x2)

]−2Cov

(fN(x1), fN(x2)

).

(8)

Let X be a compact subset of Rn and consider the space C(X )of continuous functions φ : X → R. Assume that:(A1) For some point x ∈ X the expectation E[F (x , ξ)2] is finite.(A2) There exists a measurable function C : Ξ→ R+ such thatE[C(ξ)2] is finite and∣∣F (x , ξ)− F (x ′, ξ)

∣∣ ≤ C(ξ)‖x − x ′‖, (9)

for all x , x ′ ∈ X and a.e. ξ ∈ Ξ.We can view YN := fN as a random element of C(X ).Denote v0 the optimal value and S0 the set of optimal solutionsof the true problem.

Theorem

Suppose that the set X is compact, and assumptions (A1) and(A2) hold. Then

vN = minx∈S0

fN(x) + op(N−1/2),

N1/2[vN − v0]D→ infx∈S0 Y (x).

In particular, if the optimal set (of the true problem) S0 = {x0}is a singleton, then

N1/2[vN − v0]D→ N(0, σ2(x0)).

This result suggests that the optimal value of the SAA problemconverges at a rate of Op(N−1/2). In particular, if S0 = {x0},then vN converges to v0 at the same rate as fN(x0) convergesto f (x0).

Validation analysisHow one can evaluate quality of a given (feasible) solutionx ∈ X?

The SAA approach – statistical test based on estimation off (x)− v0, where v0 is the optimal value of the true problem(Mak, Morton and Wood, 1999).(i) Estimate f (x) by the sample average fN′(x), using sample ofa large size N ′.(ii) Solve the SAA problem M times using M independentsamples each of size N. Let v (1)

N , ..., v (M)N be the optimal values

of the corresponding SAA problems. Estimate E[vN ] by theaverage M−1∑M

j=1 v (j)N .

Note that

E[fN′(x)−M−1∑M

j=1 v (j)N

]=(f (x)− v0)+

(v0 − E[vN ]

),

and that v0 − E[vN ] > 0. The bias v0 − E[vN ] is positive and(under mild regularity conditions)

limN→∞

N1/2(

v0 − E[vN ])

= E[maxx∈S0

Y (x)

],

where S0 is the set of optimal solutions of the true problem,(Y (x1), ...,Y (xk )) has a multivariate normal distribution withzero mean vector and covariance matrix given by thecovariance matrix of the random vector (F (x1, ξ), ...,F (xk , ξ)).For ill-conditioned problems this bias is of order O(N−1/2) andcan be large if the ε-optimal solution set Sε is large for somesmall ε ≥ 0.

Sample size estimates (by Large Deviations type bounds)Consider an iid sequence Y1, . . . ,YN of replications of a realvalued random variable Y , and let ZN := N−1∑N

i=1 Yi be thecorresponding sample average. Then for any real numbers aand t > 0 we have that Prob(ZN ≥ a) = Prob(etZN ≥ eta), andhence, by Markov inequality

Prob(ZN ≥ a) ≤ e−taE[etZN ] = e−ta[M(t/N)]N ,

where M(t) := E[etY ] is the moment generating function of Y .Suppose that Y has finite mean µ := E[Y ] and let a ≥ µ. Bytaking the logarithm of both sides of the above inequality,changing variables t ′ = t/N and minimizing over t ′ > 0, weobtain

1N

log[Prob(ZN ≥ a)

]≤ −I(a), (10)

where I(z) := supt∈R {tz − Λ(t)} is the conjugate of thelogarithmic moment generating function Λ(t) := log M(t).

Suppose that |X | <∞, i.e., the set X is finite, and: (i) for everyx ∈ X the expected value f (x) = E[F (x , ξ)] is finite, (ii) thereare constants σ > 0 and a ∈ (0,+∞] such that

Mx (t) ≤ exp{σ2t2/2}, ∀t ∈ [−a,a], ∀x ∈ X ,

where Mx (t) is the moment generating function of the randomvariable F (u(x), ξ)− F (x , ξ)− E[F (u(x), ξ)− F (x , ξ)] and u(x)is a point of the optimal set S0. Choose ε > 0, δ ≥ 0 andα ∈ (0,1) such that 0 < ε− δ ≤ aσ2. Then for sample size

N ≥ 2σ2

(ε− δ)2 log(|X |α

)we are guaranteed, with probability at least 1− α, that anyδ-optimal solution of the SAA problem is an ε-optimal solutionof the true problem, i.e., Prob(Sδ

N ⊂ Sε) ≥ 1− α.

Let X = {x1, x2} with f (x2)− f (x1) > ε > 0 and suppose thatrandom variable F (x2, ξ)− F (x1, ξ) has normal distribution withmean µ = f (x2)− f (x1) and variance σ2. By solving thecorresponding SAA problem we make the correct decision (thatx1 is the minimizer) if fN(x2)− fN(x1) > 0. Probability of thisevent is Φ(µ

√N/σ). Therefore we need the sample size

N > z2ασ

2/ε2 in order for our decision to be correct withprobability at least 1− α.In order to solve the corresponding optimization problem weneed to test H0 : µ ≤ 0 versus Ha : µ > 0. Assuming that σ2 isknown, by Neyman-Pearson Lemma, the uniformly mostpowerful test is: “reject H0 if fN(x2)− fN(x1) is bigger than aspecified critical value".

Now let X ⊂ Rn be a set of finite diameterD := supx ′,x∈X ‖x ′ − x‖. Suppose that: (i) for every x ∈ X theexpected value f (x) = E[F (x , ξ)] is finite, (ii) there is a constantσ > 0 such that

Mx ′,x (t) ≤ exp{σ2t2/2}, ∀t ∈ R, ∀x ′, x ∈ X ,where Mx ′,x (t) is the moment generating function of the randomvariable F (x ′, ξ)− F (x , ξ)− E[F (x ′, ξ)− F (x , ξ)], (iii) thereexists κ : Ξ→ R+ such that its moment generating function isfinite valued in a neighborhood of zero and∣∣F (x ′, ξ)− F (x , ξ)

∣∣ ≤ κ(ξ)‖x ′ − x‖, ∀ξ ∈ Ξ, ∀x ′, x ∈ X .Choose ε > 0, δ ∈ [0, ε) and α ∈ (0,1). Then for sample1 size

N ≥ O(1)σ2

(ε− δ)2

[n log

(O(1)DL(ε− δ)2

)+ log

(2α

)],

we are guaranteed that Prob(

SδN ⊂ Sε

)≥ 1− α.

1O(1) denotes a generic constant independent of the data.

In particular, if κ(ξ) ≡ L, then the estimate takes the form

N ≥ O(1)

(LDε− δ

)2 [n log

(O(1)DLε− δ

)+ log

(1α

)].

Suppose further that for some c > 0, γ ≥ 1 and ε > ε thefollowing growth condition holds

f (x) ≥ v0 + c[dist(x ,S0)]γ , ∀x ∈ Sε,

and that the problem is convex. Then, for δ ∈ [0, ε/2], we havethe following estimate of the required sample size:

N ≥(

O(1)LDc1/γε(γ−1)/γ

)2 [n log

(O(1)DL

ε

)+ log

(1α

)],

where D is the diameter of Sε. In particular, if S0 = {x0} is asingleton and γ = 1, we have the estimate (independent of ε):

N ≥ O(1)c−2L2[n log(O(1)c−1L) + log(α−1)

].

Stochastic Approximation (SA) approach

An alternative approach is going back to Robbins and Monro(1951) and is known as the Stochastic Approximation (SA)method. Assume that the true problem is convex, i.e., the setX ⊂ Rn is convex (and closed and bounded) and functionF (·, ξ) : X → R is convex for all ξ ∈ Ξ.Also assume existence of the following stochastic oracle:given x ∈ X and a random realization ξ ∈ Ξ, the oracle returnsthe quantity F (x , ξ) and a stochastic subgradient – a vectorG(x , ξ) such that g(x) := E[G(x , ξ)] is well defined and is asubgradient of f (·) at x , i.e., g(x) ∈ ∂f (x). For example, useG(x , ξ) ∈ ∂xF (x , ξ).

Classical SA algorithm

xj+1 = ΠX (xj − γjG(xj , ξj)),

where γj = θ/j , θ > 0, and ΠX (x) = arg minz∈X ‖x − z‖2 is theorthogonal (Euclidean) projection onto X . Theoretical bound(assuming f (·) is strongly convex and differentiable)

E[f (xN)− v0

]= O(N−1),

for an optimal choice of constant θ (here v0 is the optimal valueof the true problem).

Randomization approach to large scale deterministic problems

Suppose that we need to solve a deterministic probleminvolving calculations of products Ax , where A is m × n matrixand x ∈ Rn. Let pi = α−1sign(xi)xi , where α =

∑ni=1 |xi |. We

can view p as a probability vector. Let us choose at randomnumber i ∈ {1, ...,n} with probability pi . ThenE[αsign(xi)ei ] = x , where ei is the i-th coordinate vector, andhence α sign(xi)Aei is an unbiased estimate of Ax . Togetherwith an SA type algorithm, this can be applied, for example, tosolving large scale bilinear matrix game problem:

minx∈∆n

maxy∈∆m

yTAx , (11)

where ∆n ={

x ∈ Rn :∑n

i=1 = 1, x ≥ 0}.

The classical SA algorithm is very sensitive to choice of θ, doesnot work well in practice. As a simple example considerf (x) = 1

2cx2, with c = 0.2 and X = [−1,1] ⊂ R and assumethat there is no noise, i.e., G(x , ξ) ≡ ∇f (x). Suppose, further,that we take θ = 1 (i.e., γj = 1/j), which will be the optimalchoice for c = 1. Then the iteration process becomes

xj+1 = xj − f ′(xj)/j =(

1− 15j

)xj ,

and hence starting with x1 = 1,

xj =∏j−1

s=1

(1− 1

5s

)= exp

{−∑j−1

s=1 ln(

1 + 15s−1

)}> exp

{−∑j−1

s=11

5s−1

}> exp

{−(

0.25 +∫ j−1

11

5t−1dt)}

> exp{−0.25 + 0.2 ln 1.25− 1

5 ln j}> 0.8j−1/5.

That is, the convergence is extremely slow. For example forj = 109 the error of the iterated solution is greater than 0.015.On the other hand for the optimal stepsize factor ofθ = 1/c = 5, the optimal solution x = 0 is found in one iteration.

Robust SA approach (with averaging) (B. Polyak, 1990):consider

xj =

j∑t=1

νtxt , where νt =γt∑jτ=1 γτ

.

Let DX = maxx∈X ‖x − x1‖2, and assume that

E[‖G(x , ξ)‖22

]≤ M2, ∀x ∈ X ,

for some constant M > 0. Then

E[f (xj)− v0

]≤

D2X + M2∑j

t=1 γ2t

2∑j

t=1 γt.

For γt = θDXM√

t, after N iterations, we have

E[f (xN)− v0

]≤

max{θ, θ−1}M(D2

X + log N)

2DX√

N.

Constant step size variant: fixed in advance sample size(number of iterations) N and step size γj ≡ γ, j = 1, ...,N:xN = 1

N∑N

j=1 xj . Theoretical bound

E[f (xN)− v0] ≤D2X

2γN+γM2

2.

For optimal (up to factor θ) γ = θDXM√

Nwe have

E[f (xN)− v0

]≤ DXM

2θ√

N+θDXM2√

N≤ κDXM√

N,

where κ = max{θ, θ−1}. By Markov inequality it follows that

Prob{

f (xN)− v0 > ε}≤ κDXM

ε√

N.

Hence the sample size N ≥ κ2D2XM2

ε2α2 guarantees thatProb

{f (xN)− v0 > ε

}≤ α.

Mirror Decent SA method (Nemirovski)

Let ‖ · ‖ be a norm on Rn and ω : X → R be continuouslydifferentiable strongly convex on X with respect to ‖ · ‖ function,i.e., for x , x ′ ∈ X :

ω(x ′) ≥ ω(x) + (x ′ − x)T∇ω(x) + 12c‖x ′ − x‖2.

Prox mapping Px : Rn → X :

Px (y) = arg minz∈X

{ω(z) + (y −∇ω(x))Tz

}.

For ω(x) = 12xT x we have that Px (y) = ΠX (x − y). Set

xj+1 = Pxj (γjG(xj , ξj)),

xj =

j∑t=1

νtxt , where νt =γt∑jτ=1 γτ

.

Then

E[f (xj)− v0

]≤

D2ω,X + (2c)−1M2

∗∑j

t=1 γ2t

2∑j

t=1 γt,

where M∗ is a positive constant such that

E[‖G(x , ξ)‖2∗

]≤ M2

∗ , ∀x ∈ X ,

‖x‖∗ = sup‖z‖≤1 xT z is the dual norm of the norm ‖ · ‖, and

Dω,X =

[maxz∈X

ω(z)−minx∈X

ω(x)

]1/2

.

For constant step size γj = γ, j = 1, ...,N, with optimal (up to

factor θ > 0) stepsize γ =θDω,X

M∗

√2cN , we have

E[f (xN)− v0

]≤

max{θ, θ−1}√

2Dω,XM∗√cN

.

Bounds by Mirror Decent SA method.

For iteratesxj+1 = Pxj (γjG(xj , ξ

j)).

Consider

f N(x) :=N∑

j=1

νj [f (xj) + g(xj)T(x − xj)],

where f (x) = E[F (x , ξ)], g(x) = E[G(x , ξ)] andνj := γj/(

∑Nj=1 γj). Since g(x) ∈ ∂f (x), it follows that

f N∗ := min

x∈Xf N(x) ≤ v0.

Also v0 ≤ f (xN) and by convexity of f ,

f (xN) ≤ f ∗N :=N∑

j=1

νj f (xj).

That is, for any realization of the random sample ξ1, ..., ξN ,

f N∗ ≤ v0 ≤ f ∗N .

Computational (observable) counterparts of these bounds:

f N := minx∈X

N∑j=1

νj [F (xj , ξj) + G(xj , ξ

j)T(x − xj)],

fN

:=N∑

j=1

νjF (xj , ξj).

We have that E[f ∗N]

= E[f

N], and

E[f N]≤ v0 ≤ E

[f

N].

Multistage stochastic programming

Minx1∈X

f1(x1) + E[

Minx2∈X2(x1,ξ2)

f2(x2, ξ2) + · · ·

+E[

MinxT∈XT (xT−1,ξT )

fT (xT , ξT )] ],

(12)

where ξ1, ..., ξT is a random process (ξ1 is deterministic),xt ∈ Rnt are decision variables, ft : Rnt × Rdt → R are objectivefunctions and Xt : Rnt−1 × Rdt ⇒ Rnt are measurable closedvalued multifunctions.For example, in the linear case ft (xt , ξt ) := cT

t xt ,

Xt (xt−1, ξt ) := {xt : Btxt−1 + Atxt = bt , xt ≥ 0} ,

ξt = (ct ,Bt ,At ,bt ), t = 2, ...,T , is considered as a randomprocess, ξ1 = (c1,A1,b1) is supposed to be known.

Hence in the linear case the nested formulation can be writtenas

MinA1x1=b1

x1≥0

cT1 x1+E

MinB2x1+A2x2=b2

x2≥0

cT2 x2 + · · ·+ E

[Min

BT xT−1+AT xT =bTxT≥0

cTT xT

] .(13)

If the number of realizations (scenarios) of the process ξt isfinite, then problem (13) can be written as one large linearprogramming problem. There are several possible formulationsof the above multistage program.

Conditional sampling.Let ξi

2, i = 1, ...,N1, be an iid random sample of ξ2. Conditionalon ξ2 = ξi

2, a random sample ξij3, j = 1, ...,N2, is generated and

etc. The obtained scenario tree is considered as a sampleapproximation of the true problem. Note that the total numberof scenarios N =

∏T−1t=1 Nt and each scenario in the generated

tree is considered with the same probability 1/N . Note alsothat in the case of stagewise independence of thecorresponding random process, we have two possiblestrategies. We can generate a different (independent) sampleξij

3, j = 1, ...,N2, for every generated node ξi2, or we can use the

same sample ξj3, j = 1, ...,N2, for every ξi

2. In the second casewe preserve the stagewise independence condition for thegenerated scenario tree.

For T = 3, under certain regularity conditions, for ε > 0 andα ∈ (0,1), and the sample sizes N1 and N2 satisfying

O(1)[(

D1L1ε

)n1exp

{− O(1)N1ε

2

σ21

}+(

D2L2ε

)n2exp

{−O(1)N2ε

2

σ22

}]≤ α,

we have that any first-stage ε/2-optimal solution of the SAAproblem is an ε-optimal first-stage solution of the true problemwith probability at least 1− α. (Here D1,D2,L1,L2, σ1, σ2 arecertain analogues of similar constants in the sample sizeestimate of two stage problem.)

In particular, suppose that N1 = N2 and take L := max{L1,L2},D := max{D1,D2}, σ2 := max{σ2

1, σ22} and n := max{n1,n2}.

Then the required sample size N1 = N2:

N1 ≥O(1)σ2

ε2

[n log

(O(1)DL

ε

)+ log

(1α

)],

with total number of scenarios N = N21 .

That is, the total number of scenarios needed to solve aT -stage stochastic program with a reasonable accuracy by theSAA method grows exponentially with increase of the numberof stages T . Another way of putting this is that the number ofscenarios needed to solve T -stage problem would grow asO(ε−2(T−1)

)with decrease of the error level ε > 0.

This indicates that from the point of view of the number ofscenarios, complexity of multistage programming problemsgrows exponentially with increase of the number of stages.Furthermore, even if the SAA problem can be solved, itssolution does not define a policy for the true problem and of usemay be only the computed first stage solution.

If multistage stochastic programming problems cannot be solveto optimality, one may think about approximations. There areseveral possible approaches to trying to solve multistagestochastic programs approximately. One approach is to reducedynamic setting to a static case. Suppose that we can identify aparametric family of policies xt (ξ[t], θt ), t = 1, ...,T , dependingon a finite number of parameters θt ∈ Θt ⊂ Rqt , and such thatthese policies are feasible for all parameter values. That is, forall θt ∈ Θt , t = 1, ...,T , it holds that x1(θ1) ∈ X1 andxt (ξ[t], θt ) ∈ Xt

(xt−1(ξ[t−1], θt−1), ξt

), t = 2, ...,T , w.p.1.

Consider the following stochastic program

Minθ1,...,θT

f1(x1(θ1)

)+ E

[∑Tt=2 ft

(xt (ξ[t], θt ), ξt

)]s.t. θt ∈ Θt , t = 1, ...,T .

(14)

An alternative approach to solving multistage programs is toapproximate dynamic programming equations. Suppose thatthe stagewise independence condition holds and we have aprocedure for generating cutting (supporting) planes for thecost-to-go functions Qt ,N (·) of an SAA problem. By takingmaximum of these cutting planes we can construct piecewiselinear convex functions Qt (xt−1) approximating the SAAcost-to-go functions from below, i.e., Qt ,N (·) ≥ Qt (·),t = 2, ...,T . These functions Qt (xt−1) and a feasible first stagesolution x1 define the following policy:

xt ∈ arg min{

cTt xt + Qt+1(xt ) : Atxt ≤ bt − Bt xt−1

}, t = 2, ...,T .

Stochastic Dual Dynamic Programming (SDDP) algorithm(Pereira and Pinto, 1991).

Some references

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A., Robuststochastic approximation approach to stochastic programming,SIAM J. Optimization, 19, 1574-1609 (2009).

A. Shapiro, D. Dentcheva and A. Ruszczynski, Lectures onStochastic Programming: Modeling and Theory, SIAM,Philadelphia, 2009.

Lan, G., Nemirovski, A. and Shapiro, A., Validation Analysis ofMirror Descent Stochastic Approximation Method,Mathematical Programming, to appear.

monte carlo sampling techniques for solving stochastic and ...€¦ · monte carlo sampling...

Documents