dynamic programming - cermicscermics.enpc.fr/~delara/exposes/slides_dynamic_programming.pdf ·...

Dynamic Programming

Michel De Lara

Cermics, Ecole des Ponts ParisTechFrance

Ecole des Ponts ParisTech

January 2, 2020

Outline of the presentation

Dynamic Programming Without State

Dynamic Programming With State

Dynamic Programming With State and White Noise

Dynamic Programming With State and White Noise (Complements)

Basic data

Let (Ω,A∞,P) be a probability space

Let T ∈ N∗ be the horizon

For stages t = 0, . . . ,T , let Ut , Wt be measurable sets,equipped with σ-fields Ut , Wt

History space

For t = 0, . . . ,T , we define

the history space Ht

Ht = W0 ×

t−1∏

s=0

(Us ×Ws+1)

equipped with the history field Ht

Ht = W0 ⊗t−1⊗

s=0

(Us ⊗Ws+1)

A generic element ht ∈ Ht is called a history

ht =(w0, u0,w1, u1,w2, . . . , ut−2,wt−1, ut−1,wt)

[ht ] =[(w0, (us ,ws+1)s=0,...,t−1)

]= (w0, . . . ,wt) = w[0:t]

hs:t =(ur ,wr+1)r=s−1,...,t−1 = (us−1,ws , . . . , ut−1,wt)

[hs:t ] =[(ur ,wr+1)r=s−1,...,t−1] = (ws , . . . ,wt) = w[s:t]

Noise process and filtration

For t = 0, . . . ,T , let Wt be a random variable taking values in Wt

Wt : Ω → Wt

We putW[0:t] = (W0, . . . ,Wt)

We introduce the filtration A =(At

)t=0,...,T

defined by

At = σ(W0, . . . ,Wt) , t = 0, . . . ,T

Let L0(Ω,A,∏T−1

s=0 Us) be the space of A-adapted processes

(U0, . . . ,UT−1) with values in∏T−1

s=0 Us , that is, such that

σ(U0) ⊂ A0, . . . , σ(UT−1) ⊂ AT−1

Multistage stochastic optimization problem

Letφ : HT → R

be a bounded function, measurable with respect to the field HT

We consider the multistage stochastic optimization problem

min(U0,...,UT−1)∈L0

A(Ω,

∏T−1s=0 Us)

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

Bellman operators

For t = 0, . . . ,T ,

we define L∞(Ht ,Ht), the space ofbounded measurable real-valued functions over Ht

We suppose that there exists a regular conditional distributionof the random variable Wt+1 knowing the random process W[0:t],denoted by

PW[0:t]

Wt+1(w[0:t], dwt+1) = P

W[0:t]

Wt+1([ht ], dwt+1)

we define the Bellman operators

Bt+1 : L∞(Ht+1,Ht+1) → L

∞(Ht ,Ht) by

(Bt+1ϕ

)(ht) = inf

ut∈Ut

∫

Wt+1

ϕ((ht , ut ,wt+1)

)PW[0:t]

Wt+1([ht ], dwt+1)

∀ϕ ∈ L∞(Ht+1,Ht+1) , ∀ht ∈ Ht

Value functions and Bellman equation

We define inductively value functions, or Bellman functions,

Vt : Ht → R , t = 0, . . . ,T

byVT = φ , Vt = Bt+1Vt+1 , t = 0, . . . ,T − 1 .

that is, solution of the Bellman equation

Vt(ht) = infut∈Ut

∫

Wt+1

Vt+1(ht , ut ,wt+1)PW[0:t]

Wt+1([ht ], dwt+1)

Measurable selection

We suppose that, for all t = 0, . . . ,T , there exists a measurable selection

ψt : (Ht ,Ht) → (Ut ,Ut)

such that

ψt(ht) ∈ argminut∈Ut

∫

Wt+1

Vt+1(ht , ut ,wt+1)PW[0:t]

Wt+1([ht ], dwt+1)

PropositionA solution to the multistage stochastic optimization problem

min(U0,...,UT−1)∈L0

A(Ω,

∏T−1s=0 Us)

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

is the sequence U⋆0 , . . . ,U

⋆T−1 of random variables defined inductively by

U⋆t = ψt H

⋆t , t = 0, . . . ,T

whereH⋆

0 = W0 , H⋆t+1 = (H⋆

t ,U⋆t ,Wt+1) , t = 0, . . . ,T

The minimum is

E[V0(W0)

]= min

(U0,...,UT−1)∈L0A(Ω,

∏T−1s=0 Us )

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

Extension

Constraints of the form

H0 = W0 , Ht+1 = (Ht ,Ut ,Wt+1)

and(Ht ,Ut) ∈ Ct ⊂ Ht × Ut , P− a.s , t = 0, . . . ,T − 1

State reduction and dynamics

For t = 0, . . . ,T , suppose that there exists

state space Xt , a measurable set equipped with σ-field Xt

reduction mappings θtθt : Ht → Xt

dynamics f:tf:t : Xt × Ut ×Wt+1 → Xt+1

such that

θt+1(ht , ut ,wt+1) = f:t

(θt(ht), ut ,wt+1

), t = 0, . . . ,T − 1

Cost only depends on final state

Suppose that there existsφ : XT → R

such that the cost φ can be factored as

φ = φ θT

Markovian assumption

Let ∆(Wt+1) denote the set of probabilities on (Wt+1,Wt)

Suppose that, for all t = 0, . . . ,T , there exists

µt : Xt ×

t∏

s=0

Ws → ∆(Wt+1)

such thatPW[0:t]

Wt+1

([ht ], dwt+1

)= µt

(θt(ht), dwt+1

)

Bellman equation

We define inductively

VT (xT ) = φ(xT ) , ∀xT ∈ XT

Vt(xt) = infut∈Ut

∫

Wt+1

Vt+1

(f:t(xt , ut ,wt+1)

)µt(xt , dwt+1)

∀xt ∈ Xt , t = 0, . . . ,T

We suppose that there exists a measurable selection

ψt : (Xt ,Xt) → (Ut ,Ut) , t = 0, . . . ,T

such that

ψt(xt) ∈ argminut∈Ut

∫

Wt+1

Vt+1

(f:t(xt , ut ,wt+1)

)µt(xt , dwt+1)


minU0,...,UT−1

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

σ(U0) ⊂ A0, . . . , σ(UT−1) ⊂ AT−1



U⋆t = ψt(X

⋆t ) , t = 0, . . . ,T

where

X⋆0 = W0 , X⋆

t+1 = f:t(X

⋆t ,U

⋆t ,Wt+1) , t = 0, . . . ,T

The minimum is

E[V0(X

⋆0)]= min

(U0,...,UT−1)∈L0A(Ω,

∏T−1s=0 Us)

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

Extension

Constraints of the form

(Xt ,Ut) ∈ Ct ⊂ Xt × Ut , P− a.s , t = 0, . . . ,T − 1

Basic data

For t = 0, . . . ,T ,

let U, W, X be measurable sets, equipped with σ-fields U, W, X

dynamics f:tf:t : X× U×W → Xt+1

instantaneous costs Lt

Lt : X× U×W → R

final cost KK : X → R

constraints Bt

Bt : X ⇒ U

Bellman equation

We consider a stochastic process (W1, . . . ,WT ) , with values in W,which is a white noise, that is,W1, . . . ,WT are independent random variables

We consider a random variable X0, with values in X,independent of the stochastic process (W1, . . . ,WT )

We define inductively the Bellman functions

VT (x) = K (x) , ∀x ∈ X

Vt(x) = infu∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(f:t(x , u,Wt+1)

)]

∀x ∈ X , t = 0, . . . ,T − 1

We suppose that there exists a measurable selection

ψt : (X,X) → (U,U) , t = 0, . . . ,T − 1 such that

ψt(x) ∈ argminu∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(f:t(x , u,Wt+1)

)]


minU0,...,UT−1

E[ T∑

t=0

Lt(Xt ,Ut ,Wt+1) + K (XT )]

σ(U0) ⊂ A0, . . . , σ(UT−1) ⊂ AT−1

Xt+1 = f:t(Xt ,Ut ,Wt+1) , t = 0, . . . ,T − 1



U⋆t = ψt(X

⋆t ) , t = 0, . . . ,T − 1

where

X⋆0 = X0 , X⋆

t+1 = f:t(X⋆t ,U

⋆t ,Wt+1) , t = 0, . . . ,T − 1

The minimum is

E[V0(X0)

]= min

U0,...,UT−1

E[ T∑

t=0

Lt(Xt ,Ut ,Wt+1) + K (XT )]





Dynamic Programming With State and White Noise (Complements)The payoff-to-go and Bellman’s Principle of OptimalityBellman equation and the curse of dimensionalityComplements: hazard-decision, linear-quadratic, linear-convex

The shortest path on a graph illustrates

Bellman’s Principle of Optimality

Los Angeles

Chicago

Boston

For an auto travel analogy,suppose that the fastestroute from Los Angeles toto Boston passes throughChicago.The principle of optimalitytranslates to obvious factthat the Chicago to Bostonportion of the route is alsothe fastest route for a tripthat starts from Chicagoand ends in Boston.(Dimitri P. Bertsekas)

Bellman’s Principle of Optimality

Richard Ernest Bellman(August 26, 1920 –March 19, 1984)

An optimal policy has theproperty that whatever theinitial state and initialdecision are, the remainingdecisions must constitutean optimal policy withregard to the stateresulting from the firstdecision (Richard Bellman)

We make the assumption that

the primitive random variables are independent

The set S = WT+1−t0 = Rq × · · ·×Rq of scenarios is equipped with a σ-field F =

⊗T

t=t0B(Rq)

and a probability P, supposed to be of product form

P = µt0 ⊗ · · · ⊗ µT

Therefore, the primitive random variables

Wt0 ,Wt0+1, . . . ,WT−1,WT

are independent under P, with marginal distributions µt0 , . . . , µT

The notation E refers to the mathematical expectation over S either under probability P, as in E, EW, EP

or under the marginal distributions µt0 , . . . , µT , as in E, EWt+1 , Eµt+1

What is state and what is noise?

Delineating what is state and what is noise

is a modelling issue

When the uncertainties are not independent,a solution is to enlarge the state

If the water inflows follow an auto-regressive model, we have

future stock︷︸︸︷St+1 = mins♯,

stock︷︸︸︷St −

water release︷︸︸︷Qt +

water inflows︷︸︸︷At

At+1︸︷︷︸future water inflows

= α At︸︷︷︸water inflows

+Wt+1︸︷︷︸noise

where we suppose that Wt0 ,Wt0+1, . . . ,WT−1,WT

form a sequence of independent random variables

The couple xt =(st , at

)is a sufficient summary of past controls

and uncertainties to do forecasting:knowing the state xt =

(st , at

)at time t is sufficient

to forecast xt+1, given the control qt and the uncertainty Wt+1

What is a state?

Bellman autobiography, Eye of the Hurricane

Conversely, once it was realized that the concept of policy wasfundamental in control theory, the mathematicization of thebasic engineering concept of ’feedback control,’ then theemphasis upon a state variable formulation became natural.

A state in optimal stochastic control problems is a sufficientstatistics for the uncertainties and past controls(P. Whittle, Optimization over Time: Dynamic Programming andStochastic Control)

Quoting Whittle, suppose there is a variable xtwhich summarizes past history in that, given t and the value of xt ,one can calculate the optimal ut and also xt+1 without knowledge ofthe history (ω, u0, ..., ut−1), for all t, where ω represents alluncertainties. Such a variable is termed sufficient

While history takes value in an increasing space as t increases,a sufficient variable taking values in a space independent of tis called a state variable

The payoff-to-go / value function / Bellman function

Payoff-to-go / value function / Bellman functionAssume that the primitive random variablesWt0 ,Wt0+1, . . . ,WT−1,WT are independent under the probability P

The payoff-to-go from state x at time t is

Vt(x) = minPol∈Uad

E

[T−1∑

s=t

Ls(Xs ,Us ,Ws+1) + K(XT )

]

where Xt = x and, for s = t, . . . ,T − 1,Xs+1 = fs

(Xs ,Us ,Ws+1

)and Us = Pols(Xs)

The function Vt is called the value function, or the Bellman function

The original problem is Vt0(x0)

The stochastic dynamic programming equation,

or Bellman equation, is a backward equationsatisfied by the value function

Stochastic dynamic programming equationIf the primitive random variables Wt0 ,Wt0+1, . . . ,WT−1,WT

are independent under the probability P,the value function Vt(x) satisfies the following backward induction,where t runs from T − 1 down to t0

VT (x) = Ew(T )

[K(x)

]

Vt(x) = minu∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(ft(x , u,Wt+1)

)]

Algorithm for the Bellman functions

initialization VT (x) =∑

w∈W (T )

PwK(x);

for t = T ,T − 1, . . . , t0 do

forall x ∈ X do

forall u ∈ Bt(x) doforall w ∈ Wt do

lt(x , u,w) = Lt(x , u,w) + Vt+1

(ft(x , u,w)

)∑

w∈Wt

Pwlt(x , u,w)


∑

w∈Wt

Pwlt(x , u,w) ;

B⋆t (x) = argmax

u∈Bt(x)

∑

w∈Wt

Pwlt(x , u,w)

Sketch of the proof in the deterministic case


(Lt(x , u)︸︷︷︸

instantaneous gain

+

optimal payoff︷︸︸︷Vt+1

(ft(x , u)︸︷︷︸

future state

) )

Los Angeles

Chicago

Boston

A decision u at time t in state xprovides

an instantaneous gain Lt(x , u)

and a future payoff for attainingthe new state ft(x , u)

The Bellman equation provides an optimal policy

PropositionFor any time t and state x, assume the existence of the policy Pol

⋆t (x) ∈

argmax

u∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(ft(x , u,Wt+1)

)]

If Pol⋆t : x 7→ Pol⋆t(x) is measurable, then

Pol⋆ is an optimal policy

for any initial state x0, the optimal expected payoff is given by

Vt0(x0) = minPol∈Uad

V Polexpect(t0, x0) = V Pol

⋆

expect(t0, x0)

A bit of history (and fun)

“Where did the name, dynamic programming, come from?”

The 1950s were not good years formathematical research. We had avery interesting gentleman inWashington named Wilson. Hewas Secretary of Defense, and heactually had a pathological fearand hatred of the word, research.I’m not using the term lightly; I’musing it precisely. His face wouldsuffuse, he would turn red, and hewould get violent if people usedthe term, research, in his presence.You can imagine how he felt, then,about the term, mathematical.


What title, what name, could Ichoose? In the first place I wasinterested in planning, in decisionmaking, in thinking. But planning,is not a good word for variousreasons. I decided therefore to usethe word, programming.


I wanted to get across the ideathat this was dynamic, this wasmultistage, this was time-varying.I thought, let’s kill two birds withone stone. Let’s take a word thathas an absolutely precise meaning,namely dynamic, in the classicalphysical sense. It also has a veryinteresting property as anadjective, and that is it’simpossible to use the word,dynamic, in a pejorative sense.Try thinking of some combinationthat will possibly give it apejorative meaning. It’simpossible. Thus, I thoughtdynamic programming was a goodname.

Navigating between “backward off-line” and “forward on-line”

Optimal trajectories are calculated forward on-line

1. Initial state x⋆t0 = x0

2. Plug the state x⋆t0 into the “machine” Polt0 → initial decisionu⋆t0 = Pol

⋆t0(x⋆t0)

3. Run the dynamics → second state x⋆t0+1 = ft0(x⋆t0, u⋆t0 ,wt0+1)

4. Second decision u⋆t0+1 = Pol⋆t0+1(x

⋆t0+1)

5. And so on x⋆t0+2 = ft0+1(x⋆t0+1, u

⋆t0+1,wt0+2)

6. . . .

“Life is lived forward but understood backward”

(Søren Kierkegaard)

D. P. Bertsekas introduces his bookDynamic Programming and Optimal Controlwith a citation by Søren Kierkegaard

”Livet skal forstas baglaens, menleves forlaens”

Life is to be understood backwards,but it is lived forwards

The value function and the optimal policiesare computed backward and offlineby means of the Bellman equation

whereas the optimal trajectoriesare computed forward and online

The curse of dimensionality :-(

The curse of dimensionality is illustrated by the random

access memory capacity on a computer:one, two, three, infinity (Gamov)

On a computer RAM: 8 GBytes = 8(1 024)3 = 233 bytes a double-precision real: 8 bytes = 23 bytes =⇒ 230 ≈ 109 double-precision reals can be handled in RAM

If a state of dimension 4 is approximated by a grid with 100 levels bycomponents, we need to manipulate 1004 = 108 reals and

do a time loop do a control loop (after discretization) compute an expectation

The wall of dimension can be pushed beyond 3if additional properties are exploited (linearity, convexity)

Bellman equation and optimal policies

in the hazard-decision information pattern

The uncertainty is observed before making the decision

initialization VT (x) = K(x) ;for t = T ,T − 1, . . . , t0 do

forall x ∈ X do

forall w ∈ Wt+1 do

forall u ∈ Bt(x) dolt(x , u,w) = Lt(x , u,w) + Vt+1

(ft(x , u,w)

)

minu∈Bt(x)

lt(x , u,w) ;

B⋆t (x ,w) = argmax

u∈Bt(x)

lt(x , u,w)

Vt(x) =∑

w∈Wt+1

Pw minu∈Bt(x)

lt(x , u,w)

In the linear-quadratic-Gaussian case,

optimal policies are linear

When utilities are quadratic

K(x) = x ′ST x

Lt(x , u,w) = x ′Stx + w ′Rtw + u′Qtu

and the dynamic is linear

ft(x , u,w) = Ftx + Gtu + Htw

and primitive random variables Wt0 ,Wt0+1, . . . ,WT−1,WT areGaussian independent under the probability P

then, the value functions x 7→ Vt(x) are quadratic,and optimal policies are linear

ut = Ktxt

How optimal decisions can be computed on-line

If we are able to store the value functions x 7→ Vt(x)

we do not need to compute the optimal policy Pol⋆ in advance,

and store it

Indeed, when we are at state x at time t in real time,we can just compute the optimal decision u⋆t “on the fly” by

u⋆t ∈ argmax

u∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(ft(x , u,Wt+1)

)]

In addition to sparing storage, this method makes it possibleto incorporate in the above program any new information availableat time t (on the distribution of the noise Wt+1, for instance)

So, the question is:

how can we store the value functions?

The effort can be concentrated on computing the value functions

on a grid, by discretizing the Bellman equation

by estimating basis coefficients,when it is known that the value function is quadratic

by estimating upper affine approximation of the value function,when it is known that the value function is concave

by estimating lower approximation of the value function,when restricting the search to a subclass of policies(open-loop in OLFC)

In the linear-convex case, value functions are convex

Here, we aim at minimizing expected cumulated costs

E

[T−1∑

t=t0

instantaneous cost︷︸︸︷Lt(Xt ,Ut ,Wt+1)+ K(XT )︸︷︷︸

final cost

]

The value functions x 7→ Vt(x) are convex whenever

(x , u) 7→ Lt(x , u,w) is jointly convex in state and control

x 7→ K(x) is convex

the primitive random variables Wt0 ,Wt0+1, . . . ,WT−1,WT areindependent under the probability P

the dynamics are affine

ft(x , u,w) = Ftx + Gtu + Htw

The minimum over one variable of a jointly convex function

is convex in the other variable

A lemma in convex analysisLet f : Y× Z → R be convex, and let C ⊂ Y× Z be a convex set. Then

g(y) = minz∈Z,(y,z)∈C

f (y , z)

is a convex function

The Bellman equation produces convex value functions

The dynamic programming equation associated withthe problem of minimizing the expected costs is

VT (x) =

final cost︷︸︸︷K(x)


EWt+1

[Lt(x , u,Wt+1)︸︷︷︸instantaneous cost

+Vt+1

(Ftx + Gtu + HtWt+1︸︷︷︸

future state

)]

It can be shown by induction that x 7→ Vt(x) is convex

The gradient ∇Vt(x⋆t+1) at x

⋆t+1 defines a hyperplane and

a lower affine approximation of the value function, calculated byduality

When spilling decisions are made after knowing the water

inflows, we obtain a linear dynamical model

st+1︸︷︷︸future volume

= st︸︷︷︸volume

− qt︸︷︷︸turbined

− rt︸︷︷︸spilled

+ at︸︷︷︸inflow volume

st volume (stock) of water at the beginning of period [t, t + 1[

at , inflow water volume (rain, etc.) during [t, t + 1[;

qt turbined outflow volume decided at the beginning of period [t, t + 1[ (hazard follows decision) supposed to depend on the stock st

rt spilled volume decided at the end of period [t, t + 1[ (hazard precedes decision) supposed to depend on the stock st and on the inflow water at

0 ≤ qt ≤ minst , q♯ and 0 ≤ st − qt + at − rt ≤ s♯

Stochastic Dual Dynamic Programming (SDDP)

The property that value functions are convexextends to the following cases

Multiple stocks interconnected by linear dynamics

Sit+1 = Si

t + Ait +Qi−1

t −Qit − Ri

t

Water inflows following an auto-regressive model

Ait =

∑

k=1,...,K i

αkAi(t − k) +Wt

where the random variables Wt0 ,Wt0+1, . . . ,WT−1,WT areindependent

Summary

Bellman’s Principle of Optimality breaksan intertemporal optimization probleminto a sequence of interconnected static optimization problems

The payoff-to-go / value function / Bellman functionis solution of a backward dynamic programming equation,or Bellman equation

The Bellman equation provides an optimal policy,a concept of solution adapted to uncertain case

In numerical practice, the curse of dimensionalityforbids to use dynamic programmingfor a state with dimension more than four or five

However, special cases like the linear-quadratic or the linear-convexones, do not (totally) suffer from the curse of dimensionality

dynamic programming - cermicscermics.enpc.fr/~delara/exposes/slides_dynamic_programming.pdf ·...

Documents