dynamic programming - cermicscermics.enpc.fr/~delara/exposes/slides_dynamic_programming.pdf ·...

57
Dynamic Programming Michel De Lara Cermics, ´ Ecole des Ponts ParisTech France ´ Ecole des Ponts ParisTech January 2, 2020

Upload: others

Post on 08-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Dynamic Programming

Michel De Lara

Cermics, Ecole des Ponts ParisTechFrance

Ecole des Ponts ParisTech

January 2, 2020

Page 2: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Outline of the presentation

Dynamic Programming Without State

Dynamic Programming With State

Dynamic Programming With State and White Noise

Dynamic Programming With State and White Noise (Complements)

Page 3: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Outline of the presentation

Dynamic Programming Without State

Dynamic Programming With State

Dynamic Programming With State and White Noise

Dynamic Programming With State and White Noise (Complements)

Page 4: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Basic data

Let (Ω,A∞,P) be a probability space

Let T ∈ N∗ be the horizon

For stages t = 0, . . . ,T , let Ut , Wt be measurable sets,equipped with σ-fields Ut , Wt

Page 5: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

History space

For t = 0, . . . ,T , we define

the history space Ht

Ht = W0 ×

t−1∏

s=0

(Us ×Ws+1)

equipped with the history field Ht

Ht = W0 ⊗t−1⊗

s=0

(Us ⊗Ws+1)

A generic element ht ∈ Ht is called a history

ht =(w0, u0,w1, u1,w2, . . . , ut−2,wt−1, ut−1,wt)

[ht ] =[(w0, (us ,ws+1)s=0,...,t−1)

]= (w0, . . . ,wt) = w[0:t]

hs:t =(ur ,wr+1)r=s−1,...,t−1 = (us−1,ws , . . . , ut−1,wt)

[hs:t ] =[(ur ,wr+1)r=s−1,...,t−1] = (ws , . . . ,wt) = w[s:t]

Page 6: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Noise process and filtration

For t = 0, . . . ,T , let Wt be a random variable taking values in Wt

Wt : Ω → Wt

We putW[0:t] = (W0, . . . ,Wt)

We introduce the filtration A =(At

)t=0,...,T

defined by

At = σ(W0, . . . ,Wt) , t = 0, . . . ,T

Let L0(Ω,A,∏T−1

s=0 Us) be the space of A-adapted processes

(U0, . . . ,UT−1) with values in∏T−1

s=0 Us , that is, such that

σ(U0) ⊂ A0, . . . , σ(UT−1) ⊂ AT−1

Page 7: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Multistage stochastic optimization problem

Letφ : HT → R

be a bounded function, measurable with respect to the field HT

We consider the multistage stochastic optimization problem

min(U0,...,UT−1)∈L0

A(Ω,

∏T−1s=0 Us)

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

Page 8: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Bellman operators

For t = 0, . . . ,T ,

we define L∞(Ht ,Ht), the space ofbounded measurable real-valued functions over Ht

We suppose that there exists a regular conditional distributionof the random variable Wt+1 knowing the random process W[0:t],denoted by

PW[0:t]

Wt+1(w[0:t], dwt+1) = P

W[0:t]

Wt+1([ht ], dwt+1)

we define the Bellman operators

Bt+1 : L∞(Ht+1,Ht+1) → L

∞(Ht ,Ht) by

(Bt+1ϕ

)(ht) = inf

ut∈Ut

Wt+1

ϕ((ht , ut ,wt+1)

)PW[0:t]

Wt+1([ht ], dwt+1)

∀ϕ ∈ L∞(Ht+1,Ht+1) , ∀ht ∈ Ht

Page 9: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Value functions and Bellman equation

We define inductively value functions, or Bellman functions,

Vt : Ht → R , t = 0, . . . ,T

byVT = φ , Vt = Bt+1Vt+1 , t = 0, . . . ,T − 1 .

that is, solution of the Bellman equation

Vt(ht) = infut∈Ut

Wt+1

Vt+1(ht , ut ,wt+1)PW[0:t]

Wt+1([ht ], dwt+1)

Page 10: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Measurable selection

We suppose that, for all t = 0, . . . ,T , there exists a measurable selection

ψt : (Ht ,Ht) → (Ut ,Ut)

such that

ψt(ht) ∈ argminut∈Ut

Wt+1

Vt+1(ht , ut ,wt+1)PW[0:t]

Wt+1([ht ], dwt+1)

Page 11: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

PropositionA solution to the multistage stochastic optimization problem

min(U0,...,UT−1)∈L0

A(Ω,

∏T−1s=0 Us)

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

is the sequence U⋆0 , . . . ,U

⋆T−1 of random variables defined inductively by

U⋆t = ψt H

⋆t , t = 0, . . . ,T

whereH⋆

0 = W0 , H⋆t+1 = (H⋆

t ,U⋆t ,Wt+1) , t = 0, . . . ,T

The minimum is

E[V0(W0)

]= min

(U0,...,UT−1)∈L0A(Ω,

∏T−1s=0 Us )

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

Page 12: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Extension

Constraints of the form

H0 = W0 , Ht+1 = (Ht ,Ut ,Wt+1)

and(Ht ,Ut) ∈ Ct ⊂ Ht × Ut , P− a.s , t = 0, . . . ,T − 1

Page 13: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Outline of the presentation

Dynamic Programming Without State

Dynamic Programming With State

Dynamic Programming With State and White Noise

Dynamic Programming With State and White Noise (Complements)

Page 14: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

State reduction and dynamics

For t = 0, . . . ,T , suppose that there exists

state space Xt , a measurable set equipped with σ-field Xt

reduction mappings θtθt : Ht → Xt

dynamics f:tf:t : Xt × Ut ×Wt+1 → Xt+1

such that

θt+1(ht , ut ,wt+1) = f:t

(θt(ht), ut ,wt+1

), t = 0, . . . ,T − 1

Page 15: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Cost only depends on final state

Suppose that there existsφ : XT → R

such that the cost φ can be factored as

φ = φ θT

Page 16: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Markovian assumption

Let ∆(Wt+1) denote the set of probabilities on (Wt+1,Wt)

Suppose that, for all t = 0, . . . ,T , there exists

µt : Xt ×

t∏

s=0

Ws → ∆(Wt+1)

such thatPW[0:t]

Wt+1

([ht ], dwt+1

)= µt

(θt(ht), dwt+1

)

Page 17: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Bellman equation

We define inductively

VT (xT ) = φ(xT ) , ∀xT ∈ XT

Vt(xt) = infut∈Ut

Wt+1

Vt+1

(f:t(xt , ut ,wt+1)

)µt(xt , dwt+1)

∀xt ∈ Xt , t = 0, . . . ,T

We suppose that there exists a measurable selection

ψt : (Xt ,Xt) → (Ut ,Ut) , t = 0, . . . ,T

such that

ψt(xt) ∈ argminut∈Ut

Wt+1

Vt+1

(f:t(xt , ut ,wt+1)

)µt(xt , dwt+1)

Page 18: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

PropositionA solution to the multistage stochastic optimization problem

minU0,...,UT−1

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

σ(U0) ⊂ A0, . . . , σ(UT−1) ⊂ AT−1

is the sequence U⋆0 , . . . ,U

⋆T−1 of random variables defined inductively by

U⋆t = ψt(X

⋆t ) , t = 0, . . . ,T

where

X⋆0 = W0 , X⋆

t+1 = f:t(X

⋆t ,U

⋆t ,Wt+1) , t = 0, . . . ,T

The minimum is

E[V0(X

⋆0)]= min

(U0,...,UT−1)∈L0A(Ω,

∏T−1s=0 Us)

E[φ(W0,U0,W1, . . . ,UT−1,WT )

]

Page 19: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Extension

Constraints of the form

(Xt ,Ut) ∈ Ct ⊂ Xt × Ut , P− a.s , t = 0, . . . ,T − 1

Page 20: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Outline of the presentation

Dynamic Programming Without State

Dynamic Programming With State

Dynamic Programming With State and White Noise

Dynamic Programming With State and White Noise (Complements)

Page 21: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Basic data

For t = 0, . . . ,T ,

let U, W, X be measurable sets, equipped with σ-fields U, W, X

dynamics f:tf:t : X× U×W → Xt+1

instantaneous costs Lt

Lt : X× U×W → R

final cost KK : X → R

constraints Bt

Bt : X ⇒ U

Page 22: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Bellman equation

We consider a stochastic process (W1, . . . ,WT ) , with values in W,which is a white noise, that is,W1, . . . ,WT are independent random variables

We consider a random variable X0, with values in X,independent of the stochastic process (W1, . . . ,WT )

We define inductively the Bellman functions

VT (x) = K (x) , ∀x ∈ X

Vt(x) = infu∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(f:t(x , u,Wt+1)

)]

∀x ∈ X , t = 0, . . . ,T − 1

We suppose that there exists a measurable selection

ψt : (X,X) → (U,U) , t = 0, . . . ,T − 1 such that

ψt(x) ∈ argminu∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(f:t(x , u,Wt+1)

)]

Page 23: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

PropositionA solution to the multistage stochastic optimization problem

minU0,...,UT−1

E[ T∑

t=0

Lt(Xt ,Ut ,Wt+1) + K (XT )]

σ(U0) ⊂ A0, . . . , σ(UT−1) ⊂ AT−1

Xt+1 = f:t(Xt ,Ut ,Wt+1) , t = 0, . . . ,T − 1

is the sequence U⋆0 , . . . ,U

⋆T−1 of random variables defined inductively by

U⋆t = ψt(X

⋆t ) , t = 0, . . . ,T − 1

where

X⋆0 = X0 , X⋆

t+1 = f:t(X⋆t ,U

⋆t ,Wt+1) , t = 0, . . . ,T − 1

The minimum is

E[V0(X0)

]= min

U0,...,UT−1

E[ T∑

t=0

Lt(Xt ,Ut ,Wt+1) + K (XT )]

Page 24: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Outline of the presentation

Dynamic Programming Without State

Dynamic Programming With State

Dynamic Programming With State and White Noise

Dynamic Programming With State and White Noise (Complements)

Page 25: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Outline of the presentation

Dynamic Programming Without State

Dynamic Programming With State

Dynamic Programming With State and White Noise

Dynamic Programming With State and White Noise (Complements)The payoff-to-go and Bellman’s Principle of OptimalityBellman equation and the curse of dimensionalityComplements: hazard-decision, linear-quadratic, linear-convex

Page 26: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

The shortest path on a graph illustrates

Bellman’s Principle of Optimality

Los Angeles

Chicago

Boston

For an auto travel analogy,suppose that the fastestroute from Los Angeles toto Boston passes throughChicago.The principle of optimalitytranslates to obvious factthat the Chicago to Bostonportion of the route is alsothe fastest route for a tripthat starts from Chicagoand ends in Boston.(Dimitri P. Bertsekas)

Page 27: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Bellman’s Principle of Optimality

Richard Ernest Bellman(August 26, 1920 –March 19, 1984)

An optimal policy has theproperty that whatever theinitial state and initialdecision are, the remainingdecisions must constitutean optimal policy withregard to the stateresulting from the firstdecision (Richard Bellman)

Page 28: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

We make the assumption that

the primitive random variables are independent

The set S = WT+1−t0 = Rq × · · ·×Rq of scenarios is equipped with a σ-field F =

⊗T

t=t0B(Rq)

and a probability P, supposed to be of product form

P = µt0 ⊗ · · · ⊗ µT

Therefore, the primitive random variables

Wt0 ,Wt0+1, . . . ,WT−1,WT

are independent under P, with marginal distributions µt0 , . . . , µT

The notation E refers to the mathematical expectation over S either under probability P, as in E, EW, EP

or under the marginal distributions µt0 , . . . , µT , as in E, EWt+1 , Eµt+1

Page 29: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

What is state and what is noise?

Page 30: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Delineating what is state and what is noise

is a modelling issue

When the uncertainties are not independent,a solution is to enlarge the state

If the water inflows follow an auto-regressive model, we have

future stock︷︸︸︷St+1 = mins♯,

stock︷︸︸︷St −

water release︷︸︸︷Qt +

water inflows︷︸︸︷At

At+1︸︷︷︸future water inflows

= α At︸︷︷︸water inflows

+Wt+1︸ ︷︷ ︸noise

where we suppose that Wt0 ,Wt0+1, . . . ,WT−1,WT

form a sequence of independent random variables

The couple xt =(st , at

)is a sufficient summary of past controls

and uncertainties to do forecasting:knowing the state xt =

(st , at

)at time t is sufficient

to forecast xt+1, given the control qt and the uncertainty Wt+1

Page 31: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

What is a state?

Bellman autobiography, Eye of the Hurricane

Conversely, once it was realized that the concept of policy wasfundamental in control theory, the mathematicization of thebasic engineering concept of ’feedback control,’ then theemphasis upon a state variable formulation became natural.

A state in optimal stochastic control problems is a sufficientstatistics for the uncertainties and past controls(P. Whittle, Optimization over Time: Dynamic Programming andStochastic Control)

Quoting Whittle, suppose there is a variable xtwhich summarizes past history in that, given t and the value of xt ,one can calculate the optimal ut and also xt+1 without knowledge ofthe history (ω, u0, ..., ut−1), for all t, where ω represents alluncertainties. Such a variable is termed sufficient

While history takes value in an increasing space as t increases,a sufficient variable taking values in a space independent of tis called a state variable

Page 32: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Outline of the presentation

Dynamic Programming Without State

Dynamic Programming With State

Dynamic Programming With State and White Noise

Dynamic Programming With State and White Noise (Complements)The payoff-to-go and Bellman’s Principle of OptimalityBellman equation and the curse of dimensionalityComplements: hazard-decision, linear-quadratic, linear-convex

Page 33: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

The payoff-to-go / value function / Bellman function

Payoff-to-go / value function / Bellman functionAssume that the primitive random variablesWt0 ,Wt0+1, . . . ,WT−1,WT are independent under the probability P

The payoff-to-go from state x at time t is

Vt(x) = minPol∈Uad

E

[T−1∑

s=t

Ls(Xs ,Us ,Ws+1) + K(XT )

]

where Xt = x and, for s = t, . . . ,T − 1,Xs+1 = fs

(Xs ,Us ,Ws+1

)and Us = Pols(Xs)

The function Vt is called the value function, or the Bellman function

The original problem is Vt0(x0)

Page 34: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

The stochastic dynamic programming equation,

or Bellman equation, is a backward equationsatisfied by the value function

Stochastic dynamic programming equationIf the primitive random variables Wt0 ,Wt0+1, . . . ,WT−1,WT

are independent under the probability P,the value function Vt(x) satisfies the following backward induction,where t runs from T − 1 down to t0

VT (x) = Ew(T )

[K(x)

]

Vt(x) = minu∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(ft(x , u,Wt+1)

)]

Page 35: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Algorithm for the Bellman functions

initialization VT (x) =∑

w∈W (T )

PwK(x);

for t = T ,T − 1, . . . , t0 do

forall x ∈ X do

forall u ∈ Bt(x) doforall w ∈ Wt do

lt(x , u,w) = Lt(x , u,w) + Vt+1

(ft(x , u,w)

)∑

w∈Wt

Pwlt(x , u,w)

Vt(x) = minu∈Bt(x)

w∈Wt

Pwlt(x , u,w) ;

B⋆t (x) = argmax

u∈Bt(x)

w∈Wt

Pwlt(x , u,w)

Page 36: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Sketch of the proof in the deterministic case

Vt(x) = minu∈Bt(x)

(Lt(x , u)︸ ︷︷ ︸

instantaneous gain

+

optimal payoff︷ ︸︸ ︷Vt+1

(ft(x , u)︸ ︷︷ ︸

future state

) )

Los Angeles

Chicago

Boston

A decision u at time t in state xprovides

an instantaneous gain Lt(x , u)

and a future payoff for attainingthe new state ft(x , u)

Page 37: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

The Bellman equation provides an optimal policy

PropositionFor any time t and state x, assume the existence of the policy Pol

⋆t (x) ∈

argmax

u∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(ft(x , u,Wt+1)

)]

If Pol⋆t : x 7→ Pol⋆t(x) is measurable, then

Pol⋆ is an optimal policy

for any initial state x0, the optimal expected payoff is given by

Vt0(x0) = minPol∈Uad

V Polexpect(t0, x0) = V Pol

expect(t0, x0)

Page 38: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

A bit of history (and fun)

Page 39: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

“Where did the name, dynamic programming, come from?”

The 1950s were not good years formathematical research. We had avery interesting gentleman inWashington named Wilson. Hewas Secretary of Defense, and heactually had a pathological fearand hatred of the word, research.I’m not using the term lightly; I’musing it precisely. His face wouldsuffuse, he would turn red, and hewould get violent if people usedthe term, research, in his presence.You can imagine how he felt, then,about the term, mathematical.

Page 40: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

“Where did the name, dynamic programming, come from?”

What title, what name, could Ichoose? In the first place I wasinterested in planning, in decisionmaking, in thinking. But planning,is not a good word for variousreasons. I decided therefore to usethe word, programming.

Page 41: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

“Where did the name, dynamic programming, come from?”

I wanted to get across the ideathat this was dynamic, this wasmultistage, this was time-varying.I thought, let’s kill two birds withone stone. Let’s take a word thathas an absolutely precise meaning,namely dynamic, in the classicalphysical sense. It also has a veryinteresting property as anadjective, and that is it’simpossible to use the word,dynamic, in a pejorative sense.Try thinking of some combinationthat will possibly give it apejorative meaning. It’simpossible. Thus, I thoughtdynamic programming was a goodname.

Page 42: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Navigating between “backward off-line” and “forward on-line”

Page 43: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Optimal trajectories are calculated forward on-line

1. Initial state x⋆t0 = x0

2. Plug the state x⋆t0 into the “machine” Polt0 → initial decisionu⋆t0 = Pol

⋆t0(x⋆t0)

3. Run the dynamics → second state x⋆t0+1 = ft0(x⋆t0, u⋆t0 ,wt0+1)

4. Second decision u⋆t0+1 = Pol⋆t0+1(x

⋆t0+1)

5. And so on x⋆t0+2 = ft0+1(x⋆t0+1, u

⋆t0+1,wt0+2)

6. . . .

Page 44: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

“Life is lived forward but understood backward”

(Søren Kierkegaard)

D. P. Bertsekas introduces his bookDynamic Programming and Optimal Controlwith a citation by Søren Kierkegaard

”Livet skal forstas baglaens, menleves forlaens”

Life is to be understood backwards,but it is lived forwards

The value function and the optimal policiesare computed backward and offlineby means of the Bellman equation

whereas the optimal trajectoriesare computed forward and online

Page 45: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

The curse of dimensionality :-(

Page 46: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

The curse of dimensionality is illustrated by the random

access memory capacity on a computer:one, two, three, infinity (Gamov)

On a computer RAM: 8 GBytes = 8(1 024)3 = 233 bytes a double-precision real: 8 bytes = 23 bytes =⇒ 230 ≈ 109 double-precision reals can be handled in RAM

If a state of dimension 4 is approximated by a grid with 100 levels bycomponents, we need to manipulate 1004 = 108 reals and

do a time loop do a control loop (after discretization) compute an expectation

The wall of dimension can be pushed beyond 3if additional properties are exploited (linearity, convexity)

Page 47: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Outline of the presentation

Dynamic Programming Without State

Dynamic Programming With State

Dynamic Programming With State and White Noise

Dynamic Programming With State and White Noise (Complements)The payoff-to-go and Bellman’s Principle of OptimalityBellman equation and the curse of dimensionalityComplements: hazard-decision, linear-quadratic, linear-convex

Page 48: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Bellman equation and optimal policies

in the hazard-decision information pattern

The uncertainty is observed before making the decision

initialization VT (x) = K(x) ;for t = T ,T − 1, . . . , t0 do

forall x ∈ X do

forall w ∈ Wt+1 do

forall u ∈ Bt(x) dolt(x , u,w) = Lt(x , u,w) + Vt+1

(ft(x , u,w)

)

minu∈Bt(x)

lt(x , u,w) ;

B⋆t (x ,w) = argmax

u∈Bt(x)

lt(x , u,w)

Vt(x) =∑

w∈Wt+1

Pw minu∈Bt(x)

lt(x , u,w)

Page 49: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

In the linear-quadratic-Gaussian case,

optimal policies are linear

When utilities are quadratic

K(x) = x ′ST x

Lt(x , u,w) = x ′Stx + w ′Rtw + u′Qtu

and the dynamic is linear

ft(x , u,w) = Ftx + Gtu + Htw

and primitive random variables Wt0 ,Wt0+1, . . . ,WT−1,WT areGaussian independent under the probability P

then, the value functions x 7→ Vt(x) are quadratic,and optimal policies are linear

ut = Ktxt

Page 50: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

How optimal decisions can be computed on-line

If we are able to store the value functions x 7→ Vt(x)

we do not need to compute the optimal policy Pol⋆ in advance,

and store it

Indeed, when we are at state x at time t in real time,we can just compute the optimal decision u⋆t “on the fly” by

u⋆t ∈ argmax

u∈Bt(x)

EWt+1

[Lt(x , u,Wt+1) + Vt+1

(ft(x , u,Wt+1)

)]

In addition to sparing storage, this method makes it possibleto incorporate in the above program any new information availableat time t (on the distribution of the noise Wt+1, for instance)

Page 51: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

So, the question is:

how can we store the value functions?

The effort can be concentrated on computing the value functions

on a grid, by discretizing the Bellman equation

by estimating basis coefficients,when it is known that the value function is quadratic

by estimating upper affine approximation of the value function,when it is known that the value function is concave

by estimating lower approximation of the value function,when restricting the search to a subclass of policies(open-loop in OLFC)

Page 52: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

In the linear-convex case, value functions are convex

Here, we aim at minimizing expected cumulated costs

E

[T−1∑

t=t0

instantaneous cost︷ ︸︸ ︷Lt(Xt ,Ut ,Wt+1)+ K(XT )︸ ︷︷ ︸

final cost

]

The value functions x 7→ Vt(x) are convex whenever

(x , u) 7→ Lt(x , u,w) is jointly convex in state and control

x 7→ K(x) is convex

the primitive random variables Wt0 ,Wt0+1, . . . ,WT−1,WT areindependent under the probability P

the dynamics are affine

ft(x , u,w) = Ftx + Gtu + Htw

Page 53: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

The minimum over one variable of a jointly convex function

is convex in the other variable

A lemma in convex analysisLet f : Y× Z → R be convex, and let C ⊂ Y× Z be a convex set. Then

g(y) = minz∈Z,(y,z)∈C

f (y , z)

is a convex function

Page 54: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

The Bellman equation produces convex value functions

The dynamic programming equation associated withthe problem of minimizing the expected costs is

VT (x) =

final cost︷︸︸︷K(x)

Vt(x) = minu∈Bt(x)

EWt+1

[Lt(x , u,Wt+1)︸ ︷︷ ︸instantaneous cost

+Vt+1

(Ftx + Gtu + HtWt+1︸ ︷︷ ︸

future state

)]

It can be shown by induction that x 7→ Vt(x) is convex

The gradient ∇Vt(x⋆t+1) at x

⋆t+1 defines a hyperplane and

a lower affine approximation of the value function, calculated byduality

Page 55: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

When spilling decisions are made after knowing the water

inflows, we obtain a linear dynamical model

st+1︸︷︷︸future volume

= st︸︷︷︸volume

− qt︸︷︷︸turbined

− rt︸︷︷︸spilled

+ at︸︷︷︸inflow volume

st volume (stock) of water at the beginning of period [t, t + 1[

at , inflow water volume (rain, etc.) during [t, t + 1[;

qt turbined outflow volume decided at the beginning of period [t, t + 1[ (hazard follows decision) supposed to depend on the stock st

rt spilled volume decided at the end of period [t, t + 1[ (hazard precedes decision) supposed to depend on the stock st and on the inflow water at

0 ≤ qt ≤ minst , q♯ and 0 ≤ st − qt + at − rt ≤ s♯

Page 56: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Stochastic Dual Dynamic Programming (SDDP)

The property that value functions are convexextends to the following cases

Multiple stocks interconnected by linear dynamics

Sit+1 = Si

t + Ait +Qi−1

t −Qit − Ri

t

Water inflows following an auto-regressive model

Ait =

k=1,...,K i

αkAi(t − k) +Wt

where the random variables Wt0 ,Wt0+1, . . . ,WT−1,WT areindependent

Page 57: Dynamic Programming - CERMICScermics.enpc.fr/~delara/exposes/slides_Dynamic_Programming.pdf · Dynamic Programming With State and White Noise (Complements) The payoff-to-go and Bellman’s

Summary

Bellman’s Principle of Optimality breaksan intertemporal optimization probleminto a sequence of interconnected static optimization problems

The payoff-to-go / value function / Bellman functionis solution of a backward dynamic programming equation,or Bellman equation

The Bellman equation provides an optimal policy,a concept of solution adapted to uncertain case

In numerical practice, the curse of dimensionalityforbids to use dynamic programmingfor a state with dimension more than four or five

However, special cases like the linear-quadratic or the linear-convexones, do not (totally) suffer from the curse of dimensionality