dynamic programming - cermicscermics.enpc.fr/~delara/exposes/slides_dynamic_programming.pdf ·...
TRANSCRIPT
Dynamic Programming
Michel De Lara
Cermics, Ecole des Ponts ParisTechFrance
Ecole des Ponts ParisTech
January 2, 2020
Outline of the presentation
Dynamic Programming Without State
Dynamic Programming With State
Dynamic Programming With State and White Noise
Dynamic Programming With State and White Noise (Complements)
Outline of the presentation
Dynamic Programming Without State
Dynamic Programming With State
Dynamic Programming With State and White Noise
Dynamic Programming With State and White Noise (Complements)
Basic data
Let (Ω,A∞,P) be a probability space
Let T ∈ N∗ be the horizon
For stages t = 0, . . . ,T , let Ut , Wt be measurable sets,equipped with σ-fields Ut , Wt
History space
For t = 0, . . . ,T , we define
the history space Ht
Ht = W0 ×
t−1∏
s=0
(Us ×Ws+1)
equipped with the history field Ht
Ht = W0 ⊗t−1⊗
s=0
(Us ⊗Ws+1)
A generic element ht ∈ Ht is called a history
ht =(w0, u0,w1, u1,w2, . . . , ut−2,wt−1, ut−1,wt)
[ht ] =[(w0, (us ,ws+1)s=0,...,t−1)
]= (w0, . . . ,wt) = w[0:t]
hs:t =(ur ,wr+1)r=s−1,...,t−1 = (us−1,ws , . . . , ut−1,wt)
[hs:t ] =[(ur ,wr+1)r=s−1,...,t−1] = (ws , . . . ,wt) = w[s:t]
Noise process and filtration
For t = 0, . . . ,T , let Wt be a random variable taking values in Wt
Wt : Ω → Wt
We putW[0:t] = (W0, . . . ,Wt)
We introduce the filtration A =(At
)t=0,...,T
defined by
At = σ(W0, . . . ,Wt) , t = 0, . . . ,T
Let L0(Ω,A,∏T−1
s=0 Us) be the space of A-adapted processes
(U0, . . . ,UT−1) with values in∏T−1
s=0 Us , that is, such that
σ(U0) ⊂ A0, . . . , σ(UT−1) ⊂ AT−1
Multistage stochastic optimization problem
Letφ : HT → R
be a bounded function, measurable with respect to the field HT
We consider the multistage stochastic optimization problem
min(U0,...,UT−1)∈L0
A(Ω,
∏T−1s=0 Us)
E[φ(W0,U0,W1, . . . ,UT−1,WT )
]
Bellman operators
For t = 0, . . . ,T ,
we define L∞(Ht ,Ht), the space ofbounded measurable real-valued functions over Ht
We suppose that there exists a regular conditional distributionof the random variable Wt+1 knowing the random process W[0:t],denoted by
PW[0:t]
Wt+1(w[0:t], dwt+1) = P
W[0:t]
Wt+1([ht ], dwt+1)
we define the Bellman operators
Bt+1 : L∞(Ht+1,Ht+1) → L
∞(Ht ,Ht) by
(Bt+1ϕ
)(ht) = inf
ut∈Ut
∫
Wt+1
ϕ((ht , ut ,wt+1)
)PW[0:t]
Wt+1([ht ], dwt+1)
∀ϕ ∈ L∞(Ht+1,Ht+1) , ∀ht ∈ Ht
Value functions and Bellman equation
We define inductively value functions, or Bellman functions,
Vt : Ht → R , t = 0, . . . ,T
byVT = φ , Vt = Bt+1Vt+1 , t = 0, . . . ,T − 1 .
that is, solution of the Bellman equation
Vt(ht) = infut∈Ut
∫
Wt+1
Vt+1(ht , ut ,wt+1)PW[0:t]
Wt+1([ht ], dwt+1)
Measurable selection
We suppose that, for all t = 0, . . . ,T , there exists a measurable selection
ψt : (Ht ,Ht) → (Ut ,Ut)
such that
ψt(ht) ∈ argminut∈Ut
∫
Wt+1
Vt+1(ht , ut ,wt+1)PW[0:t]
Wt+1([ht ], dwt+1)
PropositionA solution to the multistage stochastic optimization problem
min(U0,...,UT−1)∈L0
A(Ω,
∏T−1s=0 Us)
E[φ(W0,U0,W1, . . . ,UT−1,WT )
]
is the sequence U⋆0 , . . . ,U
⋆T−1 of random variables defined inductively by
U⋆t = ψt H
⋆t , t = 0, . . . ,T
whereH⋆
0 = W0 , H⋆t+1 = (H⋆
t ,U⋆t ,Wt+1) , t = 0, . . . ,T
The minimum is
E[V0(W0)
]= min
(U0,...,UT−1)∈L0A(Ω,
∏T−1s=0 Us )
E[φ(W0,U0,W1, . . . ,UT−1,WT )
]
Extension
Constraints of the form
H0 = W0 , Ht+1 = (Ht ,Ut ,Wt+1)
and(Ht ,Ut) ∈ Ct ⊂ Ht × Ut , P− a.s , t = 0, . . . ,T − 1
Outline of the presentation
Dynamic Programming Without State
Dynamic Programming With State
Dynamic Programming With State and White Noise
Dynamic Programming With State and White Noise (Complements)
State reduction and dynamics
For t = 0, . . . ,T , suppose that there exists
state space Xt , a measurable set equipped with σ-field Xt
reduction mappings θtθt : Ht → Xt
dynamics f:tf:t : Xt × Ut ×Wt+1 → Xt+1
such that
θt+1(ht , ut ,wt+1) = f:t
(θt(ht), ut ,wt+1
), t = 0, . . . ,T − 1
Cost only depends on final state
Suppose that there existsφ : XT → R
such that the cost φ can be factored as
φ = φ θT
Markovian assumption
Let ∆(Wt+1) denote the set of probabilities on (Wt+1,Wt)
Suppose that, for all t = 0, . . . ,T , there exists
µt : Xt ×
t∏
s=0
Ws → ∆(Wt+1)
such thatPW[0:t]
Wt+1
([ht ], dwt+1
)= µt
(θt(ht), dwt+1
)
Bellman equation
We define inductively
VT (xT ) = φ(xT ) , ∀xT ∈ XT
Vt(xt) = infut∈Ut
∫
Wt+1
Vt+1
(f:t(xt , ut ,wt+1)
)µt(xt , dwt+1)
∀xt ∈ Xt , t = 0, . . . ,T
We suppose that there exists a measurable selection
ψt : (Xt ,Xt) → (Ut ,Ut) , t = 0, . . . ,T
such that
ψt(xt) ∈ argminut∈Ut
∫
Wt+1
Vt+1
(f:t(xt , ut ,wt+1)
)µt(xt , dwt+1)
PropositionA solution to the multistage stochastic optimization problem
minU0,...,UT−1
E[φ(W0,U0,W1, . . . ,UT−1,WT )
]
σ(U0) ⊂ A0, . . . , σ(UT−1) ⊂ AT−1
is the sequence U⋆0 , . . . ,U
⋆T−1 of random variables defined inductively by
U⋆t = ψt(X
⋆t ) , t = 0, . . . ,T
where
X⋆0 = W0 , X⋆
t+1 = f:t(X
⋆t ,U
⋆t ,Wt+1) , t = 0, . . . ,T
The minimum is
E[V0(X
⋆0)]= min
(U0,...,UT−1)∈L0A(Ω,
∏T−1s=0 Us)
E[φ(W0,U0,W1, . . . ,UT−1,WT )
]
Extension
Constraints of the form
(Xt ,Ut) ∈ Ct ⊂ Xt × Ut , P− a.s , t = 0, . . . ,T − 1
Outline of the presentation
Dynamic Programming Without State
Dynamic Programming With State
Dynamic Programming With State and White Noise
Dynamic Programming With State and White Noise (Complements)
Basic data
For t = 0, . . . ,T ,
let U, W, X be measurable sets, equipped with σ-fields U, W, X
dynamics f:tf:t : X× U×W → Xt+1
instantaneous costs Lt
Lt : X× U×W → R
final cost KK : X → R
constraints Bt
Bt : X ⇒ U
Bellman equation
We consider a stochastic process (W1, . . . ,WT ) , with values in W,which is a white noise, that is,W1, . . . ,WT are independent random variables
We consider a random variable X0, with values in X,independent of the stochastic process (W1, . . . ,WT )
We define inductively the Bellman functions
VT (x) = K (x) , ∀x ∈ X
Vt(x) = infu∈Bt(x)
EWt+1
[Lt(x , u,Wt+1) + Vt+1
(f:t(x , u,Wt+1)
)]
∀x ∈ X , t = 0, . . . ,T − 1
We suppose that there exists a measurable selection
ψt : (X,X) → (U,U) , t = 0, . . . ,T − 1 such that
ψt(x) ∈ argminu∈Bt(x)
EWt+1
[Lt(x , u,Wt+1) + Vt+1
(f:t(x , u,Wt+1)
)]
PropositionA solution to the multistage stochastic optimization problem
minU0,...,UT−1
E[ T∑
t=0
Lt(Xt ,Ut ,Wt+1) + K (XT )]
σ(U0) ⊂ A0, . . . , σ(UT−1) ⊂ AT−1
Xt+1 = f:t(Xt ,Ut ,Wt+1) , t = 0, . . . ,T − 1
is the sequence U⋆0 , . . . ,U
⋆T−1 of random variables defined inductively by
U⋆t = ψt(X
⋆t ) , t = 0, . . . ,T − 1
where
X⋆0 = X0 , X⋆
t+1 = f:t(X⋆t ,U
⋆t ,Wt+1) , t = 0, . . . ,T − 1
The minimum is
E[V0(X0)
]= min
U0,...,UT−1
E[ T∑
t=0
Lt(Xt ,Ut ,Wt+1) + K (XT )]
Outline of the presentation
Dynamic Programming Without State
Dynamic Programming With State
Dynamic Programming With State and White Noise
Dynamic Programming With State and White Noise (Complements)
Outline of the presentation
Dynamic Programming Without State
Dynamic Programming With State
Dynamic Programming With State and White Noise
Dynamic Programming With State and White Noise (Complements)The payoff-to-go and Bellman’s Principle of OptimalityBellman equation and the curse of dimensionalityComplements: hazard-decision, linear-quadratic, linear-convex
The shortest path on a graph illustrates
Bellman’s Principle of Optimality
Los Angeles
Chicago
Boston
For an auto travel analogy,suppose that the fastestroute from Los Angeles toto Boston passes throughChicago.The principle of optimalitytranslates to obvious factthat the Chicago to Bostonportion of the route is alsothe fastest route for a tripthat starts from Chicagoand ends in Boston.(Dimitri P. Bertsekas)
Bellman’s Principle of Optimality
Richard Ernest Bellman(August 26, 1920 –March 19, 1984)
An optimal policy has theproperty that whatever theinitial state and initialdecision are, the remainingdecisions must constitutean optimal policy withregard to the stateresulting from the firstdecision (Richard Bellman)
We make the assumption that
the primitive random variables are independent
The set S = WT+1−t0 = Rq × · · ·×Rq of scenarios is equipped with a σ-field F =
⊗T
t=t0B(Rq)
and a probability P, supposed to be of product form
P = µt0 ⊗ · · · ⊗ µT
Therefore, the primitive random variables
Wt0 ,Wt0+1, . . . ,WT−1,WT
are independent under P, with marginal distributions µt0 , . . . , µT
The notation E refers to the mathematical expectation over S either under probability P, as in E, EW, EP
or under the marginal distributions µt0 , . . . , µT , as in E, EWt+1 , Eµt+1
What is state and what is noise?
Delineating what is state and what is noise
is a modelling issue
When the uncertainties are not independent,a solution is to enlarge the state
If the water inflows follow an auto-regressive model, we have
future stock︷︸︸︷St+1 = mins♯,
stock︷︸︸︷St −
water release︷︸︸︷Qt +
water inflows︷︸︸︷At
At+1︸︷︷︸future water inflows
= α At︸︷︷︸water inflows
+Wt+1︸ ︷︷ ︸noise
where we suppose that Wt0 ,Wt0+1, . . . ,WT−1,WT
form a sequence of independent random variables
The couple xt =(st , at
)is a sufficient summary of past controls
and uncertainties to do forecasting:knowing the state xt =
(st , at
)at time t is sufficient
to forecast xt+1, given the control qt and the uncertainty Wt+1
What is a state?
Bellman autobiography, Eye of the Hurricane
Conversely, once it was realized that the concept of policy wasfundamental in control theory, the mathematicization of thebasic engineering concept of ’feedback control,’ then theemphasis upon a state variable formulation became natural.
A state in optimal stochastic control problems is a sufficientstatistics for the uncertainties and past controls(P. Whittle, Optimization over Time: Dynamic Programming andStochastic Control)
Quoting Whittle, suppose there is a variable xtwhich summarizes past history in that, given t and the value of xt ,one can calculate the optimal ut and also xt+1 without knowledge ofthe history (ω, u0, ..., ut−1), for all t, where ω represents alluncertainties. Such a variable is termed sufficient
While history takes value in an increasing space as t increases,a sufficient variable taking values in a space independent of tis called a state variable
Outline of the presentation
Dynamic Programming Without State
Dynamic Programming With State
Dynamic Programming With State and White Noise
Dynamic Programming With State and White Noise (Complements)The payoff-to-go and Bellman’s Principle of OptimalityBellman equation and the curse of dimensionalityComplements: hazard-decision, linear-quadratic, linear-convex
The payoff-to-go / value function / Bellman function
Payoff-to-go / value function / Bellman functionAssume that the primitive random variablesWt0 ,Wt0+1, . . . ,WT−1,WT are independent under the probability P
The payoff-to-go from state x at time t is
Vt(x) = minPol∈Uad
E
[T−1∑
s=t
Ls(Xs ,Us ,Ws+1) + K(XT )
]
where Xt = x and, for s = t, . . . ,T − 1,Xs+1 = fs
(Xs ,Us ,Ws+1
)and Us = Pols(Xs)
The function Vt is called the value function, or the Bellman function
The original problem is Vt0(x0)
The stochastic dynamic programming equation,
or Bellman equation, is a backward equationsatisfied by the value function
Stochastic dynamic programming equationIf the primitive random variables Wt0 ,Wt0+1, . . . ,WT−1,WT
are independent under the probability P,the value function Vt(x) satisfies the following backward induction,where t runs from T − 1 down to t0
VT (x) = Ew(T )
[K(x)
]
Vt(x) = minu∈Bt(x)
EWt+1
[Lt(x , u,Wt+1) + Vt+1
(ft(x , u,Wt+1)
)]
Algorithm for the Bellman functions
initialization VT (x) =∑
w∈W (T )
PwK(x);
for t = T ,T − 1, . . . , t0 do
forall x ∈ X do
forall u ∈ Bt(x) doforall w ∈ Wt do
lt(x , u,w) = Lt(x , u,w) + Vt+1
(ft(x , u,w)
)∑
w∈Wt
Pwlt(x , u,w)
Vt(x) = minu∈Bt(x)
∑
w∈Wt
Pwlt(x , u,w) ;
B⋆t (x) = argmax
u∈Bt(x)
∑
w∈Wt
Pwlt(x , u,w)
Sketch of the proof in the deterministic case
Vt(x) = minu∈Bt(x)
(Lt(x , u)︸ ︷︷ ︸
instantaneous gain
+
optimal payoff︷ ︸︸ ︷Vt+1
(ft(x , u)︸ ︷︷ ︸
future state
) )
Los Angeles
Chicago
Boston
A decision u at time t in state xprovides
an instantaneous gain Lt(x , u)
and a future payoff for attainingthe new state ft(x , u)
The Bellman equation provides an optimal policy
PropositionFor any time t and state x, assume the existence of the policy Pol
⋆t (x) ∈
argmax
u∈Bt(x)
EWt+1
[Lt(x , u,Wt+1) + Vt+1
(ft(x , u,Wt+1)
)]
If Pol⋆t : x 7→ Pol⋆t(x) is measurable, then
Pol⋆ is an optimal policy
for any initial state x0, the optimal expected payoff is given by
Vt0(x0) = minPol∈Uad
V Polexpect(t0, x0) = V Pol
⋆
expect(t0, x0)
A bit of history (and fun)
“Where did the name, dynamic programming, come from?”
The 1950s were not good years formathematical research. We had avery interesting gentleman inWashington named Wilson. Hewas Secretary of Defense, and heactually had a pathological fearand hatred of the word, research.I’m not using the term lightly; I’musing it precisely. His face wouldsuffuse, he would turn red, and hewould get violent if people usedthe term, research, in his presence.You can imagine how he felt, then,about the term, mathematical.
“Where did the name, dynamic programming, come from?”
What title, what name, could Ichoose? In the first place I wasinterested in planning, in decisionmaking, in thinking. But planning,is not a good word for variousreasons. I decided therefore to usethe word, programming.
“Where did the name, dynamic programming, come from?”
I wanted to get across the ideathat this was dynamic, this wasmultistage, this was time-varying.I thought, let’s kill two birds withone stone. Let’s take a word thathas an absolutely precise meaning,namely dynamic, in the classicalphysical sense. It also has a veryinteresting property as anadjective, and that is it’simpossible to use the word,dynamic, in a pejorative sense.Try thinking of some combinationthat will possibly give it apejorative meaning. It’simpossible. Thus, I thoughtdynamic programming was a goodname.
Navigating between “backward off-line” and “forward on-line”
Optimal trajectories are calculated forward on-line
1. Initial state x⋆t0 = x0
2. Plug the state x⋆t0 into the “machine” Polt0 → initial decisionu⋆t0 = Pol
⋆t0(x⋆t0)
3. Run the dynamics → second state x⋆t0+1 = ft0(x⋆t0, u⋆t0 ,wt0+1)
4. Second decision u⋆t0+1 = Pol⋆t0+1(x
⋆t0+1)
5. And so on x⋆t0+2 = ft0+1(x⋆t0+1, u
⋆t0+1,wt0+2)
6. . . .
“Life is lived forward but understood backward”
(Søren Kierkegaard)
D. P. Bertsekas introduces his bookDynamic Programming and Optimal Controlwith a citation by Søren Kierkegaard
”Livet skal forstas baglaens, menleves forlaens”
Life is to be understood backwards,but it is lived forwards
The value function and the optimal policiesare computed backward and offlineby means of the Bellman equation
whereas the optimal trajectoriesare computed forward and online
The curse of dimensionality :-(
The curse of dimensionality is illustrated by the random
access memory capacity on a computer:one, two, three, infinity (Gamov)
On a computer RAM: 8 GBytes = 8(1 024)3 = 233 bytes a double-precision real: 8 bytes = 23 bytes =⇒ 230 ≈ 109 double-precision reals can be handled in RAM
If a state of dimension 4 is approximated by a grid with 100 levels bycomponents, we need to manipulate 1004 = 108 reals and
do a time loop do a control loop (after discretization) compute an expectation
The wall of dimension can be pushed beyond 3if additional properties are exploited (linearity, convexity)
Outline of the presentation
Dynamic Programming Without State
Dynamic Programming With State
Dynamic Programming With State and White Noise
Dynamic Programming With State and White Noise (Complements)The payoff-to-go and Bellman’s Principle of OptimalityBellman equation and the curse of dimensionalityComplements: hazard-decision, linear-quadratic, linear-convex
Bellman equation and optimal policies
in the hazard-decision information pattern
The uncertainty is observed before making the decision
initialization VT (x) = K(x) ;for t = T ,T − 1, . . . , t0 do
forall x ∈ X do
forall w ∈ Wt+1 do
forall u ∈ Bt(x) dolt(x , u,w) = Lt(x , u,w) + Vt+1
(ft(x , u,w)
)
minu∈Bt(x)
lt(x , u,w) ;
B⋆t (x ,w) = argmax
u∈Bt(x)
lt(x , u,w)
Vt(x) =∑
w∈Wt+1
Pw minu∈Bt(x)
lt(x , u,w)
In the linear-quadratic-Gaussian case,
optimal policies are linear
When utilities are quadratic
K(x) = x ′ST x
Lt(x , u,w) = x ′Stx + w ′Rtw + u′Qtu
and the dynamic is linear
ft(x , u,w) = Ftx + Gtu + Htw
and primitive random variables Wt0 ,Wt0+1, . . . ,WT−1,WT areGaussian independent under the probability P
then, the value functions x 7→ Vt(x) are quadratic,and optimal policies are linear
ut = Ktxt
How optimal decisions can be computed on-line
If we are able to store the value functions x 7→ Vt(x)
we do not need to compute the optimal policy Pol⋆ in advance,
and store it
Indeed, when we are at state x at time t in real time,we can just compute the optimal decision u⋆t “on the fly” by
u⋆t ∈ argmax
u∈Bt(x)
EWt+1
[Lt(x , u,Wt+1) + Vt+1
(ft(x , u,Wt+1)
)]
In addition to sparing storage, this method makes it possibleto incorporate in the above program any new information availableat time t (on the distribution of the noise Wt+1, for instance)
So, the question is:
how can we store the value functions?
The effort can be concentrated on computing the value functions
on a grid, by discretizing the Bellman equation
by estimating basis coefficients,when it is known that the value function is quadratic
by estimating upper affine approximation of the value function,when it is known that the value function is concave
by estimating lower approximation of the value function,when restricting the search to a subclass of policies(open-loop in OLFC)
In the linear-convex case, value functions are convex
Here, we aim at minimizing expected cumulated costs
E
[T−1∑
t=t0
instantaneous cost︷ ︸︸ ︷Lt(Xt ,Ut ,Wt+1)+ K(XT )︸ ︷︷ ︸
final cost
]
The value functions x 7→ Vt(x) are convex whenever
(x , u) 7→ Lt(x , u,w) is jointly convex in state and control
x 7→ K(x) is convex
the primitive random variables Wt0 ,Wt0+1, . . . ,WT−1,WT areindependent under the probability P
the dynamics are affine
ft(x , u,w) = Ftx + Gtu + Htw
The minimum over one variable of a jointly convex function
is convex in the other variable
A lemma in convex analysisLet f : Y× Z → R be convex, and let C ⊂ Y× Z be a convex set. Then
g(y) = minz∈Z,(y,z)∈C
f (y , z)
is a convex function
The Bellman equation produces convex value functions
The dynamic programming equation associated withthe problem of minimizing the expected costs is
VT (x) =
final cost︷︸︸︷K(x)
Vt(x) = minu∈Bt(x)
EWt+1
[Lt(x , u,Wt+1)︸ ︷︷ ︸instantaneous cost
+Vt+1
(Ftx + Gtu + HtWt+1︸ ︷︷ ︸
future state
)]
It can be shown by induction that x 7→ Vt(x) is convex
The gradient ∇Vt(x⋆t+1) at x
⋆t+1 defines a hyperplane and
a lower affine approximation of the value function, calculated byduality
When spilling decisions are made after knowing the water
inflows, we obtain a linear dynamical model
st+1︸︷︷︸future volume
= st︸︷︷︸volume
− qt︸︷︷︸turbined
− rt︸︷︷︸spilled
+ at︸︷︷︸inflow volume
st volume (stock) of water at the beginning of period [t, t + 1[
at , inflow water volume (rain, etc.) during [t, t + 1[;
qt turbined outflow volume decided at the beginning of period [t, t + 1[ (hazard follows decision) supposed to depend on the stock st
rt spilled volume decided at the end of period [t, t + 1[ (hazard precedes decision) supposed to depend on the stock st and on the inflow water at
0 ≤ qt ≤ minst , q♯ and 0 ≤ st − qt + at − rt ≤ s♯
Stochastic Dual Dynamic Programming (SDDP)
The property that value functions are convexextends to the following cases
Multiple stocks interconnected by linear dynamics
Sit+1 = Si
t + Ait +Qi−1
t −Qit − Ri
t
Water inflows following an auto-regressive model
Ait =
∑
k=1,...,K i
αkAi(t − k) +Wt
where the random variables Wt0 ,Wt0+1, . . . ,WT−1,WT areindependent
Summary
Bellman’s Principle of Optimality breaksan intertemporal optimization probleminto a sequence of interconnected static optimization problems
The payoff-to-go / value function / Bellman functionis solution of a backward dynamic programming equation,or Bellman equation
The Bellman equation provides an optimal policy,a concept of solution adapted to uncertain case
In numerical practice, the curse of dimensionalityforbids to use dynamic programmingfor a state with dimension more than four or five
However, special cases like the linear-quadratic or the linear-convexones, do not (totally) suffer from the curse of dimensionality