a short presentation of dynamic programming - cermics – centre...
TRANSCRIPT
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
A short presentation of
dynamic programming
Michel De Lara
cermics, Ecole nationale des ponts et chaussees, ParisTech
7 juin 2006
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Outline of the presentation
1 Deterministic dynamic programming
2 Stochastic dynamic programming
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
State equation
x(t + 1) = F (t, x(t), u(t)) , t ∈ 0, . . . ,T with x(0) = x0
where
x(t) ∈ X = Rn represents the system’s state vector at time t;
x0 ∈ X is the initial condition;
u(t) ∈ U = Rp represents decision or control vector;
F : N × X × U → X is the so called dynamics functionrepresenting the system’s evolution;
the horizon T ∈ N or T = +∞ stands for the term.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Constraints
the state constraints are respected at any time
x(t) ∈ D(t) ⊂ X ;
the control constraints are respected at any time
u(t) ∈ B(t, x(t)) ⊂ U ;
the final state achieves a fixed target C ⊂ X
x(T ) ∈ C = D(T ) .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Criterion
The trajectory space is the product space1X
T+1 × UT .
A generic element, a state and control trajectory, is denoted by2
(x(·), u(·)) = (x(0), . . . , x(T ), u(0), . . . , u(T − 1)) .
A criterion I is a function
I : XT+1 × U
T → R
which assigns a real number to a state and control trajectory.
1To be understood as XN× U
N in the infinite horizon case (T = +∞).2To be understood as (x(·), u(·)) = ((x(t))t∈N, (u(t))t∈N) in the infinite
horizon case (T = +∞).Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Additive criterion (finite horizon)
It is the most usual criterion defined in the finite horizon case bythe sum
I (x(·), u(·)) =T−1∑
t=0
L(t, x(t), u(t)) + M(T , x(T )).
Function L is referred to as the system’s instantaneous utility (orgain, profit, benefit, payoff, etc.) or instantaneous cost (or loss,disutility, etc. according to the situation), while function M isknown as the final utility or the final cost.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Additive criterion (infinite horizon)
In the infinite horizon case, we consider
I (x(·), u(·)) =
+∞∑
t=0
L(t, x(t), u(t)).
In economics, the usual present value (PV) approach correspondsto the time separable case with discounting criterion in the form of
I (x(·), u(·)) =
+∞∑
t=0
ρtL(x(t), u(t))
where ρ stands for a discount factor (0 ≤ ρ ≤ 1).
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Quadratic case
The quadratic case corresponds to the situation where L and M
are quadratic in the sense that L(t, x , u) = x ′R(t)x + u′Q(t)u andM(T , x) = x ′R(T )x , where R(t) and Q(t) are positive matrices,giving
I (x(·), u(·)) =
T−1∑
t=0
[x(t)′R(t)x(t)+u(t)′Q(t)u(t)]+x(T )′R(T )x(T ) .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
The Maximin
The Rawlsian or maximin form in the finite horizon is
I (x(·), u(·)) = min
(
mint=1,...,T−1
L(t, x(t), u(t)),M(T , x(T ))
)
.
In the infinite horizon, we obtain
I (x(·), u(·)) = mint=0,...,+∞
L(t, x(t), u(t)).
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Maximal intertemporal utility
We focus on the maximization problem in additive and separableform in finite horizon
I ? = sup(x(·),u(·))∈Tad(0,x0)
T−1∑
t=0
L(t, x(t), u(t)) + M(T , x(T )) ,
where the set of admissible trajectories Tad(0, x0) is defined asfollows.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Admissible trajectories
Definition
Let Tad(t, x) ⊂ XT+1 × U
T be defined by
(x(·), u(·)) ∈ Tad(t, x) ⇐⇒
x(t) = x
x(s + 1) = F (s, x(s), u(s))u(s) ∈ B(s, x(s))x(s) ∈ D(s)∀s ≥ t
Tad(t, x) is the set of trajectories which visit x at time t whilerespecting both the constraints and the dynamics after time t.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Viability kernel
Definition
The viability kernel at time s ∈ 0, . . . ,T is defined by:
Viab(s) :=
x ∈ D(s)
∣
∣
∣
∣
∣
∣
∣
∣
∣
∣
∣
∣
there exists decisions u(·)and states x(·) starting from x at time s
satisfying for any time t ∈ s, . . . ,Tdynamics x(t + 1) = F (t, x(t), u(t))and constraints u(t) ∈ B(t, x(t)) ,
x(t) ∈ D(t)
.
Notice that the viability kernel at horizon T is the target:
Viab(T ) = D(T ) = C .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Dynamic programming equation for viability kernels
Proposition
The viability kernel Viab(t) satisfies the backward induction
Viab(t) = x ∈ D(t) | ∃u ∈ B(t, x) ,
F (t, x , u) ∈ Viab(t + 1) ,
Viab(T ) = D(T ) .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Viable controls
For every point x inside the corridor Viab(t), there exists a controlwhich yields a solution x(t + 1) belonging to Viab(t + 1) and,consequently, to D(t + 1).
Definition
Viable controls are
Bviab(t, x) := u ∈ B(t, x) | F (t, x , u) ∈ Viab(t + 1) .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Value function
The value function V (t, x) at time t and for state x represents theoptimal value of the criterion over T − t periods, given that thestate of the system x(t) at time t is x . In particular V (0, x0) = I ?.
Definition
V (T , x) :=
M(T , x) if x ∈ D(T )−∞ otherwise,
and, for t = 0, ...,T − 1,
V (t, x) := sup(x(·),u(·))∈Tad(t,x)
(
T−1∑
s=t
L(s, x(s), u(s)) + M(T , x(T )))
.
We also set V (t, x) = −∞ whenever no feasibility occurs i.e.Tad(t, x) = ∅ or, equivalently, x 6∈ Viab(t).
V (0, x0) = I ?.Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Dynamic programming equation in finite horizon
Proposition
Assume no state constraint, namely D(t) = X. The value functionis solution of the following dynamic programming backward
equation (or Bellman equation), for t = T − 1, ..., 0:
V (T , x) = M(T , x)
V (t, x) = supu∈B(t,x)
(
L(t, x , u) + V(
t + 1,F (t, x , u)))
.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Thus, to evaluate the value V (t, .) at each time, we start from thefinal value V (T , .) = M(T , .) and then compute V (T − 1, .), andso on by backward induction.
Notice that the essence of dynamic programming is to replace oneoptimization problem over a trajectory space X
T × UT−1 by a
sequence of T optimization problems over the primitive space U.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Dynamic programming equation in finite horizon with
viability constraints
Proposition
V (T , x) = M(T , x),
∀x ∈ Viab(T )
V (t, x) = supu∈Bviab(t,x)
(
L(t, x , u) + V(
t + 1,F (t, x , u)))
,
∀x ∈ Viab(t).
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Optimal feedback
Definition
An optimal feedback is any mapping υ? : 0, ...,T − 1 × X → U
such that any trajectory (x?(·), u?(·)) generated by
x?(0) = x0 , x?(t+1) = F (t, x?(t), u?(t)) , u?(t) = υ?(t, x(t)) ,
for t = 0, ...,T − 1, for any initial condition x0 ∈ D(0), belongs toTad(0, x0) and is an optimal feasible trajectory, that is
max(x(·),u(·))∈Tad(0,x0)
I (x(·), u(·)) = I (x?(·), u?(·)) .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Note that υ (greek letter upsilon) denotes a mapping from0, ...,T − 1 × X to U, while u denotes a variable (u ∈ U).
Proposition
For any time t and state x ∈ Viab(t), assume the existence of thefollowing feedback decision
υ?(t, x) ∈ arg maxu∈Bviab(t,x)
(
L(t, x , u) + V (t + 1,F (t, x , u)))
.
Then υ? is an optimal feasible feedback.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Proof
Recall that
(x(·), u(·)) ∈ Tad(t, x) ⇐⇒
x(t) = x
x(s + 1) = F (s, x(s), u(s))u(s) ∈ B(s, x(s))x(s) ∈ D(s)∀s ≥ t
For any x ∈ Viab(t), the admissible set Tad(t, x) is not empty andwe have
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
V (t, x) = sup(x(·),u(·))∈Tad(t,x)
(
T−1∑
s=t
L(s, x(s), u(s)) + M(T , x(T )))
= supu(t)∈B(t,x)
(
sup8
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
:
u(t + 1), . . . , u(T − 1)x(t + 1) = F (t, x , u(t))x(s + 1) = F (s, x(s), u(s))x(s) ∈ D(s)u(s) ∈ B(s, x(s))s ≥ t + 1
L(t, x(t), u(t))
+T−1∑
s=t+1
L(s, x(s), u(s)) + M(T , x(T ))
)
= supu∈Bviab(t,x)
(
L(t, x , u)
+ sup(x(·),u(·))∈Tad(t+1,F (t,x,u))
T−1∑
s=t+1
L(x(s), u(s), s))
+ M(T , x(T ))
= supu∈Bviab(t,x)
(
L(t, x , u) + V (t + 1,F (t, x , u)))
.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Extension to Whittle criterion
Let us call a criterion I in Whittle form whenever it is given by thefollowing backward induction
I (x(·), u(·)) = C (0)
C (t) = ψ(t, x(t), u(t),C (t+1)) , t = 0, . . . ,T − 1
C (T ) = M(T , x(T )) ,
where ψ is either strictly increasing or continuously increasing in itslast argument. This form is adapted to maximin dynamicprogramming, and includes the additive case for whichψ(t, x , u,C ) = L(t, x , u) + C .
V (T , x) := M(T , x),
V (t, x) := supu∈B(t,x)
ψ(t, x , u,V (t + 1,F (t, x , u))) .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementDynamic programming equation
Dynamic programming equation in infinite horizon
x(t + 1) = F (x(t), u(t)) , t ∈ N
I (x(·), u(·)) =+∞∑
t=0
ρtL(x(t), u(t))
Proposition
V (x) = supu∈B(x)
(
L(x , u) + ρV(
F (x , u)))
.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
STOCHASTIC DYNAMIC PROGRAMMING
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
State equation with random inputs
The uncertain dynamic model in discrete time is described by astate equation,
x(t+1) = F (t, x(t), u(t),w(t)) , t = 0, . . . ,T with x(0) = x0
where
the horizon T ∈ N or T = +∞ stands for the term;
x(t) ∈ X = Rn represents the system’s state vector at time t;
x0 ∈ X is the initial condition;
u(t) ∈ U = Rp represents decision or control vector;
w(t) stands for the uncertain variable, or disturbance, noise,taking its values in a set W = R
q;
F : N × X × U × W → X is the so called dynamics functionrepresenting the system’s evolution.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Constraints and viability
As in the certain case, we may require state and decisionconstraints to be satisfied. However, since state trajectories are nolonger unique, the following requirements depend upon thescenarios w(·) = (w(0), w(1). . . , w(T − 1)) ∈ W
T in a way thatwe shall specify later. The assertions below are thus to be taken in
a loose sense at this stage.
The state constraints are respected at any time t
x(t) ∈ D(t) ⊂ X .
The control constraints are respected at any time t
u(t) ∈ B(t, x(t)) ⊂ U .
The final state achieves a fixed target C ⊂ X
x(T ) ∈ C .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Admissible feedbacks
Solutions are no longer trajectories, as in the deterministic case,but are feedbacks.
Definition
Γ = γ : N × X → U
Γad = γ ∈ Γ | γ(t, x) ∈ B(t, x) ,
∀(t, x) ∈ 0, . . . ,T − 1 × X .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Solution maps
For γ ∈ Γ, let F γ denote the mapping F γ : N × X × W → X
defined byF γ(t, x ,w) := F (t, x , γ(t, x),w) .
Definition
The state map and control map are defined for any timet0 ∈ 0, . . . ,T byxF [t0, x0, γ,w(·)](t) = x(t) anduF [t0, x0, γ,w(·)](t) = u(t) = γ(t, x(t)) respectively,where x(·) satisfies the dynamics
x(t + 1) = F γ(t, x(t),w(t)) , t = t0, . . . ,T
and the initial condition x(t0) = x0.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Causality
It should be noticed that, with straightforward notations,
xF [t0, x0, γ,w(·)](t0) = x0
xF [t0, x0, γ,w(·)](t) = xF [t0, x0, γ, (w(t0), . . . ,w(t − 1))](t)for t ≥ t0 + 1
expressing thus a causality property, since the future states after t0
only depend upon the disturbances after t0.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Criteria to optimize
The criterion I now depends upon the scenarios: this raisesquestions as how to turn this family of values (one per scenario)into a single one to be optimized.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Additive criterion (finite time)
The additive and separable form in finite horizon is
I (x(·), u(·),w(·)) =
T−1∑
t=0
L(t, x(t), u(t),w(t)) + M(T , x(T ))
in which
L : N × X × U × W → R specifies the instantaneous cost (orloss, disutility, etc. according to the situation) when thecriterion I is minimized, and the instantaneous utility (or gain,profit, benefit, payoff, etc.) when the criterion I is maximized;
M : N × X → R, represents the final cost when the criterion I
is minimized, and the final utility else.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Additive criterion (infinite time)
The additive and separable form in the infinite horizon is
I (x(·), u(·),w(·)) =
+∞∑
t=0
L(t, x(t), u(t),w(t)) .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Multiplicative form
The multiplicative form is
I (x(·), u(·),w(·)) =T−1∏
t=0
L(t, x(t), u(t),w(t)) × M(T , x(T )) .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Probabilistic assumptions
Probabilistic assumptions on the uncertainty w(·) may be added,providing a stochastic nature to the problem.
Mathematically speaking, w(·) = (w(0),w(1), . . . ,w(T − 1)) is asequence of random variables defined over a measurable space(Ω,F) equipped with a probability P. When T = +∞, one ratherspeaks of a stochastic process.
The notation E refers to the mathematical expectation underprobability P. Recall that a random variable is a measurablefunction on (Ω,F).
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Measurability assumptions
To be able to perform mathematical expectations, we are lead toconsider measurability assumptions. The sets X and U are nowassumed to be equipped with σ-fields X and U respectively, thedynamics is supposed to be measurable and, by a feedback, wenow implicitely mean a measurable feedback. From now on,
Γ := γ : N × X → U measurable .
Once a feedback γ is picked up in Γ, all the variables x(·), u(·) andw(·) become random variables defined over (Ω,F ,P), by means ofthe relations
x(t) = xF [0, x0, γ,w(·)](t) and u(t) = γ(t, x(t)) .
Thus, any quantity depending upon states, controls, disturbancesis now a random variable, hence, when bounded or nonnegative,admits an integral with respect to P.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
The i.i.d. case
Following a common hypothesis, we shall, for sake of simplicity,assume that the random variables w(·) are independent andidentically distributed (i.i.d.) under P.
In such a probabilistic context, we use the notation
E[a(w)] for the expected value of any integrable randomvariable a : W → R
and E[A(w(·))] for any random variable A : WT → R.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
The discrete i.i.d. case
For discrete probability laws, this means that
E(a(w)) =∑
w∈W
µ(w)a(w)
with µ the common discrete law on W of the random variablesw(t)3 and
E[A(w(·))] =∑
w0∈W
· · ·∑
wT−1∈W
A(w0, . . . ,wT−1)µ(w0) · · · µ(wT−1) .
3Thus, we can choose Ω = WT and P the product of T + 1 copies of µ.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
The continuous i.i.d. case
For continuous probability laws on W = Rq, this gives
E[a(w)] =
∫
W
a(w)f (w)dw
with f the common density on W of the random variables w(t)4
and
E[A(w(·))] =
∫
W
· · ·
∫
W
A(w0, . . . ,wT−1)f (w0) · · · f (wT−1)dw0 · · · dwT−1 .
4Thus, we can choose Ω = WT and P the product of T + 1 copies of
f (w)dw .Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Minimal mean cost
Definition
For any admissible feedback strategy γ ∈ Γad and initial conditionx0 ∈ X, let us consider the expected criterion or mean cost
I (x0, γ) := E
[
I
(
xF [0, x0, γ,w(·)](·), uF [0, x0, γ,w(·)](·),w(·)
)]
.
The stochastic optimization problem corresponds to
I?(x0) = inf
γ∈Γad
I (x0, γ)
= infγ∈Γad
E
[
I
(
xF [0, x0, γ,w(·)](·), uF [0, x0, γ,w(·)](·),w(·)
)]
.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Optimal feedback
Definition
Any γ? ∈ Γad such that
I?(x0) = min
γ∈ΓadI (x0, γ) = I (x0, γ
?)
is an optimal feedback.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Stochastic dynamic programming equation in finite horizon
Definition
In absence of state constraints (D(t) = X for t = 0, . . . , T ), thevalue function V (t, x), is defined by the following backwardinduction:
V (T , x) := M(T , x),
V (t, x) := infu∈B(t,x)
E
[
L(t, x , u,w(t)) + V(
t + 1,F (t, x , u,w(t)))
]
.
Contrarily to the deterministic case, the value function is defined
by a backward induction relation and then one can prove that itcoincides with some optimal cost.
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Optimal feedback
Assume no state constraint, namely D(t) = X. For any time t andstate x , assume the existence of the following feedback decision
γ?(t, x) ∈ arg minu∈B(t,x)
E
[
L(t, x , u,w(t))+V (t+1,F (t, x , u,w(t)))
]
.
Then γ? : (t, x) → γ?(t, x) is an optimal strategy, and, for anyx0 ∈ X, the optimal expected cost is given by
V (0, x0) = I?(x0) = I (x0, γ
?) .
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Extension to Whittle criterion
Let us call a criterion I in strong Whittle form whenever it is givenby the following backward induction
I (x(·), u(·),w(·)) = C (0)
C (t) = g(t, x(t), u(t),w(t)) + β(t, x(t), u(t),w(t))C (t+1) ,
t = 0, . . . ,T − 1
C (T ) = M(T , x(T )) ,
where β(t, x(t), u(t),w(t)) > 0. Equivalently,
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
I (x(·), u(·),w(·)) =∑T
t=0 β0β1 · · · βt−1gt
βt = β(t, x(t), u(t),w(t)) > 0 , t = 0, . . . ,T − 1gt = g(t, x(t), u(t),w(t)) , t = 0, . . . ,T − 1gT = M(T , x(T )) .
This form happens to be adapted to stochastic dynamicprogramming, and to include both the additive and multiplicativecases, with respectivelyg(t, x , u,w) = L(t, x , u,w), β(t, x , u,w) = 1 andg(t, x , u,w) = 0, β(t, x , u,w) = L(t, x , u,w).
Cours EDF, mai-juin 2006
MODE
,
Deterministic dynamic programmingStochastic dynamic programming
Problem statementStochastic dynamic programming equation
Value function
In absence of state constraints (D(t) = X for t = 0, . . . , T ), thevalue function V (t, x) is defined by the following backwardinduction:
V (T , x) := M(T , x),
V (t, x) := infu∈B(t,x)
E
[
g(t, x , u,w(t))
+ β(t, x , u,w(t))V (t + 1,F (t, x , u,w(t)))
]
.
Cours EDF, mai-juin 2006