a short presentation of dynamic programming - cermics – centre...

MODE

,

Deterministic dynamic programmingStochastic dynamic programming

A short presentation of

dynamic programming

Michel De Lara

cermics, Ecole nationale des ponts et chaussees, ParisTech

7 juin 2006

Cours EDF, mai-juin 2006

MODE

,


Outline of the presentation

1 Deterministic dynamic programming

2 Stochastic dynamic programming


MODE

,


Problem statementDynamic programming equation

State equation

x(t + 1) = F (t, x(t), u(t)) , t ∈ 0, . . . ,T with x(0) = x0

where

x(t) ∈ X = Rn represents the system’s state vector at time t;

x0 ∈ X is the initial condition;

u(t) ∈ U = Rp represents decision or control vector;

F : N × X × U → X is the so called dynamics functionrepresenting the system’s evolution;

the horizon T ∈ N or T = +∞ stands for the term.


MODE

,



Constraints

the state constraints are respected at any time

x(t) ∈ D(t) ⊂ X ;

the control constraints are respected at any time

u(t) ∈ B(t, x(t)) ⊂ U ;

the final state achieves a fixed target C ⊂ X

x(T ) ∈ C = D(T ) .


MODE

,



Criterion

The trajectory space is the product space1X

T+1 × UT .

A generic element, a state and control trajectory, is denoted by2

(x(·), u(·)) = (x(0), . . . , x(T ), u(0), . . . , u(T − 1)) .

A criterion I is a function

I : XT+1 × U

T → R

which assigns a real number to a state and control trajectory.

1To be understood as XN× U

N in the infinite horizon case (T = +∞).2To be understood as (x(·), u(·)) = ((x(t))t∈N, (u(t))t∈N) in the infinite

horizon case (T = +∞).Cours EDF, mai-juin 2006

MODE

,



Additive criterion (finite horizon)

It is the most usual criterion defined in the finite horizon case bythe sum

I (x(·), u(·)) =T−1∑

t=0

L(t, x(t), u(t)) + M(T , x(T )).

Function L is referred to as the system’s instantaneous utility (orgain, profit, benefit, payoff, etc.) or instantaneous cost (or loss,disutility, etc. according to the situation), while function M isknown as the final utility or the final cost.


MODE

,



Additive criterion (infinite horizon)

In the infinite horizon case, we consider

I (x(·), u(·)) =

+∞∑

t=0

L(t, x(t), u(t)).

In economics, the usual present value (PV) approach correspondsto the time separable case with discounting criterion in the form of

I (x(·), u(·)) =

+∞∑

t=0

ρtL(x(t), u(t))

where ρ stands for a discount factor (0 ≤ ρ ≤ 1).


MODE

,



Quadratic case

The quadratic case corresponds to the situation where L and M

are quadratic in the sense that L(t, x , u) = x ′R(t)x + u′Q(t)u andM(T , x) = x ′R(T )x , where R(t) and Q(t) are positive matrices,giving

I (x(·), u(·)) =

T−1∑

t=0

[x(t)′R(t)x(t)+u(t)′Q(t)u(t)]+x(T )′R(T )x(T ) .


MODE

,



The Maximin

The Rawlsian or maximin form in the finite horizon is

I (x(·), u(·)) = min

(

mint=1,...,T−1

L(t, x(t), u(t)),M(T , x(T ))

)

.

In the infinite horizon, we obtain

I (x(·), u(·)) = mint=0,...,+∞

L(t, x(t), u(t)).


MODE

,



Maximal intertemporal utility

We focus on the maximization problem in additive and separableform in finite horizon

I ? = sup(x(·),u(·))∈Tad(0,x0)

T−1∑

t=0

L(t, x(t), u(t)) + M(T , x(T )) ,

where the set of admissible trajectories Tad(0, x0) is defined asfollows.


MODE

,



Admissible trajectories

Definition

Let Tad(t, x) ⊂ XT+1 × U

T be defined by

(x(·), u(·)) ∈ Tad(t, x) ⇐⇒

x(t) = x

x(s + 1) = F (s, x(s), u(s))u(s) ∈ B(s, x(s))x(s) ∈ D(s)∀s ≥ t

Tad(t, x) is the set of trajectories which visit x at time t whilerespecting both the constraints and the dynamics after time t.


MODE

,



Viability kernel

Definition

The viability kernel at time s ∈ 0, . . . ,T is defined by:

Viab(s) :=

x ∈ D(s)

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

there exists decisions u(·)and states x(·) starting from x at time s

satisfying for any time t ∈ s, . . . ,Tdynamics x(t + 1) = F (t, x(t), u(t))and constraints u(t) ∈ B(t, x(t)) ,

x(t) ∈ D(t)

.

Notice that the viability kernel at horizon T is the target:

Viab(T ) = D(T ) = C .


MODE

,



Dynamic programming equation for viability kernels

Proposition

The viability kernel Viab(t) satisfies the backward induction

Viab(t) = x ∈ D(t) | ∃u ∈ B(t, x) ,

F (t, x , u) ∈ Viab(t + 1) ,

Viab(T ) = D(T ) .


MODE

,



Viable controls

For every point x inside the corridor Viab(t), there exists a controlwhich yields a solution x(t + 1) belonging to Viab(t + 1) and,consequently, to D(t + 1).

Definition

Viable controls are

Bviab(t, x) := u ∈ B(t, x) | F (t, x , u) ∈ Viab(t + 1) .


MODE

,



Value function

The value function V (t, x) at time t and for state x represents theoptimal value of the criterion over T − t periods, given that thestate of the system x(t) at time t is x . In particular V (0, x0) = I ?.

Definition

V (T , x) :=

M(T , x) if x ∈ D(T )−∞ otherwise,

and, for t = 0, ...,T − 1,

V (t, x) := sup(x(·),u(·))∈Tad(t,x)

(

T−1∑

s=t

L(s, x(s), u(s)) + M(T , x(T )))

.

We also set V (t, x) = −∞ whenever no feasibility occurs i.e.Tad(t, x) = ∅ or, equivalently, x 6∈ Viab(t).

V (0, x0) = I ?.Cours EDF, mai-juin 2006

MODE

,



Dynamic programming equation in finite horizon

Proposition

Assume no state constraint, namely D(t) = X. The value functionis solution of the following dynamic programming backward

equation (or Bellman equation), for t = T − 1, ..., 0:

V (T , x) = M(T , x)

V (t, x) = supu∈B(t,x)

(

L(t, x , u) + V(

t + 1,F (t, x , u)))

.


MODE

,



Thus, to evaluate the value V (t, .) at each time, we start from thefinal value V (T , .) = M(T , .) and then compute V (T − 1, .), andso on by backward induction.

Notice that the essence of dynamic programming is to replace oneoptimization problem over a trajectory space X

T × UT−1 by a

sequence of T optimization problems over the primitive space U.


MODE

,



Dynamic programming equation in finite horizon with

viability constraints

Proposition

V (T , x) = M(T , x),

∀x ∈ Viab(T )

V (t, x) = supu∈Bviab(t,x)

(

L(t, x , u) + V(

t + 1,F (t, x , u)))

,

∀x ∈ Viab(t).


MODE

,



Optimal feedback

Definition

An optimal feedback is any mapping υ? : 0, ...,T − 1 × X → U

such that any trajectory (x?(·), u?(·)) generated by

x?(0) = x0 , x?(t+1) = F (t, x?(t), u?(t)) , u?(t) = υ?(t, x(t)) ,

for t = 0, ...,T − 1, for any initial condition x0 ∈ D(0), belongs toTad(0, x0) and is an optimal feasible trajectory, that is

max(x(·),u(·))∈Tad(0,x0)

I (x(·), u(·)) = I (x?(·), u?(·)) .


MODE

,



Note that υ (greek letter upsilon) denotes a mapping from0, ...,T − 1 × X to U, while u denotes a variable (u ∈ U).

Proposition

For any time t and state x ∈ Viab(t), assume the existence of thefollowing feedback decision

υ?(t, x) ∈ arg maxu∈Bviab(t,x)

(

L(t, x , u) + V (t + 1,F (t, x , u)))

.

Then υ? is an optimal feasible feedback.


MODE

,



Proof

Recall that

(x(·), u(·)) ∈ Tad(t, x) ⇐⇒

x(t) = x

x(s + 1) = F (s, x(s), u(s))u(s) ∈ B(s, x(s))x(s) ∈ D(s)∀s ≥ t

For any x ∈ Viab(t), the admissible set Tad(t, x) is not empty andwe have


MODE

,



V (t, x) = sup(x(·),u(·))∈Tad(t,x)

(

T−1∑

s=t

L(s, x(s), u(s)) + M(T , x(T )))

= supu(t)∈B(t,x)

(

sup8

>

>

>

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

>

>

>

:

u(t + 1), . . . , u(T − 1)x(t + 1) = F (t, x , u(t))x(s + 1) = F (s, x(s), u(s))x(s) ∈ D(s)u(s) ∈ B(s, x(s))s ≥ t + 1

L(t, x(t), u(t))

+T−1∑

s=t+1

L(s, x(s), u(s)) + M(T , x(T ))

)

= supu∈Bviab(t,x)

(

L(t, x , u)

+ sup(x(·),u(·))∈Tad(t+1,F (t,x,u))

T−1∑

s=t+1

L(x(s), u(s), s))

+ M(T , x(T ))

= supu∈Bviab(t,x)

(

L(t, x , u) + V (t + 1,F (t, x , u)))

.


MODE

,



Extension to Whittle criterion

Let us call a criterion I in Whittle form whenever it is given by thefollowing backward induction

I (x(·), u(·)) = C (0)

C (t) = ψ(t, x(t), u(t),C (t+1)) , t = 0, . . . ,T − 1

C (T ) = M(T , x(T )) ,

where ψ is either strictly increasing or continuously increasing in itslast argument. This form is adapted to maximin dynamicprogramming, and includes the additive case for whichψ(t, x , u,C ) = L(t, x , u) + C .

V (T , x) := M(T , x),

V (t, x) := supu∈B(t,x)

ψ(t, x , u,V (t + 1,F (t, x , u))) .


MODE

,



Dynamic programming equation in infinite horizon

x(t + 1) = F (x(t), u(t)) , t ∈ N

I (x(·), u(·)) =+∞∑

t=0

ρtL(x(t), u(t))

Proposition

V (x) = supu∈B(x)

(

L(x , u) + ρV(

F (x , u)))

.


MODE

,


Problem statementStochastic dynamic programming equation

STOCHASTIC DYNAMIC PROGRAMMING


MODE

,



State equation with random inputs

The uncertain dynamic model in discrete time is described by astate equation,

x(t+1) = F (t, x(t), u(t),w(t)) , t = 0, . . . ,T with x(0) = x0

where

the horizon T ∈ N or T = +∞ stands for the term;

x(t) ∈ X = Rn represents the system’s state vector at time t;

x0 ∈ X is the initial condition;

u(t) ∈ U = Rp represents decision or control vector;

w(t) stands for the uncertain variable, or disturbance, noise,taking its values in a set W = R

q;

F : N × X × U × W → X is the so called dynamics functionrepresenting the system’s evolution.


MODE

,



Constraints and viability

As in the certain case, we may require state and decisionconstraints to be satisfied. However, since state trajectories are nolonger unique, the following requirements depend upon thescenarios w(·) = (w(0), w(1). . . , w(T − 1)) ∈ W

T in a way thatwe shall specify later. The assertions below are thus to be taken in

a loose sense at this stage.

The state constraints are respected at any time t

x(t) ∈ D(t) ⊂ X .

The control constraints are respected at any time t

u(t) ∈ B(t, x(t)) ⊂ U .

The final state achieves a fixed target C ⊂ X

x(T ) ∈ C .


MODE

,



Admissible feedbacks

Solutions are no longer trajectories, as in the deterministic case,but are feedbacks.

Definition

Γ = γ : N × X → U

Γad = γ ∈ Γ | γ(t, x) ∈ B(t, x) ,

∀(t, x) ∈ 0, . . . ,T − 1 × X .


MODE

,



Solution maps

For γ ∈ Γ, let F γ denote the mapping F γ : N × X × W → X

defined byF γ(t, x ,w) := F (t, x , γ(t, x),w) .

Definition

The state map and control map are defined for any timet0 ∈ 0, . . . ,T byxF [t0, x0, γ,w(·)](t) = x(t) anduF [t0, x0, γ,w(·)](t) = u(t) = γ(t, x(t)) respectively,where x(·) satisfies the dynamics

x(t + 1) = F γ(t, x(t),w(t)) , t = t0, . . . ,T

and the initial condition x(t0) = x0.


MODE

,



Causality

It should be noticed that, with straightforward notations,

xF [t0, x0, γ,w(·)](t0) = x0

xF [t0, x0, γ,w(·)](t) = xF [t0, x0, γ, (w(t0), . . . ,w(t − 1))](t)for t ≥ t0 + 1

expressing thus a causality property, since the future states after t0

only depend upon the disturbances after t0.


MODE

,



Criteria to optimize

The criterion I now depends upon the scenarios: this raisesquestions as how to turn this family of values (one per scenario)into a single one to be optimized.


MODE

,



Additive criterion (finite time)

The additive and separable form in finite horizon is

I (x(·), u(·),w(·)) =

T−1∑

t=0

L(t, x(t), u(t),w(t)) + M(T , x(T ))

in which

L : N × X × U × W → R specifies the instantaneous cost (orloss, disutility, etc. according to the situation) when thecriterion I is minimized, and the instantaneous utility (or gain,profit, benefit, payoff, etc.) when the criterion I is maximized;

M : N × X → R, represents the final cost when the criterion I

is minimized, and the final utility else.


MODE

,



Additive criterion (infinite time)

The additive and separable form in the infinite horizon is

I (x(·), u(·),w(·)) =

+∞∑

t=0

L(t, x(t), u(t),w(t)) .


MODE

,



Multiplicative form

The multiplicative form is

I (x(·), u(·),w(·)) =T−1∏

t=0

L(t, x(t), u(t),w(t)) × M(T , x(T )) .


MODE

,



Probabilistic assumptions

Probabilistic assumptions on the uncertainty w(·) may be added,providing a stochastic nature to the problem.

Mathematically speaking, w(·) = (w(0),w(1), . . . ,w(T − 1)) is asequence of random variables defined over a measurable space(Ω,F) equipped with a probability P. When T = +∞, one ratherspeaks of a stochastic process.

The notation E refers to the mathematical expectation underprobability P. Recall that a random variable is a measurablefunction on (Ω,F).


MODE

,



Measurability assumptions

To be able to perform mathematical expectations, we are lead toconsider measurability assumptions. The sets X and U are nowassumed to be equipped with σ-fields X and U respectively, thedynamics is supposed to be measurable and, by a feedback, wenow implicitely mean a measurable feedback. From now on,

Γ := γ : N × X → U measurable .

Once a feedback γ is picked up in Γ, all the variables x(·), u(·) andw(·) become random variables defined over (Ω,F ,P), by means ofthe relations

x(t) = xF [0, x0, γ,w(·)](t) and u(t) = γ(t, x(t)) .

Thus, any quantity depending upon states, controls, disturbancesis now a random variable, hence, when bounded or nonnegative,admits an integral with respect to P.


MODE

,



The i.i.d. case

Following a common hypothesis, we shall, for sake of simplicity,assume that the random variables w(·) are independent andidentically distributed (i.i.d.) under P.

In such a probabilistic context, we use the notation

E[a(w)] for the expected value of any integrable randomvariable a : W → R

and E[A(w(·))] for any random variable A : WT → R.


MODE

,



The discrete i.i.d. case

For discrete probability laws, this means that

E(a(w)) =∑

w∈W

µ(w)a(w)

with µ the common discrete law on W of the random variablesw(t)3 and

E[A(w(·))] =∑

w0∈W

· · ·∑

wT−1∈W

A(w0, . . . ,wT−1)µ(w0) · · · µ(wT−1) .

3Thus, we can choose Ω = WT and P the product of T + 1 copies of µ.


MODE

,



The continuous i.i.d. case

For continuous probability laws on W = Rq, this gives

E[a(w)] =

∫

W

a(w)f (w)dw

with f the common density on W of the random variables w(t)4

and

E[A(w(·))] =

∫

W

· · ·

∫

W

A(w0, . . . ,wT−1)f (w0) · · · f (wT−1)dw0 · · · dwT−1 .

4Thus, we can choose Ω = WT and P the product of T + 1 copies of

f (w)dw .Cours EDF, mai-juin 2006

MODE

,



Minimal mean cost

Definition

For any admissible feedback strategy γ ∈ Γad and initial conditionx0 ∈ X, let us consider the expected criterion or mean cost

I (x0, γ) := E

[

I

(

xF [0, x0, γ,w(·)](·), uF [0, x0, γ,w(·)](·),w(·)

)]

.

The stochastic optimization problem corresponds to

I?(x0) = inf

γ∈Γad

I (x0, γ)

= infγ∈Γad

E

[

I

(

xF [0, x0, γ,w(·)](·), uF [0, x0, γ,w(·)](·),w(·)

)]

.


MODE

,



Optimal feedback

Definition

Any γ? ∈ Γad such that

I?(x0) = min

γ∈ΓadI (x0, γ) = I (x0, γ

?)

is an optimal feedback.


MODE

,



Stochastic dynamic programming equation in finite horizon

Definition

In absence of state constraints (D(t) = X for t = 0, . . . , T ), thevalue function V (t, x), is defined by the following backwardinduction:

V (T , x) := M(T , x),

V (t, x) := infu∈B(t,x)

E

[

L(t, x , u,w(t)) + V(

t + 1,F (t, x , u,w(t)))

]

.

Contrarily to the deterministic case, the value function is defined

by a backward induction relation and then one can prove that itcoincides with some optimal cost.


MODE

,



Optimal feedback

Assume no state constraint, namely D(t) = X. For any time t andstate x , assume the existence of the following feedback decision

γ?(t, x) ∈ arg minu∈B(t,x)

E

[

L(t, x , u,w(t))+V (t+1,F (t, x , u,w(t)))

]

.

Then γ? : (t, x) → γ?(t, x) is an optimal strategy, and, for anyx0 ∈ X, the optimal expected cost is given by

V (0, x0) = I?(x0) = I (x0, γ

?) .


MODE

,



Extension to Whittle criterion

Let us call a criterion I in strong Whittle form whenever it is givenby the following backward induction

I (x(·), u(·),w(·)) = C (0)

C (t) = g(t, x(t), u(t),w(t)) + β(t, x(t), u(t),w(t))C (t+1) ,

t = 0, . . . ,T − 1

C (T ) = M(T , x(T )) ,

where β(t, x(t), u(t),w(t)) > 0. Equivalently,


MODE

,



I (x(·), u(·),w(·)) =∑T

t=0 β0β1 · · · βt−1gt

βt = β(t, x(t), u(t),w(t)) > 0 , t = 0, . . . ,T − 1gt = g(t, x(t), u(t),w(t)) , t = 0, . . . ,T − 1gT = M(T , x(T )) .

This form happens to be adapted to stochastic dynamicprogramming, and to include both the additive and multiplicativecases, with respectivelyg(t, x , u,w) = L(t, x , u,w), β(t, x , u,w) = 1 andg(t, x , u,w) = 0, β(t, x , u,w) = L(t, x , u,w).


MODE

,



Value function

In absence of state constraints (D(t) = X for t = 0, . . . , T ), thevalue function V (t, x) is defined by the following backwardinduction:

V (T , x) := M(T , x),

V (t, x) := infu∈B(t,x)

E

[

g(t, x , u,w(t))

+ β(t, x , u,w(t))V (t + 1,F (t, x , u,w(t)))

]

.


a short presentation of dynamic programming - cermics – centre...

Documents