lecture slides on dynamic programming …todorov/courses/amath579/... · lecture slides on dynamic...

275
LECTURE SLIDES ON DYNAMIC PROGRAMMING BASED ON LECTURES GIVEN AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY CAMBRIDGE, MASS FALL 2008 DIMITRI P. BERTSEKAS These lecture slides are based on the book: “Dynamic Programming and Optimal Con- trol: 3rd edition,” Vols. 1 and 2, Athena Scientific, 2007, by Dimitri P. Bertsekas; see http://www.athenasc.com/dpbook.html Last Updated: December 2008 The slides may be freely reproduced and distributed.

Upload: phungngoc

Post on 18-Aug-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

LECTURE SLIDES ON DYNAMIC PROGRAMMING

BASED ON LECTURES GIVEN AT THE

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

CAMBRIDGE, MASS

FALL 2008

DIMITRI P. BERTSEKAS

These lecture slides are based on the book:“Dynamic Programming and Optimal Con-trol: 3rd edition,” Vols. 1 and 2, AthenaScientific, 2007, by Dimitri P. Bertsekas;see

http://www.athenasc.com/dpbook.html

Last Updated: December 2008

The slides may be freely reproduced anddistributed.

6.231 DYNAMIC PROGRAMMING

LECTURE 1

LECTURE OUTLINE

• Problem Formulation

• Examples

• The Basic Problem

• Significance of Feedback

DP AS AN OPTIMIZATION METHODOLOGY

• Generic optimization problem:

minu∈U

g(u)

where u is the optimization/decision variable, g(u)is the cost function, and U is the constraint set

• Categories of problems:− Discrete (U is finite) or continuous− Linear (g is linear and U is polyhedral) or

nonlinear− Stochastic or deterministic: In stochastic prob-

lems the cost involves a stochastic parameterw, which is averaged, i.e., it has the form

g(u) = Ew

{G(u,w)

}where w is a random parameter.

• DP can deal with complex stochastic problemswhere information about w becomes available instages, and the decisions are also made in stagesand make use of this information.

BASIC STRUCTURE OF STOCHASTIC DP

• Discrete-time system

xk+1 = fk(xk, uk, wk), k = 0, 1, . . . , N − 1

− k: Discrete time− xk: State; summarizes past information that

is relevant for future optimization− uk: Control; decision to be selected at time

k from a given set− wk: Random parameter (also called distur-

bance or noise depending on the context)− N : Horizon or number of times control is

applied

• Cost function that is additive over time

E

{gN (xN ) +

N−1∑k=0

gk(xk, uk, wk)

}

INVENTORY CONTROL EXAMPLE

InventorySystem

Stock Ordered atPeriod k

Stock at Period k Stock at Period k + 1

Demand at Period k

xk

wk

xk + 1 = xk + uk - wk

ukCos t of P e riod k

c uk + r (xk + uk - wk)

• Discrete-time system

xk+1 = fk(xk, uk, wk) = xk + uk − wk

• Cost function that is additive over time

E

{gN (xN ) +

N−1∑k=0

gk(xk, uk, wk)

}

= E

{N−1∑k=0

(cuk + r(xk + uk − wk)

)}

• Optimization over policies: Rules/functions uk =µk(xk) that map states to controls

ADDITIONAL ASSUMPTIONS

• The set of values that the control uk can takedepend at most on xk and not on prior x or u

• Probability distribution of wk does not dependon past values wk−1, . . . , w0, but may depend onxk and uk

− Otherwise past values of w or x would beuseful for future optimization

• Sequence of events envisioned in period k:− xk occurs according to

xk = fk−1

(xk−1, uk−1, wk−1

)− uk is selected with knowledge of xk, i.e.,

uk ∈ Uk(xk)

− wk is random and generated according to adistribution

Pwk(xk, uk)

DETERMINISTIC FINITE-STATE PROBLEMS

• Scheduling example: Find optimal sequence ofoperations A, B, C, D

• A must precede B, and C must precede D

• Given startup cost SA and SC , and setup tran-sition cost Cmn from operation m to operation n

A

S A

C

S C

AB

CAB

ACCAC

CDA

CAD

ABC

CA

CCD CD

ACD

ACB

CAB

CAD

CBC

CCB

CCD

CAB

CCA

CDA

CCD

CBD

CDB

CBD

CDB

CAB

InitialState

STOCHASTIC FINITE-STATE PROBLEMS

• Example: Find two-game chess match strategy

• Timid play draws with prob. pd > 0 and loseswith prob. 1−pd. Bold play wins with prob. pw <1/2 and loses with prob. 1 − pw

1 - 0

0.5-0.5

0 - 1

2 - 0

1.5-0.5

1 - 1

0.5-1.5

0 - 2

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play

0 - 0

0.5-0.5

0 - 1

pd

1 - pd

1st Game / Bold Play

0 - 0

1 - 0

0 - 1

1 - pw

pw

1 - 0

0.5-0.5

0 - 1

2 - 0

1.5-0.5

1 - 1

0.5-1.5

0 - 2

pd

pd

pd

1 - pd

1 - pd

1 - pd

1 - pw

pw

1 - pw

pw

1 - pw

pw

BASIC PROBLEM

• System xk+1 = fk(xk, uk, wk), k = 0, . . . , N−1

• Control contraints uk ∈ Uk(xk)

• Probability distribution Pk(· | xk, uk) of wk

• Policies π = {µ0, . . . , µN−1}, where µk mapsstates xk into controls uk = µk(xk) and is suchthat µk(xk) ∈ Uk(xk) for all xk

• Expected cost of π starting at x0 is

Jπ(x0) = E

{gN (xN ) +

N−1∑k=0

gk(xk, µk(xk), wk)

}

• Optimal cost function

J∗(x0) = minπ

Jπ(x0)

• Optimal policy π∗ satisfies

Jπ∗(x0) = J∗(x0)

When produced by DP, π∗ is independent of x0.

SIGNIFICANCE OF FEEDBACK

• Open-loop versus closed-loop policies

Systemxk + 1 = fk(xk,uk,wk)

mk

uk = mk(xk) xk

wk

uk = µk(xk)

µk

• In deterministic problems open loop is as goodas closed loop

• Chess match example; value of information

Timid Play

1 - pd

pd

Bold Play

0 - 0

1 - 0

0 - 1

1 - pw

pw

1.5-0.5

1 - 1

1 - 1

0 - 2

1 - pw

pwBold Play

VARIANTS OF DP PROBLEMS

• Continuous-time problems

• Imperfect state information problems

• Infinite horizon problems

• Suboptimal control

LECTURE BREAKDOWN

• Finite Horizon Problems (Vol. 1, Ch. 1-6)− Ch. 1: The DP algorithm (2 lectures)− Ch. 2: Deterministic finite-state problems (2

lectures)− Ch. 3: Deterministic continuous-time prob-

lems (1 lecture)− Ch. 4: Stochastic DP problems (2 lectures)− Ch. 5: Imperfect state information problems

(2 lectures)− Ch. 6: Suboptimal control (3 lectures)

• Infinite Horizon Problems - Simple (Vol. 1, Ch.7, 3 lectures)

• Infinite Horizon Problems - Advanced (Vol. 2)− Ch. 1: Discounted problems - Computational

methods (2 lectures)− Ch. 2: Stochastic shortest path problems (1

lecture)− Ch. 6: Approximate DP (6 lectures)

A NOTE ON THESE SLIDES

• These slides are a teaching aid, not a text

• Don’t expect a rigorous mathematical develop-ment or precise mathematical statements

• Figures are meant to convey and enhance ideas,not to express them precisely

• Omitted proofs and a much fuller discussioncan be found in the text, which these slides follow

6.231 DYNAMIC PROGRAMMING

LECTURE 2

LECTURE OUTLINE

• The basic problem

• Principle of optimality

• DP example: Deterministic problem

• DP example: Stochastic problem

• The general DP algorithm

• State augmentation

BASIC PROBLEM

• System xk+1 = fk(xk, uk, wk), k = 0, . . . , N−1

• Control constraints uk ∈ Uk(xk)

• Probability distribution Pk(· | xk, uk) of wk

• Policies π = {µ0, . . . , µN−1}, where µk mapsstates xk into controls uk = µk(xk) and is suchthat µk(xk) ∈ Uk(xk) for all xk

• Expected cost of π starting at x0 is

Jπ(x0) = E

{gN (xN ) +

N−1∑k=0

gk(xk, µk(xk), wk)

}

• Optimal cost function

J∗(x0) = minπ

Jπ(x0)

• Optimal policy π∗ is one that satisfies

Jπ∗(x0) = J∗(x0)

PRINCIPLE OF OPTIMALITY

• Let π∗ = {µ∗0, µ

∗1, . . . , µ

∗N−1} be optimal policy

• Consider the “tail subproblem” whereby we areat xi at time i and wish to minimize the “cost-to-go” from time i to time N

E

{gN (xN ) +

N−1∑k=i

gk

(xk, µk(xk), wk

)}

and the “tail policy” {µ∗i , µ

∗i+1, . . . , µ

∗N−1}

0 Ni

xi Tail Subproblem

• Principle of optimality : The tail policy is opti-mal for the tail subproblem (optimization of thefuture does not depend on what we did in the past)

• DP first solves ALL tail subroblems of finalstage

• At the generic step, it solves ALL tail subprob-lems of a given time length, using the solution ofthe tail subproblems of shorter time length

DETERMINISTIC SCHEDULING EXAMPLE

• Find optimal sequence of operations A, B, C,D (A must precede B and C must precede D)

A

C

AB

AC

CDA

ABC

CA

CD

ACD

ACB

CAB

CAD

InitialState1 0

76

2

86

6

2

2

9

3

33

3

3

3

5

1

5

44

3

1

5

4

• Start from the last tail subproblem and go back-wards

• At each state-time pair, we record the optimalcost-to-go and the optimal decision

STOCHASTIC INVENTORY EXAMPLE

InventorySystem

Stock Ordered atPeriod k

Stock at Period k Stock at Period k + 1

Demand at Period k

xk

wk

xk + 1 = xk + uk - wk

ukCos t of P e riod k

c uk + r (xk + uk - wk)

• Tail Subproblems of Length 1:

JN−1(xN−1) = minuN−1≥0

EwN−1

{cuN−1

+ r(xN−1 + uN−1 − wN−1)}

• Tail Subproblems of Length N − k:

Jk(xk) = minuk≥0

Ewk

{cuk + r(xk + uk − wk)

+ Jk+1(xk + uk − wk)}

• J0(x0) is opt. cost of initial state x0

DP ALGORITHM

• Start with

JN (xN ) = gN (xN ),

and go backwards using

Jk(xk) = minuk∈Uk(xk)

Ewk

{gk(xk, uk, wk)

+ Jk+1

(fk(xk, uk, wk)

)}, k = 0, 1, . . . , N − 1.

• Then J0(x0), generated at the last step, is equalto the optimal cost J∗(x0). Also, the policy

π∗ = {µ∗0, . . . , µ

∗N−1}

where µ∗k(xk) minimizes in the right side above for

each xk and k, is optimal

• Justification: Proof by induction that Jk(xk) isequal to J∗

k (xk), defined as the optimal cost of thetail subproblem that starts at time k at state xk

• Note:− ALL the tail subproblems are solved (in ad-

dition to the original problem)− Intensive computational requirements

PROOF OF THE INDUCTION STEP

• Let πk ={µk, µk+1, . . . , µN−1

}denote a tail

policy from time k onward• Assume that Jk+1(xk+1) = J∗

k+1(xk+1). Then

J∗k (xk) = min

(µk,πk+1)E

wk,...,wN−1

{gk

(xk, µk(xk), wk

)

+ gN (xN ) +

N−1∑i=k+1

gi

(xi, µi(xi), wi

)}

= minµk

Ewk

{gk

(xk, µk(xk), wk

)

+ minπk+1

[E

wk+1,...,wN−1

{gN (xN ) +

N−1∑i=k+1

gi

(xi, µi(xi), wi

)}]}

= minµk

Ewk

{gk

(xk, µk(xk), wk

)+ J∗

k+1

(fk

(xk, µk(xk), wk

))}= min

µk

Ewk

{gk

(xk, µk(xk), wk

)+ Jk+1

(fk

(xk, µk(xk), wk

))}= min

uk∈Uk(xk)Ewk

{gk(xk, uk, wk) + Jk+1

(fk(xk, uk, wk)

)}= Jk(xk)

LINEAR-QUADRATIC ANALYTICAL EXAMPLE

Temperature u0

Temperature u1

Final Temperature x2

Initial Temperature x0

Oven 1 Oven 2x1

• System

xk+1 = (1 − a)xk + auk, k = 0, 1,

where a is given scalar from the interval (0, 1)

• Costr(x2 − T )2 + u2

0 + u21

where r is given positive scalar

• DP Algorithm:

J2(x2) = r(x2 − T )2

J1(x1) = minu1

[u2

1 + r((1 − a)x1 + au1 − T

)2]J0(x0) = min

u0

[u2

0 + J1

((1 − a)x0 + au0

)]

STATE AUGMENTATION

• When assumptions of the basic problem areviolated (e.g., disturbances are correlated, cost isnonadditive, etc) reformulate/augment the state

• Example: Time lags

xk+1 = fk(xk, xk−1, uk, wk)

• Introduce additional state variable yk = xk−1.New system takes the form

(xk+1

yk+1

)=(

fk(xk, yk, uk, wk)xk

)

View xk = (xk, yk) as the new state.

• DP algorithm for the reformulated problem:

Jk(xk, xk−1) = minuk∈Uk(xk)

Ewk

{gk(xk, uk, wk)

+ Jk+1

(fk(xk, xk−1, uk, wk), xk

)}

6.231 DYNAMIC PROGRAMMING

LECTURE 3

LECTURE OUTLINE

• Deterministic finite-state DP problems

• Backward shortest path algorithm

• Forward shortest path algorithm

• Shortest path examples

• Alternative shortest path algorithms

DETERMINISTIC FINITE-STATE PROBLEM

. . .

. . .

. . .

Stage 0 Stage 1 Stage 2 Stage N - 1 Stage N

Initial State s

tArtificial TerminalNode

Terminal Arcswith Cost Equalto Terminal Cost

. . .

• States <==> Nodes

• Controls <==> Arcs

• Control sequences (open-loop) <==> pathsfrom initial state to terminal states

• akij : Cost of transition from state i ∈ Sk to state

j ∈ Sk+1 at time k (view it as “length” of the arc)

• aNit : Terminal cost of state i ∈ SN

• Cost of control sequence <==> Cost of the cor-responding path (view it as “length” of the path)

BACKWARD AND FORWARD DP ALGORITHMS

• DP algorithm:

JN (i) = aNit , i ∈ SN ,

Jk(i) = minj∈Sk+1

[ak

ij+Jk+1(j)], i ∈ Sk, k = 0, . . . , N−1

The optimal cost is J0(s) and is equal to thelength of the shortest path from s to t

• Observation: An optimal path s → t is also anoptimal path t → s in a “reverse” shortest pathproblem where the direction of each arc is reversedand its length is left unchanged

• Forward DP algorithm (= backward DP algo-rithm for the reverse problem):

JN (j) = a0sj , j ∈ S1,

Jk(j) = mini∈SN−k

[aN−k

ij + Jk+1(i)], j ∈ SN−k+1

The optimal cost is J0(t) = mini∈SN

[aN

it + J1(i)]

• View Jk(j) as optimal cost-to-arrive to state jfrom initial state s

A NOTE ON FORWARD DP ALGORITHMS

• There is no forward DP algorithm for stochasticproblems

• Mathematically, for stochastic problems, wecannot restrict ourselves to open-loop sequences,so the shortest path viewpoint fails

• Conceptually, in the presence of uncertainty,the concept of “optimal-cost-to-arrive” at a statexk does not make sense. For example, it may beimpossible to guarantee (with prob. 1) that anygiven state can be reached

• By contrast, even in stochastic problems, theconcept of “optimal cost-to-go” from any state xk

makes clear sense

GENERIC SHORTEST PATH PROBLEMS

• {1, 2, . . . , N, t}: nodes of a graph (t: the desti-nation)

• aij : cost of moving from node i to node j

• Find a shortest (minimum cost) path from eachnode i to node t

• Assumption: All cycles have nonnegative length.Then an optimal path need not take more than Nmoves

• We formulate the problem as one where we re-quire exactly N moves but allow degenerate movesfrom a node i to itself with cost aii = 0

Jk(i) = optimal cost of getting from i to t in N−k moves

J0(i): Cost of the optimal path from i to t.

• DP algorithm:

Jk(i) = minj=1,...,N

[aij+Jk+1(j)

], k = 0, 1, . . . , N−2,

with JN−1(i) = ait, i = 1, 2, . . . , N

EXAMPLE

27 5

25 5

6 1

3

0 .53

1

2

4

0 1 2 3 4

1

2

3

4

5

State i

Stage k

3 3 3 3

4 4 4 5

4.5 4.5 5.5 7

2 2 2 2

Destination 5

(a) (b)

JN−1(i) = ait, i = 1, 2, . . . , N,

Jk(i) = minj=1,...,N

[aij+Jk+1(j)

], k = 0, 1, . . . , N−2.

ESTIMATION / HIDDEN MARKOV MODELS

• Markov chain with transition probabilities pij

• State transitions are hidden from view

• For each transition, we get an (independent)observation

• r(z; i, j): Prob. the observation takes value zwhen the state transition is from i to j

• Trajectory estimation problem: Given the ob-servation sequence ZN = {z1, z2, . . . , zN}, what isthe “most likely” state transition sequence XN ={x0, x1, . . . , xN} [one that maximizes p(XN | ZN )over all XN = {x0, x1, . . . , xN}].

. . .

. . .

. . .

s x0 x1 x2 xN - 1 xN t

VITERBI ALGORITHM

• We have

p(XN | ZN ) =p(XN , ZN )

p(ZN )

where p(XN , ZN ) and p(ZN ) are the unconditionalprobabilities of occurrence of (XN , ZN ) and ZN

• Maximizing p(XN | ZN ) is equivalent with max-imizing ln(p(XN , ZN ))

• We have

p(XN , ZN ) = πx0

N∏k=1

pxk−1xkr(zk;xk−1, xk)

so the problem is equivalent to

minimize − ln(πx0) −N∑

k=1

ln(pxk−1xkr(zk;xk−1, xk)

)over all possible sequences {x0, x1, . . . , xN}.

• This is a shortest path problem.

GENERAL SHORTEST PATH ALGORITHMS

• There are many nonDP shortest path algo-rithms. They can all be used to solve deterministicfinite-state problems

• They may be preferable than DP if they avoidcalculating the optimal cost-to-go of EVERY state

• This is essential for problems with HUGE statespaces. Such problems arise for example in com-binatorial optimization

1

1 20

20

5

3

5

4

4

15

15

3

ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Artificial Terminal Node t

Origin Node sA

1

11

20 20

2020

44

4 4

1515 5

5

3 3

5

33

15

LABEL CORRECTING METHODS

• Given: Origin s, destination t, lengths aij ≥ 0.

• Idea is to progressively discover shorter pathsfrom the origin s to every other node i

• Notation:− di (label of i): Length of the shortest path

found (initially ds = 0, di = ∞ for i �= s)− UPPER: The label dt of the destination− OPEN list: Contains nodes that are cur-

rently active in the sense that they are candi-dates for further examination (initially OPEN={s})

Label Correcting Algorithm

Step 1 (Node Removal): Remove a node i fromOPEN and for each child j of i, do step 2

Step 2 (Node Insertion Test): If di + aij <min{dj ,UPPER}, set dj = di + aij and set i tobe the parent of j. In addition, if j �= t, place j inOPEN if it is not already in OPEN, while if j = t,set UPPER to the new value di + ait of dt

Step 3 (Termination Test): If OPEN is empty,terminate; else go to step 1

VISUALIZATION/EXPLANATION

• Given: Origin s, destination t, lengths aij ≥ 0

• di (label of i): Length of the shortest path foundthus far (initially ds = 0, di = ∞ for i �= s). Thelabel di is implicitly associated with an s → i path

• UPPER: The label dt of the destination

• OPEN list: Contains “active” nodes (initiallyOPEN={s})

i j

REMOVE

Is di + aij < dj ?(Is the path s --> i --> j better than the current path s --> j ?)

Is di + aij < UPPER ?

(Does the path s --> i --> j have a chance to be part of a shorter s --> t path ?)

YES

YES

INSERT

O P E N

Set dj = di + aij

EXAMPLE

ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Artificial Terminal Node t

Origin Node sA

1

11

20 20

2020

44

4 4

1515 5

5

3 3

5

33

15

1

2

3

4

5

6

7

8

9

1 0

Iter. No. Node Exiting OPEN OPEN after Iteration UPPER

0 - 1 ∞1 1 2, 7,10 ∞2 2 3, 5, 7, 10 ∞3 3 4, 5, 7, 10 ∞4 4 5, 7, 10 43

5 5 6, 7, 10 43

6 6 7, 10 13

7 7 8, 10 13

8 8 9, 10 13

9 9 10 13

10 10 Empty 13

• Note that some nodes never entered OPEN

6.231 DYNAMIC PROGRAMMING

LECTURE 4

LECTURE OUTLINE

• Label correcting methods for shortest paths• Variants of label correcting methods• Branch-and-bound as a shortest path algorithm

LABEL CORRECTING METHODS

• Origin s, destination t, lengths aij that are ≥ 0

• di (label of i): Length of the shortest pathfound thus far (initially di = ∞ except ds = 0).The label di is implicitly associated with an s → ipath• UPPER: Label dt of the destination• OPEN list: Contains “active” nodes (initiallyOPEN={s})

i j

REMOVE

Is di + aij < dj ?(Is the path s --> i --> j better than the current path s --> j ?)

Is di + aij < UPPER ?

(Does the path s --> i --> j have a chance to be part of a shorter s --> t path ?)

YES

YES

INSERT

O P E N

Set dj = di + aij

VALIDITY OF LABEL CORRECTING METHODS

Proposition: If there exists at least one pathfrom the origin to the destination, the label cor-recting algorithm terminates with UPPER equalto the shortest distance from the origin to the des-tination

Proof: (1) Each time a node j enters OPEN, itslabel is decreased and becomes equal to the lengthof some path from s to j

(2) The number of possible distinct path lengthsis finite, so the number of times a node can enterOPEN is finite, and the algorithm terminates

(3) Let (s, j1, j2, . . . , jk, t) be a shortest path andlet d∗ be the shortest distance. If UPPER > d∗at termination, UPPER will also be larger thanthe length of all the paths (s, j1, . . . , jm), m =1, . . . , k, throughout the algorithm. Hence, nodejk will never enter the OPEN list with djk equalto the shortest distance from s to jk. Similarlynode jk−1 will never enter the OPEN list withdjk−1 equal to the shortest distance from s to jk−1.Continue to j1 to get a contradiction

MAKING THE METHOD EFFICIENT

• Reduce the value of UPPER as quickly as pos-sible

− Try to discover “good” s → t paths early inthe course of the algorithm

• Keep the number of reentries into OPEN low− Try to remove from OPEN nodes with small

label first.− Heuristic rationale: if di is small, then dj

when set to di+aij will be accordingly small,so reentrance of j in the OPEN list is lesslikely

• Reduce the overhead for selecting the node tobe removed from OPEN

• These objectives are often in conflict. They giverise to a large variety of distinct implementations

• Good practical strategies try to strike a compro-mise between low overhead and small label nodeselection

NODE SELECTION METHODS

• Depth-first search: Remove from the top ofOPEN and insert at the top of OPEN.

− Has low memory storage properties (OPENis not too long). Reduces UPPER quickly.

Origin Node s

Destination Node t

1 4

2

3

4 5

6

7 8 9

1 0

1 3

1 1 1 2

1

• Best-first search (Djikstra): Remove fromOPEN a node with minimum value of label.

− Interesting property: Each node will be in-serted in OPEN at most once.

− Nodes enter OPEN at minimum distance− Many implementations/approximations

ADVANCED INITIALIZATION

• Instead of starting from di = ∞ for all i �= s,start with

di = length of some path from s to i (or di = ∞)

OPEN = {i �= t | di < ∞}

• Motivation: Get a small starting value of UP-PER.

• No node with shortest distance ≥ initial valueof UPPER will enter OPEN

• Good practical idea:− Run a heuristic (or use common sense) to

get a “good” starting path P from s to t

− Use as UPPER the length of P , and as di

the path distances of all nodes i along P

• Very useful also in reoptimization, where wesolve the same problem with slightly different data

VARIANTS OF LABEL CORRECTING METHODS

• If a lower bound hj of the true shortest dis-tance from j to t is known, use the test

di + aij + hj < UPPER

for entry into OPEN, instead of

di + aij < UPPER

The label correcting method with lower bounds asabove is often referred to as the A∗ method.

• If an upper bound mj of the true shortestdistance from j to t is known, then if dj + mj <UPPER, reduce UPPER to dj + mj .

• Important use: Branch-and-bound algorithmfor discrete optimization can be viewed as an im-plementation of this last variant.

BRANCH-AND-BOUND METHOD

• Problem: Minimize f(x) over a finite set offeasible solutions X.

• Idea of branch-and-bound: Partition the fea-sible set into smaller subsets, and then calculatecertain bounds on the attainable cost within someof the subsets to eliminate from further consider-ation other subsets.

Bounding Principle

Given two subsets Y1 ⊂ X and Y2 ⊂ X, supposethat we have bounds

f1≤ min

x∈Y1

f(x), f2 ≥ minx∈Y2

f(x).

Then, if f2 ≤ f1, the solutions in Y1 may be dis-

regarded since their cost cannot be smaller thanthe cost of the best solution in Y2.

• The B+B algorithm can be viewed as a la-bel correcting algorithm, where lower bounds de-fine the arc costs, and upper bounds are used tostrengthen the test for admission to OPEN.

SHORTEST PATH IMPLEMENTATION

• Acyclic graph/partition of X into subsets (typ-ically a tree). The leafs consist of single solutions.

• Upper/Lower bounds fY

and fY for the mini-mum cost over each subset Y can be calculated.

• The lower bound of a leaf {x} is f(x)

• Each arc (Y,Z) has length fZ− f

Y

• Shortest distance from X to Y = fY− f

X

• Distance from origin X to a leaf {x} is f(x)−fX

• Shortest path from X to the set of leafs givesthe optimal cost and optimal solution

• UPPER is the smallest f(x) − fX

out of leafnodes {x} examined so far

{1,2,3,4,5}

{1,2,}

{4,5}{1,2,3}

{1} {2}

{3} {4} {5}

BRANCH-AND-BOUND ALGORITHM

Step 1: Remove a node Y from OPEN. For eachchild Yj of Y , do the following:

− Entry Test: If fY j

< UPPER, place Yj inOPEN.

− Update UPPER: If fY j < UPPER, set UP-PER = fY j , and if Yj consists of a singlesolution, mark that as being the best solu-tion found so far

Step 2: (Termination Test) If OPEN: empty,terminate; the best solution found so far is opti-mal. Else go to Step 1

• It is neither practical nor necessary to generatea priori the acyclic graph (generate it as you go)

• Keys to branch-and-bound:− Generate as sharp as possible upper and lower

bounds at each node− Have a good partitioning and node selection

strategy

• Method involves a lot of art, may be prohibitivelytime-consuming ... but guaranteed to find an op-timal solution

6.231 DYNAMIC PROGRAMMING

LECTURE 5

LECTURE OUTLINE

• Deterministic continuous-time optimal control

• Examples

• Connection with the calculus of variations

• The Hamilton-Jacobi-Bellman equation as acontinuous-time limit of the DP algorithm

• The Hamilton-Jacobi-Bellman equation as asufficient condition

• Examples

PROBLEM FORMULATION

• Continuous-time dynamic system:

x(t) = f(x(t), u(t)

), 0 ≤ t ≤ T, x(0) : given,

where− x(t) ∈ �n: state vector at time t

− u(t) ∈ U ⊂ �m: control vector at time t

− U : control constraint set− T : terminal time

• Admissible control trajectories{u(t) | t ∈ [0, T ]

}:

piecewise continuous functions{u(t) | t ∈ [0, T ]

}with u(t) ∈ U for all t ∈ [0, T ]; uniquely determine{x(t) | t ∈ [0, T ]

}• Problem: Find an admissible control trajectory{u(t) | t ∈ [0, T ]

}and corresponding state trajec-

tory{x(t) | t ∈ [0, T ]

}, that minimizes the cost

h(x(T )

)+∫ T

0

g(x(t), u(t)

)dt

• f, h, g are assumed continuously differentiable

EXAMPLE I

• Motion control: A unit mass moves on a lineunder the influence of a force u

• x(t) =(x1(t), x2(t)

): position and velocity of

the mass at time t

• Problem: From a given(x1(0), x2(0)

), bring the

mass “near” a given final position-velocity pair(x1, x2) at time T in the sense:

minimize∣∣x1(T ) − x1

∣∣2 +∣∣x2(T ) − x2

∣∣2subject to the control constraint

|u(t)| ≤ 1, for all t ∈ [0, T ]

• The problem fits the framework with

x1(t) = x2(t), x2(t) = u(t),

h(x(T )

)=∣∣x1(T ) − x1

∣∣2 +∣∣x2(T ) − x2

∣∣2,g(x(t), u(t)

)= 0, for all t ∈ [0, T ]

EXAMPLE II

• A producer with production rate x(t) at time tmay allocate a portion u(t) of his/her productionrate to reinvestment and 1−u(t) to production ofa storable good. Thus x(t) evolves according to

x(t) = γu(t)x(t),

where γ > 0 is a given constant

• The producer wants to maximize the total amountof product stored

∫ T

0

(1 − u(t)

)x(t)dt

subject to

0 ≤ u(t) ≤ 1, for all t ∈ [0, T ]

• The initial production rate x(0) is a given pos-itive number

EXAMPLE III (CALCULUS OF VARIATIONS)

Le ngth = Ú0

T

1 + (u(t))2 d t

a x(t)

T t0

x(t) = u(t).

GivenPoint Given

Line

∫ T

0

√1 +

(u(t)

)2dt

• Find a curve from a given point to a given linethat has minimum length

• The problem is

minimize∫ T

0

√1 +

(x(t)

)2dt

subject to x(0) = α

• Reformulation as an optimal control problem:

minimize∫ T

0

√1 +

(u(t)

)2dt

subject to x(t) = u(t), x(0) = α

HAMILTON-JACOBI-BELLMAN EQUATION I

• We discretize [0, T ] at times 0, δ, 2δ, . . . , Nδ,where δ = T/N , and we let

xk = x(kδ), uk = u(kδ), k = 0, 1, . . . , N

• We also discretize the system and cost:

xk+1 = xk+f(xk, uk)·δ, h(xN )+N−1∑k=0

g(xk, uk)·δ

• We write the DP algorithm for the discretizedproblem

J∗(Nδ, x) = h(x),

J∗(kδ, x) = minu∈U

[g(x, u)·δ+J∗((k+1)·δ, x+f(x, u)·δ)].

• Assume J∗ is differentiable and Taylor-expand:

J∗(kδ, x) = minu∈U

[g(x, u) · δ + J∗(kδ, x) + ∇tJ

∗(kδ, x) · δ

+ ∇xJ∗(kδ, x)′f(x, u) · δ + o(δ)]

• Cancel J∗(kδ, x), divide by δ, and take limit

HAMILTON-JACOBI-BELLMAN EQUATION II

• Let J∗(t, x) be the optimal cost-to-go of thecontinuous problem. Assuming the limit is valid

limk→∞, δ→0, kδ=t

J∗(kδ, x) = J∗(t, x), for all t, x,

we obtain for all t, x,

0 = minu∈U

[g(x, u)+∇tJ∗(t, x)+∇xJ∗(t, x)′f(x, u)

]with the boundary condition J∗(T, x) = h(x)

• This is the Hamilton-Jacobi-Bellman (HJB)equation – a partial differential equation, which issatisfied for all time-state pairs (t, x) by the cost-to-go function J∗(t, x) (assuming J∗ is differen-tiable and the preceding informal limiting proce-dure is valid)

• Hard to tell a priori if J∗(t, x) is differentiable

• So we use the HJB Eq. as a verification tool; ifwe can solve it for a differentiable J∗(t, x), then:

− J∗ is the optimal-cost-to-go function− The control µ∗(t, x) that minimizes in the

RHS for each (t, x) defines an optimal con-trol

VERIFICATION/SUFFICIENCY THEOREM

• Suppose V (t, x) is a solution to the HJB equa-tion; that is, V is continuously differentiable in tand x, and is such that for all t, x,

0 = minu∈U

[g(x, u) + ∇tV (t, x) + ∇xV (t, x)′f(x, u)

],

V (T, x) = h(x), for all x

• Suppose also that µ∗(t, x) attains the minimumabove for all t and x

• Let{x∗(t) | t ∈ [0, T ]

}and u∗(t) = µ∗(t, x∗(t)

),

t ∈ [0, T ], be the corresponding state and controltrajectories

• Then

V (t, x) = J∗(t, x), for all t, x,

and{u∗(t) | t ∈ [0, T ]

}is optimal

PROOF

Let {(u(t), x(t)) | t ∈ [0, T ]} be any admissiblecontrol-state trajectory. We have for all t ∈ [0, T ]

0 ≤ g(x(t), u(t)

)+∇tV

(t, x(t)

)+∇xV

(t, x(t)

)′f(x(t), u(t)

).

Using the system equation ˙x(t) = f(x(t), u(t)

),

the RHS of the above is equal to

g(x(t), u(t)

)+

d

dt

(V (t, x(t))

)Integrating this expression over t ∈ [0, T ],

0 ≤∫ T

0

g(x(t), u(t)

)dt+V

(T, x(T )

)−V(0, x(0)

).

Using V (T, x) = h(x) and x(0) = x(0), we have

V(0, x(0)

) ≤ h(x(T )

)+∫ T

0

g(x(t), u(t)

)dt.

If we use u∗(t) and x∗(t) in place of u(t) and x(t),the inequalities becomes equalities, and

V(0, x(0)

)= h

(x∗(T )

)+∫ T

0

g(x∗(t), u∗(t)

)dt

EXAMPLE OF THE HJB EQUATION

Consider the scalar system x(t) = u(t), with |u(t)| ≤1 and cost (1/2)

(x(T )

)2. The HJB equation is

0 = min|u|≤1

[∇tV (t, x)+∇xV (t, x)u], for all t, x,

with the terminal condition V (T, x) = (1/2)x2

• Evident candidate for optimality: µ∗(t, x) =−sgn(x). Corresponding cost-to-go

J∗(t, x) =12(max

{0, |x| − (T − t)

})2.

• We verify that J∗ solves the HJB Eq., and thatu = −sgn(x) attains the min in the RHS. Indeed,

∇tJ∗(t, x) = max{0, |x| − (T − t)

},

∇xJ∗(t, x) = sgn(x) · max{0, |x| − (T − t)

}.

Substituting, the HJB Eq. becomes

0 = min|u|≤1

[1 + sgn(x) · u]max

{0, |x| − (T − t)

}

LINEAR QUADRATIC PROBLEM

Consider the n-dimensional linear system

x(t) = Ax(t) + Bu(t),

and the quadratic cost

x(T )′QT x(T ) +∫ T

0

(x(t)′Qx(t) + u(t)′Ru(t)

)dt

The HJB equation is

0 = minu∈�m

[x′Qx+u′Ru+∇tV (t, x)+∇xV (t, x)′(Ax+Bu)

],

with the terminal condition V (T, x) = x′QT x. Wetry a solution of the form

V (t, x) = x′K(t)x, K(t) : n × n symmetric,

and show that V (t, x) solves the HJB equation if

K(t) = −K(t)A−A′K(t)+K(t)BR−1B′K(t)−Q

with the terminal condition K(T ) = QT

6.231 DYNAMIC PROGRAMMING

LECTURE 6

LECTURE OUTLINE

• Examples of stochastic DP problems

• Linear-quadratic problems

• Inventory control

LINEAR-QUADRATIC PROBLEMS

• System: xk+1 = Akxk + Bkuk + wk

• Quadratic cost

Ewk

k=0,1,...,N−1

{x′

NQNxN +N−1∑k=0

(x′kQkxk + u′

kRkuk)

}

where Qk ≥ 0 and Rk > 0 (in the positive (semi)definitesense).

• wk are independent and zero mean

• DP algorithm:JN (xN ) = x′

NQNxN ,

Jk(xk) = minuk

E{x′

kQkxk + u′kRkuk

+ Jk+1(Akxk + Bkuk + wk)}

• Key facts:− Jk(xk) is quadratic− Optimal policy {µ∗

0, . . . , µ∗N−1} is linear:

µ∗k(xk) = Lkxk

− Similar treatment of a number of variants

DERIVATION

• By induction verify that

µ∗k(xk) = Lkxk, Jk(xk) = x′

kKkxk +constant,

where Lk are matrices given by

Lk = −(B′kKk+1Bk + Rk)−1B′

kKk+1Ak,

and where Kk are symmetric positive semidefinitematrices given by

KN = QN ,

Kk = A′k

(Kk+1 − Kk+1Bk(B′

kKk+1Bk

+ Rk)−1B′kKk+1

)Ak + Qk.

• This is called the discrete-time Riccati equation.

• Just like DP, it starts at the terminal time Nand proceeds backwards.

• Certainty equivalence holds (optimal policy isthe same as when wk is replaced by its expectedvalue E{wk} = 0).

ASYMPTOTIC BEHAVIOR OF RICCATI EQUATION

• Assume time-independent system and cost perstage, and some technical assumptions: controla-bility of (A,B) and observability of (A,C) whereQ = C ′C

• The Riccati equation converges limk→−∞ Kk =K, where K is pos. definite, and is the unique(within the class of pos. semidefinite matrices) so-lution of the algebraic Riccati equation

K = A′(K − KB(B′KB + R)−1B′K)A + Q

• The corresponding steady-state controller µ∗(x) =Lx, where

L = −(B′KB + R)−1B′KA,

is stable in the sense that the matrix (A + BL) ofthe closed-loop system

xk+1 = (A + BL)xk + wk

satisfies limk→∞(A + BL)k = 0.

GRAPHICAL PROOF FOR SCALAR SYSTEMS

A2R

B2 + Q

P 0

Q

F(P )

4 50

PP k P k + 1P *

-R

B2

• Riccati equation (with Pk = KN−k):

Pk+1 = A2

(Pk − B2P 2

k

B2Pk + R

)+ Q,

or Pk+1 = F (Pk), where

F (P ) =A2RP

B2P + R+ Q.

• Note the two steady-state solutions, satisfyingP = F (P ), of which only one is positive.

RANDOM SYSTEM MATRICES

• Suppose that {A0, B0}, . . . , {AN−1, BN−1} arenot known but rather are independent randommatrices that are also independent of the wk

• DP algorithm is

JN (xN ) = x′NQNxN ,

Jk(xk) = minuk

Ewk,Ak,Bk

{x′

kQkxk

+ u′kRkuk + Jk+1(Akxk + Bkuk + wk)

}• Optimal policy µ∗

k(xk) = Lkxk, where

Lk = −(Rk + E{B′kKk+1Bk}

)−1E{B′

kKk+1Ak},

and where the matrices Kk are given by

KN = QN ,

Kk = E{A′kKk+1Ak} − E{A′

kKk+1Bk}(Rk + E{B′

kKk+1Bk})−1

E{B′kKk+1Ak} + Qk

PROPERTIES

• Certainty equivalence may not hold

• Riccati equation may not converge to a steady-state

Q

4 50

0 P

F (P )

-R

E{B2}

• We have Pk+1 = F (Pk), where

F (P ) =E{A2}RP

E{B2}P + R+ Q +

TP 2

E{B2}P + R,

T = E{A2}E{B2} − (E{A})2(E{B})2

INVENTORY CONTROL

• xk: stock, uk: inventory purchased, wk: de-mand

xk+1 = xk + uk − wk, k = 0, 1, . . . , N − 1

• Minimize

E

{N−1∑k=0

(cuk + r(xk + uk − wk)

)}

where, for some p > 0 and h > 0,

r(x) = pmax(0,−x) + hmax(0, x)

• DP algorithm:

JN (xN ) = 0,

Jk(xk) = minuk≥0

[cuk+H(xk+uk)+E

{Jk+1(xk+uk−wk)

}],

where H(x + u) = E{r(x + u − w)}.

OPTIMAL POLICY

• DP algorithm can be written as

JN (xN ) = 0,

Jk(xk) = minuk≥0

Gk(xk + uk) − cxk,

where

Gk(y) = cy + H(y) + E{Jk+1(y − w)

}.

• If Gk is convex and lim|x|→∞ Gk(x) → ∞, wehave

µ∗k(xk) =

{Sk − xk if xk < Sk,0 if xk ≥ Sk,

where Sk minimizes Gk(y).

• This is shown, assuming that c < p, by showingthat Jk is convex for all k, and

lim|x|→∞

Jk(x) → ∞

JUSTIFICATION

• Graphical inductive proof that Jk is convex.

- cy

- cy

y

H(y)

cy + H(y)

S N - 1

c SN - 1

JN - 1(xN - 1)

xN - 1S N - 1

6.231 DYNAMIC PROGRAMMING

LECTURE 7

LECTURE OUTLINE

• Stopping problems

• Scheduling problems

• Other applications

PURE STOPPING PROBLEMS

• Two possible controls:− Stop (incur a one-time stopping cost, and

move to cost-free and absorbing stop state)− Continue [using xk+1 = fk(xk, wk) and in-

curring the cost-per-stage]

• Each policy consists of a partition of the set ofstates xk into two regions:

− Stop region, where we stop− Continue region, where we continue

STOPREGION

CONTINUE REGION

Stop State

EXAMPLE: ASSET SELLING

• A person has an asset, and at k = 0, 1, . . . , N−1receives a random offer wk

• May accept wk and invest the money at fixedrate of interest r, or reject wk and wait for wk+1.Must accept the last offer wN−1

• DP algorithm (xk: current offer, T : stop state):

JN (xN ) ={

xN if xN �= T ,0 if xN = T ,

Jk(xk) =

{max

[(1 + r)N−kxk, E

{Jk+1(wk)

}]if xk �= T ,

0 if xk = T .

• Optimal policy;

accept the offer xk if xk > αk,

reject the offer xk if xk < αk,

where

αk =E{Jk+1(wk)

}(1 + r)N−k

.

FURTHER ANALYSIS

0 1 2 N - 1 N k

ACCEPT

REJECT

a 1

a N - 1

a 2

• Can show that αk ≥ αk+1 for all k

• Proof: Let Vk(xk) = Jk(xk)/(1 + r)N−k forxk �= T. Then the DP algorithm is VN (xN ) = xN

and

Vk(xk) = max[xk, (1 + r)−1 E

w

{Vk+1(w)

}].

We have αk = Ew

{Vk+1(w)

}/(1 + r), so it is enough

to show that Vk(x) ≥ Vk+1(x) for all x and k.Start with VN−1(x) ≥ VN (x) and use the mono-tonicity property of DP.

• We can also show that αk → a as k → −∞.Suggests that for an infinite horizon the optimalpolicy is stationary.

GENERAL STOPPING PROBLEMS

• At time k, we may stop at cost t(xk) or choosea control uk ∈ U(xk) and continue

JN (xN ) = t(xN ),

Jk(xk) = min[t(xk), min

uk∈U(xk)E{g(xk, uk, wk)

+ Jk+1

(f(xk, uk, wk)

)}]• Optimal to stop at time k for states x in theset

Tk =

{x

∣∣∣ t(x) ≤ minu∈U(x)

E{

g(x, u, w) + Jk+1

(f(x, u, w)

)}}• Since JN−1(x) ≤ JN (x), we have Jk(x) ≤Jk+1(x) for all k, so

T0 ⊂ · · · ⊂ Tk ⊂ Tk+1 ⊂ · · · ⊂ TN−1.

• Interesting case is when all the Tk are equal (toTN−1, the set where it is better to stop than to goone step and stop). Can be shown to be true if

f(x, u, w) ∈ TN−1, for all x ∈ TN−1, u ∈ U(x), w.

SCHEDULING PROBLEMS

• Set of tasks to perform, the ordering is subjectto optimal choice.

• Costs depend on the order

• There may be stochastic uncertainty, and prece-dence and resource availability constraints

• Some of the hardest combinatorial problemsare of this type (e.g., traveling salesman, vehiclerouting, etc.)

• Some special problems admit a simple quasi-analytical solution method

− Optimal policy has an “index form”, i.e.,each task has an easily calculable “index”,and it is optimal to select the task that hasthe maximum value of index (multi-armedbandit problems - to be discussed later)

− Some problems can be solved by an “inter-change argument”(start with some schedule,interchange two adjacent tasks, and see whathappens)

EXAMPLE: THE QUIZ PROBLEM

• Given a list of N questions. If question i is an-swered correctly (given probability pi), we receivereward Ri; if not the quiz terminates. Choose or-der of questions to maximize expected reward.

• Let i and j be the kth and (k + 1)st questionsin an optimally ordered list

L = (i0, . . . , ik−1, i, j, ik+2, . . . , iN−1)

E {reward of L} = E{reward of {i0, . . . , ik−1}

}+ pi0 · · · pik−1(piRi + pipjRj)

+ pi0 · · · pik−1pipjE{reward of {ik+2, . . . , iN−1}

}Consider the list with i and j interchanged

L′ = (i0, . . . , ik−1, j, i, ik+2, . . . , iN−1)

Since L is optimal, E{reward of L} ≥ E{reward of L′},so it follows that piRi + pipjRj ≥ pjRj + pjpiRi

orpiRi/(1 − pi) ≥ pjRj/(1 − pj).

MINIMAX CONTROL

• Consider basic problem with the difference thatthe disturbance wk instead of being random, it isjust known to belong to a given set Wk(xk, uk).

• Find policy π that minimizes the cost

Jπ(x0) = maxwk∈Wk(xk,µk(xk))

k=0,1,...,N−1

[gN (xN )

+N−1∑k=0

gk

(xk, µk(xk), wk

)]

• The DP algorithm takes the form

JN (xN ) = gN (xN ),

Jk(xk) = minuk∈U(xk)

maxwk∈Wk(xk,uk)

[gk(xk, uk, wk)

+ Jk+1

(fk(xk, uk, wk)

)](Exercise 1.5 in the text, solution posted on thewww).

UNKNOWN-BUT-BOUNDED CONTROL

• For each k, keep the xk of the controlled system

xk+1 = fk

(xk, µk(xk), wk

)inside a given set Xk, the target set at time k.

• This is a minimax control problem, where thecost at stage k is

gk(xk) ={

0 if xk ∈ Xk,1 if xk /∈ Xk.

• We must reach at time k the set

Xk ={xk | Jk(xk) = 0

}in order to be able to maintain the state withinthe subsequent target sets.

• Start with XN = XN , and for k = 0, 1, . . . , N −1,

Xk ={xk ∈ Xk | there exists uk ∈ Uk(xk) such that

fk(xk, uk, wk) ∈ Xk+1, for all wk ∈ Wk(xk, uk)}

6.231 DYNAMIC PROGRAMMING

LECTURE 8

LECTURE OUTLINE

• Problems with imperfect state info

• Reduction to the perfect state info case

• Linear quadratic problems

• Separation of estimation and control

BASIC PROBLEM WITH IMPERFECT STATE INFO

• Same as basic problem of Chapter 1 with onedifference: the controller, instead of knowing xk,receives at each time k an observation of the form

z0 = h0(x0, v0), zk = hk(xk, uk−1, vk), k ≥ 1

• The observation zk belongs to some space Zk.• The random observation disturbance vk is char-acterized by a probability distribution

Pvk (· | xk, . . . , x0, uk−1, . . . , u0, wk−1, . . . , w0, vk−1, . . . , v0)

• The initial state x0 is also random and charac-terized by a probability distribution Px0 .

• The probability distribution Pwk(· | xk, uk) ofwk is given, and it may depend explicitly on xk

and uk but not on w0, . . . , wk−1, v0, . . . , vk−1.

• The control uk is constrained to a given subsetUk (this subset does not depend on xk, which isnot assumed known).

INFORMATION VECTOR AND POLICIES

• Denote by Ik the information vector , i.e., theinformation available at time k:

Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk−1), k ≥ 1,I0 = z0

• We consider policies π = {µ0, µ1, . . . , µN−1},where each function µk maps the information vec-tor Ik into a control uk and

µk(Ik) ∈ Uk, for all Ik, k ≥ 0

• We want to find a policy π that minimizes

Jπ = Ex0,wk,vk

k=0,...,N−1

{gN (xN ) +

N−1∑k=0

gk

(xk, µk(Ik), wk

)}

subject to the equations

xk+1 = fk

(xk, µk(Ik), wk

), k ≥ 0,

z0 = h0(x0, v0), zk = hk

(xk, µk−1(Ik−1), vk

), k ≥ 1

REFORMULATION AS PERFECT INFO PROBLEM

• We have

Ik+1 = (Ik, zk+1, uk), k = 0, 1, . . . , N−2, I0 = z0

View this as a dynamic system with state Ik, con-trol uk, and random disturbance zk+1

• We have

P (zk+1 | Ik, uk) = P (zk+1 | Ik, uk, z0, z1, . . . , zk),

since z0, z1, . . . , zk are part of the information vec-tor Ik. Thus the probability distribution of zk+1

depends explicitly only on the state Ik and controluk and not on the prior “disturbances” zk, . . . , z0

• Write

E{gk(xk, uk, wk)

}= E

{E

xk,wk

{gk(xk, uk, wk) | Ik, uk

}}

so the cost per stage of the new system is

gk(Ik, uk) = Exk,wk

{gk(xk, uk, wk) | Ik, uk

}

DP ALGORITHM

• Writing the DP algorithm for the (reformulated)perfect state info problem and doing the algebra:

Jk(Ik) = minuk∈Uk

[E

xk, wk, zk+1

{gk(xk, uk, wk)

+ Jk+1(Ik, zk+1, uk) | Ik, uk

}]for k = 0, 1, . . . , N − 2, and for k = N − 1,

JN−1(IN−1) = minuN−1∈UN−1[

ExN−1, wN−1

{gN

(fN−1(xN−1, uN−1, wN−1)

)

+ gN−1(xN−1, uN−1, wN−1) | IN−1, uN−1

}]

• The optimal cost J∗ is given by

J∗ = Ez0

{J0(z0)

}

LINEAR-QUADRATIC PROBLEMS

• System: xk+1 = Akxk + Bkuk + wk

• Quadratic cost

Ewk

k=0,1,...,N−1

{x′

NQNxN +N−1∑k=0

(x′kQkxk + u′

kRkuk)

}

where Qk ≥ 0 and Rk > 0

• Observations

zk = Ckxk + vk, k = 0, 1, . . . , N − 1

• w0, . . . , wN−1, v0, . . . , vN−1 indep. zero mean

• Key fact to show:− Optimal policy {µ∗

0, . . . , µ∗N−1} is of the form:

µ∗k(Ik) = LkE{xk | Ik}

Lk: same as for the perfect state info case− Estimation problem and control problem can

be solved separately

DP ALGORITHM I

• Last stage N − 1 (supressing index N − 1):

JN−1(IN−1) = minuN−1

[ExN−1,wN−1

{x′

N−1QxN−1

+ u′N−1RuN−1 + (AxN−1 + BuN−1 + wN−1)

· Q(AxN−1 + BuN−1 + wN−1) | IN−1, uN−1

}]

• Since E{wN−1 | IN−1} = E{wN−1} = 0, theminimization involves

minuN−1

[u′

N−1(B′QB + R)uN−1

+ 2E{xN−1 | IN−1}′A′QBuN−1

]The minimization yields the optimal µ∗

N−1:

u∗N−1 = µ∗

N−1(IN−1) = LN−1E{xN−1 | IN−1}

where

LN−1 = −(B′QB + R)−1B′QA

DP ALGORITHM II

• Substituting in the DP algorithm

JN−1(IN−1) = ExN−1

{x′

N−1KN−1xN−1 | IN−1

}+ E

xN−1

{(xN−1 − E{xN−1 | IN−1}

)′· PN−1

(xN−1 − E{xN−1 | IN−1}

) | IN−1

}+ E

wN−1

{w′N−1QNwN−1},

where the matrices KN−1 and PN−1 are given by

PN−1 = A′N−1QNBN−1(RN−1 + B′

N−1QNBN−1)−1

· B′N−1QNAN−1,

KN−1 = A′N−1QNAN−1 − PN−1 + QN−1

• Note the structure of JN−1: in addition tothe quadratic and constant terms, it involves aquadratic in the estimation error

xN−1 − E{xN−1 | IN−1}

DP ALGORITHM III

• DP equation for period N − 2:

JN−2(IN−2) = minuN−2

[E

xN−2,wN−2,zN−1

{x′N−2QxN−2

+ u′N−2RuN−2 + JN−1(IN−1) | IN−2, uN−2}

]= E

{x′

N−2QxN−2 | IN−2

}+ min

uN−2

[u′

N−2RuN−2

+ E{

x′N−1KN−1xN−1 | IN−2, uN−2

}]+ E

{(xN−1 − E{xN−1 | IN−1}

)′· PN−1

(xN−1 − E{xN−1 | IN−1}

)| IN−2, uN−2

}+ EwN−1{w′

N−1QNwN−1}

• Key point: We have excluded the next to lastterm from the minimization with respect to uN−2

• This term turns out to be independent of uN−2

QUALITY OF ESTIMATION LEMMA

• For every k, there is a function Mk such thatwe have

xk−E{xk | Ik} = Mk(x0, w0, . . . , wk−1, v0, . . . , vk),

independently of the policy being used

• The following simplified version of the lemmaconveys the main idea

• Simplified Lemma: Let r, u, z be random vari-ables such that r and u are independent, and letx = r + u. Then

x − E{x | z, u} = r − E{r | z}

• Proof: We have

x − E{x | z, u} = r + u − E{r + u | z, u}= r + u − E{r | z, u} − u

= r − E{r | z, u}= r − E{r | z}

APPLYING THE QUALITY OF EST. LEMMA

• Using the lemma,

xN−1 − E{xN−1 | IN−1} = ξN−1,

where

ξN−1: function of x0, w0, . . . , wN−2, v0, . . . , vN−1

• Since ξN−1 is independent of uN−2, the condi-tional expectation of ξ′N−1PN−1ξN−1 satisfies

E{ξ′N−1PN−1ξN−1 | IN−2, uN−2}= E{ξ′N−1PN−1ξN−1 | IN−2}

and is independent of uN−2.

• So minimization in the DP algorithm yields

u∗N−2 = µ∗

N−2(IN−2) = LN−2E{xN−2 | IN−2}

FINAL RESULT

• Continuing similarly (using also the quality ofestimation lemma)

µ∗k(Ik) = LkE{xk | Ik},

where Lk is the same as for perfect state info:

Lk = −(Rk + B′kKk+1Bk)−1B′

kKk+1Ak,

with Kk generated from KN = QN , using

Kk = A′kKk+1Ak − Pk + Qk,

Pk = A′kKk+1Bk(Rk + B′

kKk+1Bk)−1B′kKk+1Ak

xk + 1 = Akxk + Bkuk + wk

Lk

uk

wk

xkzk = Ckxk + vk

Delay

EstimatorE{xk | Ik}

uk - 1

zk

vk

zkuk

SEPARATION INTERPRETATION

• The optimal controller can be decomposed into

(a) An estimator , which uses the data to gener-ate the conditional expectation E{xk | Ik}.

(b) An actuator , which multiplies E{xk | Ik} bythe gain matrix Lk and applies the controlinput uk = LkE{xk | Ik}.

• Generically the estimate x of a random vector xgiven some information (random vector) I, whichminimizes the mean squared error

Ex{‖x − x‖2 | I} = ‖x‖2 − 2E{x | I}x + ‖x‖2

is E{x | I} (set to zero the derivative with respectto x of the above quadratic form).

• The estimator portion of the optimal controlleris optimal for the problem of estimating the statexk assuming the control is not subject to choice.

• The actuator portion is optimal for the controlproblem assuming perfect state information.

STEADY STATE/IMPLEMENTATION ASPECTS

• As N → ∞, the solution of the Riccati equationconverges to a steady state and Lk → L.

• If x0, wk, and vk are Gaussian, E{xk | Ik} isa linear function of Ik and is generated by a nicerecursive algorithm, the Kalman filter.

• The Kalman filter involves also a Riccati equa-tion, so for N → ∞, and a stationary system, italso has a steady-state structure.

• Thus, for Gaussian uncertainty, the solution isnice and possesses a steady state.

• For nonGaussian uncertainty, computing E{xk | Ik}maybe very difficult, so a suboptimal solution istypically used.

• Most common suboptimal controller: ReplaceE{xk | Ik} by the estimate produced by the Kalmanfilter (act as if x0, wk, and vk are Gaussian).

• It can be shown that this controller is optimalwithin the class of controllers that are linear func-tions of Ik.

6.231 DYNAMIC PROGRAMMING

LECTURE 9

LECTURE OUTLINE

• DP for imperfect state info

• Sufficient statistics

• Conditional state distribution as a sufficientstatistic

• Finite-state systems

• Examples

EVIEW: PROBLEM WITH IMPERFECT STATE INF

• Instead of knowing xk, we receive observations

z0 = h0(x0, v0), zk = hk(xk, uk−1, vk), k ≥ 0

• Ik: information vector available at time k:

I0 = z0, Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk−1), k ≥ 1

• Optimization over policies π = {µ0, µ1, . . . , µN−1},where µk(Ik) ∈ Uk, for all Ik and k.

• Find a policy π that minimizes

Jπ = Ex0,wk,vk

k=0,...,N−1

{gN (xN ) +

N−1∑k=0

gk

(xk, µk(Ik), wk

)}

subject to the equations

xk+1 = fk

(xk, µk(Ik), wk

), k ≥ 0,

z0 = h0(x0, v0), zk = hk

(xk, µk−1(Ik−1), vk

), k ≥ 1

DP ALGORITHM

• DP algorithm:

Jk(Ik) = minuk∈Uk

[E

xk, wk, zk+1

{gk(xk, uk, wk)

+ Jk+1(Ik, zk+1, uk) | Ik, uk

}]for k = 0, 1, . . . , N − 2, and for k = N − 1,

JN−1(IN−1) = minuN−1∈UN−1[

ExN−1, wN−1

{gN

(fN−1(xN−1, uN−1, wN−1)

)

+ gN−1(xN−1, uN−1, wN−1) | IN−1, uN−1

}]

• The optimal cost J∗ is given by

J∗ = Ez0

{J0(z0)

}.

SUFFICIENT STATISTICS

• Suppose that we can find a function Sk(Ik) suchthat the right-hand side of the DP algorithm canbe written in terms of some function Hk as

minuk∈Uk

Hk

(Sk(Ik), uk

).

• Such a function Sk is called a sufficient statistic.

• An optimal policy obtained by the precedingminimization can be written as

µ∗k(Ik) = µk

(Sk(Ik)

),

where µk is an appropriate function.

• Example of a sufficient statistic: Sk(Ik) = Ik

• Another important sufficient statistic

Sk(Ik) = Pxk|Ik

DP ALGORITHM IN TERMS OF PXK |IK

• It turns out that Pxk|Ikis generated recursively

by a dynamic system (estimator) of the form

Pxk+1|Ik+1= Φk

(Pxk|Ik

, uk, zk+1

)for a suitable function Φk

• DP algorithm can be written as

Jk(Pxk|Ik) = min

uk∈Uk

[E

xk,wk,zk+1

{gk(xk, uk, wk)

+ Jk+1

(Φk(Pxk|Ik

, uk, zk+1)) | Ik, uk

}]

uk xk

Delay

Estimator

uk - 1

uk - 1

vk

zk

zk

wk

f k - 1

Actuator

xk + 1 = fk(xk ,uk ,wk) zk = hk(xk ,uk - 1,vk)

System Measurement

P xk

| Ik

mk

EXAMPLE: A SEARCH PROBLEM

• At each period, decide to search or not searcha site that may contain a treasure.

• If we search and a treasure is present, we findit with prob. β and remove it from the site.

• Treasure’s worth: V . Cost of search: C

• States: treasure present & treasure not present

• Each search can be viewed as an observation ofthe state

• Denote

pk : prob. of treasure present at the start of time k

with p0 given.

• pk evolves at time k according to the equation

pk+1 =

pk if not search,0 if search and find treasure,

pk(1−β)pk(1−β)+1−pk

if search and no treasure.

SEARCH PROBLEM (CONTINUED)

• DP algorithm

Jk(pk) = max[0, −C + pkβV

+ (1 − pkβ)Jk+1

(pk(1 − β)

pk(1 − β) + 1 − pk

)],

with JN (pN ) = 0.

• Can be shown by induction that the functionsJk satisfy

Jk(pk) = 0, for all pk ≤ C

βV

• Furthermore, it is optimal to search at periodk if and only if

pkβV ≥ C

(expected reward from the next search ≥ the costof the search)

FINITE-STATE SYSTEMS

• Suppose the system is a finite-state Markovchain, with states 1, . . . , n.

• Then the conditional probability distributionPxk|Ik

is a vector

(P (xk = 1 | Ik), . . . , P (xk = n | Ik)

)• The DP algorithm can be executed over the n-dimensional simplex (state space is not expandingwith increasing k)

• When the control and observation spaces arealso finite sets, it turns out that the cost-to-gofunctions Jk in the DP algorithm are piecewiselinear and concave (Exercise 5.7).

• This is conceptually important and also (mod-erately) useful in practice.

INSTRUCTION EXAMPLE

• Teaching a student some item. Possible statesare L: Item learned, or L: Item not learned.

• Possible decisions: T : Terminate the instruc-tion, or T : Continue the instruction for one periodand then conduct a test that indicates whether thestudent has learned the item.

• The test has two possible outcomes: R: Studentgives a correct answer, or R: Student gives anincorrect answer.

• Probabilistic structure

L L R

rt

1 1

1 - r1 - tL RL

• Cost of instruction is I per period

• Cost of terminating instruction; 0 if student haslearned the item, and C > 0 if not.

INSTRUCTION EXAMPLE II

• Let pk: prob. student has learned the item giventhe test results so far

pk = P (xk|Ik) = P (xk = L | z0, z1, . . . , zk).

• Using Bayes’ rule we can obtain

pk+1 = Φ(pk, zk+1)

=

{1−(1−t)(1−pk)

1−(1−t)(1−r)(1−pk) if zk+1 = R,

0 if zk+1 = R.

• DP algorithm:

Jk(pk) = min

[(1 − pk)C, I + E

zk+1

{Jk+1

(Φ(pk, zk+1)

)}].

starting with

JN−1(pN−1) = min[(1−pN−1)C, I+(1−t)(1−pN−1)C

].

INSTRUCTION EXAMPLE III

• Write the DP algorithm as

Jk(pk) = min[(1 − pk)C, I + Ak(pk)

],

where

Ak(pk) = P (zk+1 = R | Ik)Jk+1

(Φ(pk, R)

)+ P (zk+1 = R | Ik)Jk+1

(Φ(pk, R)

)• Can show by induction that Ak(p) are piecewiselinear, concave, monotonically decreasing, with

Ak−1(p) ≤ Ak(p) ≤ Ak+1(p), for all p ∈ [0, 1].

0 p

C

I

I + AN - 1(p )

I + AN - 2(p )

I + AN - 3(p )

1a N - 1 a N - 3a N - 2 1 -I

C

6.231 DYNAMIC PROGRAMMING

LECTURE 10

LECTURE OUTLINE

• Suboptimal control

• Certainty equivalent control

• Limited lookahead policies

• Performance bounds

• Problem approximation approach

• Heuristic cost-to-go approximation

PRACTICAL DIFFICULTIES OF DP

• The curse of modeling

• The curse of dimensionality− Exponential growth of the computational and

storage requirements as the number of statevariables and control variables increases

− Quick explosion of the number of states incombinatorial problems

− Intractability of imperfect state informationproblems

• There may be real-time solution constraints− A family of problems may be addressed. The

data of the problem to be solved is given withlittle advance notice

− The problem data may change as the systemis controlled – need for on-line replanning

CERTAINTY EQUIVALENT CONTROL (CEC)

• Replace the stochastic problem with a deter-ministic problem

• At each time k, the uncertain quantities arefixed at some “typical” values

• Implementation for an imperfect info problem.At each time k:

(1) Compute a state estimate xk(Ik) given thecurrent information vector Ik.

(2) Fix the wi, i ≥ k, at some wi(xi, ui). Solvethe deterministic problem:

minimize gN (xN )+N−1∑i=k

gi

(xi, ui, wi(xi, ui)

)

subject to xk = xk(Ik) and for i ≥ k,

ui ∈ Ui, xi+1 = fi

(xi, ui, wi(xi, ui)

).

(3) Use as control the first element in the opti-mal control sequence found.

ALTERNATIVE IMPLEMENTATION

• Let{µd

0(x0), . . . , µdN−1(xN−1)

}be an optimal

controller obtained from the DP algorithm for thedeterministic problem

minimize gN (xN ) +

N−1∑k=0

gk

(xk, µk(xk), wk(xk, uk)

)subject to xk+1 = fk

(xk, µk(xk), wk(xk, uk)

), µk(xk) ∈ Uk

The CEC applies at time k the control input

µk(Ik) = µdk

(xk(Ik)

)

xk

Delay

Estimator

uk - 1

uk - 1

vk

zk

zk

wk

Actuator

xk + 1 = fk(xk ,uk ,wk) zk = hk(xk ,uk - 1,vk)

System Measurement

mkd

u k =mkd (xk)

xk(Ik)

CEC WITH HEURISTICS

• Solve the “deterministic equivalent” problemusing a heuristic/suboptimal policy

• Improved version of this idea: At time k min-imize the stage k cost and plus the heuristic costof the remaining stages, i.e., apply at time k acontrol uk that minimizes over uk ∈ Uk(xk)

gk

(xk, uk, wk(xk, uk)

)+Hk+1

(fk

(xk, uk, wk(xk, uk)

))where Hk+1 is the cost-to-go function correspond-ing to the heuristic.

• This an example of an important suboptimalcontrol idea:

Minimize at each stage k the sum of approxima-tions to the current stage cost and the optimalcost-to-go.

• This is a central idea in several other suboptimalcontrol schemes, such as limited lookahead, androllout algorithms.

• Hk+1(xk+1) may be computed off-line or on-line.

PARTIALLY STOCHASTIC CEC

• Instead of fixing all future disturbances to theirtypical values, fix only some, and treat the rest asstochastic.

• Important special case: Treat an imperfect stateinformation problem as one of perfect state infor-mation, using an estimate xk(Ik) of xk as if it wereexact.

• Multiaccess Communication Example: Con-sider controlling the slotted Aloha system (dis-cussed in Ch. 5) by optimally choosing the prob-ability of transmission of waiting packets. Thisis a hard problem of imperfect state info, whoseperfect state info version is easy.

• Natural partially stochastic CEC:

µk(Ik) = min[1,

1xk(Ik)

],

where xk(Ik) is an estimate of the current packetbacklog based on the entire past channel historyof successes, idles, and collisions (which is Ik).

LIMITED LOOKAHEAD POLICIES

• One-step lookahead (1SL) policy : At each k andstate xk, use the control µk(xk) that

minuk∈Uk(xk)

E{gk(xk, uk, wk)+Jk+1

(fk(xk, uk, wk)

)},

where− JN = gN .− Jk+1: approximation to true cost-to-go Jk+1

• Two-step lookahead policy : At each k and xk,use the control µk(xk) attaining the minimum above,where the function Jk+1 is obtained using a 1SLapproximation (solve a 2-step DP problem).

• If Jk+1 is readily available and the minimiza-tion above is not too hard, the 1SL policy is im-plementable on-line.

• Sometimes one also replaces Uk(xk) above witha subset of “most promising controls” Uk(xk).

• As the length of lookahead increases, the re-quired computation quickly explodes.

PERFORMANCE BOUNDS FOR 1SL

• Let Jk(xk) be the cost-to-go from (xk, k) of the1SL policy, based on functions Jk.

• Assume that for all (xk, k), we have

Jk(xk) ≤ Jk(xk), (*)

where JN = gN and for all k,

Jk(xk) = minuk∈Uk(xk)

E{gk(xk, uk, wk)

+ Jk+1

(fk(xk, uk, wk)

)},

[so Jk(xk) is computed along with µk(xk)]. Then

Jk(xk) ≤ Jk(xk), for all (xk, k).

• Important application: When Jk is the cost-to-go of some heuristic policy (then the 1SL policy iscalled the rollout policy).

• The bound can be extended to the case wherethere is a δk in the RHS of (*). Then

Jk(xk) ≤ Jk(xk) + δk + · · · + δN−1

COMPUTATIONAL ASPECTS

• Sometimes nonlinear programming can be usedto calculate the 1SL or the multistep version [par-ticularly when Uk(xk) is not a discrete set]. Con-nection with stochastic programming methods.

• The choice of the approximating functions Jk

is critical, and is calculated in a variety of ways.

• Some approaches:

(a) Problem Approximation: Approximate theoptimal cost-to-go with some cost derivedfrom a related but simpler problem

(b) Heuristic Cost-to-Go Approximation: Ap-proximate the optimal cost-to-go with a func-tion of a suitable parametric form, whose pa-rameters are tuned by some heuristic or sys-tematic scheme (Neuro-Dynamic Program-ming)

(c) Rollout Approach: Approximate the optimalcost-to-go with the cost of some suboptimalpolicy, which is calculated either analyticallyor by simulation

PROBLEM APPROXIMATION

• Many (problem-dependent) possibilities− Replace uncertain quantities by nominal val-

ues, or simplify the calculation of expectedvalues by limited simulation

− Simplify difficult constraints or dynamics

• Example of enforced decomposition: Route mvehicles that move over a graph. Each node has a“value.” The first vehicle that passes through thenode collects its value. Max the total collectedvalue, subject to initial and final time constraints(plus time windows and other constraints).

• Usually the 1-vehicle version of the problem ismuch simpler. This motivates an approximationobtained by solving single vehicle problems.

• 1SL scheme: At time k and state xk (positionof vehicles and “collected value nodes”), considerall possible kth moves by the vehicles, and at theresulting states we approximate the optimal value-to-go with the value collected by optimizing thevehicle routes one-at-a-time

HEURISTIC COST-TO-GO APPROXIMATION

• Use a cost-to-go approximation from a paramet-ric class J(x, r) where x is the current state andr = (r1, . . . , rm) is a vector of “tunable” scalars(weights).

• By adjusting the weights, one can change the“shape” of the approximation J so that it is rea-sonably close to the true optimal cost-to-go func-tion.

• Two key issues:− The choice of parametric class J(x, r) (the

approximation architecture).− Method for tuning the weights (“training”

the architecture).

• Successful application strongly depends on howthese issues are handled, and on insight about theproblem.

• Sometimes a simulator is used, particularlywhen there is no mathematical model of the sys-tem.

APPROXIMATION ARCHITECTURES

• Divided in linear and nonlinear [i.e., linear ornonlinear dependence of J(x, r) on r].

• Linear architectures are easier to train, but non-linear ones (e.g., neural networks) are richer.

• Architectures based on feature extraction

Feature ExtractionMapping

Cost Approximator w/Parameter Vector r

FeatureVector yState x

Cost Approximation

J (y,r )

• Ideally, the features will encode much of thenonlinearity that is inherent in the cost-to-go ap-proximated, and the approximation may be quiteaccurate without a complicated architecture.

• Sometimes the state space is partitioned, and“local” features are introduced for each subset ofthe partition (they are 0 outside the subset).

• With a well-chosen feature vector y(x), we canuse a linear architecture

J(x, r) = J(y(x), r

)=∑

i

riyi(x)

COMPUTER CHESS

• Programs use a feature-based position evaluatorthat assigns a score to each move/position

FeatureExtraction

Weightingof Features

Score

Features:Material balance,Mobility,Safety, etc

Position Evaluator

• Most often the weighting of features is linearbut multistep lookahead is involved.

• Most often the training is done by trial anderror.

• Additional features:− Depth first search− Variable depth search when dynamic posi-

tions are involved− Alpha-beta pruning

6.231 DYNAMIC PROGRAMMING

LECTURE 11

LECTURE OUTLINE

• Rollout algorithms

• Cost improvement property

• Discrete deterministic problems

• Sequential consistency and greedy algorithms

• Sequential improvement

ROLLOUT ALGORITHMS

• One-step lookahead policy: At each k andstate xk, use the control µk(xk) that

minuk∈Uk(xk)

E{gk(xk, uk, wk)+Jk+1

(fk(xk, uk, wk)

)},

where− JN = gN .− Jk+1: approximation to true cost-to-go Jk+1

• Rollout algorithm: When Jk is the cost-to-goof some heuristic policy (called the base policy)

• Cost improvement property (to be shown): Therollout algorithm achieves no worse (and usuallymuch better) cost than the base heuristic startingfrom the same state.

• Main difficulty: Calculating Jk(xk) may becomputationally intensive if the cost-to-go of thebase policy cannot be analytically calculated.

− May involve Monte Carlo simulation if theproblem is stochastic.

− Things improve in the deterministic case.

EXAMPLE: THE QUIZ PROBLEM

• A person is given N questions; answering cor-rectly question i has probability pi, reward vi.Quiz terminates at the first incorrect answer.

• Problem: Choose the ordering of questions soas to maximize the total expected reward.

• Assuming no other constraints, it is optimal touse the index policy : Answer questions in decreas-ing order of pivi/(1 − pi).

• With minor changes in the problem, the indexpolicy need not be optimal. Examples:

− A limit (< N) on the maximum number ofquestions that can be answered.

− Time windows, sequence-dependent rewards,precedence constraints.

• Rollout with the index policy as base policy:Convenient because at a given state (subset ofquestions already answered), the index policy andits expected reward can be easily calculated.

• Very effective for solving the quiz problem andimportant generalizations in scheduling (see Bert-sekas and Castanon, J. of Heuristics, Vol. 5, 1999).

COST IMPROVEMENT PROPERTY

• Let

Jk(xk): Cost-to-go of the rollout policy

Hk(xk): Cost-to-go of the base policy

• We claim that Jk(xk) ≤ Hk(xk) for all xk, k

• Proof by induction: We have JN (xN ) = HN (xN )for all xN . Assume that

Jk+1(xk+1) ≤ Hk+1(xk+1), ∀ xk+1.

Then, for all xk

Jk(xk) = E{

gk

(xk, µk(xk), wk

)+ Jk+1

(fk

(xk, µk(xk), wk

))}≤ E

{gk

(xk, µk(xk), wk

)+ Hk+1

(fk

(xk, µk(xk), wk

))}≤ E

{gk

(xk, µk(xk), wk

)+ Hk+1

(fk

(xk, µk(xk), wk

))}= Hk(xk)

− Induction hypothesis ==> 1st inequality− Min selection of µk(xk) ==> 2nd inequality− Definition of Hk, µk ==> last equality

EXAMPLE: THE BREAKTHROUGH PROBLEM

root

• Given a binary tree with N stages.

• Each arc is either free or is blocked (crossed outin the figure).

• Problem: Find a free path from the root to theleaves (such as the one shown with thick lines).

• Base heuristic (greedy): Follow the right branchif free; else follow the left branch if free.

• For large N and given prob. of free branch:the rollout algorithm requires O(N) times morecomputation, but has O(N) times larger prob. offinding a free path than the greedy algorithm.

DISCRETE DETERMINISTIC PROBLEMS

• Any discrete optimization problem (with finitenumber of choices/feasible solutions) can be rep-resented as a sequential decision process by usinga tree.

• The leaves of the tree correspond to the feasiblesolutions.

• The problem can be solved by DP, starting fromthe leaves and going back towards the root.

• Example: Traveling salesman problem. Find aminimum cost tour that goes exactly once througheach of N cities.

ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Origin Node sA

Traveling salesman problem with four cities A, B, C, D

A CLASS OF GENERAL DISCRETE PROBLEMS

• Generic problem:− Given a graph with directed arcs− A special node s called the origin− A set of terminal nodes, called destinations,

and a cost g(i) for each destination i.− Find min cost path starting at the origin,

ending at one of the destination nodes.

• Base heuristic: For any nondestination node i,constructs a path (i, i1, . . . , im, i) starting at i andending at one of the destination nodes i. We calli the projection of i, and we denote H(i) = g(i).

• Rollout algorithm: Start at the origin; choosethe successor node with least cost projection

s i1 im

j1

j2

j3

j4

p(j1)

p(j2)

p(j3)

p(j4)

im-1

Neighbors of imProjections of

Neighbors of im

EXAMPLE: ONE-DIMENSIONAL WALK

• A person takes either a unit step to the left ora unit step to the right. Minimize the cost g(i) ofthe point i where he will end up after N steps.

g(i)

iNN - 2-N 0

(N,0)

(0,0)

(N,-N) (N,N)

i_

i_

• Base heuristic: Always go to the right. Rolloutfinds the rightmost local minimum.

• Base heuristic: Compare always go to the rightand always go the left. Choose the best of the two.Rollout finds a global minimum.

SEQUENTIAL CONSISTENCY

• The base heuristic is sequentially consistent ifall nodes of its path have the same projection, i.e.,for every node i, whenever it generates the path(i, i1, . . . , im, i) starting at i, it also generates thepath (i1, . . . , im, i) starting at i1.

• Prime example of a sequentially consistent heuris-tic is a greedy algorithm. It uses an estimate F (i)of the optimal cost starting from i.

• At the typical step, given a path (i, i1, . . . , im),where im is not a destination, the algorithm addsto the path a node im+1 such that

im+1 = arg minj∈N(im)

F (j)

• Prop.: If the base heuristic is sequentially con-sistent, the cost of the rollout algorithm is no morethan the cost of the base heuristic. In particular,if (s, i1, . . . , im) is the rollout path, we have

H(s) ≥ H(i1) ≥ · · · ≥ H(im−1) ≥ H(im)

where H(i) = cost of the heuristic starting at i.

• Proof: Rollout deviates from the greedy pathonly when it discovers an improved path.

SEQUENTIAL IMPROVEMENT

• We say that the base heuristic is sequentiallyimproving if for every non-destination node i, wehave

H(i) ≥ minj is neighbor of i

H(j)

• If the base heuristic is sequentially improving,the cost of the rollout algorithm is no more thanthe cost of the base heuristic, starting from anynode.

• Fortified rollout algorithm:− Simple variant of the rollout algorithm, where

we keep the best path found so far throughthe application of the base heuristic.

− If the rollout path deviates from the bestpath found, then follow the best path.

− Can be shown to be a rollout algorithm withsequentially improving base heuristic for aslightly modified variant of the original prob-lem.

− Has the cost improvement property.

6.231 DYNAMIC PROGRAMMING

LECTURE 12

LECTURE OUTLINE

• More on rollout algorithms - Stochastic prob-lems

• Simulation-based methods for rollout

• Approximations of rollout algorithms

• Rolling horizon approximations

• Discretization of continuous time

• Discretization of continuous space

• Other suboptimal approaches

OLLOUT ALGORITHMS - STOCHASTIC PROBLEM

• Rollout policy: At each k and state xk, usethe control µk(xk) that

minuk∈Uk(xk)

Qk(xk, uk),

where

Qk(xk, uk) = E{gk(xk, uk, wk)+Hk+1

(fk(xk, uk, wk)

)}and Hk+1(xk+1) is the cost-to-go of the heuristic.

• Qk(xk, uk) is called the Q-factor of (xk, uk),and for a stochastic problem, its computation mayinvolve Monte Carlo simulation.

• Potential difficulty: To minimize over uk the Q-factor, we must form Q-factor differences Qk(xk, u)−Qk(xk, u). This differencing often amplifies thesimulation error in the calculation of the Q-factors.

• Potential remedy: Compare any two controlsu and u by simulating the Q-factor differencesQk(xk, u) − Qk(xk, u) directly. This may effectvariance reduction of the simulation-induced er-ror.

Q-FACTOR APPROXIMATION

• Here, instead of simulating the Q-factors, weapproximate the costs-to-go Hk+1(xk+1).

• Certainty equivalence approach: Given xk, fixfuture disturbances at “typical” values wk+1, . . . , wN−1

and approximate the Q-factors with

Qk(xk, uk) = E{gk(xk, uk, wk)+Hk+1

(fk(xk, uk, wk)

)}where Hk+1

(fk(xk, uk, wk)

)is the cost of the heuris-

tic with the disturbances fixed at the typical val-ues.

• This is an approximation of Hk+1

(fk(xk, uk, wk)

)by using a “single sample simulation.”

• Variant of the certainty equivalence approach:Approximate Hk+1

(fk(xk, uk, wk)

)by simulation

using a small number of “representative samples”(scenarios).

• Alternative: Calculate (exact or approximate)values for the cost-to-go of the base policy at alimited set of state-time pairs, and then approx-imate Hk+1 using an approximation architectureand a “training algorithm” or “least-squares fit.”

ROLLING HORIZON APPROACH

• This is an l-step lookahead policy where thecost-to-go approximation is just 0.

• Alternatively, the cost-to-go approximation isthe terminal cost function gN .

• A short rolling horizon saves computation.

• “Paradox”: It is not true that a longer rollinghorizon always improves performance.

• Example: At the initial state, there are twocontrols available (1 and 2). At every other state,there is only one control.

CurrentSta te

Optimal Trajectory

HighCost

... ...

... ...

1

2

LowCost

HighCost

l S t a g e s

ROLLING HORIZON COMBINED WITH ROLLOUT

• We can use a rolling horizon approximation incalculating the cost-to-go of the base heuristic.

• Because the heuristic is suboptimal, the ratio-nale for a long rolling horizon becomes weaker.

• Example: N -stage stopping problem wherethe stopping cost is 0, the continuation cost is ei-ther −ε or 1, where 0 < ε << 1, and the first statewith continuation cost equal to 1 is state m. Thenthe optimal policy is to stop at state m, and theoptimal cost is −mε.

0 1 2 m N

Stopped State

- e - e 1... ...

• Consider the heuristic that continues at everystate, and the rollout policy that is based on thisheuristic, with a rolling horizon of l ≤ m steps.

• It will continue up to the first m− l + 1 stages,thus compiling a cost of −(m−l+1)ε. The rolloutperformance improves as l becomes shorter!

• Limited vision may work to our advantage!

DISCRETIZATION

• If the state space and/or control space is con-tinuous/infinite, it must be replaced by a finitediscretization.

• Need for consistency, i.e., as the discretizationbecomes finer, the cost-to-go functions of the dis-cretized problem converge to those of the contin-uous problem.

• Pitfalls with discretizing continuous time.

• The control constraint set changes a lot as wepass to the discrete-time approximation.

• Continuous-Time Shortest Path Pitfall:

x1(t) = u1(t), x2(t) = u2(t),

with control constraint ui(t) ∈ {−1, 1} and cost∫ T

0g(x(t)

)dt. Compare with naive discretization

x1(t+∆t) = x1(t)+∆tu1(t), x2(t+∆t) = x2(t)+∆tu2(t)

with ui(t) ∈ {−1, 1}.• “Convexification effect” of continuous time.

SPACE DISCRETIZATION I

• Given a discrete-time system with state spaceS, consider a finite subset S; for example S couldbe a finite grid within a continuous state space S.

• Difficulty: f(x, u, w) /∈ S for x ∈ S.

• We define an approximation to the originalproblem, with state space S, as follows:

• Express each x ∈ S as a convex combination ofstates in S, i.e.,

x =∑xi∈S

γi(x)xi where γi(x) ≥ 0,∑

i

γi(x) = 1

• Define a “reduced” dynamic system with statespace S, whereby from each xi ∈ S we move tox = f(xi, u, w) according to the system equationof the original problem, and then move to xj ∈ Swith probabilities γj(x).

• Define similarly the corresponding cost per stageof the transitions of the reduced system.

SPACE DISCRETIZATION II

• Let Jk(xi) be the optimal cost-to-go of the “re-duced” problem from each state xi ∈ S and timek onward.

• Approximate the optimal cost-to-go of any x ∈S for the original problem by

Jk(x) =∑xi∈S

γi(x)Jk(xi),

and use one-step-lookahead based on Jk.

• The choice of coefficients γi(x) is in principlearbitrary, but should aim at consistency, i.e., asthe number of states in S increases, Jk(x) shouldconverge to the optimal cost-to-go of the originalproblem.

• Interesting observation: While the original prob-lem may be deterministic, the reduced problem isalways stochastic.

• Generalization: The set S may be any finite set(not a subset of S) as long as the coefficients γi(x)admit a meaningful interpretation that quantifiesthe degree of association of x with xi.

OTHER SUBOPTIMAL CONTROL APPROACHES

• Minimize the DP equation error: Approxi-mate the optimal cost-to-go functions Jk(xk) withfunctions Jk(xk, rk), where rk is a vector of un-known parameters, chosen to minimize some formof error in the DP equations.

• Direct approximation of control policies:For a subset of states xi, i = 1, . . . ,m, find

µk(xi) = arg minuk∈Uk(xi)

E{g(xi, uk, wk)

+ Jk+1

(fk(xi, uk, wk), rk+1

)}.

Then find µk(xk, sk), where sk is a vector of pa-rameters obtained by solving the problem

mins

m∑i=1

‖µk(xi) − µk(xi, s)‖2.

• Approximation in policy space: Do notbother with cost-to-go approximations. Parametrizethe policies as µk(xk, sk), and minimize the costfunction of the problem over the parameters sk.

6.231 DYNAMIC PROGRAMMING

LECTURE 13

LECTURE OUTLINE

• Infinite horizon problems

• Stochastic shortest path problems

• Bellman’s equation

• Dynamic programming – value iteration

• Examples

TYPES OF INFINITE HORIZON PROBLEMS

• Same as the basic problem, but:− The number of stages is infinite.− The system is stationary.

• Total cost problems: Minimize

Jπ(x0) = limN→∞ E

wkk=0,1,...

{N−1∑k=0

αkg(xk, µk(xk), wk

)}

− Stochastic shortest path problems (α = 1,finite-state system with a termination state)

− Discounted problems (α < 1, bounded costper stage)

− Discounted and undiscounted problems withunbounded cost per stage

• Average cost problems

limN→∞

1N

Ewk

k=0,1,...

{N−1∑k=0

g(xk, µk(xk), wk

)}

PREVIEW OF INFINITE HORIZON RESULTS

• Key issue: The relation between the infinite andfinite horizon optimal cost-to-go functions.

• Illustration: Let α = 1 and JN (x) denote theoptimal cost of the N -stage problem, generatedafter N DP iterations, starting from J0(x) ≡ 0

Jk+1(x) = minu∈U(x)

Ew

{g(x, u, w) + Jk

(f(x, u, w)

)}, ∀ x

• Typical results for total cost problems:

J∗(x) = limN→∞

JN (x), ∀ x

J∗(x) = minu∈U(x)

Ew

{g(x, u, w) + J∗(f(x, u, w)

)}, ∀ x

(Bellman’s Equation). If µ(x) minimizes in Bell-man’s Eq., the policy {µ, µ, . . .} is optimal.

• Bellman’s Eq. always holds. The other re-sults are true for SSP (and bounded/discounted;unusual exceptions for other problems).

STOCHASTIC SHORTEST PATH PROBLEMS

• Assume finite-state system: States 1, . . . , n andspecial cost-free termination state t

− Transition probabilities pij(u)− Control constraints u ∈ U(i)− Cost of policy π = {µ0, µ1, . . .} is

Jπ(i) = limN→∞E

{N−1∑k=0

g(xk, µk(xk)

)∣∣∣ x0 = i

}

− Optimal policy if Jπ(i) = J∗(i) for all i.− Special notation: For stationary policies π =

{µ, µ, . . .}, we use Jµ(i) in place of Jπ(i).

• Assumption (Termination inevitable): There ex-ists integer m such that for every policy and initialstate, there is positive probability that the termi-nation state will be reached after no more that mstages; for all π, we have

ρπ = maxi=1,...,n

P{xm �= t | x0 = i, π} < 1

FINITENESS OF POLICY COST-TO-GO FUNCTIONS

• Letρ = max

πρπ.

Note that ρπ depends only on the first m compo-nents of the policy π, so that ρ < 1.

• For any π and any initial state i

P{x2m �= t | x0 = i, π} = P{x2m �= t | xm �= t, x0 = i, π}× P{xm �= t | x0 = i, π} ≤ ρ2

and similarly

P{xkm �= t | x0 = i, π} ≤ ρk, i = 1, . . . , n

• So E{Cost between times km and (k + 1)m − 1 }

≤ mρk maxi=1,...,nu∈U(i)

∣∣g(i, u)∣∣

and

∣∣Jπ(i)∣∣ ≤ ∞∑

k=0

mρk maxi=1,...,nu∈U(i)

∣∣g(i, u)∣∣ =

m

1 − ρmax

i=1,...,nu∈U(i)

∣∣g(i, u)∣∣

MAIN RESULT

• Given any initial conditions J0(1), . . . , J0(n),the sequence Jk(i) generated by the DP iteration

Jk+1(i) = minu∈U(i)

g(i, u) +

n∑j=1

pij(u)Jk(j)

, ∀ i

converges to the optimal cost J∗(i) for each i.

• Bellman’s equation has J∗(i) as unique solution:

J∗(i) = minu∈U(i)

g(i, u) +

n∑j=1

pij(u)J∗(j)

, ∀ i

• A stationary policy µ is optimal if and onlyif for every state i, µ(i) attains the minimum inBellman’s equation.

• Key proof idea: The “tail” of the cost series,

∞∑k=mK

E{g(xk, µk(xk)

)}

vanishes as K increases to ∞.

OUTLINE OF PROOF THAT JN → J∗

• Assume for simplicity that J0(i) = 0 for all i,and for any K ≥ 1, write the cost of any policy πas

Jπ(x0) =

mK−1∑k=0

E{

g(xk, µk(xk)

)}+

∞∑k=mK

E{

g(xk, µk(xk)

)}

≤mK−1∑

k=0

E{

g(xk, µk(xk)

)}+

∞∑k=K

ρkm maxi,u

|g(i, u)|

Take the minimum of both sides over π to obtain

J∗(x0) ≤ JmK(x0) +ρK

1 − ρmmax

i,u|g(i, u)|.

Similarly, we have

JmK(x0) − ρK

1 − ρmmax

i,u|g(i, u)| ≤ J∗(x0).

It follows that limK→∞ JmK(x0) = J∗(x0).

• It can be seen that JmK(x0) and JmK+k(x0)converge to the same limit for k = 1, . . . ,m − 1,so JN (x0) → J∗(x0)

EXAMPLE I

• Minimizing the E{Time to Termination}: Let

g(i, u) = 1, ∀ i = 1, . . . , n, u ∈ U(i)

• Under our assumptions, the costs J∗(i) uniquelysolve Bellman’s equation, which has the form

J∗(i) = minu∈U(i)

1 +

n∑j=1

pij(u)J∗(j)

, i = 1, . . . , n

• In the special case where there is only one con-trol at each state, J∗(i) is the mean first passagetime from i to t. These times, denoted mi, are theunique solution of the equations

mi = 1 +n∑

j=1

pijmj , i = 1, . . . , n.

EXAMPLE II

• A spider and a fly move along a straight line.

• The fly moves one unit to the left with proba-bility p, one unit to the right with probability p,and stays where it is with probability 1 − 2p.

• The spider moves one unit towards the fly if itsdistance from the fly is more that one unit.

• If the spider is one unit away from the fly, itwill either move one unit towards the fly or staywhere it is.

• If the spider and the fly land in the same posi-tion, the spider captures the fly.

• The spider’s objective is to capture the fly inminimum expected time.

• This is an SSP w/ state = the distance be-tween spider and fly (i = 1, . . . , n and t = 0 thetermination state).

• There is control choice only at state 1.

EXAMPLE II (CONTINUED)

• For M = move, and M = don’t move

p11(M) = 2p, p10(M) = 1 − 2p,

p12(M) = p, p11(M) = 1 − 2p, p10(M) = p,

pii = p, pi(i−1) = 1−2p, pi(i−2) = p, i ≥ 2,

with all other transition probabilities being 0.

• Bellman’s equation:

J∗(i) = 1+pJ∗(i)+(1−2p)J∗(i−1)+pJ∗(i−2), i ≥ 2

J∗(1) = 1+min[2pJ∗(1), pJ∗(2)+ (1− 2p)J∗(1)

]w/ J∗(0) = 0. Substituting J∗(2) in Eq. for J∗(1),

J∗(1) = 1+min[2pJ∗(1),

p

1 − p+

(1 − 2p)J∗(1)1 − p

].

• Work from here to find that when one unit awayfrom the fly it is optimal not to move if and onlyif p ≥ 1/3.

6.231 DYNAMIC PROGRAMMING

LECTURE 14

LECTURE OUTLINE

• Review of stochastic shortest path problems

• Computational methods− Value iteration− Policy iteration− Linear programming

• Discounted problems as special case of SSP

STOCHASTIC SHORTEST PATH PROBLEMS

• Assume finite-state system: States 1, . . . , n andspecial cost-free termination state t

− Transition probabilities pij(u)− Control constraints u ∈ U(i)− Cost of policy π = {µ0, µ1, . . .} is

Jπ(i) = limN→∞E

{N−1∑k=0

g(xk, µk(xk)

)∣∣∣ x0 = i

}

− Optimal policy if Jπ(i) = J∗(i) for all i.− Special notation: For stationary policies π =

{µ, µ, . . .}, we use Jµ(i) in place of Jπ(i).

• Assumption (Termination inevitable): There ex-ists integer m such that for every policy and initialstate, there is positive probability that the termi-nation state will be reached after no more that mstages; for all π, we have

ρπ = maxi=1,...,n

P{xm �= t | x0 = i, π} < 1

MAIN RESULT

• Given any initial conditions J0(1), . . . , J0(n),the sequence Jk(i) generated by value iteration

Jk+1(i) = minu∈U(i)

g(i, u) +

n∑j=1

pij(u)Jk(j)

, ∀ i

converges to the optimal cost J∗(i) for each i.

• Bellman’s equation has J∗(i) as unique solution:

J∗(i) = minu∈U(i)

g(i, u) +

n∑j=1

pij(u)J∗(j)

, ∀ i

• A stationary policy µ is optimal if and onlyif for every state i, µ(i) attains the minimum inBellman’s equation.

• Key proof idea: The “tail” of the cost series,

∞∑k=mK

E{g(xk, µk(xk)

)}

vanishes as K increases to ∞.

BELLMAN’S EQUATION FOR A SINGLE POLICY

• Consider a stationary policy µ

• Jµ(i), i = 1, . . . , n, are the unique solution ofthe linear system of n equations

Jµ(i) = g(i, µ(i)

)+

n∑j=1

pij

(µ(i)

)Jµ(j), ∀ i = 1, . . . , n

• Proof: This is just Bellman’s equation for amodified/restricted problem where there is onlyone policy, the stationary policy µ, i.e., the controlconstraint set at state i is U(i) = {µ(i)}• The equation provides a way to compute Jµ(i),i = 1, . . . , n, but the computation is substantialfor large n [O(n3)]

• For large n, value iteration may be preferable.(Typical case of a large linear system of equations,where an iterative method may be better than adirect solution method.)

POLICY ITERATION

• It generates a sequence µ1, µ2, . . . of stationarypolicies, starting with any stationary policy µ0.

• At the typical iteration, given µk, we performa policy evaluation step, that computes the Jµk(i)as the solution of the (linear) system of equations

J(i) = g(i, µk(i)

)+

n∑j=1

pij

(µk(i)

)J(j), i = 1, . . . , n,

in the n unknowns J(1), . . . , J(n). We then per-form a policy improvement step, which computesa new policy µk+1 as

µk+1(i) = arg minu∈U(i)

g(i, u) +

n∑j=1

pij(u)Jµk(j)

, ∀ i

• The algorithm stops when Jµk(i) = Jµk+1(i) forall i

• Note the connection with the rollout algorithm,which is just a single policy iteration

JUSTIFICATION OF POLICY ITERATION

• We can show thatJµk+1(i) ≤ Jµk(i) for all i, k

• Fix k and consider the sequence generated by

JN+1(i) = g(i, µk+1(i)

)+

n∑j=1

pij

(µk+1(i)

)JN (j)

where J0(i) = Jµk(i). We have

J0(i) = g(i, µk(i)

)+

n∑j=1

pij

(µk(i)

)J0(j)

≥ g(i, µk+1(i)

)+

n∑j=1

pij

(µk+1(i)

)J0(j) = J1(i)

Using the monotonicity property of DP,

J0(i) ≥ J1(i) ≥ · · · ≥ JN (i) ≥ JN+1(i) ≥ · · · , ∀ i

Since JN (i) → Jµk+1(i) as N → ∞, we obtainJµk(i) = J0(i) ≥ Jµk+1(i) for all i. Also if Jµk(i) =Jµk+1(i) for all i, Jµk solves Bellman’s equationand is therefore equal to J∗

• A policy cannot be repeated, there are finitelymany stationary policies, so the algorithm termi-nates with an optimal policy

LINEAR PROGRAMMING

• We claim that J∗ is the “largest” J that satisfiesthe constraint

J(i) ≤ g(i, u) +n∑

j=1

pij(u)J(j), (1)

for all i = 1, . . . , n and u ∈ U(i).

• Proof: If we use value iteration to generate a se-quence of vectors Jk =

(Jk(1), . . . , Jk(n)

)starting

with a J0 such that

J0(i) ≤ minu∈U(i)

g(i, u) +

n∑j=1

pij(u)J0(j)

, ∀ i

Then, Jk(i) ≤ Jk+1(i) for all k and i (mono-tonicity property of DP) and Jk → J∗, so thatJ0(i) ≤ J∗(i) for all i.

• So J∗ = (J∗(1), . . . , J∗(n)) is the solution of thelinear program of maximizing

∑ni=1 J(i) subject

to the constraint (1).

LINEAR PROGRAMMING (CONTINUED)

J (1)

J (2)

0

J* = (J*(1),J*(2))

J (1) = g(1,u2) + p 11(u2)J (1) + p 12(u

2)J (2)

J (1) = g(1,u1 ) + p 11(u1)J (1) + p 12(u

1)J (2)

J (2) = g(2,u1) + p 21(u1)J (1)+ p 22(u

1)J (2)

J (2) = g(2,u2) + p 21(u2)J (1)+ p 22(u

2)J (2)

• Drawback: For large n the dimension of thisprogram is very large. Furthermore, the num-ber of constraints is equal to the number of state-control pairs.

DISCOUNTED PROBLEMS

• Assume a discount factor α < 1.

• Conversion to an SSP problem.

i j

pij(u)

pii(u) p jj(u )

pji(u)

a

1 - a

i j

pij(u)

pii(u) pjj(u)

pji(u)

a

a

a1 - a

t

• Value iteration converges to J∗ for all initial J0:

Jk+1(i) = minu∈U(i)

g(i, u) + α

n∑j=1

pij(u)Jk(j)

, ∀ i

• J∗ is the unique solution of Bellman’s equation:

J∗(i) = minu∈U(i)

g(i, u) + α

n∑j=1

pij(u)J∗(j)

, ∀ i

DISCOUNTED PROBLEMS (CONTINUED)

• Policy iteration converges finitely to an optimalpolicy, and linear programming works.

• Example: Asset selling over an infinite horizon.If accepted, the offer xk of period k, is invested ata rate of interest r.

• By depreciating the sale amount to period 0dollars, we view (1 + r)−kxk as the reward forselling the asset in period k at a price xk, wherer > 0 is the rate of interest. So the discount factoris α = 1/(1 + r).

• J∗ is the unique solution of Bellman’s equation

J∗(x) = max

[x,

E{J∗(w)

}1 + r

].

• An optimal policy is to sell if and only if the cur-rent offer xk is greater than or equal to α, where

α =E{J∗(w)

}1 + r

.

6.231 DYNAMIC PROGRAMMING

LECTURE 15

LECTURE OUTLINE

• Average cost per stage problems

• Connection with stochastic shortest path prob-lems

• Bellman’s equation

• Value iteration

• Policy iteration

AVERAGE COST PER STAGE PROBLEM

• Stationary system with finite number of statesand controls

• Minimize over policies π = {µ0, µ1, ...}

Jπ(x0) = limN→∞

1N

Ewk

k=0,1,...

{N−1∑k=0

g(xk, µk(xk), wk

)}

• Important characteristics (not shared by othertypes of infinite horizon problems)

− For any fixed K, the cost incurred up to timeK does not matter (only the state that weare at time K matters)

− If all states “communicate” the optimal costis independent of the initial state [if we cango from i to j in finite expected time, wemust have J∗(i) ≤ J∗(j)]. So J∗(i) ≡ λ∗ forall i.

− Because “communication” issues are so im-portant, the methodology relies heavily onMarkov chain theory.

CONNECTION WITH SSP

• Assumption: State n is such that for someinteger m > 0, and for all initial states and allpolicies, n is visited with positive probability atleast once within the first m stages.

• Divide the sequence of generated states intocycles marked by successive visits to n.

• Each of the cycles can be viewed as a statetrajectory of a corresponding stochastic shortestpath problem with n as the termination state.

i j

pij(u)

pii(u) pjj(u)pji(u)

n

pin(u) pjn(u)

pn n(u)

pnj(u)pni(u)

i j

pij(u)

pii(u) pjj(u)pji(u)

n

t

Artificial Termination State

SpecialState n

pni(u)

pin(u)

pn n(u)

pnj(u)

pjn(u)

• Let the cost at i of the SSP be g(i, u) − λ∗

• We will show that

Av. Cost Probl. ≡ A Min Cost Cycle Probl. ≡ SSP Probl.

CONNECTION WITH SSP (CONTINUED)

• Consider a minimum cycle cost problem: Finda stationary policy µ that minimizes the expectedcost per transition within a cycle

Cnn(µ)Nnn(µ)

,

where for a fixed µ,

Cnn(µ) : E{cost from n up to the first return to n}

Nnn(µ) : E{time from n up to the first return to n}

• Intuitively, optimal cycle cost = λ∗, so

Cnn(µ) − Nnn(µ)λ∗ ≥ 0,

with equality if µ is optimal.

• Thus, the optimal µ must minimize over µ theexpression Cnn(µ) − Nnn(µ)λ∗, which is the ex-pected cost of µ starting from n in the SSP withstage costs g(i, u)− λ∗. Also: Optimal SSP Cost= 0.

BELLMAN’S EQUATION

• Let h∗(i) the optimal cost of this SSP problemwhen starting at the nontermination states i =1, . . . , n. Then, h∗(1), . . . , h∗(n) solve uniquelythe corresponding Bellman’s equation

h∗(i) = minu∈U(i)

g(i, u) − λ∗ +

n−1∑j=1

pij(u)h∗(j)

, ∀ i

• If µ∗ is an optimal stationary policy for the SSPproblem, we have

h∗(n) = Cnn(µ∗) − Nnn(µ∗)λ∗ = 0

• Combining these equations, we have

λ∗+h∗(i) = minu∈U(i)

g(i, u) +

n∑j=1

pij(u)h∗(j)

, ∀ i

• If µ∗(i) attains the min for each i, µ∗ is optimal.

MORE ON THE CONNECTION WITH SSP

• Interpretation of h∗(i) as a relative or differen-tial cost : It is the minimum of

E{cost to reach n from i for the first time}− E{cost if the stage cost were λ∗ and not g(i, u)}

• We don’t know λ∗, so we can’t solve the aver-age cost problem as an SSP problem. But similarvalue and policy iteration algorithms are possible.

• Example: A manufacturer at each time:− Receives an order with prob. p and no order

with prob. 1 − p.− May process all unfilled orders at cost K >

0, or process no order at all. The cost perunfilled order at each time is c > 0.

− Maximum number of orders that can remainunfilled is n.

− Find a processing policy that minimizes thetotal expected cost per stage.

EXAMPLE (CONTINUED)

• State = number of unfilled orders. State 0 isthe special state for the SSP formulation.

• Bellman’s equation: For states i = 0, 1, . . . , n−1

λ∗ + h∗(i) = min[K + (1 − p)h∗(0) + ph∗(1),

ci + (1 − p)h∗(i) + ph∗(i + 1)],

and for state n

λ∗ + h∗(n) = K + (1 − p)h∗(0) + ph∗(1)

• Optimal policy: Process i unfilled orders if

K+(1−p)h∗(0)+ph∗(1) ≤ ci+(1−p)h∗(i)+ph∗(i+1).

• Intuitively, h∗(i) is monotonically nondecreas-ing with i (interpret h∗(i) as optimal costs-to-gofor the associate SSP problem). So a threshold pol-icy is optimal: process the orders if their numberexceeds some threshold integer m∗.

VALUE ITERATION

• Natural value iteration method: Generate op-timal k-stage costs by DP algorithm starting withany J0:

Jk+1(i) = minu∈U(i)

g(i, u) +

n∑j=1

pij(u)Jk(j)

, ∀ i

• Result: limk→∞ Jk(i)/k = λ∗ for all i.

• Proof outline: Let J∗k be so generated from the

initial condition J∗0 = h∗. Then, by induction,

J∗k (i) = kλ∗ + h∗(i), ∀i, ∀ k.

On the other hand,

∣∣Jk(i) − J∗k (i)

∣∣ ≤ maxj=1,...,n

∣∣J0(j) − h∗(j)∣∣, ∀ i

since Jk(i) and J∗k (i) are optimal costs for two

k-stage problems that differ only in the terminalcost functions, which are J0 and h∗.

RELATIVE VALUE ITERATION

• The value iteration method just described hastwo drawbacks:

− Since typically some components of Jk di-verge to ∞ or −∞, calculating limk→∞ Jk(i)/kis numerically cumbersome.

− The method will not compute a correspond-ing differential cost vector h∗.

• We can bypass both difficulties by subtractinga constant from all components of the vector Jk,so that the difference, call it hk, remains bounded.

• Relative value iteration algorithm: Pick anystate s, and iterate according to

hk+1(i) = minu∈U(i)

g(i, u) +

n∑j=1

pij(u)hk(j)

− minu∈U(s)

g(s, u) +

n∑j=1

psj(u)hk(j)

, ∀ i

• Then we can show hk → h∗ (under an extraassumption).

POLICY ITERATION

• At the typical iteration, we have a stationaryµk.

• Policy evaluation: Compute λk and hk(i) of µk,using the n + 1 equations hk(n) = 0 and

λk + hk(i) = g(i, µk(i)

)+

n∑j=1

pij

(µk(i)

)hk(j), ∀ i

• Policy improvement: Find for all i

µk+1(i) = arg minu∈U(i)

g(i, u) +

n∑j=1

pij(u)hk(j)

• If λk+1 = λk and hk+1(i) = hk(i) for all i, stop;otherwise, repeat with µk+1 replacing µk.

• Result: For each k, we either have λk+1 < λk

or

λk+1 = λk, hk+1(i) ≤ hk(i), i = 1, . . . , n.

The algorithm terminates with an optimal policy.

6.231 DYNAMIC PROGRAMMING

LECTURE 16

LECTURE OUTLINE

• Control of continuous-time Markov chains –Semi-Markov problems

• Problem formulation – Equivalence to discrete-time problems

• Discounted problems

• Average cost problems

CONTINUOUS-TIME MARKOV CHAINS

• Stationary system with finite number of statesand controls

• State transitions occur at discrete times

• Control applied at these discrete times and staysconstant between transitions

• Time between transitions is random

• Cost accumulates in continuous time (may alsobe incurred at the time of transition)

• Example: Admission control in a system withrestricted capacity (e.g., a communication link)

− Customer arrivals: a Poisson process− Customers entering the system, depart after

exponentially distributed time− Upon arrival we must decide whether to ad-

mit or to block a customer− There is a cost for blocking a customer− For each customer that is in the system, there

is a customer-dependent reward per unit time− Minimize time-discounted or average cost

PROBLEM FORMULATION

• x(t) and u(t): State and control at time t

• tk: Time of kth transition (t0 = 0)

• xk = x(tk); x(t) = xk for tk ≤ t < tk+1.

• uk = u(tk); u(t) = uk for tk ≤ t < tk+1.

• No transition probabilities; instead transitiondistributions (quantify the uncertainty about bothtransition time and next state)

Qij(τ, u) = P{tk+1−tk ≤ τ, xk+1 = j | xk = i, uk = u}• Two important formulas:

(1) Transition probabilities are specified by

pij(u) = P{xk+1 = j | xk = i, uk = u} = limτ→∞

Qij(τ, u)

(2) The Cumulative Distribution Function (CDF)of τ given i, j, u is (assuming pij(u) > 0)

P{tk+1−tk ≤ τ | xk = i, xk+1 = j, uk = u} =Qij(τ, u)pij(u)

Thus, Qij(τ, u) can be viewed as a “scaled CDF”

EXPONENTIAL TRANSITION DISTRIBUTIONS

• Important example of transition distributions:

Qij(τ, u) = pij(u)(1 − e−νi(u)τ

),

where pij(u) are transition probabilities, and νi(u)is called the transition rate at state i.

• Interpretation: If the system is in state i andcontrol u is applied

− the next state will be j with probability pij(u)− the time between the transition to state i

and the transition to the next state j is ex-ponentially distributed with parameter νi(u)(independently of j):

P{transition time interval > τ | i, u} = e−νi(u)τ

• The exponential distribution is memoryless.This implies that for a given policy, the systemis a continuous-time Markov chain (the future de-pends on the past through the present).

• Without the memoryless property, the Markovproperty holds only at the times of transition.

COST STRUCTURES

• There is cost g(i, u) per unit time, i.e.

g(i, u)dt = the cost incurred in time dt

• There may be an extra “instantaneous” costg(i, u) at the time of a transition (let’s ignore thisfor the moment)

• Total discounted cost of π = {µ0, µ1, . . .} start-ing from state i (with discount factor β > 0)

limN→∞

E

{N−1∑k=0

∫ tk+1

tk

e−βtg(xk, µk(xk)

)dt∣∣∣ x0 = i

}

• Average cost per unit time

limN→∞

1

E{tN}E

{N−1∑k=0

∫ tk+1

tk

g(xk, µk(xk)

)dt

∣∣∣ x0 = i

}

• We will see that both problems have equivalentdiscrete-time versions.

A NOTE ON NOTATION

• The scaled CDF Qij(τ, u) can be used to modeldiscrete, continuous, and mixed distributions forthe transition time τ .

• Generally, expected values of functions of τ canbe written as integrals involving dQij(τ, u). Forexample, the conditional expected value of τ giveni, j, and u is written as

E{τ | i, j, u} =∫ ∞

0

τdQij(τ, u)

pij(u)

• If Qij(τ, u) is continuous with respect to τ , itsderivative

qij(τ, u) =dQij

dτ(τ, u)

can be viewed as a “scaled” density function. Ex-pected values of functions of τ can then be writtenin terms of qij(τ, u). For example

E{τ | i, j, u} =∫ ∞

0

τqij(τ, u)pij(u)

• If Qij(τ, u) is discontinuous and “staircase-like,”expected values can be written as summations.

DISCOUNTED PROBLEMS – COST CALCULATION

• For a policy π = {µ0, µ1, . . .}, write

Jπ(i) = E{1st transition cost}+E{e−βτJπ1(j) | i, µ0(i)}

where Jπ1(j) is the cost-to-go of the policy π1 ={µ1, µ2, . . .}• We calculate the two costs in the RHS. TheE{1st transition cost}, if u is applied at state i, is

G(i, u) = Ej

{Eτ{1st transition cost | j}

}=

n∑j=1

pij(u)

∫ ∞

0

(∫ τ

0

e−βtg(i, u)dt

)dQij(τ, u)

pij(u)

=

n∑j=1

∫ ∞

0

1 − e−βτ

βg(i, u)dQij(τ, u)

• Thus the E{1st transition cost} is

G(i, µ0(i)

)= g

(i, µ0(i)

) n∑j=1

∫ ∞

0

1 − e−βτ

βdQij

(τ, µ0(i)

)

COST CALCULATION (CONTINUED)

• Also the expected (discounted) cost from thenext state j is

E{e−βτJπ1(j) | i, µ0(i)

}= Ej

{E{e−βτ | i, µ0(i), j}Jπ1(j) | i, µ0(i)

}=

n∑j=1

pij(u)(∫ ∞

0

e−βτdQij(τ, u)

pij(u)

)Jπ1(j)

=n∑

j=1

mij

(µ(i)

)Jπ1(j)

where mij(u) is given by

mij(u) =

∫ ∞

0

e−βτdQij(τ, u)

(<

∫ ∞

0

dQij(τ, u) = pij(u)

)

and can be viewed as the “effective discount fac-tor” [the analog of αpij(u) in the discrete-timecase].

• So Jπ(i) can be written as

Jπ(i) = G(i, µ0(i)

)+

n∑j=1

mij

(µ0(i)

)Jπ1(j)

EQUIVALENCE TO AN SSP

• Similar to the discrete-time case, introduce astochastic shortest path problem with an artificialtermination state t

• Under control u, from state i the system movesto state j with probability mij(u) and to the ter-mination state t with probability 1−∑n

j=1 mij(u)

• Bellman’s equation: For i = 1, . . . , n,

J∗(i) = minu∈U(i)

G(i, u) +

n∑j=1

mij(u)J∗(j)

• Analogs of value iteration, policy iteration, andlinear programming.

• If in addition to the cost per unit time g, thereis an extra (instantaneous) one-stage cost g(i, u),Bellman’s equation becomes

J∗(i) = minu∈U(i)

g(i, u) + G(i, u) +

n∑j=1

mij(u)J∗(j)

MANUFACTURER’S EXAMPLE REVISITED

• A manufacturer receives orders with interarrivaltimes uniformly distributed in [0, τmax].

• He may process all unfilled orders at cost K > 0,or process none. The cost per unit time of anunfilled order is c. Max number of unfilled ordersis n.

• The nonzero transition distributions are

Qi1(τ,Fill) = Qi(i+1)(τ,Not Fill) = min[1,

τ

τmax

]• The one-stage expected cost G is

G(i,Fill) = 0, G(i,Not Fill) = γ c i,

where

γ =n∑

j=1

∫ ∞

0

1 − e−βτ

βdQij(τ, u) =

∫ τmax

0

1 − e−βτ

βτmaxdτ

• There is an “instantaneous” cost

g(i,Fill) = K, g(i,Not Fill) = 0

MANUFACTURER’S EXAMPLE CONTINUED

• The “effective discount factors” mij(u) in Bell-man’s Equation are

mi1(Fill) = mi(i+1)(Not Fill) = α,

where

α =

∫ ∞

0

e−βτdQij(τ, u) =

∫ τmax

0

e−βτ

τmaxdτ =

1 − e−βτmax

βτmax

• Bellman’s equation has the form

J∗(i) = min[K+αJ∗(1), γci+αJ∗(i+1)

], i = 1, 2, . . .

• As in the discrete-time case, we can concludethat there exists an optimal threshold i∗:

fill the orders <==> their number i exceeds i∗

AVERAGE COST

• Minimize

limN→∞

1E{tN}E

{∫ tN

0

g(x(t), u(t)

)dt

}

assuming there is a special state that is “recurrentunder all policies”

• Total expected cost of a transition

G(i, u) = g(i, u)τ i(u),where τ i(u): Expected transition time.

• We now apply the SSP argument used for thediscrete-time case. Divide trajectory into cyclesmarked by successive visits to n. The cost at (i, u)is G(i, u) − λ∗τ i(u), where λ∗ is the optimal ex-pected cost per unit time. Each cycle is viewed asa state trajectory of a corresponding SSP problemwith the termination state being essentially n.

• So Bellman’s Eq. for the average cost problem:

h∗(i) = minu∈U(i)

G(i, u) − λ∗τ i(u) +

n∑j=1

pij(u)h∗(j)

AVERAGE COST MANUFACTURER’S EXAMPLE

• The expected transition times are

τ i(Fill) = τ i(Not Fill) =τmax

2

the expected transition cost is

G(i,Fill) = 0, G(i,Not Fill) =c i τmax

2

and there is also the “instantaneous” cost

g(i,Fill) = K, g(i,Not Fill) = 0

• Bellman’s equation:

h∗(i) = min[K − λ∗ τmax

2+ h∗(1),

ciτmax

2− λ∗ τmax

2+ h∗(i + 1)

]

• Again it can be shown that a threshold policyis optimal.

6.231 DYNAMIC PROGRAMMING

LECTURE 17

LECTURE OUTLINE

• We start a four-lecture sequence on advancedinfinite horizon DP

• We allow infinite state space, so the stochasticshortest path framework cannot be used any more

• Results are rigorous assuming a countable dis-turbance space

− This includes deterministic problems witharbitrary state space, and countable stateMarkov chains

− Otherwise the mathematics of measure the-ory make analysis difficult, although the fi-nal results are essentially the same as forcountable disturbance space

• The discounted problem is the proper startingpoint for this analysis

• The central mathematical structure is that theDP mapping is a contraction mapping (instead ofexistence of a termination state)

DISCOUNTED PROBLEMS W/ BOUNDED COST

• Stationary system with arbitrary state space

xk+1 = f(xk, uk, wk), k = 0, 1, . . .

• Cost of a policy π = {µ0, µ1, . . .}

Jπ(x0) = limN→∞ E

wkk=0,1,...

{N−1∑k=0

αkg(xk, µk(xk), wk

)}

with α < 1, and for some M , we have |g(x, u, w)| ≤M for all (x, u, w)

• Shorthand notation for DP mappings (operateon functions of state to produce other functions)

(TJ)(x) = minu∈U(x)

Ew

{g(x, u, w) + αJ

(f(x, u, w)

)}, ∀ x

TJ is the optimal cost function for the one-stageproblem with stage cost g and terminal cost αJ .

• For any stationary policy µ

(TµJ)(x) = Ew

{g(x, µ(x), w

)+ αJ

(f(x, µ(x), w)

)}, ∀ x

“SHORTHAND” THEORY – A SUMMARY

• Cost function expressions [with J0(x) ≡ 0]

Jπ(x) = limk→∞

(Tµ0Tµ1 · · ·Tµk J0)(x), Jµ(x) = limk→∞

(T kµ J0)(x)

• Bellman’s equation: J∗ = TJ∗, Jµ = TµJµ

• Optimality condition:

µ: optimal <==> TµJ∗ = TJ∗

• Value iteration: For any (bounded) J and allx,

J∗(x) = limk→∞

(T kJ)(x)

• Policy iteration: Given µk,− Policy evaluation: Find Jµk by solving

Jµk = TµkJµk

− Policy improvement: Find µk+1 such that

Tµk+1Jµk = TJµk

TWO KEY PROPERTIES

• Monotonicity property: For any functions Jand J ′ such that J(x) ≤ J ′(x) for all x, and anyµ

(TJ)(x) ≤ (TJ ′)(x), ∀ x,

(TµJ)(x) ≤ (TµJ ′)(x), ∀ x.

• Additivity property: For any J , any scalarr, and any µ

(T (J + re)

)(x) = (TJ)(x) + αr, ∀ x,

(Tµ(J + re)

)(x) = (TµJ)(x) + αr, ∀ x,

where e is the unit function [e(x) ≡ 1].

CONVERGENCE OF VALUE ITERATION

• If J0 ≡ 0,

J∗(x) = limN→∞

(TNJ0)(x), for all x

Proof: For any initial state x0, and policy π ={µ0, µ1, . . .},

Jπ(x0) = E

{ ∞∑k=0

αkg(xk, µk(xk), wk

)}

= E

{N−1∑k=0

αkg(xk, µk(xk), wk

)}

+ E

{ ∞∑k=N

αkg(xk, µk(xk), wk

)}

The tail portion satisfies∣∣∣∣∣E{ ∞∑

k=N

αkg(xk, µk(xk), wk

)}∣∣∣∣∣ ≤ αNM

1 − α,

where M ≥ |g(x, u, w)|. Take the min over π ofboth sides. Q.E.D.

BELLMAN’S EQUATION

• The optimal cost function J∗ satisfies Bellman’sEq., i.e. J∗ = T (J∗).

Proof: For all x and N ,

J∗(x) − αNM

1 − α≤ (TNJ0)(x) ≤ J∗(x) +

αNM

1 − α,

where J0(x) ≡ 0 and M ≥ |g(x, u, w)|. ApplyingT to this relation, and using Monotonicity andAdditivity,

(TJ∗)(x) − αN+1M

1 − α≤ (TN+1J0)(x)

≤ (TJ∗)(x) +αN+1M

1 − α

Taking the limit as N → ∞ and using the fact

limN→∞

(TN+1J0)(x) = J∗(x)

we obtain J∗ = TJ∗. Q.E.D.

THE CONTRACTION PROPERTY

• Contraction property: For any bounded func-tions J and J ′, and any µ,

maxx

∣∣(TJ)(x) − (TJ ′)(x)∣∣ ≤ α max

x

∣∣J(x) − J ′(x)∣∣,

maxx

∣∣(TµJ)(x)−(TµJ ′)(x)∣∣ ≤ α max

x

∣∣J(x)−J ′(x)∣∣.

Proof: Denote c = maxx∈S

∣∣J(x) − J ′(x)∣∣. Then

J(x) − c ≤ J ′(x) ≤ J(x) + c, ∀ x

Apply T to both sides, and use the Monotonicityand Additivity properties:

(TJ)(x)−αc ≤ (TJ ′)(x) ≤ (TJ)(x)+αc, ∀ x

Hence

∣∣(TJ)(x) − (TJ ′)(x)∣∣ ≤ αc, ∀ x.

Q.E.D.

IMPLICATIONS OF CONTRACTION PROPERTY

• Bellman’s equation J = TJ has a unique solu-tion, namely J∗, and for any bounded J , we have

limk→∞

(T kJ)(x) = J∗(x), ∀ x

Proof: Use

maxx

∣∣(T kJ)(x) − J∗(x)∣∣ ≤ max

x

∣∣(T kJ)(x) − (T kJ∗)(x)∣∣

≤ αk maxx

∣∣J(x) − J∗(x)∣∣

• Convergence rate: For all k,

maxx

∣∣(T kJ)(x) − J∗(x)∣∣ ≤ αk max

x

∣∣J(x) − J∗(x)∣∣

• Also, for each stationary µ, Jµ is the uniquesolution of J = TµJ and

limk→∞

(T kµ J)(x) = Jµ(x), ∀ x,

for any bounded J .

NEC. AND SUFFICIENT OPT. CONDITION

• A stationary policy µ is optimal if and only ifµ(x) attains the minimum in Bellman’s equationfor each x; i.e.,

TJ∗ = TµJ∗.

Proof: If TJ∗ = TµJ∗, then using Bellman’s equa-tion (J∗ = TJ∗), we have

J∗ = TµJ∗,

so by uniqueness of the fixed point of Tµ, we obtainJ∗ = Jµ; i.e., µ is optimal.

• Conversely, if the stationary policy µ is optimal,we have J∗ = Jµ, so

J∗ = TµJ∗.

Combining this with Bellman’s equation (J∗ =TJ∗), we obtain TJ∗ = TµJ∗. Q.E.D.

COMPUTATIONAL METHODS

• Value iteration and variants− Gauss-Seidel version− Approximate value iteration

• Policy iteration and variants− Combination with value iteration− Modified policy iteration− Asynchronous policy iteration

• Linear programming

maximizen∑

i=1

J(i)

subject to J(i) ≤ g(i, u) + αn∑

j=1

pij(u)J(j), ∀ (i, u)

• Approximate linear programming: use in placeof J(i) a low-dim. basis function representation

J(i, r) =m∑

k=1

rkwk(i)

and low-dim. LP (with many constraints)

6.231 DYNAMIC PROGRAMMING

LECTURE 18

LECTURE OUTLINE

• One-step lookahead and rollout for discountedproblems

• Approximate policy iteration: Infinite statespace

• Contraction mappings in DP

• Discounted problems: Countable state spacewith unbounded costs

ONE-STEP LOOKAHEAD POLICIES

• At state i use the control µ(i) that attains theminimum in

(T J)(i) = minu∈U(i)

g(i, u) + α

n∑j=1

pij(u)J(j)

,

where J is some approximation to J∗.

• Assume that

T J ≤ J + δe,

for some scalar δ, where e is the unit vector. Then

Jµ ≤ T J +αδ

1 − αe ≤ J +

δ

1 − αe.

• Assume that

J∗ − εe ≤ J ≤ J∗ + εe,

for some scalar ε. Then

Jµ ≤ J∗ +2αε

1 − αe.

APPLICATION TO ROLLOUT POLICIES

• Let µ1, . . . , µM be stationary policies, and let

J(i) = min{Jµ1(i), . . . , JµM (i)

}, ∀ i.

• Then, for all i, and m = 1, . . . ,M , we have

(T J)(i) = minu∈U(i)

g(i, u) + α

n∑j=1

pij(u)J(j)

≤ minu∈U(i)

g(i, u) + α

n∑j=1

pij(u)Jµm(j)

≤ Jµm(i)

• Taking minimum over m,

(T J)(i) ≤ J(i), ∀ i.

• Using the preceding slide result with δ = 0,

Jµ(i) ≤ J(i) = min{Jµ1(i), . . . , JµM (i)

}, ∀ i,

i.e., the rollout policy µ improves over each µm.

APPROXIMATE POLICY ITERATION

• Suppose that the policy evaluation is approxi-mate, according to,

maxx

|Jk(x) − Jµk(x)| ≤ δ, k = 0, 1, . . .

and policy improvement is approximate, accordingto,

maxx

|(Tµk+1Jk)(x)−(TJk)(x)| ≤ ε, k = 0, 1, . . .

where δ and ε are some positive scalars.

• Error Bound: The sequence {µk} generatedby approximate policy iteration satisfies

lim supk→∞

maxx∈S

(Jµk(x) − J∗(x)

) ≤ ε + 2αδ

(1 − α)2

• Typical practical behavior: The method makessteady progress up to a point and then the iteratesJµk oscillate within a neighborhood of J∗.

CONTRACTION MAPPINGS

• Given a real vector space Y with a norm ‖ · ‖(i.e., ‖y‖ ≥ 0 for all y ∈ Y , ‖y‖ = 0 if and only ify = 0, and ‖y + z‖ ≤ ‖y‖ + ‖z‖ for all y, z ∈ Y )

• A function F : Y → Y is said to be a contractionmapping if for some ρ ∈ (0, 1), we have

‖F (y) − F (z)‖ ≤ ρ‖y − z‖, for all y, z ∈ Y.

ρ is called the modulus of contraction of F .

• For m > 1, we say that F is an m-stage con-traction if Fm is a contraction.

• Important example: Let S be a set (e.g., statespace in DP), v : S → � be a positive-valuedfunction. Let B(S) be the set of all functions J :S → � such that J(s)/v(s) is bounded over s.

• We define a norm on B(S), called the weightedsup-norm, by

‖J‖ = maxs∈S

|J(s)|v(s)

.

• Important special case: The discounted prob-lem mappings T and Tµ [for v(s) ≡ 1, ρ = α].

CONTRACTION MAPPING FIXED-POINT TH.

• Contraction Mapping Fixed-Point Theo-rem: If F : B(S) → B(S) is a contraction withmodulus ρ ∈ (0, 1), then there exists a uniqueJ∗ ∈ B(S) such that

J∗ = FJ∗.

Furthermore, if J is any function in B(S), then{F kJ} converges to J∗ and we have

‖F kJ − J∗‖ ≤ ρk‖J − J∗‖, k = 1, 2, . . . .

• Similar result if F is an m-stage contractionmapping.

• This is a special case of a general result forcontraction mappings F : Y → Y over normedvector spaces Y that are complete: every sequence{yk} that is Cauchy (satisfies ‖ym − yn‖ → 0 asm,n → ∞) converges.

• The space B(S) is complete (see the text for aproof).

A DP-LIKE CONTRACTION MAPPING I

• Let S = {1, 2, . . .}, and let F : B(S) → B(S)be a linear mapping of the form

(FJ)(i) = b(i) +∑j∈S

a(i, j)J(j), ∀ i

where b(i) and a(i, j) are some scalars. Then F isa contraction with modulus ρ if∑

j∈S |a(i, j)| v(j)v(i)

≤ ρ, ∀ i

• Let F : B(S) → B(S) be a mapping of the form

(FJ)(i) = minµ∈M

(FµJ)(i), ∀ i

where M is parameter set, and for each µ ∈ M ,Fµ is a contraction mapping from B(S) to B(S)with modulus ρ. Then F is a contraction mappingwith modulus ρ.

A DP-LIKE CONTRACTION MAPPING II

• Let S = {1, 2, . . .}, let M be a parameter set,and for each µ ∈ M , let

(FµJ)(i) = b(i, µ) +∑j∈S

a(i, j, µ)J(j), ∀ i

• We have FµJ ∈ B(S) for all J ∈ B(S) providedbµ ∈ B(S) and Vµ ∈ B(S), where

bµ ={b(1, µ), b(2, µ), . . .

}, Vµ =

{V (1, µ), V (2, µ), . . .

},

V (i, µ) =∑j∈S

∣∣a(i, j, µ)∣∣ v(j), ∀ i

• Consider the mapping F

(FJ)(i) = minµ∈M

(FµJ)(i), ∀ i

We have FJ ∈ B(S) for all J ∈ B(S), providedb ∈ B(S) and V ∈ B(S), where

b ={b(1), b(2), . . .

}, V =

{V (1), V (2), . . .

},

with b(i) = maxµ∈M b(i, µ) and V (i) = maxµ∈M V (i, µ).

DISCOUNTED DP - UNBOUNDED COST I

• State space S = {1, 2, . . .}, transition probabil-ities pij(u), cost g(i, u).

• Weighted sup-norm

‖J‖ = maxi∈S

|J(i)|vi

on B(S): sequences{J(i)

}such that ‖J‖ < ∞.

• Assumptions:

(a) G ={G(1), G(2), . . .

} ∈ B(S), where

G(i) = maxu∈U(i)

∣∣g(i, u)∣∣, ∀ i

(b) V ={V (1), V (2), . . .

} ∈ B(S), where

V (i) = maxu∈U(i)

∑j∈S

pij(u) vj , ∀ i

(c) There exists an integer m ≥ 1 and a scalarρ ∈ (0, 1) such that for every policy π,

αm

∑j∈S P (xm = j | x0 = i, π) vj

vi≤ ρ, ∀ i

DISCOUNTED DP - UNBOUNDED COST II

• Example: Let vi = i for all i = 1, 2, . . .

• Assumption (a) is satisfied if the maximum ex-pected absolute cost per stage at state i grows nofaster than linearly with i.

• Assumption (b) states that the maximum ex-pected next state following state i,

maxu∈U(i)

E{j | i, u},

also grows no faster than linearly with i.

• Assumption (c) is satisfied if

αm∑j∈S

P (xm = j | x0 = i, π) j ≤ ρ i, ∀ i

It requires that for all π, the expected value of thestate obtained m stages after reaching state i is nomore than α−mρ i.

• If there is bounded upward expected change ofthe state starting at i, there exists m sufficientlylarge so that Assumption (c) is satisfied.

DISCOUNTED DP - UNBOUNDED COST III

• Consider the DP mappings Tµ and T ,

(TµJ)(i) = g(i, µ(i)

)+α

∑j∈S

pij

(µ(i)

)J(j), ∀ i,

(TJ)(i) = minu∈U(i)

g(i, u) + α

∑j∈S

pij(u)J(j)

, ∀ i

• Proposition: Under the earlier assumptions,T and Tµ map B(S) into B(S), and are m-stagecontraction mappings with modulus ρ.

• The m-stage contraction properties can be usedto essentially replicate the analysis for the case ofbounded cost, and to show the standard results:

− The value iteration method Jk+1 = TJk con-verges to the unique solution J∗ of Bellman’sequation J = TJ .

− The unique solution J∗ of Bellman’s equa-tion is the optimal cost function.

− A stationary policy µ is optimal if and onlyif TµJ∗ = TJ∗.

6.231 DYNAMIC PROGRAMMING

LECTURE 19

LECTURE OUTLINE

• Undiscounted problems

• Stochastic shortest path problems (SSP)

• Proper and improper policies

• Analysis and computational methods for SSP

• Pathologies of SSP

UNDISCOUNTED PROBLEMS

• System: xk+1 = f(xk, uk, wk)

• Cost of a policy π = {µ0, µ1, . . .}

Jπ(x0) = limN→∞ E

wkk=0,1,...

{N−1∑k=0

g(xk, µk(xk), wk

)}

• Shorthand notation for DP mappings

(TJ)(x) = minu∈U(x)

Ew

{g(x, u, w) + J

(f(x, u, w)

)}, ∀ x

• For any stationary policy µ

(TµJ)(x) = Ew

{g(x, µ(x), w

)+ J

(f(x, µ(x), w)

)}, ∀ x

• Neither T nor Tµ are contractions in general,but their monotonicity is helpful.

• SSP problems provide a “soft boundary” be-tween the easy finite-state discounted problemsand the hard undiscounted problems.

− They share features of both.− Some of the nice theory is recovered because

of the termination state.

SSP THEORY SUMMARY I

• As earlier, we have a cost-free term. state t, afinite number of states 1, . . . , n, and finite numberof controls, but we will make weaker assumptions.

• Mappings T and Tµ (modified to account fortermination state t):

(TJ)(i) = minu∈U(i)

[g(i, u) +

n∑j=1

pij(u)J(j)

], i = 1, . . . , n,

(TµJ)(i) = g(i, µ(i)

)+

n∑j=1

pij

(µ(i)

)J(j), i = 1, . . . , n.

• Definition: A stationary policy µ is calledproper, if under µ, from every state i, there isa positive probability path that leads to t.

• Important fact: If µ is proper, Tµ is contrac-tion with respect to some weighted max norm

maxi

1vi|(TµJ)(i)−(TµJ ′)(i)| ≤ ρµ max

i

1vi|J(i)−J ′(i)|

• T is similarly a contraction if all µ are proper(the case discussed in the text, Ch. 7, Vol. I).

SSP THEORY SUMMARY II

• The theory can be pushed one step further.Assume that:

(a) There exists at least one proper policy

(b) For each improper µ, Jµ(i) = ∞ for some i

• Then T is not necessarily a contraction, but:− J∗ is the unique solution of Bellman’s Equ.− µ∗ is optimal if and only if Tµ∗J∗ = TJ∗

− limk→∞(T kJ)(i) = J∗(i) for all i

− Policy iteration terminates with an optimalpolicy, if started with a proper policy

• Example: Deterministic shortest path problemwith a single destination t.

− States <=> nodes; Controls <=> arcs− Termination state <=> the destination− Assumption (a) <=> every node is con-

nected to the destination− Assumption (b) <=> all cycle costs > 0

SSP ANALYSIS I

• For a proper policy µ, Jµ is the unique fixedpoint of Tµ, and T k

µ J → Jµ for all J (holds by thetheory of Vol. I, Section 7.2)

• A stationary µ satisfying J ≥ TµJ for some Jmust be proper - true because

J ≥ T kµ J = P k

µ J +k−1∑m=0

Pmµ gµ

and some component of the term on the rightblows up if µ is improper (by our assumptions).

• Consequence: T can have at most one fixedpoint.

Proof: If J and J ′ are two solutions, select µand µ′ such that J = TJ = TµJ and J ′ = TJ ′ =Tµ′J ′. By preceding assertion, µ and µ′ must beproper, and J = Jµ and J ′ = Jµ′ . Also

J = T kJ ≤ T kµ′J → Jµ′ = J ′

Similarly, J ′ ≤ J , so J = J ′.

SSP ANALYSIS II

• We now show that T has a fixed point, and alsothat policy iteration converges.

• Generate a sequence {µk} by policy iterationstarting from a proper policy µ0.

• µ1 is proper and Jµ0 ≥ Jµ1 since

Jµ0 = Tµ0Jµ0 ≥ TJµ0 = Tµ1Jµ0 ≥ T kµ1Jµ0 ≥ Jµ1

• Thus {Jµk} is nonincreasing, some policy µ willbe repeated, with Jµ = TJµ. So Jµ is a fixed pointof T .

• Next show T kJ → Jµ for all J , i.e., value it-eration converges to the same limit as policy iter-ation. (Sketch: True if J = Jµ, argue using theproperness of µ to show that the terminal costdifference J − Jµ does not matter.)

• To show Jµ = J∗, for any π = {µ0, µ1, . . .}

Tµ0 · · ·Tµk−1J0 ≥ T kJ0,

where J0 ≡ 0. Take lim sup as k → ∞, to obtainJπ ≥ Jµ, so µ is optimal and Jµ = J∗.

SSP ANALYSIS III

• If all policies are proper (the assumption ofSection 7.1, Vol. I), Tµ and T are contractionswith respect to a weighted sup norm.

Proof: Consider a new SSP problem where thetransition probabilities are the same as in the orig-inal, but the transition costs are all equal to −1.Let J be the corresponding optimal cost vector.For all µ,

J(i) = −1+ minu∈U(i)

n∑j=1

pij(u)J(j) ≤ −1+

n∑j=1

pij

(µ(i)

)J(j)

For vi = −J(i), we have vi ≥ 1, and for all µ,

n∑j=1

pij

(µ(i)

)vj ≤ vi − 1 ≤ ρ vi, i = 1, . . . , n,

whereρ = max

i=1,...,n

vi − 1vi

< 1.

This implies contraction of Tµ and T by the resultsof the preceding lecture.

PATHOLOGIES I: DETERM. SHORTEST PATHS

• If there is a cycle with cost = 0, Bellman’s equa-tion has an infinite number of solutions. Example:

0

0

11 2 t

• We have J∗(1) = J∗(2) = 1.

• Bellman’s equation is

J(1) = J(2), J(2) = min[J(1), 1].

• It has J∗ as solution.

• Set of solutions of Bellman’s equation:

{J | J(1) = J(2) ≤ 1

}.

PATHOLOGIES II: DETERM. SHORTEST PATHS

• If there is a cycle with cost < 0, Bellman’sequation has no solution [among functions J with−∞ < J(i) < ∞ for all i]. Example:

0

-1

11 2 t

• We have J∗(1) = J∗(2) = −∞.

• Bellman’s equation is

J(1) = J(2), J(2) = min[−1 + J(1), 1].

• There is no solution [among functions J with−∞ < J(i) < ∞ for all i].

• Bellman’s equation has as solution J∗(1) =J∗(2) = −∞ [within the larger class of functionsJ(·) that can take the value −∞ for some (orall) states]. This situation can be generalized (seeChapter 3 of Vol. 2 of the text).

PATHOLOGIES III: THE BLACKMAILER

• Two states, state 1 and the termination state t.

• At state 1, choose a control u ∈ (0, 1] (theblackmail amount demanded) at a cost −u, andmove to t with probability u2, or stay in 1 withprobability 1 − u2.

• Every stationary policy is proper, but the con-trol set in not finite.

• For any stationary µ with µ(1) = u, we have

Jµ(1) = −u + (1 − u2)Jµ(1)

from which Jµ(1) = − 1u

• Thus J∗(1) = −∞, and there is no optimalstationary policy.

• It turns out that a nonstationary policy is op-timal: demand µk(1) = γ/(k + 1) at time k, withγ ∈ (0, 1/2). (Blackmailer requests diminishingamounts over time, which add to ∞; the proba-bility of the victim’s refusal diminishes at a muchfaster rate.)

6.231 DYNAMIC PROGRAMMING

LECTURE 20

LECTURE OUTLINE

• We begin a 6-lecture series on approximate DPfor large/intractable problems.

• We will mainly follow Chapter 6, Vol. 2 ofthe text (with supplemental refs). Note: An up-dated/expanded version of Chapter 6, Vol. 2 isposted in the internet.

• In this lecture we classify/overview the mainapproaches:

− Rollout/Simulation-based single policy iter-ation (we will not discuss this further)

− Approximation in value space (approximatepolicy iteration, Q-Learning, Bellman errorapproach, approximate LP)

− Approximation in policy space (policy para-metrization, gradient methods)

− Problem approximation (simplification - ag-gregation - limited lookahead) - we will brieflydiscuss this today

APPROXIMATION IN VALUE SPACE

• We will mainly adopt an n-state discountedmodel (the easiest case - but think of HUGE n).

• Extensions to SSP and average cost are possible(but more quirky). We will discuss them later.

• Main idea: Approximate J∗ or Jµ with an ap-proximation architecture

J∗(i) ≈ J(i, r) or Jµ(i) ≈ J(i, r)

• Principal example: Subspace approximation

J(i, r) = φ(i)′r =s∑

k=1

φk(i)rk

where φ1, . . . , φs are basis functions spanning ans-dimensional subspace of �n

• Key issue: How to optimize r with low/s-dimensi-onal operations only

• Other than manual/trial-and-error approaches(e.g/, as in computer chess), the only other ap-proaches are simulation-based. They are collec-tively known as “neuro-dynamic programming” or“reinforcement learning”

APPROX. IN VALUE SPACE - APPROACHES

• Policy evaluation/Policy improvement− Uses simulation algorithms to approximate

the cost Jµ of the current policy µ

• Approximation of the optimal cost function J∗

− Q-Learning: Use a simulation algorithm toapproximate the optimal costs J∗(i) or theQ-factors

Q∗(i, u) = g(i, u) + αn∑

j=1

pij(u)J∗(j)

− Bellman error approach: Find r to

minr

Ei

{(J(i, r) − (T J)(i, r)

)2}

where Ei{·} is taken with respect to somedistribution

− Approximate LP (discussed earlier - supple-mented with clever schemes to overcome thelarge number of constraints issue)

POLICY EVALUATE/POLICY IMPROVE

• An example

System Simulator

Decision Generator

Cost-to-Go ApproximatorSupplies Values J(j,r)

Least-SquaresOptimization

~

J(j,r)~

State iDecision µ(i)_

-

• The “least squares optimization” may be re-placed by a different algorithm

POLICY EVALUATE/POLICY IMPROVE I

• Approximate the cost of the current policy byusing a simulation method.

− Direct policy evaluation - Cost samples gen-erated by simulation, and optimization byleast squares

− Indirect policy evaluation - solving the pro-jected equation Φr = ΠTµ(Φr) where Π isprojection w/ respect to a suitable weightedEuclidean norm

S: Subspace spanned by basis functions0

ΠJµ

Projectionon S

S: Subspace spanned by basis functions

Tµ(Φr)

0

Φr = ΠTµ(Φr)

Projectionon S

Direct Mehod: Projection of cost vector Jµ Indirect method: Solving a projected form of Bellman’s equation

• Batch and incremental methods

• Regular and optimistic policy iteration

POLICY EVALUATE/POLICY IMPROVE II

• Projected equation methods are preferred andhave rich theory

• TD(λ): Stochastic iterative algorithm for solv-ing Φr = ΠTµ(Φr)

• LSPE(λ): A simulation-based form of projectedvalue iteration

Φrk+1 = ΠTµ(Φrk) + simulation noise

S: Subspace spanned by basis functions

T(Φrk) = g + αPΦrk

0

Value Iterate

Projectionon S

Φrk+1

Simulation error

S: Subspace spanned by basis functions

Φrk

T(Φrk) = g + αPΦrk

0

Φrk+1

Value Iterate

Projectionon S

Projected Value Iteration (PVI) Least Squares Policy Evaluation (LSPE)

Φrk

• LSTD(λ): Solves a simulation-based approxi-mation Φr = ΠTµ(Φr) of the projected equation,using a linear system solver (e.g., Gaussian elimi-nation/Matlab)

APPROXIMATION IN POLICY SPACE

• We parameterize the set of policies by a vectorr = (r1, . . . , rs) and we optimize the cost over r.

• In a special case of this approach, the param-eterization of the policies is indirect, through anapproximate cost function.

− A cost approximation architecture parame-terized by r, defines a policy dependent on rvia the minimization in Bellman’s equation.

• Discounted problem example:− Denote by gi(r), i = 1, . . . , n, the one-stage

expected cost starting at state i, and by pij(r)the transition probabilities.

− Each value of r defines a stationary policy,with cost starting at state i denoted by Ji(r).

− Use a gradient (or other) method to mini-mize over r

J(r) =n∑

i=1

q(i)Ji(r),

where(q(1), . . . , q(n)

)is some probability dis-

tribution over the states.

PROBLEM APPROXIMATION - AGGREGATION

• Another major idea in ADP is to approximatethe cost-to-go function of the problem with thecost-to-go function of a simpler problem. The sim-plification is often ad-hoc/problem dependent.

• Aggregation is a (semi-)systematic approach forproblem approximation. Main elements:

− Introduce a few “aggregate” states, viewedas the states of an “aggregate” system

− Define transition probabilities and costs ofthe aggregate system, by associating multi-ple states of the original system with eachaggregate state

− Solve (exactly or approximately) the “ag-gregate” problem by any kind of value or pol-icy iteration method (including simulation-based methods, such as Q-learning)

− Use the optimal cost of the aggregate prob-lem to approximate the optimal cost of theoriginal problem

• Example (Hard Aggregation): We are given apartition of the state space into subsets of states,and each subset is viewed as an aggregate state(each state belongs to one and only one subset).

AGGREGATION/DISAGGREGATION PROBS

• The aggregate system transition probabilitiesare defined via two (somewhat arbitrary) choices:

• For each original system state i and aggregatestate m, the aggregation probability aim

− This may be roughly interpreted as the “de-gree of membership of i in the aggregatestate m.”

− In the hard aggregation example, aim = 1 ifstate i belongs to aggregate state/subset m.

• For each aggregate state m and original systemstate i, the disaggregation probability dmi

− This may be roughly interpreted as the “de-gree to which i is representative of m.”

− In the hard aggregation example (assumingall states that belong to aggregate state/subsetm are “equally representative”) dmi = 1/|m|for each state i that belongs to aggregatestate/subset m, where |m| is the cardinality(number of states) of m.

AGGREGATION EXAMPLES

• Hard aggregation (each original system stateis associated with one aggregate state):

Original SystemStates

Aggregate States

1 1/4 1 1/3

pij(u)

AggregationProbabilities

DisaggregationProbabilities m n

i j

• Soft aggregation (each original system state isassociated with multiple aggregate states):

Original SystemStates

Aggregate States

1/2

1/41/3

pij(u)

AggregationProbabilities

DisaggregationProbabilities m n

i j

1/2

1/3

2/3

• Coarse grid (each aggregate state is an originalsystem state):

Original SystemStates

Aggregate States

1/2 1 1

pij(u)

AggregationProbabilities

DisaggregationProbabilities m n

i j

1/21/3

2/3

AGGREGATE TRANSITION PROBABILITIES

• Let the aggregation and disaggregation proba-bilities, aim and dmi, and the original transitionprobabilities pij(u) be given.

• The transition probability from aggregate statem to aggregate state n under u is

qmn(u) =∑

i

∑j

dmipij(u)ajn

and the transition cost is similarly defined.

• This corresponds to a probabilistic process thatcan be simulated as follows:

− From aggregate state m, generate originalstate i according to dmi.

− Generate a transition from i to j accordingto pij(u), with cost g(i, u, j).

− From original state j, generate aggregate staten according to ajn.

• After solving for the optimal costs J(m) of theaggregate problem, the costs of the original prob-lem are approximated by

J(i) =∑m

aimJ(m)

6.231 DYNAMIC PROGRAMMING

LECTURE 21

LECTURE OUTLINE

• Discounted problems - Approximate policy eval-uation/policy improvement

• Direct approach - Least squares

• Batch and incremental gradient methods

• Implementation using TD

• Optimistic policy iteration

• Exploration issues

THEORETICAL BASIS

• If policies are approximately evaluated using anapproximation architecture:

maxi

|J(i, rk) − Jµk(i)| ≤ δ, k = 0, 1, . . .

• If policy improvement is also approximate,

maxi

|(Tµk+1 J)(i, rk)−(T J)(i, rk)| ≤ ε, k = 0, 1, . . .

• Error Bound: The sequence {µk} generatedby approximate policy iteration satisfies

lim supk→∞

maxi

(Jµk(i) − J∗(i)

) ≤ ε + 2αδ

(1 − α)2

• Typical practical behavior: The method makessteady progress up to a point and then the iteratesJµk oscillate within a neighborhood of J∗.

SIMULATION-BASED POLICY EVALUATION

• Suppose we can implement in a simulator theimproved policy µ, and want to calculate Jµ bysimulation.

• Generate by simulation sample costs. Then:

Jµ(i) ≈ 1Mi

Mi∑m=1

c(i,m)

c(i,m) : mth (noisy) sample cost starting from state i

• Approximating well each Jµ(i) is impracticalfor a large state space. Instead, a “compact rep-resentation” Jµ(i, r) is used, where r is a tunableparameter vector.

• Direct approach: Calculate an optimal value r∗of r by a least squares fit

r∗ = arg minr

n∑i=1

Mi∑m=1

∣∣c(i,m) − Jµ(i, r)∣∣2

• Note that this is much easier when the archi-tecture is linear - but this is not a requirement.

SIMULATION-BASED DIRECT APPROACH

System Simulator

Decision Generator

Cost-to-Go ApproximatorSupplies Values J(j,r)

Least-SquaresOptimization

~

J(j,r)~

State iDecision µ(i)_

-

• Simulator: Given a state-control pair (i, u), gen-erates the next state j using system’s transitionprobabilities under policy µ currently evaluated

• Decision generator: Generates the control µ(i)of the evaluated policy at the current state i

• Cost-to-go approximator: J(j, r) used by thedecision generator and corresponding to precedingpolicy (already evaluated in preceding iteration)

• Least squares optimizer: Uses cost samples c(i,m)produced by the simulator and solves a least squaresproblem to approximate Jµ(·, r)

BATCH GRADIENT METHOD I

• Focus on a batch: an N -transition portion(i0, . . . , iN ) of a simulated trajectory

• We view the numbers

N−1∑t=k

αt−kg(it, µ(it), it+1

), k = 0, . . . , N − 1,

as cost samples, one per initial state i0, . . . , iN−1

• Least squares problem

minr

12

N−1∑k=0

(J(ik, r) −

N−1∑t=k

αt−kg(it, µ(it), it+1

))2

• Gradient iteration

r := r − γ

N−1∑k=0

∇J(ik, r)

(J(ik, r) −

N−1∑t=k

αt−kg(it, µ(it), it+1

))

BATCH GRADIENT METHOD II

• Important tradeoff:− In order to reduce simulation error and cost

samples for a representatively large subset ofstates, we must use a large N

− To keep the work per gradient iteration small,we must use a small N

• To address the issue of size of N , small batchesmay be used and changed after one or more iter-ations.

• Then the method becomes susceptible to sim-ulation noise - requires a diminishing stepsize forconvergence.

• This slows down the convergence (which canbe very slow for a gradient method even withoutnoise).

• Theoretical convergence is guaranteed (with adiminishing stepsize) under reasonable conditions,but in practice this is not much of a guarantee.

INCREMENTAL GRADIENT METHOD I

• Again focus on an N -transition portion (i0, . . . , iN )of a simulated trajectory.

• The batch gradient method processes the Ntransitions all at once, and updates r using thegradient iteration.

• The incremental method updates r a total of Ntimes, once after each transition.

• After each transition (ik, ik+1) it uses only theportion of the gradient affected by that transition:

− Evaluate the (single-term) gradient ∇J(ik, r)at the current value of r (call it rk).

− Sum all the terms that involve the transi-tion (ik, ik+1), and update rk by making acorrection along their sum:

rk+1 =rk − γ

(∇J(ik, rk)J(ik, rk)

−(

k∑t=0

αk−t∇J(it, rt)

)g(ik, µ(ik), ik+1

))

INCREMENTAL GRADIENT METHOD II

• After N transitions, all the component gradientterms of the batch iteration are accumulated.

• BIG difference:− In the incremental method, r is changed while

processing the batch – the (single-term) gra-dient ∇J(it, r) is evaluated at the most re-cent value of r [after the transition (it, it+1)].

− In the batch version these gradients are eval-uated at the value of r prevailing at the be-ginning of the batch.

• Because r is updated at intermediate transi-tions within a batch (rather than at the end ofthe batch), the location of the end of the batchbecomes less relevant.

• Can have very long batches - can have a singlevery long simulated trajectory and a single batch.

• The incremental version can be implementedmore flexibly, converges much faster in practice.

• Interesting convergence analysis (beyond ourscope - see Bertsekas and Tsitsiklis, NDP book,also paper in SIAM J. on Optimization, 2000)

TEMPORAL DIFFERENCES - TD(1)

• A mathematically equivalent implementation ofthe incremental method.

• It uses temporal difference (TD for short)

dk = g(ik, µ(ik), ik+1

)+αJ(ik+1, r)−J(ik, r), k ≤ N−2,

dN−1 = g(iN−1, µ(iN−1), iN

)− J(iN−1, r)

• Following the transition (ik, ik+1), set

rk+1 = rk + γkdk

k∑t=0

αk−t∇J(it, rt)

• This algorithm is known as TD(1). In the im-portant linear case J(i, r) = φ(i)′r, it becomes

rk+1 = rk + γkdk

k∑t=0

αk−tφ(it)

• A variant of TD(1) is TD(λ), λ ∈ [0, 1]. It sets

rk+1 = rk + γkdk

k∑t=0

(αλ)k−tφ(it)

OPTIMISTIC POLICY ITERATION

• We have assumed so far is that the least squaresoptimization must be solved completely for r.

• An alternative, known as optimistic policy iter-ation, is to solve this problem approximately andreplace policy µ with policy µ after only a fewsimulation samples.

• Extreme possibility is to replace µ with µ at theend of each state transition: After state transition(ik, ik+1), set

rk+1 = rk + γkdk

k∑t=0

(αλ)k−t∇J(it, rt),

and simulate next transition (ik+1, ik+2) using µ(ik+1),the control of the new policy.

• For λ = 0, we obtain (the popular) optimisticTD(0), which has the simple form

rk+1 = rk + γkdk∇J(ik, rk)

• Optimistic policy iteration can exhibit fascinat-ing and counterintuitive behavior (see the NDPbook by Bertsekas and Tsitsiklis, Section 6.4.2).

THE ISSUE OF EXPLORATION

• To evaluate a policy µ, we need to generate costsamples using that policy - this biases the simula-tion by underrepresenting states that are unlikelyto occur under µ.

• As a result, the cost-to-go estimates of theseunderrepresented states may be highly inaccurate.

• This seriously impacts the improved policy µ.

• This is known as inadequate exploration - a par-ticularly acute difficulty when the randomness em-bodied in the transition probabilities is “relativelysmall” (e.g., a deterministic system).

• One possibility to guarantee adequate explo-ration: Frequently restart the simulation and en-sure that the initial states employed form a richand representative subset.

• Another possibility: Occasionally generatingtransitions that use a randomly selected controlrather than the one dictated by the policy µ.

• Other methods, to be discussed later, use twoMarkov chains (one is the chain of the policy andis used to generate the transition sequence, theother is used to generate the state sequence).

APPROXIMATING Q-FACTORS

• The approach described so far for policy eval-uation requires calculating expected values for allcontrols u ∈ U(i) (and knowledge of pij(u)).

• Model-free alternative: Approximate Q-factors

Q(i, u, r) ≈n∑

j=1

pij(u)(g(i, u, j) + αJµ(j)

)

and use for policy improvement the minimization

µ(i) = arg minu∈U(i)

Q(i, u, r)

• r is an adjustable parameter vector and Q(i, u, r)is a parametric architecture, such as

Q(i, u, r) =m∑

k=1

rkφk(i, u)

• Can use any method for constructing cost ap-proximations, e.g., TD(λ).

• Use the Markov chain with states (i, u) - pij(µ(i))is the transition prob. to (j, µ(i)), 0 to other (j, u′).

• Major concern: Acutely diminished exploration.

6.231 DYNAMIC PROGRAMMING

LECTURE 22

LECTURE OUTLINE

• Discounted problems - Approximate policy eval-uation/policy improvement

• Indirect approach - The projected equation

• Contraction properties - Error bounds

• PVI (Projected Value Iteration)

• LSPE (Least Squares Policy Evaluation)

• Tetris - A case study

POLICY EVALUATION/POLICY IMPROVEMENT

Approximate Policy

Evaluation

Policy Improvement

Guess Initial Policy

Evaluate Approximate Cost

Jµ(r) = Φr Using Simulation

Generate “Improved” Policy µ

• Linear cost function approximation

J(r) = Φr

where Φ is full rank n × s matrix with columnsthe basis functions, and ith row denoted φ(i)′.

• Policy “improvement”

µ(i) = arg minu∈U(i)

n∑j=1

pij(u)(g(i, u, j) + αφ(j)′r

)

• Indirect methods find Φr by solving a projectedequation.

WEIGHTED EUCLIDEAN PROJECTIONS

• Consider a weighted Euclidean norm

‖J‖v =

√√√√ n∑i=1

vi

(J(i)

)2,

where v is a vector of positive weights v1, . . . , vn.

• Let Π denote the projection operation onto

S = {Φr | r ∈ �s}

with respect to this norm, i.e., for any J ∈ �n,

ΠJ = ΦrJ

whererJ = arg min

r∈�s‖J − Φr‖v

• Π and rJ can be written explicitly:

Π = Φ(Φ′V Φ)−1Φ′V, rJ = (Φ′V Φ)−1Φ′V J,

where V is the diagonal matrix with vi, i = 1, . . . , n,along the diagonal.

THE PROJECTED BELLMAN EQUATION

• For a fixed policy µ to be evaluated, considerthe corresponding mapping T :

(TJ)(i) =n∑

i=1

pij

(g(i, j)+αJ(j)

), i = 1, . . . , n,

or more compactly,

TJ = g + αPJ

• The solution Jµ of Bellman’s equation J = TJis approximated by the solution of

Φr = ΠT (Φr)

S: Subspace spanned by basis functions

T(Φr)

0

Φr = ΠT(Φr)

Projectionon S

Indirect method: Solving a projected form of Bellmanʼs equation

KEY QUESTIONS AND RESULTS

• Does the projected equation have a solution?

• Under what conditions is the mapping ΠT acontraction, so ΠT has unique fixed point?

• Assuming ΠT has unique fixed point Φr∗, howclose is Φr∗ to Jµ?

• Assumption: P has a single recurrent classand no transient states, i.e., it has steady-stateprobabilities that are positive

ξj = limN→∞

1N

N∑k=1

P (ik = j | i0 = i) > 0, j = 1, . . . , n

• Proposition: ΠT is contraction of modulusα with respect to the weighted Euclidean norm‖ · ‖ξ, where ξ = (ξ1, . . . , ξn) is the steady-stateprobability vector. The unique fixed point Φr∗ ofΠT satisfies

‖Jµ − Φr∗‖ξ ≤ 1√1 − α2

‖Jµ − ΠJµ‖ξ

ANALYSIS

• Important property of the projection Π on Swith weighted Euclidean norm ‖ · ‖v. For all J ∈�n, J ∈ S, the Pythagorean Theorem holds:

‖J − J‖2v = ‖J − ΠJ‖2

v + ‖ΠJ − J‖2v

• Proof: Geometrically, (J − ΠJ) and (ΠJ − J)are orthogonal in the scaled geometry of the norm‖ · ‖v, where two vectors x, y ∈ �n are orthogonalif∑n

i=1 vixiyi = 0. Expand the quadratic in theRHS below:

‖J − J‖2v = ‖(J − ΠJ) + (ΠJ − J)‖2

v

• The Pythagorean Theorem implies that the pro-jection is nonexpansive, i.e.,

‖ΠJ − ΠJ‖v ≤ ‖J − J‖v, for all J, J ∈ �n.

To see this, note that

∥∥Π(J − J)∥∥2

v≤ ∥∥Π(J − J)

∥∥2

v+∥∥(I − Π)(J − J)

∥∥2

v

= ‖J − J‖2v

PROOF OF CONTRACTION PROPERTY

• Lemma: We have

‖Pz‖ξ ≤ ‖z‖ξ, z ∈ �n

• Proof of lemma: Let pij be the components ofP . For all z ∈ �n, we have

‖Pz‖2ξ =

n∑i=1

ξi

n∑

j=1

pijzj

2

≤n∑

i=1

ξi

n∑j=1

pijz2j

=n∑

j=1

n∑i=1

ξipijz2j =

n∑j=1

ξjz2j = ‖z‖2

ξ ,

where the inequality follows from the convexity ofthe quadratic function, and the next to last equal-ity follows from the defining property

∑ni=1 ξipij =

ξj of the steady-state probabilities.

• Using the lemma, the nonexpansiveness of Π,and the definition TJ = g + αPJ , we have

‖ΠTJ−ΠT J‖ξ ≤ ‖TJ−T J‖ξ = α‖P (J−J)‖ξ ≤ α‖J−J‖ξ

for all J, J ∈ �n. Hence T is a contraction ofmodulus α.

PROOF OF ERROR BOUND

• Let Φr∗ be the fixed point of ΠT . We have

‖Jµ − Φr∗‖ξ ≤ 1√1 − α2

‖Jµ − ΠJµ‖ξ.

Proof: We have

‖Jµ − Φr∗‖2ξ = ‖Jµ − ΠJµ‖2

ξ +∥∥ΠJµ − Φr∗

∥∥2

ξ

= ‖Jµ − ΠJµ‖2ξ +

∥∥ΠTJµ − ΠT (Φr∗)∥∥2

ξ

≤ ‖Jµ − ΠJµ‖2ξ + α2‖Jµ − Φr∗‖2

ξ ,

where the first equality uses the Pythagorean The-orem, the second equality holds because Jµ is thefixed point of T and Φr∗ is the fixed point of ΠT ,and the inequality uses the contraction propertyof ΠT . From this relation, the result follows.

• Note: The factor 1/√

1 − α2 in the RHS canbe replaced by a factor that is smaller and com-putable. SeeH. Yu and D. P. Bertsekas, “New Error Boundsfor Approximations from Projected Linear Equa-tions,” Report LIDS-P-2797, MIT, July 2008.

PROJECTED VALUE ITERATION (PVI)

• Given the projection property of ΠT , we mayconsider the PVI method

Φrk+1 = ΠT (Φrk)

S: Subspace spanned by basis functions

Φrk

T(Φrk) = g + αPΦrk

0

Φrk+1

Value Iterate

Projectionon S

• Question: Can we implement PVI using simu-lation, without the need for n-dimensional linearalgebra calculations?

• LSPE (Least Squares Policy Evaluation) is asimulation-based implementation of PVI.

LSPE - SIMULATION-BASED PVI

• PVI, i.e., Φrk+1 = ΠT (Φrk) can be written as

rk+1 = arg minr∈�s

∥∥Φr − T (Φrk)∥∥2

ξ,

from which by setting the gradient to 0,

(n∑

i=1

ξi φ(i)φ(i)′

)rk+1 =

n∑i=1

ξi φ(i)

n∑j=1

pij

(g(i, j)+αφ(j)′rk

)

• For LSPE we generate an infinite trajectory(i0, i1, . . .) and update rk after transition (ik, ik+1)(k∑

t=0

φ(it)φ(it)′

)rk+1 =

k∑t=0

φ(it)(g(it, it+1)+αφ(it+1)′rk

)• LSPE can equivalently be written as

(n∑

i=1

ξi,k φ(i)φ(i)′

)rk+1 =

n∑i=1

ξi,k φ(i)

n∑j=1

pij,k

(g(i, j) + αφ(j)′rk

)where ξi,k, pij,k: empirical frequencies of state iand transition (i, j), based on (i0, . . . , ik+1).

LSPE INTERPRETATION

• LSPE can be written as PVI with sim. error:

Φrk+1 = ΠT (Φrk) + ek

where ek diminishes to 0 as the empirical frequen-cies ξi,k and pij,k approach ξ and pij .

S: Subspace spanned by basis functions

T(Φrk) = g + αPΦrk

0

Value Iterate

Projectionon S

Φrk+1

Simulation error

S: Subspace spanned by basis functions

Φrk

T(Φrk) = g + αPΦrk

0

Φrk+1

Value Iterate

Projectionon S

Projected Value Iteration (PVI) Least Squares Policy Evaluation (LSPE)

Φrk

• Convergence proof is simple: Use the law oflarge numbers.

• Optimistic LSPE: Changes policy prior to con-vergence - behavior can be very complicated.

EXAMPLE: TETRIS I

• The state consists of the board position i, andthe shape of the current falling block (astronomi-cally large number of states).

• It can be shown that all policies are proper!!

• Use a linear approximation architecture withfeature extraction

J(i, r) =s∑

m=1

φm(i)rm,

where r = (r1, . . . , rs) is the parameter vector andφm(i) is the value of mth feature associated w/ i.

EXAMPLE: TETRIS II

• Approximate policy iteration was implementedwith the following features:

− The height of each column of the wall− The difference of heights of adjacent columns− The maximum height over all wall columns− The number of “holes” on the wall− The number 1 (provides a constant offset)

• Playing data was collected for a fixed valueof the parameter vector r (and the correspondingpolicy); the policy was approximately evaluatedby choosing r to match the playing data in someleast-squares sense.

• LSPE (its SSP version) was used for approxi-mate policy evaluation.

• Both regular and optimistic versions were used.

• See: Bertsekas and Ioffe, “Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming,” LIDS Report, 1996. Alsothe NDP book.

6.231 DYNAMIC PROGRAMMING

LECTURE 23

LECTURE OUTLINE

• Review of indirect policy evaluation methods

• Multistep methods, LSPE(λ)

• LSTD(λ)

• Q-learning

• Q-learning with linear function approximation

• Q-learning for optimal stopping problems

REVIEW: PROJECTED BELLMAN EQUATION

• For a fixed policy µ to be evaluated, considerthe corresponding mapping T :

(TJ)(i) =n∑

i=1

pij

(g(i, j)+αJ(j)

), i = 1, . . . , n,

or more compactly,

TJ = g + αPJ

• The solution Jµ of Bellman’s equation J = TJis approximated by the solution of

Φr = ΠT (Φr)

S: Subspace spanned by basis functions

T(Φr)

0

Φr = ΠT(Φr)

Projectionon S

Indirect method: Solving a projected form of Bellmanʼs equation

PVI/LSPE

• Key Result: ΠT is contraction of modulusα with respect to the weighted Euclidean norm‖ · ‖ξ, where ξ = (ξ1, . . . , ξn) is the steady-stateprobability vector. The unique fixed point Φr∗ ofΠT satisfies

‖Jµ − Φr∗‖ξ ≤ 1√1 − α2

‖Jµ − ΠJµ‖ξ

• Projected Value Iteration (PVI): Φrk+1 =ΠT (Φrk), which can be written as

rk+1 = arg minr∈�s

∥∥Φr − T (Φrk)∥∥2

ξ

or equivalently

rk+1 = arg minr∈�s

n∑i=1

ξi

(φ(i)′r −

n∑j=1

pij

(g(i, j) + αφ(j)′rk

))2

• LSPE (simulation-based approximation):We generate an infinite trajectory (i0, i1, . . .) andupdate rk after transition (ik, ik+1)

rk+1 = arg minr∈�s

k∑t=0

(φ(it)′r−g(it, it+1)−αφ(it+1)′rk

)2

JUSTIFICATION OF PVI/LSPE CONNECTION

• By writing the necessary optimality conditionsfor the least squares minimization, PVI can bewritten as(n∑

i=1

ξi φ(i)φ(i)′

)rk+1 =

n∑i=1

ξi φ(i)

n∑j=1

pij

(g(i, j)+αφ(j)′rk

)

• Similarly, by writing the necessary optimal-ity conditions for the least squares minimization,LSPE can be written as(k∑

t=0

φ(it)φ(it)′

)rk+1 =

k∑t=0

φ(it)(g(it, it+1)+αφ(it+1)′rk

)

• So LSPE is just PVI with the two expected val-ues approximated by simulation-based averages.• Convergence follows by the law of large num-bers.• The bottleneck in rate of convergence is thelaw of large of numbers/simulation error (PVI isa contraction with modulus α, and converges fastrelative to simulation).

LEAST SQUARES TEMP. DIFFERENCES (LSTD)

• Taking the limit in PVI, we see that the pro-jected equation, Φr∗ = ΠT (Φr∗), can be written asAr∗ + b = 0, where

A =

n∑i=1

ξi φ(i)

n∑j=1

pijφ(j) − φ(i)

)′

b =

n∑i=1

ξi φ(i)

n∑j=1

pijg(i, j)

• A, b are expected values that can be approxi-mated by simulation: Ak ≈ A, bk ≈ b, where

Ak =1

k + 1

k∑t=0

φ(it)(αφ(it+1) − φ(it)

)′

bk =1

k + 1

k∑t=0

φ(it)g(it, it+1)

• LSTD method: Approximates r∗ as

r∗ ≈ rk = −A−1k

bk

• Conceptually very simple ... but less suitable foroptimistic policy iteration (hard to transfer infofrom one policy evaluation to the next).• Can be shown that convergence rate is the samefor LSPE/LSTD (for large k, ‖rk−rk‖ << ‖rk−r∗‖).

MULTISTEP METHODS

• Introduce a multistep version of Bellman’s equa-tion J = T (λ)J, where for λ ∈ [0, 1),

T (λ) = (1 − λ)

∞∑t=0

λtT t+1

• Note that T t is a contraction with modulus αt,with respect to the weighted Euclidean norm ‖·‖ξ,where ξ is the steady-state probability vector ofthe Markov chain.• From this it follows that T (λ) is a contractionwith modulus

αλ = (1 − λ)

∞∑t=0

αt+1λt =α(1 − λ)

1 − αλ

• T t and T (λ) have the same fixed point Jµ and

‖Jµ − Φr∗λ‖ξ ≤ 1√1 − α2

λ

‖Jµ − ΠJµ‖ξ

where Φr∗λ is the fixed point of ΠT (λ).• The fixed point Φr∗λ depends on λ.• Note that αλ ↓ 0 as λ ↑ 1, so error bound improvesas λ ↑ 1.

PVI(λ)

Φrk+1 = ΠT (λ)(Φrk) = Π

((1 − λ)

∞∑t=0

λtT t+1(Φrk)

)

orrk+1 = arg min

r∈�s

∥∥Φr − T (λ)(Φrk)∥∥2

ξ

• Using algebra and the relation

(T t+1J)(i) = E

{αt+1J(it+1) +

t∑k=0

αkg(ik, ik+1)

∣∣∣ i0 = i

}

we can write PVI(λ) as

rk+1 = arg minr∈�s

n∑i=1

ξi

(φ(i)′r − φ(i)′rk

−∞∑

t=0

(αλ)tE{

dk(it, it+1) | i0 = i})2

where

dk(it, it+1) = g(it, it+1) + αφ(it+1)′rk − φ(it)′rk,

are the, so called, temporal differences (TD) - theyare the errors in satisfying Bellman’s equation.

LSPE(λ)

• Replacing the expected values defining PVI(λ)by simulation-based estimates we obtain LSPE(λ).• It has the form

rk+1 = arg minr∈�s

k∑t=0

(φ(it)

′r − φ(it)′rk

−k∑

m=t

(αλ)m−tdk(im, im+1)

)2

where (i0, i1, . . .) is an infinitely long trajectory gen-erated by simulation.• Can be implemented with convenient incremen-tal update formulas (see the text).• Note the λ-tradeoff:

− As λ ↑ 1, the accuracy of the solution Φr∗λimproves - the error bound to ‖Jµ − Φr∗λ‖ξ

improves.− As λ ↑ 1, the “simulation noise” in the LSPE(λ)

iteration (2nd summation term) increases, solonger simulation trajectories are needed forLSPE(λ) to approximate well PVI(λ).

Q-LEARNING I

• Q-learning has two motivations:− Dealing with multiple policies simultaneously− Using a model-free approach [no need to know

pij(u) explicitly, only to simulate them]• The Q-factors are defined by

Q∗(i, u) =

n∑j=1

pij(u)(g(i, u, j) + αJ∗(j)

), ∀ (i, u)

• In view of J∗ = TJ∗, we have J∗(i) = minu∈U(i) Q∗(i, u)

so the Q factors solve the equation

Q∗(i, u) =

n∑j=1

pij(u)

(g(i, u, j) + α min

u′∈U(j)Q∗(j, u′)

), ∀ (i, u)

• Q(i, u) can be shown to be the unique solution ofthis equation. Reason: This is Bellman’s equationfor a system whose states are the original states1, . . . , n, together with all the pairs (i, u).• Value iteration:

Q(i, u) :=

n∑j=1

pij(u)

(g(i, u, j) + α min

u′∈U(j)Q(j, u′)

), ∀ (i, u)

Q-LEARNING II

• Use any probabilistic mechanism to select se-quence of pairs (ik, uk) [all pairs (i, u) are choseninfinitely often], and for each k, select jk accord-ing to pikj(uk).• At each k, Q-learning algorithm updates Q(ik, uk)

according to

Q(ik, uk) :=(1 − γk(ik, uk)

)Q(ik, uk)

+ γk(ik, uk)

(g(ik, uk, jk) + α min

u′∈U(jk)Q(jk, u′)

)

• Stepsize γk(ik, uk) must converge to 0 at properrate (e.g., like 1/k).• Important mathematical point: In the Q-factorversion of Bellman’s equation the order of expec-tation and minimization is reversed relatively tothe ordinary cost version of Bellman’s equation:

J∗(i) = minu∈U(i)

n∑j=1

pij(u)(g(i, u, j) + αJ∗(j)

)• Q-learning can be shown to converge to true/exactQ-factors (a sophisticated proof).• Major drawback: The large number of pairs (i, u)- no function approximation is used.

Q-FACTOR APROXIMATIONS

• Introduce basis function approximation for Q-factors:

Q(i, u, r) = φ(i, u)′r

• We cannot use LSPE/LSTD because the Q-factor Bellman equation involves minimization/multiplecontrols.• An optimistic version of LSPE(0) is possible:• Generate an infinitely long sequence {(ik, uk) |k = 0, 1, . . .}.• At iteration k, given rk and state/control (ik, uk):(1) Simulate next transition (ik, ik+1) using the

transition probabilities pikj(uk).

(2) Generate control uk+1 from the minimization

uk+1 = arg minu∈U(ik+1)

Q(ik+1, u, rk)

(3) Update the parameter vector via

rk+1 = arg minr∈�s

k∑t=0

(φ(it, ut)

′r

− g(it, ut, it+1) − αφ(it+1, ut+1)′rk

)2

Q-LEARNING FOR OPTIMAL STOPPING

• Not much is known about convergence of opti-mistic LSPE(0).

• Major difficulty is that the projected Bellmanequation for Q-factors may not be a contraction,and may have multiple solutions or no solution.

• There is one important case, optimal stop-ping, where this difficulty does not occur.

• Given a Markov chain with states {1, . . . , n},and transition probabilities pij . We assume thatthe states form a single recurrent class, with steady-state distribution vector ξ = (ξ1, . . . , ξn).

• At the current state i, we have two options:− Stop and incur a cost c(i), or− Continue and incur a cost g(i, j), where j is

the next state.

• Q-factor for the continue action:

Q(i) =n∑

j=1

pij

(g(i, j)+α min

{c(j), Q(j)

})∆(FQ)(i)

• Major fact: F is a contraction of modulus αwith respect to norm ‖ · ‖ξ.

LSPE FOR OPTIMAL STOPPING

• Introduce Q-factor approximation

Q(i, r) = φ(i)′r

• PVI for Q-factors:

Φrk+1 = ΠF (Φrk)

• LSPE

rk+1 =

(k∑

t=0

φ(it)φ(it)′)−1

k∑t=0

φ(it)(g(it, it+1) + α min

{c(it+1), φ(it+1)′rk

})

• Simpler version: Replace the term φ(it+1)′rk

by φ(it+1)′rt. The algorithm still converges tothe unique fixed point of ΠF (see H. Yu and D.P. Bertsekas, “A Least Squares Q-Learning Algo-rithm for Optimal Stopping Problems”).

6.231 DYNAMIC PROGRAMMING

LECTURE 24

LECTURE OUTLINE

• More on projected equation methods/policyevaluation

• Stochastic shortest path problems

• Average cost problems

• Generalization - Two Markov Chain methods

• LSTD-like methods - Use to enhance explo-ration

REVIEW: PROJECTED BELLMAN EQUATION

• For fixed policy µ to be evaluated, the solutionof Bellman’s equation J = TJ is approximated bythe solution of

Φr = ΠT (Φr)

whose solution is in turn obtained using a simulation-based method such as LSPE(λ), LSTD(λ), or TD(λ).

S: Subspace spanned by basis functions

T(Φr)

0

Φr = ΠT(Φr)

Projectionon S

Indirect method: Solving a projected form of Bellmanʼs equation

• These ideas apply to other (linear) Bellmanequations, e.g., for SSP and average cost.

• Key Issue: Construct framework where ΠT [orat least ΠT (λ)] is a contraction.

STOCHASTIC SHORTEST PATHS

• Introduce approximation subspace

S = {Φr | r ∈ �s}and for a given proper policy, Bellman’s equationand its projected version

J = TJ = g + PJ, Φr = ΠT (Φr)

Also its λ-version

Φr = ΠT (λ)(Φr), T (λ) = (1 − λ)∞∑

t=0

λtT t+1

• Question: What should be the norm of pro-jection?

• Speculation based on discounted case: Itshould be a weighted Euclidean norm with weightvector ξ = (ξ1, . . . , ξn), where ξi should be sometype of long-term occupancy probability of state i(which can be generated by simulation).

• But what does “long-term occupancy probabil-ity of a state” mean in the SSP context?

• How do we generate infinite length trajectoriesgiven that termination occurs with prob. 1?

SIMULATION TRAJECTORIES FOR SSP

• We envision simulation of trajectories up totermination, followed by restart at state i withsome fixed probabilities q0(i) > 0.

• Then the “long-term occupancy probability ofa state” of i is proportional to

q(i) =∞∑

t=0

qt(i), i = 1, . . . , n,

where

qt(i) = P (it = i), i = 1, . . . , n, t = 0, 1, . . .

• We use the projection norm

‖J‖q =

√√√√ n∑i=1

q(i)(J(i)

)2[Note that 0 < q(i) < ∞, but q is not a prob.distribution. ]

• We can show that ΠT (λ) is a contraction withrespect to ‖ · ‖ξ (see the next slide).

CONTRACTION PROPERTY FOR SSP

• We have q =∑∞

t=0 qt so

q′P =∞∑

t=0

q′tP =∞∑

t=1

q′t = q′ − q′0

orn∑

i=1

q(i)pij = q(j) − q0(j), ∀ j

• To verify that ΠT is a contraction, we showthat there exists β < 1 such that ‖Pz‖2

q ≤ β‖z‖2q

for all z ∈ �n.

• For all z ∈ �n, we have

‖Pz‖2q =

n∑i=1

q(i)

n∑

j=1

pijzj

2

≤n∑

i=1

q(i)n∑

j=1

pijz2j

=n∑

j=1

z2j

n∑i=1

q(i)pij =n∑

j=1

(q(j) − q0(j)

)z2j

= ‖z‖2q − ‖z‖2

q0 ≤ β‖z‖2q

where

β = 1 − minj

q0(j)q(j)

PVI(λ) AND LSPE(λ) FOR SSP

• We consider PVI(λ): Φrk+1 = ΠT (λ)(Φrk),which can be written as

rk+1 = arg minr∈�s

n∑i=1

q(i)

(φ(i)′r − φ(i)′rk

−∞∑

t=0

λtE{dk(it, it+1) | i0 = i

})2

where dk(it, it+1) are the TDs.

• The LSPE(λ) algorithm is a simulation-basedapproximation. Let (i0,l, i1,l, . . . , iNl,l) be the lthtrajectory (with iNl,l = 0), and let rk be the pa-rameter vector after k trajectories. We set

rk+1 = arg minr

k+1∑l=1

Nl−1∑t=0

(φ(it,l)′r − φ(it,l)′rk

−Nl−1∑m=t

λm−tdk(im,l, im+1,l)

)2

where

dk(im,l, im+1,l) = g(im,l, im+1,l)+φ(im+1,l)′rk−φ(im,l)′rk

• Can also update rk at every transition.

AVERAGE COST PROBLEMS

• Consider a single policy to be evaluated, withsingle recurrent class, no transient states, and steady-state probability vector ξ = (ξ1, . . . , ξn).

• The average cost, denoted by η, is independentof the initial state

η = limN→∞

1N

E

{N−1∑k=0

g(xk, xk+1

) ∣∣∣ x0 = i

}, ∀ i

• Bellman’s equation is J = FJ with

FJ = g − ηe + PJ

where e is the unit vector e = (1, . . . , 1).

• The projected equation and its λ-version are

Φr = ΠF (Φr), Φr = ΠF (λ)(Φr)

• A problem here is that F is not a contractionwith respect to any norm (since e = Pe).

• However, ΠF (λ) turns out to be a contractionwith respect to ‖ · ‖ξ assuming that e does not be-long to S and λ > 0 [the case λ = 0 is exceptional,but can be handled - see the text].

LSPE(λ) FOR AVERAGE COST

• We generate an infinitely long trajectory (i0, i1, . . .).

• We estimate the average cost η separately: Fol-lowing each transition (ik, ik+1), we set

ηk =1

k + 1

k∑t=0

g(it, it+1)

• Also following (ik, ik+1), we update rk by

rk+1 = arg minr∈�s

k∑t=0

(φ(it)′r − φ(it)′rk −

k∑m=t

λm−tdk(m)

)2

where dk(m) are the TDs

dk(m) = g(im, im+1)−ηm +φ(im+1)′rk−φ(im)′rk

• Note that the TDs include the estimate ηm.Since ηm converges to η, for large m it can beviewed as a constant and lumped into the one-stage cost.

GENERALIZATION/UNIFICATION

• Consider approximate solution of x = T (x),where

T (x) = Ax + b, A is n × n, b ∈ �n

by solving the projected equation y = ΠT (y),where Π is projection on a subspace of basis func-tions (with respect to some Euclidean norm).

• We will generalize from DP to the case whereA is arbitrary, subject only to

I − ΠA : invertible

• Benefits of generalization:− Unification/higher perspective for TD meth-

ods in approximate DP− An extension to a broad new area of applica-

tions, where a DP perspective may be help-ful

• Challenge: Dealing with less structure− Lack of contraction− Absence of a Markov chain

LSTD-LIKE METHOD

• Let Π be projection with respect to

‖x‖ξ =

√√√√ n∑i=1

ξix2i

where ξ ∈ �n is a probability distribution withpositive components.

• If r∗ is the solution of the projected equation,we have Φr∗ = Π(AΦr∗ + b) or

r∗ = arg minr∈�s

n∑i=1

ξi

φ(i)′r −

n∑j=1

aijφ(j)′r∗ − bi

2

where φ(i)′ denotes the ith row of the matrix Φ.

• Optimality condition/equivalent form:

n∑i=1

ξiφ(i)

φ(i) −

n∑j=1

aijφ(j)

r∗ =n∑

i=1

ξiφ(i)bi

• The two expected values are approximated bysimulation.

SIMULATION MECHANISM

Row Sampling According to ξ

i0 i1

j0 j1

ik ik+1

jk jk+1

. . . . . .

Column SamplingAccording to P

• Row sampling: Generate sequence {i0, i1, . . .}according to ξ, i.e., relative frequency of each rowi is ξi

• Column sampling: Generate{(i0, j0), (i1, j1), . . .

}according to some transition probability matrix Pwith

pij > 0 if aij �= 0,

i.e., for each i, the relative frequency of (i, j) is pij

• Row sampling may be done using a Markovchain with transition matrix Q (unrelated to P )

• Row sampling may also be done without aMarkov chain - just sample rows according to someknown distribution ξ (e.g., a uniform)

ROW AND COLUMN SAMPLING

Row Sampling According to ξ

(May Use Markov Chain Q)

| |

Column Sampling

According to

Markov Chain

P ∼ |A|

i0 i1

j0 j1

ik ik+1

jk jk+1

. . . . . .

• Row sampling ∼ State Sequence Generation inDP. Affects:

− The projection norm.− Whether ΠA is a contraction.

• Column sampling ∼ Transition Sequence Gen-eration in DP.

− Can be totally unrelated to row sampling.Affects the sampling/simulation error.

− “Matching” P with |A| is beneficial (has aneffect like in importance sampling).

• Independent row and column sampling allowsexploration at will! Resolves the exploration prob-lem that is critical in approximate policy iteration.

LSTD-LIKE METHOD

• Optimality condition/equivalent form of pro-jected equation

n∑i=1

ξiφ(i)

φ(i) −

n∑j=1

aijφ(j)

r∗ =n∑

i=1

ξiφ(i)bi

• The two expected values are approximated byrow and column sampling (batch 0 → t).

• We solve the linear equation

t∑k=0

φ(ik)(

φ(ik) − aikjk

pikjk

φ(jk))′

rt =t∑

k=0

φ(ik)bik

• We have rt → r∗, regardless of ΠA being a con-traction (by law of large numbers; see next slide).

• An LSPE-like method is also possible, but re-quires that ΠA is a contraction.

• Under the assumption∑n

j=1 |aij | ≤ 1 for all i,there are conditions that guarantee contraction ofΠA; see the paper by Bertsekas and Yu,“ProjectedEquation Methods for Approximate Solution ofLarge Linear Systems,” 2008, or the expanded ver-sion of Chapter 6 Vol 2

JUSTIFICATION W/ LAW OF LARGE NUMBERS

• We will match terms in the exact optimalitycondition and the simulation-based version.

• Let ξti be the relative frequency of i in row

sampling up to time t.

• We have

1t + 1

t∑k=0

φ(ik)φ(ik)′ =n∑

i=1

ξtiφ(i)φ(i)′ ≈

n∑i=1

ξiφ(i)φ(i)′

1t + 1

t∑k=0

φ(ik)bik =n∑

i=1

ξtiφ(i)bi ≈

n∑i=1

ξiφ(i)bi

• Let ptij be the relative frequency of (i, j) in

column sampling up to time t.

1t + 1

t∑k=0

aikjk

pikjk

φ(ik)φ(jk)′

=n∑

i=1

ξti

n∑j=1

ptij

aij

pijφ(i)φ(j)′

≈n∑

i=1

ξi

n∑j=1

aijφ(i)φ(j)′

6.231 DYNAMIC PROGRAMMING

LECTURE 25

LECTURE OUTLINE

• Additional topics in ADP

• Nonlinear versions of the projected equation

• Extension of Q-learning for optimal stopping

• Basis function adaptation

• Gradient-based approximation in policy space

NONLINEAR EXTENSIONS OF PROJECTED EQ.

• If the mapping T is nonlinear (as for exam-ple in the case of multiple policies) the projectedequation Φr = ΠT (Φr) is also nonlinear.

• Any solution r∗ satisfies

r∗ ∈ arg minr∈�s

∥∥Φr − T (Φr∗)∥∥2

or equivalently

Φ′(Φr∗ − T (Φr∗))

= 0

This is a nonlinear equation, which may have oneor many solutions, or no solution at all.

• If ΠT is a contraction, then there is a uniquesolution that can be obtained (in principle) by thefixed point iteration

Φrk+1 = ΠT (Φrk)

• We have seen a nonlinear special case of pro-jected value iteration/LSPE where ΠT is a con-traction, namely optimal stopping.

• This case can be generalized.

LSPE FOR OPTIMAL STOPPING EXTENDED

• Consider a system of the form

x = T (x) = Af(x) + b,

where f : �n → �n is a mapping with scalar com-ponents of the form f(x) =

(f1(x1), . . . , fn(xn)

).

• Assume that each fi : � → � is nonexpansive:

∣∣fi(xi) − fi(xi)∣∣ ≤ |xi − xi|, ∀ i, xi, xi ∈ �

This guarantees that T is a contraction with re-spect to any weighted Euclidean norm ‖·‖ξ when-ever A is a contraction with respect to that norm.

• Algorithms similar to LSPE [approximatingΦrk+1 = ΠT (Φrk)] are then possible.

• Special case: In the optimal stopping problemof Section 6.4, x is the Q-factor corresponding tothe continuation action, α ∈ (0, 1) is a discountfactor, fi(xi) = min{ci, xi}, and A = αP , whereP is the transition matrix for continuing.

• If∑n

j=1 pij < 1 for some state i, and 0 ≤ P ≤Q, where Q is an irreducible transition matrix,then Π((1−γ)I+γT ) is a contraction with respectto ‖ · ‖ξ for all γ ∈ (0, 1), even with α = 1.

BASIS FUNCTION ADAPTATION I

• An important issue in ADP is how to selectbasis functions.

• A possible approach is to introduce basis func-tions that are parametrized by a vector θ, andoptimize over θ, i.e., solve the problem

minθ∈Θ

F(J(θ)

)

where J(θ) is the solution of the projected equa-tion.

• One example is

F(J(θ)

)=∥∥J(θ) − T

(J(θ)

)∥∥2

• Another example is

F(J(θ)

)=∑i∈I

|J(i) − J(θ)(i)|2,

where I is a subset of states, and J(i), i ∈ I, arethe costs of the policy at these states calculateddirectly by simulation.

BASIS FUNCTION ADAPTATION II

• Some algorithm may be used to minimize F(J(θ)

)over θ.

• A challenge here is that the algorithm shoulduse low-dimensional calculations.

• One possibility is to use a form of random searchmethod; see the paper by Menache, Mannor, andShimkin (Annals of Oper. Res., Vol. 134, 2005)

• Another possibility is to use a gradient method.For this it is necessary to estimate the partialderivatives of J(θ) with respect to the componentsof θ.

• It turns out that by differentiating the pro-jected equation, these partial derivatives can becalculated using low-dimensional operations. Seethe paper by Menache, Mannor, and Shimkin, anda recent paper by Yu and Bertsekas (2008).

APPROXIMATION IN POLICY SPACE I

• Consider an average cost problem, where theproblem data are parametrized by a vector r, i.e.,a cost vector g(r), transition probability matrixP (r). Let η(r) be the (scalar) average cost perstage, satisfying Bellman’s equation

η(r)e + h(r) = g(r) + P (r)h(r)

where h(r) is the corresponding differential costvector.• Consider minimizing η(r) over r (here the datadependence on control is encoded in the parametriza-tion). We can try to solve the problem by nonlin-ear programming/gradient descent methods.

• Important fact: If ∆η is the change in η dueto a small change ∆r from a given r, we have

∆η = ξ′(∆g + ∆Ph),where ξ is the steady-state probability distribu-tion/vector corresponding to P (r), and all the quan-tities above are evaluated at r:

∆η = η(r + ∆r) − η(r),

∆g = g(r+∆r)−g(r), ∆P = P (r+∆r)−P (r)

APPROXIMATION IN POLICY SPACE II

• Proof of the gradient formula: We have,by “differentiating” Bellman’s equation,

∆η(r)·e+∆h(r) = ∆g(r)+∆P (r)h(r)+P (r)∆h(r)

By left-multiplying with ξ′,

ξ′∆η(r)·e+ξ′∆h(r) = ξ′(∆g(r)+∆P (r)h(r)

)+ξ′P (r)∆h(r)

Since ξ′∆η(r) · e = ∆η(r) and ξ′ = ξ′P (r), thisequation simplifies to

∆η = ξ′(∆g + ∆Ph)

• Since we don’t know ξ, we cannot implement agradient-like method for minimizing η(r). An al-ternative is to use “sampled gradients”, i.e., gener-ate a simulation trajectory (i0, i1, . . .), and changer once in a while, in the direction of a simulation-based estimate of ξ′(∆g + ∆Ph).

• There is much recent research on this subject,see e.g., the work of Marbach and Tsitsiklis, andKonda and Tsitsiklis, and the refs given there.