chapter 12. dynamic programming · 2019. 10. 22. · dynamic programming (dp) • a technique that...

33
Chapter 12. Dynamic Programming Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: 20171011/20191017

Upload: others

Post on 12-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

Chapter 12. Dynamic Programming

Neural Networks and Learning Machines (Haykin)

Lecture Notes on Self-learning Neural Algorithms

Byoung-Tak ZhangSchool of Computer Science and Engineering

Seoul National University

Version: 20171011/20191017

Page 2: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

Contents12.1 Introduction …………………………………………………..…………………………….... 312.2 Markov Decision Process ………………………………………….………………..…. 512.3 Bellman’s Optimality Criterion …………………………..………………….…….... 812.4 Policy Iteration ……….………..………………….…………………………..………..... 11

12.5 Value Iteration ………………………………………………………………..…….…...…. 1312.6 Approximate DP: Direct Methods ….…………..……..……….………………….. 1712.7 Temporal Difference Learning ….………………….……..…………….………….. 1812.8 Q-Learning ……………….…………………..……………….…………………….....……. 2112.9 Approximate DP: Indirect Methods ………………………….……..….........…. 24

12.10 Least Squares Policy Evaluation …………….…………………………..…….…... 2612.11 Approximate Policy Iteration …..……….….……………………..………….……. 30Summary and Discussion …………….…………….………………………….……………... 33

(c) 2017 Biointelligence Lab, SNU 2

Page 3: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.1 Introduction (1/2)

(c) 2017 Biointelligence Lab, SNU 3

Two paradigms of learning1. Learning with a teacher: Supervised learning2. Learning without a teacher: Unsupervised learning / reinforcement

learning / semisupervised learning

Reinforcement learning1. Behavioral learning (action, sequential decision-making)2. Interaction between an agent and its environment3. Achieving a specific goal under uncertainty

Two approaches to reinforcement learning1. Classical approach: punishment and reward (classical conditioning),

highly skilled behavior2. Modern approach: dynamic programming, planning

Page 4: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.1 Introduction (2/2)

(c) 2017 Biointelligence Lab, SNU 4

Dynamic programming (DP)• A technique that deals with situations where decisions are made in stages,

with the outcome of each decision being predictable to some extent before the next decision is made.

• Decisions cannot be made in solation, but the desire for a low cost at the present must be balanced against the undesirability of high cost in the future.

• Credit or blame must be assigned to each one of a set of interacting decisions (credit assignment problem)

• Decision making by an agent that operates in a stochastic environment• How can an agent or decision maker improve its long-term performance in a

stochastic environment when the attainment of this improvement may require having to sacrifice short-term performance?

• Markov decision processRight balance between• Realistic description of a given problem (practical)• Power of analytic and computational methods to apply to the problem

(theoretical)

Page 5: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.2 Markov Decision Process (1/3)

(c) 2017 Biointelligence Lab, SNU 5

Markov decision process (MDP):• a finite set of discrete states • a finite set of possible actions • cost (or reward)• discrete timeThe state of the environment is a summary of the entire past experience of an agent gained from its interaction with the environment, such that the information necessary for the agent to predict the future behavior of the environment is contained in that summary.

Figure 12.1 An agent interacting with its environment.

!!

MDP!=!The!sequence!of!states!{X n , !n=0,1,2,...}i.e.,!a!Markov!chain!with!transition!probabilitiespij(µ(i))!for!actions!µ(i).

Page 6: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.2 Markov Decision Process (2/3)

(c) 2017 Biointelligence Lab, SNU 6

i , j∈X : #statesAi = {aik }: #actionsπ = {µ0 ,µ1 ,...} : #policy#(states#X #to#actions#A)#####µn(i)∈Ai ######for#all#states#i#####Nonstationary#policy:#π = {µ0 ,µ1 ,...}#####Stationary#policy:#π = {µ ,µ ,...}pij(a): #transition#probability#####pij(a)= P(Xn+1 = j |Xn = i ,An = a)#####1.#pij(a)≥0#####for all i and j#####2. pij(a)=1

j∑ for all i

g(i ,aik , j): #cost#functionγ : #discount#factor#####γ ng(i ,aik , j): #discounted#cost

Figure 12.2 Illustration of two possible transitions: The transition from state to state is probabilistic, but the transition from state to is deterministic.

Page 7: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.2 Markov Decision Process (3/3)

(c) 2017 Biointelligence Lab, SNU 7

Dynamic programming (DP) problem - Finite-horizon problem - Infinite-horizon problemCost-to-go function (total expected cost)

J π (i) = E γ ng( Xn ,µn( Xn ), Xn+1) | X0 = in=0

∑⎡

⎣⎢⎢

⎦⎥⎥

g( Xn ,µn( Xn ), Xn+1) : observed cost

Optimal value J *(i) = min

πJ π (i) (For stationary policy: J µ (i) = J *(i))

Basic problem in DP Given a stationary MDP, find a stationary policy π that minimizes the cost-to-go function J µ (i) for all initial states i.

Notation:

Cost function J(i) ó Value function V(s)

Cost g(.)ó Reward r(.)

Page 8: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.3 Bellman’s Optimality Criterion (1/3)

(c) 2017 Biointelligence Lab, SNU 8

Principle)of)optimality!!!!!An!optimal!policy!has!the!property!that!whatever!the!initial!state!and!!!!!!initial!decsion!are,!the!remaining!decsions!must!constitute!an!optimal!!!!!!policy!starting!from!the!state!resulting!from!the!first!decision.

Consider!a!finite:horizon!problem!for!which!the!cost:to:go!function!is:!

!!!!!J0(X0)=E gK (XK ) gn(Xn ,µn(Xn),Xn+1)n=0

K−1

∑⎡

⎣⎢

⎦⎥

Suppose!we!wish!to!minimize!the!cost:to:go!function

!!!!!Jn(Xn)=E gK (XK ) gk(Xk ,µk(Xk ),Xk+1)k=n

K−1

∑⎡

⎣⎢

⎦⎥

Then,!the!trauncated!policy!{µn* ,µn+1* ,...,µK−1* }!!is!optimal!for!the!subproblem.

Page 9: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.3 Bellman’s Optimality Criterion (2/3)

(c) 2017 Biointelligence Lab, SNU 9

Dynamic(programming(algorithm!!!!!For!every!initial!state!X0 ,!the!optimal!cost!J *(X0)!of!the!basic!finite9horizon!problem!is!equal!to!J0(X0),!where!the!function!J0 !is!obtained!from!the!last!step!of!the!algorithm!!!!!Jn(Xn)=min

µn

EXn+1

gn(Xn ,µn(Xn),Xn+1)+ Jn+1(Xn+1)⎡⎣ ⎤⎦ !!!!!!!(12.13)

which!runs!backward!in!time,!with!!!!!!JK (XK )= gK (XK )Furthermore,!if!µn* !minimizes!the!right9hand!side!of!Eq.!(12.13)!for!each!Xn !and!n,!then!the!policy!π* = {µ0* ,µ1* ,...,µK−1* }!is!optimal.!

Page 10: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.3 Bellman’s Optimality Criterion (3/3)

(c) 2017 Biointelligence Lab, SNU 10

!!!

Bellman's!optimality!equation!!!!

!! J *(i)=minµ

c(i ,µ(i))+γ pij(µ) J *( j)j=1

N

∑⎛

⎝⎜⎞

⎠⎟for i =1,2,...,N !

Immediate!expected!cost

!!!!!c(i ,µ(i))=ΕX1[g(i ,µ(i),X1)]= pij g(i ,µ(i), j)

j=1

N

∑ !!!!!!!!!!!!!!!!!!!!

Two!methods!for!computing!an!optimal!policy!!!!A!Policy!iteration!!!!A!Value!iteration

Page 11: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.4 Policy Iteration (1/2)

Figure 12.3 Policy iteration algorithm.(c) 2017 Biointelligence Lab, SNU 11

Qµ (i,a) = c(i,a)+γ pij (a)J µ ( j)

j=1

n

∑ (Q-factor)

1. Policy evaluation step: the cost-to-go function for some current policy and the corresponding Q-factor are computed for all states and actions;

2. Policy improvement step: the current policy is updated in order to be greedy with respect to the cost-to-go function computed in step 1.

Page 12: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.4 Policy Iteration (2/2)

(c) 2017 Biointelligence Lab, SNU 12

J µn (i)= c(i,µn (i))+γ pij (µn (i))Jµn ( j)

j=1

N

∑ , i=1,2,...,N

Qµn (i,a)= c(i,a)+γ pij (a)J

µn ( j)j=1

N

∑ , a∈ Ai and i=1,2,...,N

!!µn+1(i)

Page 13: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.5 Value Iteration (1/4)

(c) 2017 Biointelligence Lab, SNU 13

!!Jn+1(i)

Page 14: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.5 Value Iteration (2/4)

Figure 12.4 Illustrative backup diagrams for (a) policy iteration and (b) value iteration.

(c) 2017 Biointelligence Lab, SNU 14

Page 15: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.5 Value Iteration (3/4)

Figure 12.5 Flow graph for stagecoach problem.

(c) 2017 Biointelligence Lab, SNU 15

Page 16: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.5 Value Iteration (4/4)

Figure 12.6 Steps involved in calculating the 𝑄-factors for the stagecoach problem. The routes (printed in blue) from 𝐴 to 𝐽 are the optimal ones.

(c) 2017 Biointelligence Lab, SNU 16

Page 17: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.6 Approximate Dynamic Programming: Direct Methods

(c) 2017 Biointelligence Lab, SNU 17

• Dynamic programming (DP) requires an explicit model, i.e. transition probabilities.

• Approximate DP: We may use Monte Carlo simulation to explicitly estimate (i.e. approximate) the transition probabilities.1. Direct methods2. Indirect methods: Approximate policy evaluation, approximate

cost-to-go • Direct methods for approximate DP

1. Value iteration: Temporal-difference (TD) learning2. Policy iteration: Q-learning

• Reinforcement learning as the direct approximation of DP

Page 18: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.7 Temporal-Difference Learning (1/3)

(c) 2017 Biointelligence Lab, SNU 18!!!

Sequence!of!states:!{in}n=0N

Cost2to2go!function!for!Bellman!equation!!!J µ(in)=E g(in ,in+1)+ J µ(in+1)⎡⎣ ⎤⎦ , n=0,1,...,N −1Applying!the!Robbins(Monro!stochastic!approximation!!!!!r + = (1−η)r +ηg(r ,v )!we!have

!!!J +(in) = (1−η) J(in)+η g(in ,in+1)+ J(in+1)⎡⎣ ⎤⎦

= J(in)+η g(in ,in+1)+ J(in+1)− J(in)⎡⎣ ⎤⎦

Temporal!difference!(TD)

!!!! dn = g(in ,in+1)+ J(in+1)− J(in), n=0,1,...,N −1TD!learning!!!!!!!J +(in)= J(in)+ηdn!!!!

Page 19: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.7 Temporal-Difference Learning (2/3)

(c) 2017 Biointelligence Lab, SNU 19!!!

Monte!Carlo!simulation!algorithm

!J µ(in)=E g(in+k ,in+k+1)k=0

N−n−1

∑⎡⎣⎢

⎦⎥ , n=0,1,...,N −1

Robbins/Monro!stochastic!approximation

J+(in)= J(in)+ µk g(in+k ,in+k+1)k=0

N−n−1

∑ − J(in)⎛⎝⎜

⎞⎠⎟

!!!!!!!!!!! = J(in)+ µk[g(in ,in+1)+ J(in+1)− J(in)!!!!!!!!!!!!!!!!!!+!g(in+1 ,in+2)+ J(in+2)− J(in+1)!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!g(iN−2 ,iN−1)+ J(iN−1)− J(iN−2)!!!!!!!!!!!!!!!!!!+!g(iN−1 ,iN )+ J(iN )− J(iN−1)]Using!the!temporal!difference

!!!!!J+(in)= J(in)+ dn+kk=0

N−n−1

To justify this is an iterative implementation of Monte Carlo simulation :Total sum of the cost for traj {in ,in+1,...,iN}

c(in ) = g(in+k ,in+k+1), n = 0,...,N −1k=0

N−n−1

∑Cost-to-go after visiting in for T simulations

J (in ) =1T

c(in )n=1

T

∑Ensemble averaged cost-to-go

J µ (in ) = Ε[c(in )] for all nIterative formula

J + (in ) = J (in ) +ηn (c(in ) − J (in ))

Page 20: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.7 Temporal-Difference Learning (3/3)

(c) 2017 Biointelligence Lab, SNU 20

TD methods• Online prediction

methods that learn how to compute their estimates, partly, on the basis of other estimates.

• Boostraping methods• Do not require a model

of the environment

TD(λ)

J + (in ) = J (in )+η λ k−ndkk=n

∑ dk = g(ik ,ik+1)+ J (ik+1)− J (ik )

λ = 0 : J + (in ) = J (in )+ηd n

λ = 1:

J + (in ) = J (in )+η dn+kk=0

N−n−1

Page 21: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.8 Q-Learning (1/3)

(c) 2017 Biointelligence Lab, SNU 21

Two$step)version)of)Bellman's)optimality)equation

)))))Q*(i ,a)= pij(a) g(i ,a, j)+γ minb∈AiQ* ( j ,b))⎛

⎝⎜⎞⎠⎟ !for!all!(i ,a)j=1

N

∑The$value$iteration$algorithm

$$$$$Q*(i ,a)= pij(a) g(i ,a, j)+γ minb∈AiQ( j ,b))⎛

⎝⎜⎞⎠⎟ !for!all!(i ,a)j=1

N

∑Small%step*size*version

*****Q*(i ,a)= (1−η)Q(i ,a)+η pij(a) g(i ,a, j)+γ minb∈AiQ( j ,b)⎛

⎝⎜⎞⎠⎟j=1

N

∑ !!for!all!(i ,a)

Stochastic*version*based*on*a*single*sample*****Qn+1(i ,a)= (1−ηn(i ,a))Qn(i ,a)+ηn(i ,a)[g(i ,a, j)+γ Jn( j)]##for#(i ,a)"="(in ,an)"""""""""""""""""""""Jn( j)=minb∈Ai

Qn( j ,b)"""""Qn+1(i ,a)=Qn(i ,a)""for"all"(i ,a)≠ (in ,an)Q"learning*algorithm:"""""Qn+1(i ,a)=Qn(i ,a)+ηn(i ,a)[g(i ,a, j)+γ minb∈Ai

Qn( j ,b)−Qn(i ,a)]

Is there any on-line procedure for learning an optimal control policy through experience that is gained solely on the basis of observing samples?

( , , , )n n n n ns i a j g=

( , , )n n n ng g i a j=

Page 22: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

Convergence TheoremSuppose that the learning-rate parameter etan(i, a) satisfies the conditions

Then, the sequence of Q-factors {Qn(i, a)} generated by the Q-learning algorithm converges with probability 1 to the optimal value Q*(i, a) for all state–action pairs (i, a) as the number of iterations n approaches infinity, provided that all state–action pairs are visited infinitely often.

12.8 Q-Learning (2/3)

(c) 2017 Biointelligence Lab, SNU 22

2

0 0( , ) and ( , ) for all (i,a)n n

n ni a i ah h

¥ ¥

= =

= ¥ < ¥å å

Time-varying learning parameter: ηn =α

β + n, n =1,2,...

Q-learning algorithm may be viewed in one of two equivalent ways• Robins-Monro stochastic approximation algorithm, or• combination of value iteration and Monte Carlo simulation.

Compromise between two conflicting objectives in reinforcement learning• Exploration: ensures that all admissible state–action pairs are explored often enough

to satisfy the Q-learning convergence theorem• Exploitation: seeks to minimize the cost-to-go function by following a greedy policy.

Page 23: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.8 Q-Learning (3/3)

(c) 2017 Biointelligence Lab, SNU 23

Figure 12.7 The time slots pertaining to the auxiliary and original control processes.

!!!

Mixed!nonstationary!policyswitches!between!an!auxilary!Markov!process!and!the!original!Markov!process!controlled!by!a!stationary!greedy!policy!determined!by!Q:learning!!!!!nk =mk−1 +L, !!k =1,2,...,and!m0 =1!!!!!!mk = nk +kL, !!k =1,2,...

Page 24: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.9 Approximate DP: Indirect Methods (1/2)

(c) 2017 Biointelligence Lab, SNU 24

Indirect approach to approximate DP:Ø Rather than explicitly estimate the transition probabilities and associated transition costs, use

Monte Carlo simulation to generate one or more system trajectories, so as to approximate the cost-to-go function of a given policy, or even the optimal cost-to-go function, and then optimize the approximation in some statistical sense.

Ø Thus, having abandoned the notion of optimality, we may capture the goal of the indirect approach to approximate dynamic programming in the following simple statement: Do as well as possible, and not more.

Ø Performance optimality is traded off for computational optimality. This strategy is precisely what the human brain does on a daily basis: Given a difficult decision-making problem, the brain provides a suboptimal solution that is the “best” in terms of reliability and available resource allocation.

!!!

Goal!of!approximate!DP:!!!!!Find!a!function!!J(i ,w)!that!approximates!the!optimal!cost7to7go!function!J *(i)!forstate!i ,!such!that!the!!!!!!cost!difference!!J *(i)− !!J(i ,w)!is!minimized!according!to!some!statistical!criterion.Two!basic!questions!!!!!Question1:!How!do!we!choose!the!approximation!function!!J(i ,w)!in!the!first!place?!!!!!Question2:!Having!chosen!an!appropriate!approximation!function!!J(i ,w),!how!do!we!adapt!the!weight!!!!!!!!!!!vector!w !so!as!to!provide!the!“best!fit”!to!Bellman’s!equation!of!optimality

Page 25: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.9 Approximate DP: Indirect Methods (2/2)

(c) 2017 Biointelligence Lab, SNU 25

Figure 12.8 Architectural layout of the linear approach to approximate dynamic programming.

Approximate+DP1.#Linear#approach#########!J(i ,w)= ϕijw j = φi

T

j∑ w ##for#all#i

2.#Nonlinear#approach####5#Recurrent#multilayer#perceptrons#(deep#architectures)####5#Supervised#training#of#a#recurrent#multilayer#perceptron#by#nonlinear#######sequential5state#estimation#algorithm#that#is#derivative#free.

Page 26: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.10 Least-Squares Policy Evaluation (1/4)

(c) 2017 Biointelligence Lab, SNU 26

J (i) ≈ !J (i,w) = φT (i)w1 00

( ) ( , ) |nn n

nJ i g i i i ig

¥

+=

é ù= E =ê úë ûå S = φw |w∈!s{ }

1. The Markov chain has positive steady-state probabilities; that is,

The implication of this assumption is that the Markov chain has a single recurrent class with no transient states.

2. The rank of matrix is s.The implication of this second assumption is that the columns of the feature matrix , and therefore the basis functions represented by , are linearly independent.

limn→∞

1n

P(ik = j | i0 = i) = π j > 0 for all ik=1

n

f

f fw

Perform value iteration within a lower dimensional subspace spanned by a set of basis functions.

φwn+1 =ΠΤ(φwn ) + (simulation noise)φ =

φ1T

φ2T

!

φNT

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

Page 27: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.10 Least-Squares Policy Evaluation (2/4)

27

1( ) ( ( , ) ( )), i 1, 2,...,

N

ijj

J i p g i j J i Ng=

= + =åT

g =

p1 jj∑ g(1, j)

p2 jj∑ g(2, j)

!pNj

j∑ g(N , j)

⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥

P =

p11 … p1N! " !pN1 ! pNN

⎜⎜⎜⎜

⎟⎟⎟⎟

J =

J (1)J (2)!J (N )

⎢⎢⎢⎢

⎥⎥⎥⎥

≈ φw

g= +J g PJT

1n n+ =J JT

φwn+1 =ΠΤ(φwn ), n = 0,1,2,…

Π : projection onto subspace S

Figure 12.9 Projected value iteration (PVI) method.

Projected)value)iteration)(PVI))for)policy)evaluation:At !iteration!n, !the!current !iterate!φwn !is !operated !on!by !the!mapping!T !and !the!new !vector !!T(φwn)!is !projected !onto!the!subspace!S , !thereby !yielding!the!updated !iterate!φwn+1.

Page 28: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.10 Least-Squares Policy Evaluation (3/4)

(c) 2017 Biointelligence Lab, SNU 28

!!!

From!porojected!value!iteration!to!least 3square!policy!evaluation!(LSPE)Least'squares!minimization!for!the!projection!Π, !i.e.!for!φwn+1 =ΠΤ(φwn)

!!!!!wn+1 = argminw

φw−Τ(φwn) π

2

Least'squares!version!of!PVI!algorithm!

!!!!!wn+1 = argminw

π ii=1

N

∑ φT(i)w− pij g(i , j)+γφT( j)wn

j=1

N

∑⎛⎝⎜

⎠⎟⎛

⎝⎜

⎠⎟

Use!Monte'Carlo!by!generating!a!trajectory!(i0 , !i1 , !i2 , !...)!for!state!i !and!updating!wn !after!each!iteration!(in ,in+1)

!!!!!wn+1 = argminw

φT(ik )w− g(ik ,ik+1)−γφT(ik+1)wn( )2

k=1

n

∑(Least'squares!policy!evaluation!or!LSPE!algorithm)LSPE!converges!to!PVI!!!!!φw* =ΠT(φw* )

Figure 12.10 Illustration of the least-squares policy evaluation (LSPE) as a stochastic version of the projected value iteration (PVI).

Page 29: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.10 Least-Squares Policy Evaluation (4/4)

(c) 2017 Biointelligence Lab, SNU 29

At iteration n+1 of the LSPE() algorithm, the updated weight vector is computed as the particular value of the weight vector w that minimizes the least-squares difference between the following two quantities:• the inner product approximating the cost function ;• its temporal difference counterpart

which is extracted from a single simulated trajectory for k = 0, 1, ..., n.

1n+w

( )TkiF w ( )kJ i

ΦT (ik )wn + (γλ)m−k dn (im ,im+1)m=k

n

LSPE(λ)

dn(ik ,ik+1)= g(ik ,ik+1)+γφT(ik+1)wn −φT(ik )wn

wn+1 = argminw

φT(i k )w −φT(ik )wn − γλ( )m−kdn(im ,im+1)

m=k

n

∑⎛

⎝⎜⎞

⎠⎟k=1

n

∑2

Page 30: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.11 Approximate Policy Iteration (1/3)

(c) 2017 Biointelligence Lab, SNU 30

1. Approximate policy evaluation step. Given the current policy, we compute a cost-to-go function approximating the actual cost-to-go function for all states i. The vector is the weight vector of the neural network used to perform the approximation.

2. Policy improvement step. Using the approximate cost-to-go function , we generate an improved policy .This new policy is designed to be greedy with respect to for all i.

( )J iµ

w

!J µ (i,w)µ

!J µ (i,w)

!J µ (i,w)

Figure 12.11 Approximate policy iteration algorithm.

Page 31: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.11 Approximate Policy Iteration (2/3)

(c) 2017 Biointelligence Lab, SNU 31

ε(w) = (k(i,m)− !J µ (i,w))2

m=1

M ( i)

∑i∈X∑

µ(i) = argmina∈Ai

Q(i,a,w)

Figure 12.12 Block diagram of the approximate policy iteration algorithm.

Q(i,a,w) = pij (a)(g(i,a, j) + γ !Jµ ( j,w))

j∈X∑

Page 32: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

12.11 Approximate Policy Iteration (3/3)

(c) 2017 Biointelligence Lab, SNU 32

Page 33: Chapter 12. Dynamic Programming · 2019. 10. 22. · Dynamic programming (DP) • A technique that deals with situations where decisions are made in stages, with the outcome of each

Summary and Discussion

(c) 2017 Biointelligence Lab, SNU 33

l Approximate Dynamic Programming: Direct Methods• Temporal difference (TD) learning• Q-learning

l Approximate Dynamic Programming: Indirect Methods• Linear structural approach, which involves

- Feature extraction of the state i- Least-squares minimization (e.g., LSPE)• Nonlinear structural approach, which relies on universal

approximators (e.g., neural networks)l Partial Observability (POMDP)l Relationship between Dynamic Programming and Viterbi

Algorithm