chapter 12. dynamic programming · 2019. 10. 22. · dynamic programming (dp) • a technique that...

Chapter 12. Dynamic Programming

Neural Networks and Learning Machines (Haykin)

Lecture Notes on Self-learning Neural Algorithms

Byoung-Tak ZhangSchool of Computer Science and Engineering

Seoul National University

Version: 20171011/20191017

Contents12.1 Introduction …………………………………………………..…………………………….... 312.2 Markov Decision Process ………………………………………….………………..…. 512.3 Bellman’s Optimality Criterion …………………………..………………….…….... 812.4 Policy Iteration ……….………..………………….…………………………..………..... 11

12.5 Value Iteration ………………………………………………………………..…….…...…. 1312.6 Approximate DP: Direct Methods ….…………..……..……….………………….. 1712.7 Temporal Difference Learning ….………………….……..…………….………….. 1812.8 Q-Learning ……………….…………………..……………….…………………….....……. 2112.9 Approximate DP: Indirect Methods ………………………….……..….........…. 24

12.10 Least Squares Policy Evaluation …………….…………………………..…….…... 2612.11 Approximate Policy Iteration …..……….….……………………..………….……. 30Summary and Discussion …………….…………….………………………….……………... 33

(c) 2017 Biointelligence Lab, SNU 2

12.1 Introduction (1/2)


Two paradigms of learning1. Learning with a teacher: Supervised learning2. Learning without a teacher: Unsupervised learning / reinforcement

learning / semisupervised learning

Reinforcement learning1. Behavioral learning (action, sequential decision-making)2. Interaction between an agent and its environment3. Achieving a specific goal under uncertainty

Two approaches to reinforcement learning1. Classical approach: punishment and reward (classical conditioning),

highly skilled behavior2. Modern approach: dynamic programming, planning

12.1 Introduction (2/2)


Dynamic programming (DP)• A technique that deals with situations where decisions are made in stages,

with the outcome of each decision being predictable to some extent before the next decision is made.

• Decisions cannot be made in solation, but the desire for a low cost at the present must be balanced against the undesirability of high cost in the future.

• Credit or blame must be assigned to each one of a set of interacting decisions (credit assignment problem)

• Decision making by an agent that operates in a stochastic environment• How can an agent or decision maker improve its long-term performance in a

stochastic environment when the attainment of this improvement may require having to sacrifice short-term performance?

• Markov decision processRight balance between• Realistic description of a given problem (practical)• Power of analytic and computational methods to apply to the problem

(theoretical)

12.2 Markov Decision Process (1/3)


Markov decision process (MDP):• a finite set of discrete states • a finite set of possible actions • cost (or reward)• discrete timeThe state of the environment is a summary of the entire past experience of an agent gained from its interaction with the environment, such that the information necessary for the agent to predict the future behavior of the environment is contained in that summary.

Figure 12.1 An agent interacting with its environment.

!!

MDP!=!The!sequence!of!states!{X n , !n=0,1,2,...}i.e.,!a!Markov!chain!with!transition!probabilitiespij(µ(i))!for!actions!µ(i).



i , j∈X : #statesAi = {aik }: #actionsπ = {µ0 ,µ1 ,...} : #policy#(states#X #to#actions#A)#####µn(i)∈Ai ######for#all#states#i#####Nonstationary#policy:#π = {µ0 ,µ1 ,...}#####Stationary#policy:#π = {µ ,µ ,...}pij(a): #transition#probability#####pij(a)= P(Xn+1 = j |Xn = i ,An = a)#####1.#pij(a)≥0#####for all i and j#####2. pij(a)=1

j∑ for all i

g(i ,aik , j): #cost#functionγ : #discount#factor#####γ ng(i ,aik , j): #discounted#cost

Figure 12.2 Illustration of two possible transitions: The transition from state to state is probabilistic, but the transition from state to is deterministic.



Dynamic programming (DP) problem - Finite-horizon problem - Infinite-horizon problemCost-to-go function (total expected cost)

J π (i) = E γ ng( Xn ,µn( Xn ), Xn+1) | X0 = in=0

∞

∑⎡

⎣⎢⎢

⎤

⎦⎥⎥

g( Xn ,µn( Xn ), Xn+1) : observed cost

Optimal value J *(i) = min

πJ π (i) (For stationary policy: J µ (i) = J *(i))

Basic problem in DP Given a stationary MDP, find a stationary policy π that minimizes the cost-to-go function J µ (i) for all initial states i.

Notation:

Cost function J(i) ó Value function V(s)

Cost g(.)ó Reward r(.)

12.3 Bellman’s Optimality Criterion (1/3)


Principle)of)optimality!!!!!An!optimal!policy!has!the!property!that!whatever!the!initial!state!and!!!!!!initial!decsion!are,!the!remaining!decsions!must!constitute!an!optimal!!!!!!policy!starting!from!the!state!resulting!from!the!first!decision.

Consider!a!finite:horizon!problem!for!which!the!cost:to:go!function!is:!

!!!!!J0(X0)=E gK (XK ) gn(Xn ,µn(Xn),Xn+1)n=0

K−1

∑⎡

⎣⎢

⎤

⎦⎥

Suppose!we!wish!to!minimize!the!cost:to:go!function

!!!!!Jn(Xn)=E gK (XK ) gk(Xk ,µk(Xk ),Xk+1)k=n

K−1

∑⎡

⎣⎢

⎤

⎦⎥

Then,!the!trauncated!policy!{µn* ,µn+1* ,...,µK−1* }!!is!optimal!for!the!subproblem.



Dynamic(programming(algorithm!!!!!For!every!initial!state!X0 ,!the!optimal!cost!J *(X0)!of!the!basic!finite9horizon!problem!is!equal!to!J0(X0),!where!the!function!J0 !is!obtained!from!the!last!step!of!the!algorithm!!!!!Jn(Xn)=min

µn

EXn+1

gn(Xn ,µn(Xn),Xn+1)+ Jn+1(Xn+1)⎡⎣ ⎤⎦ !!!!!!!(12.13)

which!runs!backward!in!time,!with!!!!!!JK (XK )= gK (XK )Furthermore,!if!µn* !minimizes!the!right9hand!side!of!Eq.!(12.13)!for!each!Xn !and!n,!then!the!policy!π* = {µ0* ,µ1* ,...,µK−1* }!is!optimal.!



!!!

Bellman's!optimality!equation!!!!

!! J *(i)=minµ

c(i ,µ(i))+γ pij(µ) J *( j)j=1

N

∑⎛

⎝⎜⎞

⎠⎟for i =1,2,...,N !

Immediate!expected!cost

!!!!!c(i ,µ(i))=ΕX1[g(i ,µ(i),X1)]= pij g(i ,µ(i), j)

j=1

N

∑ !!!!!!!!!!!!!!!!!!!!

Two!methods!for!computing!an!optimal!policy!!!!A!Policy!iteration!!!!A!Value!iteration

12.4 Policy Iteration (1/2)

Figure 12.3 Policy iteration algorithm.(c) 2017 Biointelligence Lab, SNU 11

Qµ (i,a) = c(i,a)+γ pij (a)J µ ( j)

j=1

n

∑ (Q-factor)

1. Policy evaluation step: the cost-to-go function for some current policy and the corresponding Q-factor are computed for all states and actions;

2. Policy improvement step: the current policy is updated in order to be greedy with respect to the cost-to-go function computed in step 1.

12.4 Policy Iteration (2/2)


J µn (i)= c(i,µn (i))+γ pij (µn (i))Jµn ( j)

j=1

N

∑ , i=1,2,...,N

Qµn (i,a)= c(i,a)+γ pij (a)J

µn ( j)j=1

N

∑ , a∈ Ai and i=1,2,...,N

!!µn+1(i)

12.5 Value Iteration (1/4)


!!Jn+1(i)


Figure 12.4 Illustrative backup diagrams for (a) policy iteration and (b) value iteration.



Figure 12.5 Flow graph for stagecoach problem.



Figure 12.6 Steps involved in calculating the 𝑄-factors for the stagecoach problem. The routes (printed in blue) from 𝐴 to 𝐽 are the optimal ones.


12.6 Approximate Dynamic Programming: Direct Methods


• Dynamic programming (DP) requires an explicit model, i.e. transition probabilities.

• Approximate DP: We may use Monte Carlo simulation to explicitly estimate (i.e. approximate) the transition probabilities.1. Direct methods2. Indirect methods: Approximate policy evaluation, approximate

cost-to-go • Direct methods for approximate DP

1. Value iteration: Temporal-difference (TD) learning2. Policy iteration: Q-learning

• Reinforcement learning as the direct approximation of DP

12.7 Temporal-Difference Learning (1/3)

(c) 2017 Biointelligence Lab, SNU 18!!!

Sequence!of!states:!{in}n=0N

Cost2to2go!function!for!Bellman!equation!!!J µ(in)=E g(in ,in+1)+ J µ(in+1)⎡⎣ ⎤⎦ , n=0,1,...,N −1Applying!the!Robbins(Monro!stochastic!approximation!!!!!r + = (1−η)r +ηg(r ,v )!we!have

!!!J +(in) = (1−η) J(in)+η g(in ,in+1)+ J(in+1)⎡⎣ ⎤⎦

= J(in)+η g(in ,in+1)+ J(in+1)− J(in)⎡⎣ ⎤⎦

Temporal!difference!(TD)

!!!! dn = g(in ,in+1)+ J(in+1)− J(in), n=0,1,...,N −1TD!learning!!!!!!!J +(in)= J(in)+ηdn!!!!


(c) 2017 Biointelligence Lab, SNU 19!!!

Monte!Carlo!simulation!algorithm

!J µ(in)=E g(in+k ,in+k+1)k=0

N−n−1

∑⎡⎣⎢

⎤

⎦⎥ , n=0,1,...,N −1

Robbins/Monro!stochastic!approximation

J+(in)= J(in)+ µk g(in+k ,in+k+1)k=0

N−n−1

∑ − J(in)⎛⎝⎜

⎞⎠⎟

!!!!!!!!!!! = J(in)+ µk[g(in ,in+1)+ J(in+1)− J(in)!!!!!!!!!!!!!!!!!!+!g(in+1 ,in+2)+ J(in+2)− J(in+1)!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!g(iN−2 ,iN−1)+ J(iN−1)− J(iN−2)!!!!!!!!!!!!!!!!!!+!g(iN−1 ,iN )+ J(iN )− J(iN−1)]Using!the!temporal!difference

!!!!!J+(in)= J(in)+ dn+kk=0

N−n−1

∑

To justify this is an iterative implementation of Monte Carlo simulation :Total sum of the cost for traj {in ,in+1,...,iN}

c(in ) = g(in+k ,in+k+1), n = 0,...,N −1k=0

N−n−1

∑Cost-to-go after visiting in for T simulations

J (in ) =1T

c(in )n=1

T

∑Ensemble averaged cost-to-go

J µ (in ) = Ε[c(in )] for all nIterative formula

J + (in ) = J (in ) +ηn (c(in ) − J (in ))



TD methods• Online prediction

methods that learn how to compute their estimates, partly, on the basis of other estimates.

• Boostraping methods• Do not require a model

of the environment

TD(λ)

J + (in ) = J (in )+η λ k−ndkk=n

∞

∑ dk = g(ik ,ik+1)+ J (ik+1)− J (ik )

λ = 0 : J + (in ) = J (in )+ηd n

λ = 1:

J + (in ) = J (in )+η dn+kk=0

N−n−1

∑

12.8 Q-Learning (1/3)


Two$step)version)of)Bellman's)optimality)equation

)))))Q*(i ,a)= pij(a) g(i ,a, j)+γ minb∈AiQ* ( j ,b))⎛

⎝⎜⎞⎠⎟ !for!all!(i ,a)j=1

N

∑The$value$iteration$algorithm

$$$$$Q*(i ,a)= pij(a) g(i ,a, j)+γ minb∈AiQ( j ,b))⎛

⎝⎜⎞⎠⎟ !for!all!(i ,a)j=1

N

∑Small%step*size*version

*****Q*(i ,a)= (1−η)Q(i ,a)+η pij(a) g(i ,a, j)+γ minb∈AiQ( j ,b)⎛

⎝⎜⎞⎠⎟j=1

N

∑ !!for!all!(i ,a)

Stochastic*version*based*on*a*single*sample*****Qn+1(i ,a)= (1−ηn(i ,a))Qn(i ,a)+ηn(i ,a)[g(i ,a, j)+γ Jn( j)]##for#(i ,a)"="(in ,an)"""""""""""""""""""""Jn( j)=minb∈Ai

Qn( j ,b)"""""Qn+1(i ,a)=Qn(i ,a)""for"all"(i ,a)≠ (in ,an)Q"learning*algorithm:"""""Qn+1(i ,a)=Qn(i ,a)+ηn(i ,a)[g(i ,a, j)+γ minb∈Ai

Qn( j ,b)−Qn(i ,a)]

Is there any on-line procedure for learning an optimal control policy through experience that is gained solely on the basis of observing samples?

( , , , )n n n n ns i a j g=

( , , )n n n ng g i a j=

Convergence TheoremSuppose that the learning-rate parameter etan(i, a) satisfies the conditions

Then, the sequence of Q-factors {Qn(i, a)} generated by the Q-learning algorithm converges with probability 1 to the optimal value Q*(i, a) for all state–action pairs (i, a) as the number of iterations n approaches infinity, provided that all state–action pairs are visited infinitely often.



2

0 0( , ) and ( , ) for all (i,a)n n

n ni a i ah h

¥ ¥

= =

= ¥ < ¥å å

Time-varying learning parameter: ηn =α

β + n, n =1,2,...

Q-learning algorithm may be viewed in one of two equivalent ways• Robins-Monro stochastic approximation algorithm, or• combination of value iteration and Monte Carlo simulation.

Compromise between two conflicting objectives in reinforcement learning• Exploration: ensures that all admissible state–action pairs are explored often enough

to satisfy the Q-learning convergence theorem• Exploitation: seeks to minimize the cost-to-go function by following a greedy policy.



Figure 12.7 The time slots pertaining to the auxiliary and original control processes.

!!!

Mixed!nonstationary!policyswitches!between!an!auxilary!Markov!process!and!the!original!Markov!process!controlled!by!a!stationary!greedy!policy!determined!by!Q:learning!!!!!nk =mk−1 +L, !!k =1,2,...,and!m0 =1!!!!!!mk = nk +kL, !!k =1,2,...

12.9 Approximate DP: Indirect Methods (1/2)


Indirect approach to approximate DP:Ø Rather than explicitly estimate the transition probabilities and associated transition costs, use

Monte Carlo simulation to generate one or more system trajectories, so as to approximate the cost-to-go function of a given policy, or even the optimal cost-to-go function, and then optimize the approximation in some statistical sense.

Ø Thus, having abandoned the notion of optimality, we may capture the goal of the indirect approach to approximate dynamic programming in the following simple statement: Do as well as possible, and not more.

Ø Performance optimality is traded off for computational optimality. This strategy is precisely what the human brain does on a daily basis: Given a difficult decision-making problem, the brain provides a suboptimal solution that is the “best” in terms of reliability and available resource allocation.

!!!

Goal!of!approximate!DP:!!!!!Find!a!function!!J(i ,w)!that!approximates!the!optimal!cost7to7go!function!J *(i)!forstate!i ,!such!that!the!!!!!!cost!difference!!J *(i)− !!J(i ,w)!is!minimized!according!to!some!statistical!criterion.Two!basic!questions!!!!!Question1:!How!do!we!choose!the!approximation!function!!J(i ,w)!in!the!first!place?!!!!!Question2:!Having!chosen!an!appropriate!approximation!function!!J(i ,w),!how!do!we!adapt!the!weight!!!!!!!!!!!vector!w !so!as!to!provide!the!“best!fit”!to!Bellman’s!equation!of!optimality

12.9 Approximate DP: Indirect Methods (2/2)


Figure 12.8 Architectural layout of the linear approach to approximate dynamic programming.

Approximate+DP1.#Linear#approach#########!J(i ,w)= ϕijw j = φi

T

j∑ w ##for#all#i

2.#Nonlinear#approach####5#Recurrent#multilayer#perceptrons#(deep#architectures)####5#Supervised#training#of#a#recurrent#multilayer#perceptron#by#nonlinear#######sequential5state#estimation#algorithm#that#is#derivative#free.

12.10 Least-Squares Policy Evaluation (1/4)


J (i) ≈ !J (i,w) = φT (i)w1 00

( ) ( , ) |nn n

nJ i g i i i ig

¥

+=

é ù= E =ê úë ûå S = φw |w∈!s{ }

1. The Markov chain has positive steady-state probabilities; that is,

The implication of this assumption is that the Markov chain has a single recurrent class with no transient states.

2. The rank of matrix is s.The implication of this second assumption is that the columns of the feature matrix , and therefore the basis functions represented by , are linearly independent.

limn→∞

1n

P(ik = j | i0 = i) = π j > 0 for all ik=1

n

∑

f

f fw

Perform value iteration within a lower dimensional subspace spanned by a set of basis functions.

φwn+1 =ΠΤ(φwn ) + (simulation noise)φ =

φ1T

φ2T

!

φNT

⎡

⎣

⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥


27

1( ) ( ( , ) ( )), i 1, 2,...,

N

ijj

J i p g i j J i Ng=

= + =åT

g =

p1 jj∑ g(1, j)

p2 jj∑ g(2, j)

!pNj

j∑ g(N , j)

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥

P =

p11 … p1N! " !pN1 ! pNN

⎛

⎝

⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟

J =

J (1)J (2)!J (N )

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

≈ φw

g= +J g PJT

1n n+ =J JT

φwn+1 =ΠΤ(φwn ), n = 0,1,2,…

Π : projection onto subspace S

Figure 12.9 Projected value iteration (PVI) method.

Projected)value)iteration)(PVI))for)policy)evaluation:At !iteration!n, !the!current !iterate!φwn !is !operated !on!by !the!mapping!T !and !the!new !vector !!T(φwn)!is !projected !onto!the!subspace!S , !thereby !yielding!the!updated !iterate!φwn+1.



!!!

From!porojected!value!iteration!to!least 3square!policy!evaluation!(LSPE)Least'squares!minimization!for!the!projection!Π, !i.e.!for!φwn+1 =ΠΤ(φwn)

!!!!!wn+1 = argminw

φw−Τ(φwn) π

2

Least'squares!version!of!PVI!algorithm!

!!!!!wn+1 = argminw

π ii=1

N

∑ φT(i)w− pij g(i , j)+γφT( j)wn

j=1

N

∑⎛⎝⎜

⎞

⎠⎟⎛

⎝⎜

⎞

⎠⎟

Use!Monte'Carlo!by!generating!a!trajectory!(i0 , !i1 , !i2 , !...)!for!state!i !and!updating!wn !after!each!iteration!(in ,in+1)

!!!!!wn+1 = argminw

φT(ik )w− g(ik ,ik+1)−γφT(ik+1)wn( )2

k=1

n

∑(Least'squares!policy!evaluation!or!LSPE!algorithm)LSPE!converges!to!PVI!!!!!φw* =ΠT(φw* )

Figure 12.10 Illustration of the least-squares policy evaluation (LSPE) as a stochastic version of the projected value iteration (PVI).



At iteration n+1 of the LSPE() algorithm, the updated weight vector is computed as the particular value of the weight vector w that minimizes the least-squares difference between the following two quantities:• the inner product approximating the cost function ;• its temporal difference counterpart

which is extracted from a single simulated trajectory for k = 0, 1, ..., n.

1n+w

( )TkiF w ( )kJ i

ΦT (ik )wn + (γλ)m−k dn (im ,im+1)m=k

n

∑

LSPE(λ)

dn(ik ,ik+1)= g(ik ,ik+1)+γφT(ik+1)wn −φT(ik )wn

wn+1 = argminw

φT(i k )w −φT(ik )wn − γλ( )m−kdn(im ,im+1)

m=k

n

∑⎛

⎝⎜⎞

⎠⎟k=1

n

∑2

12.11 Approximate Policy Iteration (1/3)


1. Approximate policy evaluation step. Given the current policy, we compute a cost-to-go function approximating the actual cost-to-go function for all states i. The vector is the weight vector of the neural network used to perform the approximation.

2. Policy improvement step. Using the approximate cost-to-go function , we generate an improved policy .This new policy is designed to be greedy with respect to for all i.

( )J iµ

w

!J µ (i,w)µ

!J µ (i,w)

!J µ (i,w)

Figure 12.11 Approximate policy iteration algorithm.



ε(w) = (k(i,m)− !J µ (i,w))2

m=1

M ( i)

∑i∈X∑

µ(i) = argmina∈Ai

Q(i,a,w)

Figure 12.12 Block diagram of the approximate policy iteration algorithm.

Q(i,a,w) = pij (a)(g(i,a, j) + γ !Jµ ( j,w))

j∈X∑

Summary and Discussion


l Approximate Dynamic Programming: Direct Methods• Temporal difference (TD) learning• Q-learning

l Approximate Dynamic Programming: Indirect Methods• Linear structural approach, which involves

- Feature extraction of the state i- Least-squares minimization (e.g., LSPE)• Nonlinear structural approach, which relies on universal

approximators (e.g., neural networks)l Partial Observability (POMDP)l Relationship between Dynamic Programming and Viterbi

Algorithm

chapter 12. dynamic programming · 2019. 10. 22. · dynamic programming (dp) • a technique that...

Documents