new an introduction tohgeffner/andy2.pdf · 2019. 12. 4. · a. g. barto, barcelona lectures, april...
Post on 17-Oct-2020
10 Views
Preview:
TRANSCRIPT
An Introduction to COMPUTATIONAL REINFORCEMENT LEARING
Andrew G. BartoDepartment of Computer Science
University of Massachusetts – Amherst
UPF Lecture 2
Autonomous Learning Laboratory – Department of Computer Science
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 2
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 3
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 4
Lecture 2, Part 1: Dynamic Programming
❐ Overview of a collection of classical solution methods for MDPsknown as Dynamic Programming (DP)
❐ Show how DP can be used to compute value functions, and hence,optimal policies
❐ Discuss efficiency and utility of DP
Objectives of this part:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 5
Policy Evaluation
State - value function for policy ! :
V!(s) = E! R
tst
= s{ } = E! " krt+k +1 s
t= s
k =0
#
$% & '
( ) *
Bellman equation for V!
:
V!(s) = ! (s, a) P
s " s
aR
s " s
a+ #V
!( " s )[ ]
" s
$a
$
— a system of S simultaneous linear equations
Policy Evaluation: for a given policy π, compute the state-value function V!
Recall:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 6
Iterative Methods
V0! V
1!L!V
k! V
k+1 !L! V"
!
Vk +1(s)" # (s,a) P
s $ s
aR
s $ s
a + %Vk( $ s )[ ]
$ s
&a
&
a “sweep”
A sweep consists of applying a backup operation to each state.
A full policy evaluation backup:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 8
A Small Gridworld
❐ An undiscounted episodic task❐ Nonterminal states: 1, 2, . . ., 14;❐ One terminal state (shown twice as shaded squares)❐ Actions that would take agent off the grid leave state unchanged❐ Reward is –1 until the terminal state is reached
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 9
Iterative Policy Eval for the Small Gridworld
! = random (uniform) action choices
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 10
Policy Improvement
Suppose we have computed for a deterministic policy π.V!
For a given state s, would it be better to do an action ? a ! "(s)
!
Q"(s,a) = E" r
t +1+ #V "
(st +1
) st= s,a
t= a{ }
= Ps $ s
a
$ s
% Rs $ s
a + #V "( $ s )[ ]
The value of doing a in state s is :
!
It is better to switch to action a for state s if and only if
Q" (s,a) >V" (s)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 11
Policy Improvement Cont.
! " (s) = argmaxa
Q"(s, a)
= argmaxa
Ps ! s
a
! s
# Rs ! s
a + $V " ( ! s )[ ]
Do this for all states to get a new policy ! " that is
greedy with respect to V " :
Then V! " # V
"
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 12
Policy Improvement Cont.
!
What if V" # = V
# ?
i.e., for all s$ S, V" # (s) = max
a
Ps " s
a
" s
% Rs " s
a + &V #( " s )[ ] ?
But this is the Bellman Optimality Equation.
So V ! " = V# and both " and ! " are optimal policies.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 13
Policy Iteration
!0"V
!0 "!
1" V
!1 "L!
*"V
*"!
*
policy evaluation policy improvement“greedification”
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 15
Value Iteration
!
Vk +1(s)" # (s,a) P
s $ s
aR
s $ s
a + %Vk( $ s )[ ]
$ s
&a
&
Recall the full policy evaluation backup:
!
Vk +1(s)"max
a
Ps # s
aR
s # s
a + $Vk( # s )[ ]
# s
%
Here is the full value iteration backup:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17
Asynchronous DP
❐ All the DP methods described so far require exhaustivesweeps of the entire state set.
❐ Asynchronous DP does not use sweeps. Instead it works likethis: Repeat until convergence criterion is met:
– Pick a state at random and apply the appropriatebackup
❐ Still need lots of computation, but does not get locked intohopelessly long sweeps
❐ Can you select states to backup intelligently? YES: anagent’s experience can act as a guide.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 19
Efficiency of DP
❐ To find an optimal policy is polynomial in the number ofstates…
❐ BUT, the number of states is often astronomical, e.g., oftengrowing exponentially with the number of state variables(what Bellman called “the curse of dimensionality”).
❐ In practice, classical DP can be applied to problems with afew millions of states.
❐ Asynchronous DP can be applied to larger problems, andappropriate for parallel computation.
❐ It is surprisingly easy to come up with MDPs for which DPmethods are not practical.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 20
Summary
❐ Policy evaluation: backups without a max❐ Policy improvement: form a greedy policy, if only locally❐ Policy iteration: alternate the above two processes❐ Value iteration: backups with a max❐ Full backups (to be contrasted later with sample backups)❐ Asynchronous DP: a way to avoid exhaustive sweeps❐ Bootstrapping: updating estimates based on other
estimates
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 21
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 22
Lecture 2, Part 2: Simple Monte Carlo Methods
❐ Simple Monte Carlo methods learn from complete samplereturns Only defined for episodic tasks
❐ Simple Monte Carlo methods learn directly from experience On-line: No model necessary Simulated: No need for a full model
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 24
(First-visit) Monte Carlo policy evaluation
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 27
Backup diagram for Simple Monte Carlo
❐ Entire episode included❐ Only one choice at each state
(unlike DP)
❐ MC does not bootstrap
❐ Time required to estimate onestate does not depend on thetotal number of states
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 30
Monte Carlo Estimation of Action Values (Q)
❐ Monte Carlo is most useful when a model is not available❐ Qπ(s,a) - average return starting from state s and action a
following π❐ Also converges asymptotically if every state-action pair is
visited infinitely often
We are really interested in estimates of V* and Q*,i.e., Monte Carlo Control
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 38
Learning about π while following ! "
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 41
Summary
❐ MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book)
❐ MC methods provide an alternate policy evaluationprocess
❐ One issue to watch for: maintaining sufficient exploration exploring starts, soft policies
❐ No bootstrapping (as opposed to DP)❐ Estimating values for one policy while behaving according
to another policy: importance sampling
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 42
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Simple Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 43
Lecture 2, Part 2:Temporal Difference Learning
❐ Introduce Temporal Difference (TD) learning❐ Focus first on policy evaluation, or prediction, methods❐ Then extend to control methods
Objectives of this part:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 44
TD Prediction
!
Simple (every - visit) Monte Carlo method :
V (st)"V (s
t) +# R
t$V (s
t)[ ]
Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V!
Recall:
!
The simplest TD method, TD(0) :
V (st)"V (s
t) +# r
t+1 + $V (st+1) %V (s
t)[ ]
target: the actual return after time t
target: an estimate of the return
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 45
Simple Monte Carlo
T T T TT
T T T T T
!
V (st)"V (s
t) +# R
t$V (s
t)[ ]
where Rt is the actual return following state s
t.
st
T T
T T
TT T
T TT
(constant-α MC)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 46
cf. Dynamic Programming
V(st)! E" r
t+1 +# V(st ){ }
T
T T T
st
rt+1
st+1
T
TT
T
TT
T
T
T
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 47
Simplest TD Method
T T T TT
T T T T T
st+1
rt+1
st
V(st)! V(s
t) +" r
t+1 + # V (st+1 ) $ V(st )[ ]
TTTTT
T T T T T
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 48
TD Bootstraps and Samples
❐Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps
❐Sampling: update does not involve anexpected value MC samples DP does not sample TD samples
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 51
Advantages of TD Learning
❐ TD methods do not require a model of the environment,only experience
❐ TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome
– Less memory– Less peak computation
You can learn without the final outcome– From incomplete sequences
❐ Both MC and TD converge (under certain assumptions),but which is faster/better?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 52
Random Walk Example
Values learned by TD(0)aftervarious numbers of episodes
!
equiprobable transitions
" = 0.1
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 53
TD and MC on the Random Walk
Data averaged over100 sequences of episodes
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 56
You are the Predictor
Suppose you observe the following 8 episodes:
A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0
V(A)?
V(B)?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 57
You are the Predictor
V(A)?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 58
You are the Predictor
❐ The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a Monte Carlo method gets
❐ If we consider the sequentiality of the problem, then wewould set V(A)=.75 This is correct for the maximum likelihood estimate of a
Markov model generating the data i.e, if we do a best fit Markov model, and assume it is
exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 59
Learning An Action-Value Function
Estimate Q! for the current behavior policy !.
After every transition from a nonterminal state st, do this :
Q st, a
t( )!Q s
t, a
t( ) + " r
t+1 +# Q st+1,at+1( ) $Q s
t,a
t( )[ ]
If st+1 is terminal, then Q(s
t+1, at+1 ) = 0.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 60
Sarsa: On-Policy TD Control
Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:
s, ars’
a’
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 63
Q-Learning: Off-Policy TD Control
One - step Q - learning :
Q st, a
t( )!Q s
t, a
t( ) + " r
t+1 +# maxaQ s
t+1, a( ) $Q st, a
t( )[ ]
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 64
Cliffwalking
ε−greedy, ε = 0.1
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 65
Actor-Critic Architecture
Environment
Primary Critic
Actor
Situations or States
Actions
Adaptive Critic
Primary Rewards
Effective Rewards: (involves values)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 66
Actor-Critic Methods
❐ Explicit representation of policyas well as value function
❐ Minimal computation to selectactions
❐ Can learn an explicit stochasticpolicy
❐ Can put constraints on policies❐ Appealing as psychological and
neural models
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 67
Actor-Critic Details
TD - error is used to evaluate actions :
!t= r
t+1 + " V (st+1 ) # V(s
t)
If actions are determined by preferences, p(s, a), as follows :
!t (s, a) = Pr at = a st = s{ } = ep( s, a)
e p(s ,b)
b
",
then you can update the preferences like this :
p(st , at )# p(st ,at ) + $% t
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 71
Afterstates
❐ Usually, a state-value function evaluates states in which the agent cantake an action.
❐ But sometimes it is useful to evaluate states after agent has acted, as intic-tac-toe.
❐ Why is this useful?
❐ What is this in general?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 72
Summary
❐ TD prediction❐ Introduced one-step tabular model-free TD methods❐ Extend prediction to control by employing some form of GPI
On-policy control: Sarsa Off-policy control: Q-learning
❐ These methods bootstrap and sample, combining aspects ofDP and MC methods
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 73
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 74
Lecture 2, Part 4: Unified Perspective
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 75
n-step TD Prediction
❐ Idea: Look farther into the future when you do TD backup(1, 2, 3, …, n steps)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 76
❐ Monte Carlo:
❐ TD: Use V to estimate remaining return
❐ n-step TD: 2 step return:
n-step return:
Mathematics of n-step TD Prediction
T
tT
ttttrrrrR1
3
2
21
!!
+++ ++++= """ L
)( 11
)1(
++ +=ttttsVrR !
)( 2
2
21
)2(
+++ ++=tttttsVrrR !!
)(1
3
2
21
)(
ntt
n
nt
n
ttt
n
tsVrrrrR ++
!
+++ +++++= """" L
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 77
Learning with n-step Backups
❐ Backup (on-line or off-line):
❐ Error reduction property of n-step returns
❐ Using this, you can show that n-step methods converge
!
maxs
E"{Rt
n| s
t= s}#V "
(s) $ % nmaxs
V (s) #V "(s)
n step return
Maximum error using n-step return Maximum error using V
!
"Vt(s
t) =# R
t
(n )$V
t(s
t)[ ]
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 78
Random Walk Examples
❐ How does 2-step TD work here?❐ How about 3-step TD?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 79
A Larger Example
❐ Task: 19 staterandom walk
❐ Do you think thereis an optimal n (foreverything)?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 80
Averaging n-step Returns
❐ n-step methods were introduced to help withTD(λ) understanding
❐ Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-
step
❐ Called a complex backup Draw each component Label with the weights for that
component
)4()2(
2
1
2
1tt
avg
t RRR +=
One backup
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 81
Forward View of TD(λ)
❐ TD(λ) is a method foraveraging all n-step backups weight by λn-1 (time since
visitation) λ-return:
❐ Backup using λ-return:
Rt
!= (1" ! ) !
n "1
n=1
#
$ Rt
(n)
!Vt(s
t) = " R
t
#$ V
t(s
t)[ ]
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 82
λ-return Weighting Function
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 83
Relation to TD(0) and MC
❐ λ-return can be rewritten as:
❐ If λ = 1, you get MC:
❐ If λ = 0, you get TD(0)
Rt
!= (1" ! ) !
n"1
n=1
T" t"1
# Rt
(n)+ !
T"t"1Rt
Rt
!= (1"1) 1
n"1
n=1
T"t"1
# Rt
(n )+ 1
T" t"1Rt= R
t
Rt
!= (1" 0) 0
n"1
n=1
T"t"1
# Rt
(n )+ 0
T" t"1Rt= R
t
(1)
Until termination After termination
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 84
Forward View of TD(λ)
❐ Look forward from each state to determine update fromfuture states and rewards:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 85
λ-return on the Random Walk
❐ Same 19 state random walk as before❐ Why do you think intermediate values of λ are best?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 86
Backward View of TD(λ)
❐ The forward view was for theory❐ The backward view is for mechanism
❐ New variable called eligibility trace: On each step, decay all traces by γλ and increment the
trace for the current state by 1 Accumulating trace !
et(s)" #+
et(s) =
!"et#1(s) if s $ s
t
!"et#1(s) +1 if s = s
t
% & '
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 87
Backward View
❐ Shout δt backwards over time❐ The strength of your voice decreases with temporal
distance by γλ
)()( 11 ttttttsVsVr !+= ++ "#
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 88
On-line Tabular TD(λ)
Initialize V(s) arbitrarily and e(s) = 0, for all s !S
Repeat (for each episode) :
Initialize s
Repeat (for each step of episode) :
a" action given by # for s
Take action a, observe reward, r, and next state $ s
% " r +&V( $ s ) ' V (s)
e(s)" e(s) +1
For all s :
V(s) "V(s) +(%e(s)
e(s) "&)e(s)
s" $ s
Until s is terminal
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 89
Relation of Backwards View to MC & TD(0)
❐ Using update rule:
❐ As before, if you set λ to 0, you get to TD(0)❐ If you set λ to 1, you get MC but in a better way
Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to
the end of the episode)
)()( sesVttt
!"=#
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 90
Forward View = Backward View
❐ The forward (theoretical) view of TD(λ) is equivalent tothe backward (mechanistic) view for off-line updating
❐ The book shows:
❐ On-line updating with small α is similar
!Vt
TD(s)
t= 0
T"1
# = $t= 0
T"1
# Isst
(%&)k" t'k
k=t
T"1
# !Vt
"(s
t)Isst
t= 0
T#1
$ = %t= 0
T#1
$ Isst
(&")k# t'k
k=t
T#1
$
!Vt
TD
(s)t= 0
T"1
# = !Vt
$(s
t)
t= 0
T"1
# Isst
Backward updates Forward updates
algebra shown in book
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 92
Control: Sarsa(λ)
❐ Save eligibility for state-actionpairs instead of just states
et(s, a) =
!"et#1(s, a) +1 if s = s
t and a = a
t
!"et#1(s,a) otherwise
$ % &
Qt+1(s, a) = Q
t(s, a) +'(
tet(s, a)
(t
= rt+1 + !Q
t(s
t+1,at+1) #Qt(s
t, a
t)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 93
Sarsa(λ) Algorithm
Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a
Repeat (for each episode) :
Initialize s, a
Repeat (for each step of episode) :
Take action a, observe r, ! s
Choose ! a from ! s using policy derived from Q (e.g. ? - greedy)
" # r +$Q( ! s , ! a ) %Q(s, a)
e(s,a)# e(s,a) +1
For all s,a :
Q(s, a)#Q(s, a) +&"e(s, a)
e(s, a) #$'e(s, a)
s# ! s ;a # ! a
Until s is terminal
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 94
Sarsa(λ) Gridworld Example
❐ With one trial, the agent has much more information about how to getto the goal not necessarily the best way
❐ Can considerably accelerate learning
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 109
Conclusions
❐ Eligibility traces provide efficient, incremental way tocombine MC and TD Includes advantages of MC (can deal with lack of
Markov property) Includes advantages of TD (using TD error,
bootstrapping)❐ Can significantly speed learning❐ Does have a cost in computation
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 110
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 111
TD-errorδt = rt + Vt − Vt−1
V
δ
early inlearning
V
δ
learningcomplete
δr omitted
rregular predictors of z over this interval
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 112
Dopamine Neurons and TD Error
W. Schultz et al. Universite de Fribourg
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 113
Dopamine Modulated Synaptic Plasticity
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 114
Basal Ganglia as Adaptive Critic Architecture
Houk, Adams, & Barto, 1995
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 116
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL
top related