new an introduction tohgeffner/andy2.pdf · 2019. 12. 4. · a. g. barto, barcelona lectures, april...

Post on 17-Oct-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

Andrew G. BartoDepartment of Computer Science

University of Massachusetts – Amherst

UPF Lecture 2

Autonomous Learning Laboratory – Department of Computer Science

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 2

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 3

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 4

Lecture 2, Part 1: Dynamic Programming

❐ Overview of a collection of classical solution methods for MDPsknown as Dynamic Programming (DP)

❐ Show how DP can be used to compute value functions, and hence,optimal policies

❐ Discuss efficiency and utility of DP

Objectives of this part:

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 5

Policy Evaluation

State - value function for policy ! :

V!(s) = E! R

tst

= s{ } = E! " krt+k +1 s

t= s

k =0

#

$% & '

( ) *

Bellman equation for V!

:

V!(s) = ! (s, a) P

s " s

aR

s " s

a+ #V

!( " s )[ ]

" s

$a

$

— a system of S simultaneous linear equations

Policy Evaluation: for a given policy π, compute the state-value function V!

Recall:

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 6

Iterative Methods

V0! V

1!L!V

k! V

k+1 !L! V"

!

Vk +1(s)" # (s,a) P

s $ s

aR

s $ s

a + %Vk( $ s )[ ]

$ s

&a

&

a “sweep”

A sweep consists of applying a backup operation to each state.

A full policy evaluation backup:

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 8

A Small Gridworld

❐ An undiscounted episodic task❐ Nonterminal states: 1, 2, . . ., 14;❐ One terminal state (shown twice as shaded squares)❐ Actions that would take agent off the grid leave state unchanged❐ Reward is –1 until the terminal state is reached

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 9

Iterative Policy Eval for the Small Gridworld

! = random (uniform) action choices

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 10

Policy Improvement

Suppose we have computed for a deterministic policy π.V!

For a given state s, would it be better to do an action ? a ! "(s)

!

Q"(s,a) = E" r

t +1+ #V "

(st +1

) st= s,a

t= a{ }

= Ps $ s

a

$ s

% Rs $ s

a + #V "( $ s )[ ]

The value of doing a in state s is :

!

It is better to switch to action a for state s if and only if

Q" (s,a) >V" (s)

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 11

Policy Improvement Cont.

! " (s) = argmaxa

Q"(s, a)

= argmaxa

Ps ! s

a

! s

# Rs ! s

a + $V " ( ! s )[ ]

Do this for all states to get a new policy ! " that is

greedy with respect to V " :

Then V! " # V

"

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 12

Policy Improvement Cont.

!

What if V" # = V

# ?

i.e., for all s$ S, V" # (s) = max

a

Ps " s

a

" s

% Rs " s

a + &V #( " s )[ ] ?

But this is the Bellman Optimality Equation.

So V ! " = V# and both " and ! " are optimal policies.

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 13

Policy Iteration

!0"V

!0 "!

1" V

!1 "L!

*"V

*"!

*

policy evaluation policy improvement“greedification”

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 15

Value Iteration

!

Vk +1(s)" # (s,a) P

s $ s

aR

s $ s

a + %Vk( $ s )[ ]

$ s

&a

&

Recall the full policy evaluation backup:

!

Vk +1(s)"max

a

Ps # s

aR

s # s

a + $Vk( # s )[ ]

# s

%

Here is the full value iteration backup:

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17

Asynchronous DP

❐ All the DP methods described so far require exhaustivesweeps of the entire state set.

❐ Asynchronous DP does not use sweeps. Instead it works likethis: Repeat until convergence criterion is met:

– Pick a state at random and apply the appropriatebackup

❐ Still need lots of computation, but does not get locked intohopelessly long sweeps

❐ Can you select states to backup intelligently? YES: anagent’s experience can act as a guide.

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 19

Efficiency of DP

❐ To find an optimal policy is polynomial in the number ofstates…

❐ BUT, the number of states is often astronomical, e.g., oftengrowing exponentially with the number of state variables(what Bellman called “the curse of dimensionality”).

❐ In practice, classical DP can be applied to problems with afew millions of states.

❐ Asynchronous DP can be applied to larger problems, andappropriate for parallel computation.

❐ It is surprisingly easy to come up with MDPs for which DPmethods are not practical.

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 20

Summary

❐ Policy evaluation: backups without a max❐ Policy improvement: form a greedy policy, if only locally❐ Policy iteration: alternate the above two processes❐ Value iteration: backups with a max❐ Full backups (to be contrasted later with sample backups)❐ Asynchronous DP: a way to avoid exhaustive sweeps❐ Bootstrapping: updating estimates based on other

estimates

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 21

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 22

Lecture 2, Part 2: Simple Monte Carlo Methods

❐ Simple Monte Carlo methods learn from complete samplereturns Only defined for episodic tasks

❐ Simple Monte Carlo methods learn directly from experience On-line: No model necessary Simulated: No need for a full model

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 24

(First-visit) Monte Carlo policy evaluation

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 27

Backup diagram for Simple Monte Carlo

❐ Entire episode included❐ Only one choice at each state

(unlike DP)

❐ MC does not bootstrap

❐ Time required to estimate onestate does not depend on thetotal number of states

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 30

Monte Carlo Estimation of Action Values (Q)

❐ Monte Carlo is most useful when a model is not available❐ Qπ(s,a) - average return starting from state s and action a

following π❐ Also converges asymptotically if every state-action pair is

visited infinitely often

We are really interested in estimates of V* and Q*,i.e., Monte Carlo Control

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 38

Learning about π while following ! "

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 41

Summary

❐ MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book)

❐ MC methods provide an alternate policy evaluationprocess

❐ One issue to watch for: maintaining sufficient exploration exploring starts, soft policies

❐ No bootstrapping (as opposed to DP)❐ Estimating values for one policy while behaving according

to another policy: importance sampling

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 42

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Simple Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 43

Lecture 2, Part 2:Temporal Difference Learning

❐ Introduce Temporal Difference (TD) learning❐ Focus first on policy evaluation, or prediction, methods❐ Then extend to control methods

Objectives of this part:

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 44

TD Prediction

!

Simple (every - visit) Monte Carlo method :

V (st)"V (s

t) +# R

t$V (s

t)[ ]

Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V!

Recall:

!

The simplest TD method, TD(0) :

V (st)"V (s

t) +# r

t+1 + $V (st+1) %V (s

t)[ ]

target: the actual return after time t

target: an estimate of the return

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 45

Simple Monte Carlo

T T T TT

T T T T T

!

V (st)"V (s

t) +# R

t$V (s

t)[ ]

where Rt is the actual return following state s

t.

st

T T

T T

TT T

T TT

(constant-α MC)

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 46

cf. Dynamic Programming

V(st)! E" r

t+1 +# V(st ){ }

T

T T T

st

rt+1

st+1

T

TT

T

TT

T

T

T

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 47

Simplest TD Method

T T T TT

T T T T T

st+1

rt+1

st

V(st)! V(s

t) +" r

t+1 + # V (st+1 ) $ V(st )[ ]

TTTTT

T T T T T

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 48

TD Bootstraps and Samples

❐Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps

❐Sampling: update does not involve anexpected value MC samples DP does not sample TD samples

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 51

Advantages of TD Learning

❐ TD methods do not require a model of the environment,only experience

❐ TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome

– Less memory– Less peak computation

You can learn without the final outcome– From incomplete sequences

❐ Both MC and TD converge (under certain assumptions),but which is faster/better?

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 52

Random Walk Example

Values learned by TD(0)aftervarious numbers of episodes

!

equiprobable transitions

" = 0.1

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 53

TD and MC on the Random Walk

Data averaged over100 sequences of episodes

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 56

You are the Predictor

Suppose you observe the following 8 episodes:

A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0

V(A)?

V(B)?

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 57

You are the Predictor

V(A)?

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 58

You are the Predictor

❐ The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a Monte Carlo method gets

❐ If we consider the sequentiality of the problem, then wewould set V(A)=.75 This is correct for the maximum likelihood estimate of a

Markov model generating the data i.e, if we do a best fit Markov model, and assume it is

exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 59

Learning An Action-Value Function

Estimate Q! for the current behavior policy !.

After every transition from a nonterminal state st, do this :

Q st, a

t( )!Q s

t, a

t( ) + " r

t+1 +# Q st+1,at+1( ) $Q s

t,a

t( )[ ]

If st+1 is terminal, then Q(s

t+1, at+1 ) = 0.

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 60

Sarsa: On-Policy TD Control

Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:

s, ars’

a’

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 63

Q-Learning: Off-Policy TD Control

One - step Q - learning :

Q st, a

t( )!Q s

t, a

t( ) + " r

t+1 +# maxaQ s

t+1, a( ) $Q st, a

t( )[ ]

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 64

Cliffwalking

ε−greedy, ε = 0.1

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 65

Actor-Critic Architecture

Environment

Primary Critic

Actor

Situations or States

Actions

Adaptive Critic

Primary Rewards

Effective Rewards: (involves values)

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 66

Actor-Critic Methods

❐ Explicit representation of policyas well as value function

❐ Minimal computation to selectactions

❐ Can learn an explicit stochasticpolicy

❐ Can put constraints on policies❐ Appealing as psychological and

neural models

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 67

Actor-Critic Details

TD - error is used to evaluate actions :

!t= r

t+1 + " V (st+1 ) # V(s

t)

If actions are determined by preferences, p(s, a), as follows :

!t (s, a) = Pr at = a st = s{ } = ep( s, a)

e p(s ,b)

b

",

then you can update the preferences like this :

p(st , at )# p(st ,at ) + $% t

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 71

Afterstates

❐ Usually, a state-value function evaluates states in which the agent cantake an action.

❐ But sometimes it is useful to evaluate states after agent has acted, as intic-tac-toe.

❐ Why is this useful?

❐ What is this in general?

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 72

Summary

❐ TD prediction❐ Introduced one-step tabular model-free TD methods❐ Extend prediction to control by employing some form of GPI

On-policy control: Sarsa Off-policy control: Q-learning

❐ These methods bootstrap and sample, combining aspects ofDP and MC methods

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 73

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 74

Lecture 2, Part 4: Unified Perspective

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 75

n-step TD Prediction

❐ Idea: Look farther into the future when you do TD backup(1, 2, 3, …, n steps)

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 76

❐ Monte Carlo:

❐ TD: Use V to estimate remaining return

❐ n-step TD: 2 step return:

n-step return:

Mathematics of n-step TD Prediction

T

tT

ttttrrrrR1

3

2

21

!!

+++ ++++= """ L

)( 11

)1(

++ +=ttttsVrR !

)( 2

2

21

)2(

+++ ++=tttttsVrrR !!

)(1

3

2

21

)(

ntt

n

nt

n

ttt

n

tsVrrrrR ++

!

+++ +++++= """" L

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 77

Learning with n-step Backups

❐ Backup (on-line or off-line):

❐ Error reduction property of n-step returns

❐ Using this, you can show that n-step methods converge

!

maxs

E"{Rt

n| s

t= s}#V "

(s) $ % nmaxs

V (s) #V "(s)

n step return

Maximum error using n-step return Maximum error using V

!

"Vt(s

t) =# R

t

(n )$V

t(s

t)[ ]

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 78

Random Walk Examples

❐ How does 2-step TD work here?❐ How about 3-step TD?

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 79

A Larger Example

❐ Task: 19 staterandom walk

❐ Do you think thereis an optimal n (foreverything)?

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 80

Averaging n-step Returns

❐ n-step methods were introduced to help withTD(λ) understanding

❐ Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-

step

❐ Called a complex backup Draw each component Label with the weights for that

component

)4()2(

2

1

2

1tt

avg

t RRR +=

One backup

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 81

Forward View of TD(λ)

❐ TD(λ) is a method foraveraging all n-step backups weight by λn-1 (time since

visitation) λ-return:

❐ Backup using λ-return:

Rt

!= (1" ! ) !

n "1

n=1

#

$ Rt

(n)

!Vt(s

t) = " R

t

#$ V

t(s

t)[ ]

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 82

λ-return Weighting Function

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 83

Relation to TD(0) and MC

❐ λ-return can be rewritten as:

❐ If λ = 1, you get MC:

❐ If λ = 0, you get TD(0)

Rt

!= (1" ! ) !

n"1

n=1

T" t"1

# Rt

(n)+ !

T"t"1Rt

Rt

!= (1"1) 1

n"1

n=1

T"t"1

# Rt

(n )+ 1

T" t"1Rt= R

t

Rt

!= (1" 0) 0

n"1

n=1

T"t"1

# Rt

(n )+ 0

T" t"1Rt= R

t

(1)

Until termination After termination

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 84

Forward View of TD(λ)

❐ Look forward from each state to determine update fromfuture states and rewards:

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 85

λ-return on the Random Walk

❐ Same 19 state random walk as before❐ Why do you think intermediate values of λ are best?

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 86

Backward View of TD(λ)

❐ The forward view was for theory❐ The backward view is for mechanism

❐ New variable called eligibility trace: On each step, decay all traces by γλ and increment the

trace for the current state by 1 Accumulating trace !

et(s)" #+

et(s) =

!"et#1(s) if s $ s

t

!"et#1(s) +1 if s = s

t

% & '

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 87

Backward View

❐ Shout δt backwards over time❐ The strength of your voice decreases with temporal

distance by γλ

)()( 11 ttttttsVsVr !+= ++ "#

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 88

On-line Tabular TD(λ)

Initialize V(s) arbitrarily and e(s) = 0, for all s !S

Repeat (for each episode) :

Initialize s

Repeat (for each step of episode) :

a" action given by # for s

Take action a, observe reward, r, and next state $ s

% " r +&V( $ s ) ' V (s)

e(s)" e(s) +1

For all s :

V(s) "V(s) +(%e(s)

e(s) "&)e(s)

s" $ s

Until s is terminal

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 89

Relation of Backwards View to MC & TD(0)

❐ Using update rule:

❐ As before, if you set λ to 0, you get to TD(0)❐ If you set λ to 1, you get MC but in a better way

Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to

the end of the episode)

)()( sesVttt

!"=#

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 90

Forward View = Backward View

❐ The forward (theoretical) view of TD(λ) is equivalent tothe backward (mechanistic) view for off-line updating

❐ The book shows:

❐ On-line updating with small α is similar

!Vt

TD(s)

t= 0

T"1

# = $t= 0

T"1

# Isst

(%&)k" t'k

k=t

T"1

# !Vt

"(s

t)Isst

t= 0

T#1

$ = %t= 0

T#1

$ Isst

(&")k# t'k

k=t

T#1

$

!Vt

TD

(s)t= 0

T"1

# = !Vt

$(s

t)

t= 0

T"1

# Isst

Backward updates Forward updates

algebra shown in book

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 92

Control: Sarsa(λ)

❐ Save eligibility for state-actionpairs instead of just states

et(s, a) =

!"et#1(s, a) +1 if s = s

t and a = a

t

!"et#1(s,a) otherwise

$ % &

Qt+1(s, a) = Q

t(s, a) +'(

tet(s, a)

(t

= rt+1 + !Q

t(s

t+1,at+1) #Qt(s

t, a

t)

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 93

Sarsa(λ) Algorithm

Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a

Repeat (for each episode) :

Initialize s, a

Repeat (for each step of episode) :

Take action a, observe r, ! s

Choose ! a from ! s using policy derived from Q (e.g. ? - greedy)

" # r +$Q( ! s , ! a ) %Q(s, a)

e(s,a)# e(s,a) +1

For all s,a :

Q(s, a)#Q(s, a) +&"e(s, a)

e(s, a) #$'e(s, a)

s# ! s ;a # ! a

Until s is terminal

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 94

Sarsa(λ) Gridworld Example

❐ With one trial, the agent has much more information about how to getto the goal not necessarily the best way

❐ Can considerably accelerate learning

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 109

Conclusions

❐ Eligibility traces provide efficient, incremental way tocombine MC and TD Includes advantages of MC (can deal with lack of

Markov property) Includes advantages of TD (using TD error,

bootstrapping)❐ Can significantly speed learning❐ Does have a cost in computation

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 110

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 111

TD-errorδt = rt + Vt − Vt−1

V

δ

early inlearning

V

δ

learningcomplete

δr omitted

rregular predictors of z over this interval

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 112

Dopamine Neurons and TD Error

W. Schultz et al. Universite de Fribourg

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 113

Dopamine Modulated Synaptic Plasticity

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 114

Basal Ganglia as Adaptive Critic Architecture

Houk, Adams, & Barto, 1995

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 116

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL

top related