new an introduction tohgeffner/andy2.pdf · 2019. 12. 4. · a. g. barto, barcelona lectures, april...

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

Andrew G. BartoDepartment of Computer Science

University of Massachusetts – Amherst

UPF Lecture 2

Autonomous Learning Laboratory – Department of Computer Science

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 2

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL

The Overall Plan

❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Lecture 2, Part 1: Dynamic Programming

❐ Overview of a collection of classical solution methods for MDPsknown as Dynamic Programming (DP)

❐ Show how DP can be used to compute value functions, and hence,optimal policies

❐ Discuss efficiency and utility of DP

Objectives of this part:

Policy Evaluation

State - value function for policy ! :

V!(s) = E! R

= s{ } = E! " krt+k +1 s

$% & '

Bellman equation for V!

V!(s) = ! (s, a) P

!( " s )[ ]

— a system of S simultaneous linear equations

Policy Evaluation: for a given policy π, compute the state-value function V!

Recall:

Iterative Methods

k+1 !L! V"

Vk +1(s)" # (s,a) P

a + %Vk( $ s )[ ]

a “sweep”

A sweep consists of applying a backup operation to each state.

A full policy evaluation backup:

A Small Gridworld

❐ An undiscounted episodic task❐ Nonterminal states: 1, 2, . . ., 14;❐ One terminal state (shown twice as shaded squares)❐ Actions that would take agent off the grid leave state unchanged❐ Reward is –1 until the terminal state is reached

Iterative Policy Eval for the Small Gridworld

! = random (uniform) action choices

Policy Improvement

Suppose we have computed for a deterministic policy π.V!

For a given state s, would it be better to do an action ? a ! "(s)

Q"(s,a) = E" r

t +1+ #V "

(st +1

) st= s,a

t= a{ }

= Ps $ s

% Rs $ s

a + #V "( $ s )[ ]

The value of doing a in state s is :

It is better to switch to action a for state s if and only if

Q" (s,a) >V" (s)

Policy Improvement Cont.

! " (s) = argmaxa

Q"(s, a)

= argmaxa

Ps ! s

# Rs ! s

a + $V " ( ! s )[ ]

Do this for all states to get a new policy ! " that is

greedy with respect to V " :

Then V! " # V

Policy Improvement Cont.

What if V" # = V

i.e., for all s$ S, V" # (s) = max

Ps " s

% Rs " s

a + &V #( " s )[ ] ?

But this is the Bellman Optimality Equation.

So V ! " = V# and both " and ! " are optimal policies.

Policy Iteration

!1 "L!

policy evaluation policy improvement“greedification”

Value Iteration

Vk +1(s)" # (s,a) P

a + %Vk( $ s )[ ]

Recall the full policy evaluation backup:

Vk +1(s)"max

Ps # s

a + $Vk( # s )[ ]

Here is the full value iteration backup:

Asynchronous DP

❐ All the DP methods described so far require exhaustivesweeps of the entire state set.

❐ Asynchronous DP does not use sweeps. Instead it works likethis: Repeat until convergence criterion is met:

– Pick a state at random and apply the appropriatebackup

❐ Still need lots of computation, but does not get locked intohopelessly long sweeps

❐ Can you select states to backup intelligently? YES: anagent’s experience can act as a guide.

Efficiency of DP

❐ To find an optimal policy is polynomial in the number ofstates…

❐ BUT, the number of states is often astronomical, e.g., oftengrowing exponentially with the number of state variables(what Bellman called “the curse of dimensionality”).

❐ In practice, classical DP can be applied to problems with afew millions of states.

❐ Asynchronous DP can be applied to larger problems, andappropriate for parallel computation.

❐ It is surprisingly easy to come up with MDPs for which DPmethods are not practical.

Summary

❐ Policy evaluation: backups without a max❐ Policy improvement: form a greedy policy, if only locally❐ Policy iteration: alternate the above two processes❐ Value iteration: backups with a max❐ Full backups (to be contrasted later with sample backups)❐ Asynchronous DP: a way to avoid exhaustive sweeps❐ Bootstrapping: updating estimates based on other

estimates

The Overall Plan

Lecture 2, Part 2: Simple Monte Carlo Methods

❐ Simple Monte Carlo methods learn from complete samplereturns Only defined for episodic tasks

❐ Simple Monte Carlo methods learn directly from experience On-line: No model necessary Simulated: No need for a full model

(First-visit) Monte Carlo policy evaluation

Backup diagram for Simple Monte Carlo

❐ Entire episode included❐ Only one choice at each state

(unlike DP)

❐ MC does not bootstrap

❐ Time required to estimate onestate does not depend on thetotal number of states

Monte Carlo Estimation of Action Values (Q)

❐ Monte Carlo is most useful when a model is not available❐ Qπ(s,a) - average return starting from state s and action a

following π❐ Also converges asymptotically if every state-action pair is

visited infinitely often

We are really interested in estimates of V* and Q*,i.e., Monte Carlo Control

Learning about π while following ! "

Summary

❐ MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book)

❐ MC methods provide an alternate policy evaluationprocess

❐ One issue to watch for: maintaining sufficient exploration exploring starts, soft policies

❐ No bootstrapping (as opposed to DP)❐ Estimating values for one policy while behaving according

to another policy: importance sampling

The Overall Plan

❐ Lecture 2: Simple Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience

Lecture 2, Part 2:Temporal Difference Learning

❐ Introduce Temporal Difference (TD) learning❐ Focus first on policy evaluation, or prediction, methods❐ Then extend to control methods

Objectives of this part:

TD Prediction

Simple (every - visit) Monte Carlo method :

V (st)"V (s

t) +# R

t$V (s

Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V!

Recall:

The simplest TD method, TD(0) :

V (st)"V (s

t) +# r

t+1 + $V (st+1) %V (s

target: the actual return after time t

target: an estimate of the return

Simple Monte Carlo

T T T TT

T T T T T

V (st)"V (s

t) +# R

t$V (s

where Rt is the actual return following state s

(constant-α MC)

cf. Dynamic Programming

V(st)! E" r

t+1 +# V(st ){ }

Simplest TD Method

T T T TT

T T T T T

V(st)! V(s

t) +" r

t+1 + # V (st+1 ) $ V(st )[ ]

T T T T T

TD Bootstraps and Samples

❐Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps

❐Sampling: update does not involve anexpected value MC samples DP does not sample TD samples

Advantages of TD Learning

❐ TD methods do not require a model of the environment,only experience

❐ TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome

– Less memory– Less peak computation

You can learn without the final outcome– From incomplete sequences

❐ Both MC and TD converge (under certain assumptions),but which is faster/better?

Random Walk Example

Values learned by TD(0)aftervarious numbers of episodes

equiprobable transitions

" = 0.1

TD and MC on the Random Walk

Data averaged over100 sequences of episodes

You are the Predictor

Suppose you observe the following 8 episodes:

A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0

❐ The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a Monte Carlo method gets

❐ If we consider the sequentiality of the problem, then wewould set V(A)=.75 This is correct for the maximum likelihood estimate of a

Markov model generating the data i.e, if we do a best fit Markov model, and assume it is

exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets

Learning An Action-Value Function

Estimate Q! for the current behavior policy !.

After every transition from a nonterminal state st, do this :

Q st, a

t( )!Q s

t( ) + " r

t+1 +# Q st+1,at+1( ) $Q s

t( )[ ]

If st+1 is terminal, then Q(s

t+1, at+1 ) = 0.

Sarsa: On-Policy TD Control

Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:

s, ars’

Q-Learning: Off-Policy TD Control

One - step Q - learning :

Q st, a

t( )!Q s

t( ) + " r

t+1 +# maxaQ s

t+1, a( ) $Q st, a

t( )[ ]

Cliffwalking

ε−greedy, ε = 0.1

Actor-Critic Architecture

Environment

Primary Critic

Situations or States

Actions

Adaptive Critic

Primary Rewards

Effective Rewards: (involves values)

Actor-Critic Methods

❐ Explicit representation of policyas well as value function

❐ Minimal computation to selectactions

❐ Can learn an explicit stochasticpolicy

❐ Can put constraints on policies❐ Appealing as psychological and

neural models

Actor-Critic Details

TD - error is used to evaluate actions :

t+1 + " V (st+1 ) # V(s

If actions are determined by preferences, p(s, a), as follows :

!t (s, a) = Pr at = a st = s{ } = ep( s, a)

e p(s ,b)

then you can update the preferences like this :

p(st , at )# p(st ,at ) + $% t

Afterstates

❐ Usually, a state-value function evaluates states in which the agent cantake an action.

❐ But sometimes it is useful to evaluate states after agent has acted, as intic-tac-toe.

❐ Why is this useful?

❐ What is this in general?

Summary

❐ TD prediction❐ Introduced one-step tabular model-free TD methods❐ Extend prediction to control by employing some form of GPI

On-policy control: Sarsa Off-policy control: Q-learning

❐ These methods bootstrap and sample, combining aspects ofDP and MC methods

The Overall Plan

Lecture 2, Part 4: Unified Perspective

n-step TD Prediction

❐ Idea: Look farther into the future when you do TD backup(1, 2, 3, …, n steps)

❐ Monte Carlo:

❐ TD: Use V to estimate remaining return

❐ n-step TD: 2 step return:

n-step return:

Mathematics of n-step TD Prediction

ttttrrrrR1

+++ ++++= """ L

++ +=ttttsVrR !

+++ ++=tttttsVrrR !!

tsVrrrrR ++

+++ +++++= """" L

Learning with n-step Backups

❐ Backup (on-line or off-line):

❐ Error reduction property of n-step returns

❐ Using this, you can show that n-step methods converge

t= s}#V "

(s) $ % nmaxs

V (s) #V "(s)

n step return

Maximum error using n-step return Maximum error using V

t) =# R

(n )$V

Random Walk Examples

❐ How does 2-step TD work here?❐ How about 3-step TD?

A Larger Example

❐ Task: 19 staterandom walk

❐ Do you think thereis an optimal n (foreverything)?

Averaging n-step Returns

❐ n-step methods were introduced to help withTD(λ) understanding

❐ Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-

❐ Called a complex backup Draw each component Label with the weights for that

component

)4()2(

t RRR +=

One backup

Forward View of TD(λ)

❐ TD(λ) is a method foraveraging all n-step backups weight by λn-1 (time since

visitation) λ-return:

❐ Backup using λ-return:

!= (1" ! ) !

t) = " R

λ-return Weighting Function

Relation to TD(0) and MC

❐ λ-return can be rewritten as:

❐ If λ = 1, you get MC:

❐ If λ = 0, you get TD(0)

!= (1" ! ) !

T" t"1

(n)+ !

T"t"1Rt

!= (1"1) 1

(n )+ 1

T" t"1Rt= R

!= (1" 0) 0

(n )+ 0

T" t"1Rt= R

Until termination After termination

Forward View of TD(λ)

❐ Look forward from each state to determine update fromfuture states and rewards:

λ-return on the Random Walk

❐ Same 19 state random walk as before❐ Why do you think intermediate values of λ are best?

Backward View of TD(λ)

❐ The forward view was for theory❐ The backward view is for mechanism

❐ New variable called eligibility trace: On each step, decay all traces by γλ and increment the

trace for the current state by 1 Accumulating trace !

et(s)" #+

et(s) =

!"et#1(s) if s $ s

!"et#1(s) +1 if s = s

Backward View

❐ Shout δt backwards over time❐ The strength of your voice decreases with temporal

distance by γλ

)()( 11 ttttttsVsVr !+= ++ "#

On-line Tabular TD(λ)

Initialize V(s) arbitrarily and e(s) = 0, for all s !S

Repeat (for each episode) :

Initialize s

Repeat (for each step of episode) :

a" action given by # for s

Take action a, observe reward, r, and next state $ s

% " r +&V( $ s ) ' V (s)

e(s)" e(s) +1

For all s :

V(s) "V(s) +(%e(s)

e(s) "&)e(s)

s" $ s

Until s is terminal

Relation of Backwards View to MC & TD(0)

❐ Using update rule:

❐ As before, if you set λ to 0, you get to TD(0)❐ If you set λ to 1, you get MC but in a better way

Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to

the end of the episode)

)()( sesVttt

Forward View = Backward View

❐ The forward (theoretical) view of TD(λ) is equivalent tothe backward (mechanistic) view for off-line updating

❐ The book shows:

❐ On-line updating with small α is similar

# = $t= 0

# Isst

(%&)k" t'k

t)Isst

$ = %t= 0

$ Isst

(&")k# t'k

(s)t= 0

# = !Vt

# Isst

Backward updates Forward updates

algebra shown in book

Control: Sarsa(λ)

❐ Save eligibility for state-actionpairs instead of just states

et(s, a) =

!"et#1(s, a) +1 if s = s

t and a = a

!"et#1(s,a) otherwise

Qt+1(s, a) = Q

t(s, a) +'(

tet(s, a)

= rt+1 + !Q

t+1,at+1) #Qt(s

Sarsa(λ) Algorithm

Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a

Repeat (for each episode) :

Initialize s, a

Repeat (for each step of episode) :

Take action a, observe r, ! s

Choose ! a from ! s using policy derived from Q (e.g. ? - greedy)

" # r +$Q( ! s , ! a ) %Q(s, a)

e(s,a)# e(s,a) +1

For all s,a :

Q(s, a)#Q(s, a) +&"e(s, a)

e(s, a) #$'e(s, a)

s# ! s ;a # ! a

Until s is terminal

Sarsa(λ) Gridworld Example

❐ With one trial, the agent has much more information about how to getto the goal not necessarily the best way

❐ Can considerably accelerate learning

Conclusions

❐ Eligibility traces provide efficient, incremental way tocombine MC and TD Includes advantages of MC (can deal with lack of

Markov property) Includes advantages of TD (using TD error,

bootstrapping)❐ Can significantly speed learning❐ Does have a cost in computation

The Overall Plan

TD-errorδt = rt + Vt − Vt−1

early inlearning

learningcomplete

δr omitted

rregular predictors of z over this interval

Dopamine Neurons and TD Error

W. Schultz et al. Universite de Fribourg

Dopamine Modulated Synaptic Plasticity

Basal Ganglia as Adaptive Critic Architecture

Houk, Adams, & Barto, 1995

The Overall Plan

new an introduction tohgeffner/andy2.pdf · 2019. 12. 4. · a. g. barto, barcelona lectures, april...

Documents

jucarii - agnia barto (ilustratii de i. molokanov, 1983).pdf

prof. dr. daniela barto ş

reinforcement learning in multi-agent systems ·...

sutton barto

pinturas ana barto

t learning rl barto sutton forthcoming bertsek · 2005. 1....

ursuleţul prost crescut de agnia barto

the bad little bear cub de agnia barto

line`rn˝ algebra - univerzita karlovaline`rn˝ algebra...

university of massachussetts, amherst, usa institute of...

backgammon, go, - ferdowsi university of...

barto ir lizos spinta

presentación javi barto

sutton & barto , chapter 4

opel garage barto - een inspirerende case...

c:/documents and settings/libor...

leadership reflection: a personal mission by: rachel barto

book barto a5.pdf

o sloˇzitosti h-barven´ı digraf˚u - univerzita...

tzimon barto, paganini rhapsody paganini … · tzimon...