eligibility traces: the “atomic breadcrumbs” approach to rl

Eligibility traces: The “atomic

breadcrumbs” approach to RL

Administrivia•P2M1 grades out this morning

•Let me know if you didn’t get a gradesheet

•Reminder: Q3 on Nov 10

•Threads & synchronization

•P2M3 due today

•Happy (very nearly) Halloween!

Timeline•Last time:

•Miscellany

•Design problem from midterm (Q5)

•Factory methods design pattern

•Polymorphism vs. explicit tests: case study

•More on Q learning

•Today:

•Eligibility traces & SARSA(λ)

•Design exercise

Q learning in math...

•Q learning rule says: update current Q with a fraction of next state Q value:

•Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))

Q learning in code...public class MyAgent implements Agent {

public void updateModel(SARSTuple s) {State2d start=s.getInitState();State2d end=s.getNextState();Action act=s.getAction();double r=s.getReward();double Qnow=_policy.get(start).get(act);double Qnext=_policy.get(end).findMaxQ();double Qrevised=Qnow+getAlpha()*(r+getGamma()*Qnext-Qnow);

_policy.get(start).put(act,Qrevised);}

}

Why does it work?•Won’t give a full explanation here

•Basic intuition: each step of experience “backs up” reward from goal state toward beginning

goal stater=0 r=0 r=+5

“back up” achunk of r and Q to

prev. state


prev. state


prev. state

Q QQ

QQQ

Q-learning in action15x15 maze world; R(goal)=1; R(other)=0

γ=0.9α=0.65

Q-learning in actionInitial policy

Q-learning in actionAfter 20 trials

Well, it looks good anyway•But are we sure it’s actually learning?

•How to measure whether it’s actually getting any better at the task? (Finding the goal state)

Well, it looks good anyway•But are we sure it’s actually learning?

•How to measure whether it’s actually getting any better at the task? (Finding the goal state)

•Every 10 episodes, “freeze” policy (turn off learning)

•Measure avg time to goal from a number of starting states

•Average over a number of test trials to iron out noise

•Plot learning curve: #episodes of learning vs. avg performance

Learning performance

Notes on learning perf.•After 400 learning episodes, still hasn’t asymptoted

•Note: that’s ~700,000 steps of experience!!!

•Q learning is really, really slow!!!

•Same holds for many RL methods (sadly)

That’s so inefficient!•Big problem with Q-learning:

•Each step of experience only “backs up” information by one step in the world

•Takes a really, really long time to back up info all the way to the start state

•How can we do better?

•Want to propagate info further on each step

•Ideas?

Eligibility traces•Key idea: keep extra information around

•Which states visited this trial

•How long ago they were visited

•Extra bookkeeping info called “eligibility trace”

•Written e(s,a)

•On each step: update all state/action pairs, in proportion to their eligibility

•Efficiency note: really only need to update s,a where e(s,a)!=0

Radioactive breadcrumbs•At every step ((s,a,r,s’,a’) tuple):

•Increment e(s,a) for current (s,a) pair by 1

•For every s’’,a’’ pair in S×A:

•Update Q(s’’,a’’) in proportion to e(s’’,a’’)

•Decay e(s’’,a’’) by factor of λγ

•e(s’’,a’’)*= lambda*gamma

•Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

The SARSA(λ) codepublic class SARSAlAgent implements Agent {

public void updateModel(SARSTuple s) {State2d start=s.getInitState();State2d end=s.getNextState();Action act=s.getAction();double r=s.getReward();Action nextAct=pickAction(end);double Qnow=_policy.get(start,act);double Qnext=_policy.get(end,nextAct);double delta=r+_gamma*Qnext-Qnow;setElig(start,act,getElig(start,act)+1.0);for (SAPair p : getEligiblePairs()) {

currQ=_policy.get(p.getS(),p.getA());_policy.set(p.getS(),p.getA(),

currQ+getElig(p.getS(),p.getA())*_alpha*delta);setElig(p.getS(),p.getA(),

getElig(p.getS(),p.getA())*_gamma*_lambda);}

}}

The SARSA(λ) picture

Agent starts here

Attaches eligibility

The SARSA(λ) picture

Design Exercise:Experimental Rig

Design exercise•For M4/Rollout, need to be able to:•Train agent for many trials/steps per trial•Generate learning curves for agent’s learning•Run some trials w/ learning turned on•Freeze learning•Run some trials w/ learning turned off•Average steps-to-goal over those trials•Save average as one point in curve

•Design: objects/methods to support this learning framework

•Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.

eligibility traces: the “atomic breadcrumbs” approach to rl

Documents