eligibility traces: the “atomic breadcrumbs” approach to rl

33
traces: The “atomic breadcrumbs” approach to RL

Post on 22-Dec-2015

234 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Eligibility traces: The “atomic breadcrumbs” approach to RL

Eligibility traces: The “atomic

breadcrumbs” approach to RL

Page 2: Eligibility traces: The “atomic breadcrumbs” approach to RL

Administrivia•P2M1 grades out this morning

•Let me know if you didn’t get a gradesheet

•Reminder: Q3 on Nov 10

•Threads & synchronization

•P2M3 due today

•Happy (very nearly) Halloween!

Page 3: Eligibility traces: The “atomic breadcrumbs” approach to RL

Timeline•Last time:

•Miscellany

•Design problem from midterm (Q5)

•Factory methods design pattern

•Polymorphism vs. explicit tests: case study

•More on Q learning

•Today:

•Eligibility traces & SARSA(λ)

•Design exercise

Page 4: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q learning in math...

•Q learning rule says: update current Q with a fraction of next state Q value:

•Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))

Page 5: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q learning in code...public class MyAgent implements Agent {

public void updateModel(SARSTuple s) {State2d start=s.getInitState();State2d end=s.getNextState();Action act=s.getAction();double r=s.getReward();double Qnow=_policy.get(start).get(act);double Qnext=_policy.get(end).findMaxQ();double Qrevised=Qnow+getAlpha()*(r+getGamma()*Qnext-Qnow);

_policy.get(start).put(act,Qrevised);}

}

Page 6: Eligibility traces: The “atomic breadcrumbs” approach to RL

Why does it work?•Won’t give a full explanation here

•Basic intuition: each step of experience “backs up” reward from goal state toward beginning

goal stater=0 r=0 r=+5

“back up” achunk of r and Q to

prev. state

“back up” achunk of r and Q to

prev. state

“back up” achunk of r and Q to

prev. state

Q QQ

QQQ

Page 7: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in action15x15 maze world; R(goal)=1; R(other)=0

γ=0.9α=0.65

Page 8: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionInitial policy

Page 9: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionAfter 20 trials

Page 10: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionAfter 30 trials

Page 11: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionAfter 100 trials

Page 12: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionAfter 150 trials

Page 13: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionAfter 200 trials

Page 14: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionAfter 250 trials

Page 15: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionAfter 300 trials

Page 16: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionAfter 350 trials

Page 17: Eligibility traces: The “atomic breadcrumbs” approach to RL

Q-learning in actionAfter 400 trials

Page 18: Eligibility traces: The “atomic breadcrumbs” approach to RL

Well, it looks good anyway•But are we sure it’s actually learning?

•How to measure whether it’s actually getting any better at the task? (Finding the goal state)

Page 19: Eligibility traces: The “atomic breadcrumbs” approach to RL

Well, it looks good anyway•But are we sure it’s actually learning?

•How to measure whether it’s actually getting any better at the task? (Finding the goal state)

•Every 10 episodes, “freeze” policy (turn off learning)

•Measure avg time to goal from a number of starting states

•Average over a number of test trials to iron out noise

•Plot learning curve: #episodes of learning vs. avg performance

Page 20: Eligibility traces: The “atomic breadcrumbs” approach to RL

Learning performance

Page 21: Eligibility traces: The “atomic breadcrumbs” approach to RL

Notes on learning perf.•After 400 learning episodes, still hasn’t asymptoted

•Note: that’s ~700,000 steps of experience!!!

•Q learning is really, really slow!!!

•Same holds for many RL methods (sadly)

Page 22: Eligibility traces: The “atomic breadcrumbs” approach to RL

That’s so inefficient!•Big problem with Q-learning:

•Each step of experience only “backs up” information by one step in the world

•Takes a really, really long time to back up info all the way to the start state

•How can we do better?

•Want to propagate info further on each step

•Ideas?

Page 23: Eligibility traces: The “atomic breadcrumbs” approach to RL

Eligibility traces•Key idea: keep extra information around

•Which states visited this trial

•How long ago they were visited

•Extra bookkeeping info called “eligibility trace”

•Written e(s,a)

•On each step: update all state/action pairs, in proportion to their eligibility

•Efficiency note: really only need to update s,a where e(s,a)!=0

Page 24: Eligibility traces: The “atomic breadcrumbs” approach to RL

Radioactive breadcrumbs•At every step ((s,a,r,s’,a’) tuple):

•Increment e(s,a) for current (s,a) pair by 1

•For every s’’,a’’ pair in S×A:

•Update Q(s’’,a’’) in proportion to e(s’’,a’’)

•Decay e(s’’,a’’) by factor of λγ

•e(s’’,a’’)*= lambda*gamma

•Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

Page 25: Eligibility traces: The “atomic breadcrumbs” approach to RL

The SARSA(λ) codepublic class SARSAlAgent implements Agent {

public void updateModel(SARSTuple s) {State2d start=s.getInitState();State2d end=s.getNextState();Action act=s.getAction();double r=s.getReward();Action nextAct=pickAction(end);double Qnow=_policy.get(start,act);double Qnext=_policy.get(end,nextAct);double delta=r+_gamma*Qnext-Qnow;setElig(start,act,getElig(start,act)+1.0);for (SAPair p : getEligiblePairs()) {

currQ=_policy.get(p.getS(),p.getA());_policy.set(p.getS(),p.getA(),

currQ+getElig(p.getS(),p.getA())*_alpha*delta);setElig(p.getS(),p.getA(),

getElig(p.getS(),p.getA())*_gamma*_lambda);}

}}

Page 26: Eligibility traces: The “atomic breadcrumbs” approach to RL

The SARSA(λ) picture

Agent starts here

Attaches eligibility

Page 27: Eligibility traces: The “atomic breadcrumbs” approach to RL

The SARSA(λ) picture

Page 28: Eligibility traces: The “atomic breadcrumbs” approach to RL

The SARSA(λ) picture

Page 29: Eligibility traces: The “atomic breadcrumbs” approach to RL

The SARSA(λ) picture

Page 30: Eligibility traces: The “atomic breadcrumbs” approach to RL

The SARSA(λ) picture

Page 31: Eligibility traces: The “atomic breadcrumbs” approach to RL

The SARSA(λ) picture

Page 32: Eligibility traces: The “atomic breadcrumbs” approach to RL

Design Exercise:Experimental Rig

Page 33: Eligibility traces: The “atomic breadcrumbs” approach to RL

Design exercise•For M4/Rollout, need to be able to:•Train agent for many trials/steps per trial•Generate learning curves for agent’s learning•Run some trials w/ learning turned on•Freeze learning•Run some trials w/ learning turned off•Average steps-to-goal over those trials•Save average as one point in curve

•Design: objects/methods to support this learning framework

•Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.