cse 573: artificial intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf ·...

52
CSE 573: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.]

Upload: others

Post on 16-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

CSE573:ArtificialIntelligenceReinforcementLearning

DanWeld/UniversityofWashington[ManyslidestakenfromDanKleinandPieterAbbeel /CS188IntrotoAIatUCBerkeley– materialsavailableathttp://ai.berkeley.edu.]

Page 2: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Logistics

§ PS3duetoday§ PS4dueinoneweek(Thurs2/16)§ ResearchpapercommentsdueonTues

§ PaperitselfwillbeonWebcalendarafterclass

2

Page 3: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

ReinforcementLearning

Page 4: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

ReinforcementLearning

§ Basicidea:§ Receivefeedbackintheformofrewards§ Agent’sutilityisdefinedbytherewardfunction§ Must(learnto)actsoastomaximizeexpectedrewards§ Alllearningisbasedonobservedsamplesofoutcomes!

Environment

Agent

Actions:aState:sReward:r

Page 5: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example: Animal Learning

§ RL studied experimentally for more than 60 years in psychology

§ Example: foraging

§ Rewards: food, pain, hunger, drugs, etc.§ Mechanisms and sophistication debated

§ Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies

§ Bees have a direct neural connection from nectar intake measurement to motor planning area

Page 6: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example: Backgammon

§ Reward only for win / loss in terminal states, zero otherwise

§ TD-Gammon learns a function approximation to V(s) using a neural network

§ Combined with depth 3 search, one of the top 3 players in the world

§ You could imagine training Pacman this way…

§ … but it’s tricky! (It’s also PS 4)

Page 7: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example:LearningtoWalk

Initial[Video:AIBOWALK– initial][KohlandStone,ICRA2004]

Page 8: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example:LearningtoWalk

Finished[Video:AIBOWALK– finished][KohlandStone,ICRA2004]

Page 9: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example:Sidewinding

[AndrewNg] [Video:SNAKE– climbStep+sidewinding]

Page 10: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

12

“Fewdrivingtasksareasintimidatingasparallelparking….

https://www.youtube.com/watch?v=pB_iFY2jIdI

Page 11: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

ParallelParking“Fewdrivingtasksareasintimidatingasparallelparking….

13

https://www.youtube.com/watch?v=pB_iFY2jIdI

Page 12: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Other Applications

§ Go playing§ Robotic control

§ helicopter maneuvering, autonomous vehicles§ Mars rover - path planning, oversubscription planning§ elevator planning

§ Game playing - backgammon, tetris, checkers§ Neuroscience§ Computational Finance, Sequential Auctions§ Assisting elderly in simple tasks§ Spoken dialog management§ Communication Networks – switching, routing, flow control§ War planning, evacuation planning

Page 13: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

ReinforcementLearning

§ StillassumeaMarkovdecisionprocess(MDP):§ AsetofstatessÎ S§ Asetofactions(perstate)A§ AmodelT(s,a,s’)§ ArewardfunctionR(s,a,s’)& discountγ

§ Stilllookingforapolicyp(s)

§ Newtwist:don’tknowTorR§ I.e.wedon’tknowwhichstatesaregoodorwhattheactionsdo§ Mustactuallytryactionsandstatesouttolearn

?

Page 14: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Offline(MDPs)vs.Online(RL)

OfflineSolution(Planning)

OnlineLearning(RL)

MonteCarloPlanning

Simulator

Diff:1)dyingok;2)(re)setbutton

Page 15: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

FourKeyIdeasforRL

§ Credit-AssignmentProblem§ Whatwastherealcauseofreward?

§ Exploration-exploitationtradeoff

§ Model-basedvsmodel-freelearning§ Whatfunctionisbeinglearned?

§ ApproximatingtheValueFunction§ Smallerà easiertolearn&bettergeneralization

Page 16: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

CreditAssignmentProblem

18

Page 17: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

19

Exploration-Exploitation tradeoff

§ You have visited part of the state space and found a reward of 100§ is this the best you can hope for???

§ Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge?§ at risk of missing out on a better reward somewhere

§ Exploration: should I look for states w/ more reward?§ at risk of wasting time & getting some negative reward

Page 18: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Model-BasedLearning

Page 19: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Model-BasedLearning

§ Model-BasedIdea:§ Learnanapproximatemodelbasedonexperiences§ Solveforvaluesasifthelearnedmodelwerecorrect

§ Step1:LearnempiricalMDPmodel§ Explore(e.g.,moverandomly)§ Countoutcomess’foreachs,a§ Normalizetogiveanestimateof§ Discovereach whenweexperience(s,a,s’)

§ Step2:SolvethelearnedMDP§ Forexample,usevalueiteration,asbefore

Page 20: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example:Model-BasedLearning

Randomp

Assume:g =1

ObservedEpisodes(Training) LearnedModel

A

B C D

E

B,east,C,-1C,east,D,-1D,exit,x,+10

B,east,C,-1C,east,D,-1D,exit,x,+10

E,north,C,-1C,east,A,-1A,exit,x,-10

Episode1 Episode2

Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10

T(s,a,s’).T(B,east,C)=1.00T(C,east,D)=0.75T(C,east,A)=0.25

R(s,a,s’).R(B,east,C)=-1R(C,east,D)=-1R(D,exit,x)=+10

Page 21: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Convergence

§ Ifpolicyexplores“enough”– doesn’tstarveanystate§ ThenT&Rconverge

§ So,VI,PI,Lao*etc. willfindoptimalpolicy§ UsingBellmanEquations

§ Whencanagentstartexploiting??§ (We’llanswerthisquestionlater)

23

Page 22: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

24

Two main reinforcement learning approaches

§ Model-based approaches:§ explore environment & learn model, T=P(s’|s,a) and R(s,a), (almost) everywhere§ use model to plan policy, MDP-style§ approach leads to strongest theoretical results § often works well when state-space is manageable

§ Model-free approach:§ don’t learn a model of T&R; instead, learn Q-function (or policy) directly§ weaker theoretical results§ often works better when state space is large

Page 23: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

25

Two main reinforcement learning approaches

§ Model-basedapproaches:Learn T+R

|S|2|A|+|S||A|parameters(40,400)

§ Model-freeapproach:Learn Q

|S||A|parameters (400)

Page 24: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Model-FreeLearning

Page 25: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

NothingisFreeinLife!

§ WhatexactlyisFree???§ NomodelofT§ NomodelofR

§ (Instead,justmodelQ)

27

Page 26: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Reminder:Q-ValueIteration

a

Qk+1(s,a)

s,a

s,a,s’

’)as’,(kQa’Max)=s’(kV

§ Forall s,a§ InitializeQ0(s,a)=0 notimestepsleftmeansanexpectedrewardofzero

§ K=0§ Repeat doBellmanbackups

For every (s,a) pair:

K += 1§ Untilconvergence I.e.,Qvaluesdon’tchangemuch

This is easy….We can sample this

Page 27: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Puzzle:Q-Learning

a

Qk+1(s,a)

s,a

s,a,s’

’)as’,(kQa’Max)=s’(kV

§ Forall s,a§ InitializeQ0(s,a)=0 notimestepsleftmeansanexpectedrewardofzero

§ K=0§ Repeat doBellmanbackups

For every (s,a) pair:

K += 1§ Untilconvergence I.e.,Qvaluesdon’tchangemuch

Q: How can we compute without R, T ?!?A: Compute averages using sampled outcomes

Page 28: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

SimpleExample:ExpectedAgeGoal:ComputeexpectedageofCSEstudents

UnknownP(A):“ModelBased” UnknownP(A):“ModelFree”

WithoutP(A),insteadcollectsamples[a1,a2,…aN]

KnownP(A)

Whydoesthiswork?Becausesamplesappearwiththerightfrequencies.

Whydoesthiswork?Becauseeventuallyyoulearntheright

model.

Note:neverknowP(age=22)

Page 29: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Anytime Model-FreeExpectedAgeGoal:ComputeexpectedageofCSEstudents

UnknownP(A):“ModelFree”

WithoutP(A),insteadcollectsamples[a1,a2,…aN]

Let A=0Loop for i = 1 to ∞

ai ß ask “what is your age?”A ß (i-1)/i * A + (1/i) * ai

Let A=0Loop for i = 1 to ∞

ai ß ask “what is your age?”A ß (1-α)*A + α*ai

Page 30: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

SamplingQ-Values§ Bigidea:learnfromeveryexperience!

§ Followexplorationpolicyaß π(s)§ UpdateQ(s,a)eachtimeweexperienceatransition(s,a,s’,r)§ Likelyoutcomess’willcontributeupdatesmoreoften

§ Updatetowardsrunningaverage:

s

p(s),r

s’GetasampleofQ(s,a): sample =R(s,a,s’)+γ Maxa’ Q(s’,a’)

UpdatetoQ(s,a):

Sameupdate:

Q(s,a)ß (1-𝛼)Q(s,a)+(𝛼)sample

Q(s,a)ß Q(s,a)+𝛼(sample– Q(s,a))

Rearranging: Q(s,a)ß Q(s,a)+𝛼(difference)Wheredifference=(R(s,a,s’)+γ Maxa’ Q(s’,a’))- Q(s,a)

Page 31: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

QLearning

§ Forall s,a§ InitializeQ(s,a)=0

§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:

differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)

Page 32: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example Assume:g =1,α =1/2

ObservedTransition: B,east,C,-2

0

00

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

In state B. What should you do?Suppose (for now) we follow a random exploration policy

à “Go east”

Page 33: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example Assume:g =1,α =1/2

ObservedTransition: B,east,C,-2

0

00

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

0

00

C0

0

08

D0

0

0?

B0

0

00

A0

0

00

E0

½ 0 ½ -2 0-1

Page 34: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example Assume:g =1,α =1/2

ObservedTransition: B,east,C,-2

0

00

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

0

00

C0

0

08

D0

0

0-1

B0

0

00

A0

0

00

E0

½ 0 ½ -2 83

0

0?

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

C,east,D,-2

Page 35: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example Assume:g =1,α =1/2

ObservedTransition: B,east,C,-2

0

00

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

0

00

C0

0

08

D0

0

0-1

B0

0

00

A0

0

00

E0

0

03

C0

0

08

D0

0

0-1

B0

0

00

A0

0

00

E0

C,east,D,-2

Page 36: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Q-LearningProperties

§ Q-learningconvergestooptimalQfunction(andhencelearns optimalpolicy)§ evenifyou’reactingsuboptimally!§ Thisiscalledoff-policylearning

§ Caveats:§ Youhavetoexploreenough§ Youhavetoeventuallyshrinkthelearningrate,α§ …butnotdecreaseittooquickly

§ And… ifyouwanttoact optimally§ Youhavetoswitchfromexploretoexploit

[Demo:Q-learning– auto– cliffgrid(L11D1)]

Page 37: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

VideoofDemoQ-LearningAutoCliff Grid

Page 38: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

QLearning

§ Forall s,a§ InitializeQ(s,a)=0

§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:

Page 39: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Explorationvs.Exploitation

Page 40: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Questions

§ Howtoexplore?aExplorationUniformexplorationEpsilonGreedy

With(small)probabilitye,actrandomlyWith(large)probability1-e,actoncurrentpolicy

ExplorationFunctions(suchasUCB)ThompsonSampling

§ Whentoexploit?

§ Howtoeventhinkaboutthistradeoff?

Page 41: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Questions

§ Howtoexplore?§ RandomExploration§ Uniformexploration§ EpsilonGreedy

§ With(small)probabilitye,actrandomly§ With(large)probability1-e,actoncurrentpolicy

§ ExplorationFunctions(suchasUCB)§ ThompsonSampling

§ Whentoexploit?

§ Howtoeventhinkaboutthistradeoff?

Page 42: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

ExplorationFunctions§ Whentoexplore?

§ Randomactions:exploreafixedamount§ Betteridea:exploreareaswhosebadnessisnot(yet)established,eventuallystopexploring

§ Explorationfunction§ Takesavalueestimateuandavisitcountn,andreturnsanoptimisticutility,e.g.

§ Note:thispropagatesthe“bonus”backtostatesthatleadtounknownstatesaswell!

ModifiedQ-Update:

RegularQ-Update:

Page 43: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

VideoofDemoCrawlerBot

http://inst.eecs.berkeley.edu/~ee128/fa11/videos.htmlMore demos at:

Page 44: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

ApproximateQ-Learning

Page 45: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

GeneralizingAcrossStates

§ BasicQ-Learningkeepsatableofallq-values

§ Inrealisticsituations,wecannotpossiblylearnabouteverysinglestate!§ Toomanystatestovisitthemallintraining§ Toomanystatestoholdtheq-tablesinmemory

§ Instead,wewanttogeneralize:§ Learnaboutsomesmallnumberoftrainingstatesfrom

experience§ Generalizethatexperiencetonew,similarsituations§ Thisisafundamentalideainmachinelearning,andwe’ll

seeitoverandoveragain

[demo– RLpacman]

Page 46: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example:Pacman

Let’ssaywediscoverthroughexperiencethatthisstateisbad:

Innaïveq-learning,weknownothingaboutthisstate:

Page 47: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Example:Pacman

Let’ssaywediscoverthroughexperiencethatthisstateisbad:

Oreventhisone!

Page 48: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

Feature-BasedRepresentationsSolution:describeastateusingavectoroffeatures (aka“properties”)

§ Features= functionsfromstatestoR(often0/1)capturingimportantpropertiesofthestate

§ Examplefeatures:§ Distancetoclosestghostordot§ Numberofghosts§ 1/(disttodot)2§ IsPacman inatunnel?(0/1)§ ……etc.§ Isittheexactstateonthisslide?

§ Canalsodescribeaq-state(s,a)withfeatures(e.g.actionmovesclosertofood)

Page 49: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

LinearCombinationofFeatures

§ Usingafeaturerepresentation,wecanwriteaqfunction(orvaluefunction)foranystateusingafewweights:

§ Advantage:ourexperienceissummedupinafewpowerfulnumbers

§ Disadvantage:statessharingfeaturesmayactuallyhaveverydifferentvalues!

Page 50: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

ApproximateQ-Learning

§ Q-learningwithlinearQ-functions:

§ Intuitiveinterpretation:§ Adjustweightsofactive features§ E.g.,ifsomethingunexpectedlybadhappens,blamethefeaturesthatwereon:

disprefer allstateswiththatstate’sfeatures

§ Formaljustification:inafewslides!

Exact Q’s

Approximate Q’sForall i do:

Page 51: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

QLearning

§ Forall s,a§ InitializeQ(s,a)=0

§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:

differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)

Page 52: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging

§ Forall i§ Initializewi =0

§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:

differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)