cse 573: artificial intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf ·...
TRANSCRIPT
CSE573:ArtificialIntelligenceReinforcementLearning
DanWeld/UniversityofWashington[ManyslidestakenfromDanKleinandPieterAbbeel /CS188IntrotoAIatUCBerkeley– materialsavailableathttp://ai.berkeley.edu.]
Logistics
§ PS3duetoday§ PS4dueinoneweek(Thurs2/16)§ ResearchpapercommentsdueonTues
§ PaperitselfwillbeonWebcalendarafterclass
2
ReinforcementLearning
ReinforcementLearning
§ Basicidea:§ Receivefeedbackintheformofrewards§ Agent’sutilityisdefinedbytherewardfunction§ Must(learnto)actsoastomaximizeexpectedrewards§ Alllearningisbasedonobservedsamplesofoutcomes!
Environment
Agent
Actions:aState:sReward:r
Example: Animal Learning
§ RL studied experimentally for more than 60 years in psychology
§ Example: foraging
§ Rewards: food, pain, hunger, drugs, etc.§ Mechanisms and sophistication debated
§ Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies
§ Bees have a direct neural connection from nectar intake measurement to motor planning area
Example: Backgammon
§ Reward only for win / loss in terminal states, zero otherwise
§ TD-Gammon learns a function approximation to V(s) using a neural network
§ Combined with depth 3 search, one of the top 3 players in the world
§ You could imagine training Pacman this way…
§ … but it’s tricky! (It’s also PS 4)
Example:LearningtoWalk
Initial[Video:AIBOWALK– initial][KohlandStone,ICRA2004]
Example:LearningtoWalk
Finished[Video:AIBOWALK– finished][KohlandStone,ICRA2004]
Example:Sidewinding
[AndrewNg] [Video:SNAKE– climbStep+sidewinding]
12
“Fewdrivingtasksareasintimidatingasparallelparking….
https://www.youtube.com/watch?v=pB_iFY2jIdI
ParallelParking“Fewdrivingtasksareasintimidatingasparallelparking….
13
https://www.youtube.com/watch?v=pB_iFY2jIdI
Other Applications
§ Go playing§ Robotic control
§ helicopter maneuvering, autonomous vehicles§ Mars rover - path planning, oversubscription planning§ elevator planning
§ Game playing - backgammon, tetris, checkers§ Neuroscience§ Computational Finance, Sequential Auctions§ Assisting elderly in simple tasks§ Spoken dialog management§ Communication Networks – switching, routing, flow control§ War planning, evacuation planning
ReinforcementLearning
§ StillassumeaMarkovdecisionprocess(MDP):§ AsetofstatessÎ S§ Asetofactions(perstate)A§ AmodelT(s,a,s’)§ ArewardfunctionR(s,a,s’)& discountγ
§ Stilllookingforapolicyp(s)
§ Newtwist:don’tknowTorR§ I.e.wedon’tknowwhichstatesaregoodorwhattheactionsdo§ Mustactuallytryactionsandstatesouttolearn
?
Offline(MDPs)vs.Online(RL)
OfflineSolution(Planning)
OnlineLearning(RL)
MonteCarloPlanning
Simulator
Diff:1)dyingok;2)(re)setbutton
FourKeyIdeasforRL
§ Credit-AssignmentProblem§ Whatwastherealcauseofreward?
§ Exploration-exploitationtradeoff
§ Model-basedvsmodel-freelearning§ Whatfunctionisbeinglearned?
§ ApproximatingtheValueFunction§ Smallerà easiertolearn&bettergeneralization
CreditAssignmentProblem
18
19
Exploration-Exploitation tradeoff
§ You have visited part of the state space and found a reward of 100§ is this the best you can hope for???
§ Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge?§ at risk of missing out on a better reward somewhere
§ Exploration: should I look for states w/ more reward?§ at risk of wasting time & getting some negative reward
Model-BasedLearning
Model-BasedLearning
§ Model-BasedIdea:§ Learnanapproximatemodelbasedonexperiences§ Solveforvaluesasifthelearnedmodelwerecorrect
§ Step1:LearnempiricalMDPmodel§ Explore(e.g.,moverandomly)§ Countoutcomess’foreachs,a§ Normalizetogiveanestimateof§ Discovereach whenweexperience(s,a,s’)
§ Step2:SolvethelearnedMDP§ Forexample,usevalueiteration,asbefore
Example:Model-BasedLearning
Randomp
Assume:g =1
ObservedEpisodes(Training) LearnedModel
A
B C D
E
B,east,C,-1C,east,D,-1D,exit,x,+10
B,east,C,-1C,east,D,-1D,exit,x,+10
E,north,C,-1C,east,A,-1A,exit,x,-10
Episode1 Episode2
Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10
T(s,a,s’).T(B,east,C)=1.00T(C,east,D)=0.75T(C,east,A)=0.25
…
R(s,a,s’).R(B,east,C)=-1R(C,east,D)=-1R(D,exit,x)=+10
…
Convergence
§ Ifpolicyexplores“enough”– doesn’tstarveanystate§ ThenT&Rconverge
§ So,VI,PI,Lao*etc. willfindoptimalpolicy§ UsingBellmanEquations
§ Whencanagentstartexploiting??§ (We’llanswerthisquestionlater)
23
24
Two main reinforcement learning approaches
§ Model-based approaches:§ explore environment & learn model, T=P(s’|s,a) and R(s,a), (almost) everywhere§ use model to plan policy, MDP-style§ approach leads to strongest theoretical results § often works well when state-space is manageable
§ Model-free approach:§ don’t learn a model of T&R; instead, learn Q-function (or policy) directly§ weaker theoretical results§ often works better when state space is large
25
Two main reinforcement learning approaches
§ Model-basedapproaches:Learn T+R
|S|2|A|+|S||A|parameters(40,400)
§ Model-freeapproach:Learn Q
|S||A|parameters (400)
Model-FreeLearning
NothingisFreeinLife!
§ WhatexactlyisFree???§ NomodelofT§ NomodelofR
§ (Instead,justmodelQ)
27
Reminder:Q-ValueIteration
a
Qk+1(s,a)
s,a
s,a,s’
’)as’,(kQa’Max)=s’(kV
§ Forall s,a§ InitializeQ0(s,a)=0 notimestepsleftmeansanexpectedrewardofzero
§ K=0§ Repeat doBellmanbackups
For every (s,a) pair:
K += 1§ Untilconvergence I.e.,Qvaluesdon’tchangemuch
This is easy….We can sample this
Puzzle:Q-Learning
a
Qk+1(s,a)
s,a
s,a,s’
’)as’,(kQa’Max)=s’(kV
§ Forall s,a§ InitializeQ0(s,a)=0 notimestepsleftmeansanexpectedrewardofzero
§ K=0§ Repeat doBellmanbackups
For every (s,a) pair:
K += 1§ Untilconvergence I.e.,Qvaluesdon’tchangemuch
Q: How can we compute without R, T ?!?A: Compute averages using sampled outcomes
SimpleExample:ExpectedAgeGoal:ComputeexpectedageofCSEstudents
UnknownP(A):“ModelBased” UnknownP(A):“ModelFree”
WithoutP(A),insteadcollectsamples[a1,a2,…aN]
KnownP(A)
Whydoesthiswork?Becausesamplesappearwiththerightfrequencies.
Whydoesthiswork?Becauseeventuallyyoulearntheright
model.
Note:neverknowP(age=22)
Anytime Model-FreeExpectedAgeGoal:ComputeexpectedageofCSEstudents
UnknownP(A):“ModelFree”
WithoutP(A),insteadcollectsamples[a1,a2,…aN]
Let A=0Loop for i = 1 to ∞
ai ß ask “what is your age?”A ß (i-1)/i * A + (1/i) * ai
Let A=0Loop for i = 1 to ∞
ai ß ask “what is your age?”A ß (1-α)*A + α*ai
SamplingQ-Values§ Bigidea:learnfromeveryexperience!
§ Followexplorationpolicyaß π(s)§ UpdateQ(s,a)eachtimeweexperienceatransition(s,a,s’,r)§ Likelyoutcomess’willcontributeupdatesmoreoften
§ Updatetowardsrunningaverage:
s
p(s),r
s’GetasampleofQ(s,a): sample =R(s,a,s’)+γ Maxa’ Q(s’,a’)
UpdatetoQ(s,a):
Sameupdate:
Q(s,a)ß (1-𝛼)Q(s,a)+(𝛼)sample
Q(s,a)ß Q(s,a)+𝛼(sample– Q(s,a))
Rearranging: Q(s,a)ß Q(s,a)+𝛼(difference)Wheredifference=(R(s,a,s’)+γ Maxa’ Q(s’,a’))- Q(s,a)
QLearning
§ Forall s,a§ InitializeQ(s,a)=0
§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:
differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)
Example Assume:g =1,α =1/2
ObservedTransition: B,east,C,-2
0
00
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
In state B. What should you do?Suppose (for now) we follow a random exploration policy
à “Go east”
Example Assume:g =1,α =1/2
ObservedTransition: B,east,C,-2
0
00
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
0
00
C0
0
08
D0
0
0?
B0
0
00
A0
0
00
E0
½ 0 ½ -2 0-1
Example Assume:g =1,α =1/2
ObservedTransition: B,east,C,-2
0
00
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
0
00
C0
0
08
D0
0
0-1
B0
0
00
A0
0
00
E0
½ 0 ½ -2 83
0
0?
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
C,east,D,-2
Example Assume:g =1,α =1/2
ObservedTransition: B,east,C,-2
0
00
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
0
00
C0
0
08
D0
0
0-1
B0
0
00
A0
0
00
E0
0
03
C0
0
08
D0
0
0-1
B0
0
00
A0
0
00
E0
C,east,D,-2
Q-LearningProperties
§ Q-learningconvergestooptimalQfunction(andhencelearns optimalpolicy)§ evenifyou’reactingsuboptimally!§ Thisiscalledoff-policylearning
§ Caveats:§ Youhavetoexploreenough§ Youhavetoeventuallyshrinkthelearningrate,α§ …butnotdecreaseittooquickly
§ And… ifyouwanttoact optimally§ Youhavetoswitchfromexploretoexploit
[Demo:Q-learning– auto– cliffgrid(L11D1)]
VideoofDemoQ-LearningAutoCliff Grid
QLearning
§ Forall s,a§ InitializeQ(s,a)=0
§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:
Explorationvs.Exploitation
Questions
§ Howtoexplore?aExplorationUniformexplorationEpsilonGreedy
With(small)probabilitye,actrandomlyWith(large)probability1-e,actoncurrentpolicy
ExplorationFunctions(suchasUCB)ThompsonSampling
§ Whentoexploit?
§ Howtoeventhinkaboutthistradeoff?
Questions
§ Howtoexplore?§ RandomExploration§ Uniformexploration§ EpsilonGreedy
§ With(small)probabilitye,actrandomly§ With(large)probability1-e,actoncurrentpolicy
§ ExplorationFunctions(suchasUCB)§ ThompsonSampling
§ Whentoexploit?
§ Howtoeventhinkaboutthistradeoff?
ExplorationFunctions§ Whentoexplore?
§ Randomactions:exploreafixedamount§ Betteridea:exploreareaswhosebadnessisnot(yet)established,eventuallystopexploring
§ Explorationfunction§ Takesavalueestimateuandavisitcountn,andreturnsanoptimisticutility,e.g.
§ Note:thispropagatesthe“bonus”backtostatesthatleadtounknownstatesaswell!
ModifiedQ-Update:
RegularQ-Update:
VideoofDemoCrawlerBot
http://inst.eecs.berkeley.edu/~ee128/fa11/videos.htmlMore demos at:
ApproximateQ-Learning
GeneralizingAcrossStates
§ BasicQ-Learningkeepsatableofallq-values
§ Inrealisticsituations,wecannotpossiblylearnabouteverysinglestate!§ Toomanystatestovisitthemallintraining§ Toomanystatestoholdtheq-tablesinmemory
§ Instead,wewanttogeneralize:§ Learnaboutsomesmallnumberoftrainingstatesfrom
experience§ Generalizethatexperiencetonew,similarsituations§ Thisisafundamentalideainmachinelearning,andwe’ll
seeitoverandoveragain
[demo– RLpacman]
Example:Pacman
Let’ssaywediscoverthroughexperiencethatthisstateisbad:
Innaïveq-learning,weknownothingaboutthisstate:
Example:Pacman
Let’ssaywediscoverthroughexperiencethatthisstateisbad:
Oreventhisone!
Feature-BasedRepresentationsSolution:describeastateusingavectoroffeatures (aka“properties”)
§ Features= functionsfromstatestoR(often0/1)capturingimportantpropertiesofthestate
§ Examplefeatures:§ Distancetoclosestghostordot§ Numberofghosts§ 1/(disttodot)2§ IsPacman inatunnel?(0/1)§ ……etc.§ Isittheexactstateonthisslide?
§ Canalsodescribeaq-state(s,a)withfeatures(e.g.actionmovesclosertofood)
LinearCombinationofFeatures
§ Usingafeaturerepresentation,wecanwriteaqfunction(orvaluefunction)foranystateusingafewweights:
§ Advantage:ourexperienceissummedupinafewpowerfulnumbers
§ Disadvantage:statessharingfeaturesmayactuallyhaveverydifferentvalues!
ApproximateQ-Learning
§ Q-learningwithlinearQ-functions:
§ Intuitiveinterpretation:§ Adjustweightsofactive features§ E.g.,ifsomethingunexpectedlybadhappens,blamethefeaturesthatwereon:
disprefer allstateswiththatstate’sfeatures
§ Formaljustification:inafewslides!
Exact Q’s
Approximate Q’sForall i do:
QLearning
§ Forall s,a§ InitializeQ(s,a)=0
§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:
differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)
§ Forall i§ Initializewi =0
§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:
differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)