cse 573: artificial intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf ·...

CSE573:ArtificialIntelligenceReinforcementLearning

DanWeld/UniversityofWashington[ManyslidestakenfromDanKleinandPieterAbbeel /CS188IntrotoAIatUCBerkeley– materialsavailableathttp://ai.berkeley.edu.]

Logistics

§ PS3duetoday§ PS4dueinoneweek(Thurs2/16)§ ResearchpapercommentsdueonTues

§ PaperitselfwillbeonWebcalendarafterclass

2

ReinforcementLearning


§ Basicidea:§ Receivefeedbackintheformofrewards§ Agent’sutilityisdefinedbytherewardfunction§ Must(learnto)actsoastomaximizeexpectedrewards§ Alllearningisbasedonobservedsamplesofoutcomes!

Environment

Agent

Actions:aState:sReward:r

Example: Animal Learning

§ RL studied experimentally for more than 60 years in psychology

§ Example: foraging

§ Rewards: food, pain, hunger, drugs, etc.§ Mechanisms and sophistication debated

§ Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies

§ Bees have a direct neural connection from nectar intake measurement to motor planning area

Example: Backgammon

§ Reward only for win / loss in terminal states, zero otherwise

§ TD-Gammon learns a function approximation to V(s) using a neural network

§ Combined with depth 3 search, one of the top 3 players in the world

§ You could imagine training Pacman this way…

§ … but it’s tricky! (It’s also PS 4)

Example:LearningtoWalk

Initial[Video:AIBOWALK– initial][KohlandStone,ICRA2004]

Example:LearningtoWalk

Finished[Video:AIBOWALK– finished][KohlandStone,ICRA2004]

Example:Sidewinding

[AndrewNg] [Video:SNAKE– climbStep+sidewinding]

12

“Fewdrivingtasksareasintimidatingasparallelparking….

https://www.youtube.com/watch?v=pB_iFY2jIdI

ParallelParking“Fewdrivingtasksareasintimidatingasparallelparking….

13

https://www.youtube.com/watch?v=pB_iFY2jIdI

Other Applications

§ Go playing§ Robotic control

§ helicopter maneuvering, autonomous vehicles§ Mars rover - path planning, oversubscription planning§ elevator planning

§ Game playing - backgammon, tetris, checkers§ Neuroscience§ Computational Finance, Sequential Auctions§ Assisting elderly in simple tasks§ Spoken dialog management§ Communication Networks – switching, routing, flow control§ War planning, evacuation planning


§ StillassumeaMarkovdecisionprocess(MDP):§ AsetofstatessÎ S§ Asetofactions(perstate)A§ AmodelT(s,a,s’)§ ArewardfunctionR(s,a,s’)& discountγ

§ Stilllookingforapolicyp(s)

§ Newtwist:don’tknowTorR§ I.e.wedon’tknowwhichstatesaregoodorwhattheactionsdo§ Mustactuallytryactionsandstatesouttolearn

?

Offline(MDPs)vs.Online(RL)

OfflineSolution(Planning)

OnlineLearning(RL)

MonteCarloPlanning

Simulator

Diff:1)dyingok;2)(re)setbutton

FourKeyIdeasforRL

§ Credit-AssignmentProblem§ Whatwastherealcauseofreward?

§ Exploration-exploitationtradeoff

§ Model-basedvsmodel-freelearning§ Whatfunctionisbeinglearned?

§ ApproximatingtheValueFunction§ Smallerà easiertolearn&bettergeneralization

CreditAssignmentProblem

18

19

Exploration-Exploitation tradeoff

§ You have visited part of the state space and found a reward of 100§ is this the best you can hope for???

§ Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge?§ at risk of missing out on a better reward somewhere

§ Exploration: should I look for states w/ more reward?§ at risk of wasting time & getting some negative reward

Model-BasedLearning

Model-BasedLearning

§ Model-BasedIdea:§ Learnanapproximatemodelbasedonexperiences§ Solveforvaluesasifthelearnedmodelwerecorrect

§ Step1:LearnempiricalMDPmodel§ Explore(e.g.,moverandomly)§ Countoutcomess’foreachs,a§ Normalizetogiveanestimateof§ Discovereach whenweexperience(s,a,s’)

§ Step2:SolvethelearnedMDP§ Forexample,usevalueiteration,asbefore

Example:Model-BasedLearning

Randomp

Assume:g =1

ObservedEpisodes(Training) LearnedModel

A

B C D

E

B,east,C,-1C,east,D,-1D,exit,x,+10

B,east,C,-1C,east,D,-1D,exit,x,+10

E,north,C,-1C,east,A,-1A,exit,x,-10

Episode1 Episode2

Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10

T(s,a,s’).T(B,east,C)=1.00T(C,east,D)=0.75T(C,east,A)=0.25

…

R(s,a,s’).R(B,east,C)=-1R(C,east,D)=-1R(D,exit,x)=+10

…

Convergence

§ Ifpolicyexplores“enough”– doesn’tstarveanystate§ ThenT&Rconverge

§ So,VI,PI,Lao*etc. willfindoptimalpolicy§ UsingBellmanEquations

§ Whencanagentstartexploiting??§ (We’llanswerthisquestionlater)

23

24

Two main reinforcement learning approaches

§ Model-based approaches:§ explore environment & learn model, T=P(s’|s,a) and R(s,a), (almost) everywhere§ use model to plan policy, MDP-style§ approach leads to strongest theoretical results § often works well when state-space is manageable

§ Model-free approach:§ don’t learn a model of T&R; instead, learn Q-function (or policy) directly§ weaker theoretical results§ often works better when state space is large

25

Two main reinforcement learning approaches

§ Model-basedapproaches:Learn T+R

|S|2|A|+|S||A|parameters(40,400)

§ Model-freeapproach:Learn Q

|S||A|parameters (400)

Model-FreeLearning

NothingisFreeinLife!

§ WhatexactlyisFree???§ NomodelofT§ NomodelofR

§ (Instead,justmodelQ)

27

Reminder:Q-ValueIteration

a

Qk+1(s,a)

s,a

s,a,s’

’)as’,(kQa’Max)=s’(kV

§ Forall s,a§ InitializeQ0(s,a)=0 notimestepsleftmeansanexpectedrewardofzero

§ K=0§ Repeat doBellmanbackups

For every (s,a) pair:

K += 1§ Untilconvergence I.e.,Qvaluesdon’tchangemuch

This is easy….We can sample this

Puzzle:Q-Learning

a

Qk+1(s,a)

s,a

s,a,s’

’)as’,(kQa’Max)=s’(kV

§ Forall s,a§ InitializeQ0(s,a)=0 notimestepsleftmeansanexpectedrewardofzero

§ K=0§ Repeat doBellmanbackups

For every (s,a) pair:

K += 1§ Untilconvergence I.e.,Qvaluesdon’tchangemuch

Q: How can we compute without R, T ?!?A: Compute averages using sampled outcomes

SimpleExample:ExpectedAgeGoal:ComputeexpectedageofCSEstudents

UnknownP(A):“ModelBased” UnknownP(A):“ModelFree”

WithoutP(A),insteadcollectsamples[a1,a2,…aN]

KnownP(A)

Whydoesthiswork?Becausesamplesappearwiththerightfrequencies.

Whydoesthiswork?Becauseeventuallyyoulearntheright

model.

Note:neverknowP(age=22)

Anytime Model-FreeExpectedAgeGoal:ComputeexpectedageofCSEstudents

UnknownP(A):“ModelFree”

WithoutP(A),insteadcollectsamples[a1,a2,…aN]

Let A=0Loop for i = 1 to ∞

ai ß ask “what is your age?”A ß (i-1)/i * A + (1/i) * ai

Let A=0Loop for i = 1 to ∞

ai ß ask “what is your age?”A ß (1-α)*A + α*ai

SamplingQ-Values§ Bigidea:learnfromeveryexperience!

§ Followexplorationpolicyaß π(s)§ UpdateQ(s,a)eachtimeweexperienceatransition(s,a,s’,r)§ Likelyoutcomess’willcontributeupdatesmoreoften

§ Updatetowardsrunningaverage:

s

p(s),r

s’GetasampleofQ(s,a): sample =R(s,a,s’)+γ Maxa’ Q(s’,a’)

UpdatetoQ(s,a):

Sameupdate:

Q(s,a)ß (1-𝛼)Q(s,a)+(𝛼)sample

Q(s,a)ß Q(s,a)+𝛼(sample– Q(s,a))

Rearranging: Q(s,a)ß Q(s,a)+𝛼(difference)Wheredifference=(R(s,a,s’)+γ Maxa’ Q(s’,a’))- Q(s,a)

QLearning

§ Forall s,a§ InitializeQ(s,a)=0

§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:

differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)

Example Assume:g =1,α =1/2

ObservedTransition: B,east,C,-2

0

00

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

In state B. What should you do?Suppose (for now) we follow a random exploration policy

à “Go east”



0

00

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

0

00

C0

0

08

D0

0

0?

B0

0

00

A0

0

00

E0

½ 0 ½ -2 0-1



0

00

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

0

00

C0

0

08

D0

0

0-1

B0

0

00

A0

0

00

E0

½ 0 ½ -2 83

0

0?

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

C,east,D,-2



0

00

C0

0

08

D0

0

00

B0

0

00

A0

0

00

E0

0

00

C0

0

08

D0

0

0-1

B0

0

00

A0

0

00

E0

0

03

C0

0

08

D0

0

0-1

B0

0

00

A0

0

00

E0

C,east,D,-2

Q-LearningProperties

§ Q-learningconvergestooptimalQfunction(andhencelearns optimalpolicy)§ evenifyou’reactingsuboptimally!§ Thisiscalledoff-policylearning

§ Caveats:§ Youhavetoexploreenough§ Youhavetoeventuallyshrinkthelearningrate,α§ …butnotdecreaseittooquickly

§ And… ifyouwanttoact optimally§ Youhavetoswitchfromexploretoexploit

[Demo:Q-learning– auto– cliffgrid(L11D1)]

VideoofDemoQ-LearningAutoCliff Grid

QLearning



Explorationvs.Exploitation

Questions

§ Howtoexplore?aExplorationUniformexplorationEpsilonGreedy

With(small)probabilitye,actrandomlyWith(large)probability1-e,actoncurrentpolicy

ExplorationFunctions(suchasUCB)ThompsonSampling

§ Whentoexploit?

§ Howtoeventhinkaboutthistradeoff?

Questions

§ Howtoexplore?§ RandomExploration§ Uniformexploration§ EpsilonGreedy

§ With(small)probabilitye,actrandomly§ With(large)probability1-e,actoncurrentpolicy

§ ExplorationFunctions(suchasUCB)§ ThompsonSampling

§ Whentoexploit?

§ Howtoeventhinkaboutthistradeoff?

ExplorationFunctions§ Whentoexplore?

§ Randomactions:exploreafixedamount§ Betteridea:exploreareaswhosebadnessisnot(yet)established,eventuallystopexploring

§ Explorationfunction§ Takesavalueestimateuandavisitcountn,andreturnsanoptimisticutility,e.g.

§ Note:thispropagatesthe“bonus”backtostatesthatleadtounknownstatesaswell!

ModifiedQ-Update:

RegularQ-Update:

VideoofDemoCrawlerBot

http://inst.eecs.berkeley.edu/~ee128/fa11/videos.htmlMore demos at:

ApproximateQ-Learning

GeneralizingAcrossStates

§ BasicQ-Learningkeepsatableofallq-values

§ Inrealisticsituations,wecannotpossiblylearnabouteverysinglestate!§ Toomanystatestovisitthemallintraining§ Toomanystatestoholdtheq-tablesinmemory

§ Instead,wewanttogeneralize:§ Learnaboutsomesmallnumberoftrainingstatesfrom

experience§ Generalizethatexperiencetonew,similarsituations§ Thisisafundamentalideainmachinelearning,andwe’ll

seeitoverandoveragain

[demo– RLpacman]

Example:Pacman

Let’ssaywediscoverthroughexperiencethatthisstateisbad:

Innaïveq-learning,weknownothingaboutthisstate:

Example:Pacman

Let’ssaywediscoverthroughexperiencethatthisstateisbad:

Oreventhisone!

Feature-BasedRepresentationsSolution:describeastateusingavectoroffeatures (aka“properties”)

§ Features= functionsfromstatestoR(often0/1)capturingimportantpropertiesofthestate

§ Examplefeatures:§ Distancetoclosestghostordot§ Numberofghosts§ 1/(disttodot)2§ IsPacman inatunnel?(0/1)§ ……etc.§ Isittheexactstateonthisslide?

§ Canalsodescribeaq-state(s,a)withfeatures(e.g.actionmovesclosertofood)

LinearCombinationofFeatures

§ Usingafeaturerepresentation,wecanwriteaqfunction(orvaluefunction)foranystateusingafewweights:

§ Advantage:ourexperienceissummedupinafewpowerfulnumbers

§ Disadvantage:statessharingfeaturesmayactuallyhaveverydifferentvalues!

ApproximateQ-Learning

§ Q-learningwithlinearQ-functions:

§ Intuitiveinterpretation:§ Adjustweightsofactive features§ E.g.,ifsomethingunexpectedlybadhappens,blamethefeaturesthatwereon:

disprefer allstateswiththatstate’sfeatures

§ Formaljustification:inafewslides!

Exact Q’s

Approximate Q’sForall i do:

QLearning




§ Forall i§ Initializewi =0



cse 573: artificial intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf ·...

Documents