reinforcement learning - rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •infinite deck...
TRANSCRIPT
![Page 1: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/1.jpg)
ReinforcementLearning
![Page 2: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/2.jpg)
Environments
• Fully-observablevs partially-observable• Singleagentvs multipleagents• Deterministicvs stochastic• Episodicvs sequential• Staticordynamic• Discreteorcontinuous
![Page 3: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/3.jpg)
Whatisreinforcementlearning?
• Threemachinelearningparadigms:– Supervisedlearning– Unsupervisedlearning(overlapsw/datamining)– Reinforcementlearning
• Inreinforcementlearning,theagentreceivesincrementalpiecesoffeedback,calledrewards,thatitusestojudgewhetheritisactingcorrectlyornot.
![Page 4: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/4.jpg)
Examplesofreal-lifeRL
• Learningtoplaychess.• Animals(ortoddlers)learningtowalk.• Drivingtoschoolorworkinthemorning.
• Keyidea:MostRLtasksareepisodic,meaningtheyrepeatmanytimes.– SounlikeinotherAIproblemswhereyouhaveoneshottogetitright,inRL,it'sOKtotaketimetotrydifferentthingstoseewhat'sbest.
![Page 5: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/5.jpg)
n-armedbanditproblem• Youhavenslotmachines.• Whenyouplayaslotmachine,itprovidesyouareward(negativeorpositive)accordingtosomefixedprobabilitydistribution.
• Eachmachinemayhaveadifferentprobabilitydistribution,andyoudon'tknowthedistributionsaheadoftime.
• Youwanttomaximizetheamountofreward(money)youget.
• Inwhatorderandhowmanytimesdoyouplaythemachines?
![Page 6: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/6.jpg)
RLproblems
• EveryRLproblemisstructuredsimilarly.• Wehaveanenvironment,whichconsistsofasetofstates,andactions thatcanbetakeninvariousstates.– Environmentisoftenstochastic(thereisanelementofchance).
• OurRLagentwishestolearnapolicy,π,afunctionthatmapsstatestoactions.– π(s)tellsyouwhatactiontotakeinastates.
![Page 7: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/7.jpg)
WhatisthegoalinRL?
• InotherAIproblems,the"goal"istogettoacertainstate.NotinRL!
• A RLenvironmentgivesfeedbackeverytimetheagenttakesanaction.Thisiscalledareward.– Rewardsareusuallynumbers.– Goal:Agentwantstomaximizetheamountofrewarditgetsovertime.
– Criticalpoint:Rewardsaregivenbytheenvironment,nottheagent.
![Page 8: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/8.jpg)
Mathematicsofrewards• Assumeourrewardsarer0,r1,r2,…• Whatexpressionrepresentsourtotalrewards?
• Howdowemaximizethis?Isthisagoodidea?• Usediscounting:ateachtimestep,therewardisdiscountedbyafactorofγ (calledthediscountrate).
• Futurerewardsfromtimet=rt + �rt+1 + �2rt+2 + · · · =
1X
k=0
�krt+k
![Page 9: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/9.jpg)
MarkovDecisionProcesses• AnMDPhasasetofstates,S,andasetofactions,A(s),foreverystatesinS.
• AnMDPencodestheprobabilityoftransitioningfromstatestostates'onactiona:P(s'|s,a)
• RLalsorequiresarewardfunction,usuallydenotedbyR(s,a,s')=rewardforbeinginstates,takingactiona,andarrivinginstates'.
• AnMDPisaMarkovchainthatallowsforoutsideactionstoinfluencethetransitions.
![Page 10: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/10.jpg)
• Grassgivesarewardof0.• Monstergivesarewardof-5.• Potofgoldgivesarewardof+10(andendsgame).• Twoactionsarealwaysavailable:
– ActionA:50%chanceofmovingright1square,50%chanceofstayingwhereyouare.
– ActionB:50%chanceofmovingright2squares,50%chanceofmovingleft1square.
– Anymovementthatwouldtakeyouofftheboardmovesyouasfarinthatdirectionaspossibleorkeepsyouwhereyouare.
![Page 11: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/11.jpg)
Valuefunctions• AlmostallRLalgorithmsarebasedaroundcomputing,estimating,orlearningvaluefunctions.
• Avaluefunctionrepresentstheexpectedfuturereward fromeitherastate,orastate-actionpair.– Vπ(s):Ifweareinstates,andfollowpolicyπ,whatisthetotalfuturerewardwewillsee,onaverage?
– Qπ(s,a):Ifweareinstates,andtakeactiona,thenfollowpolicyπ,whatisthetotalfuturerewardwewillsee,onaverage?
![Page 12: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/12.jpg)
Optimalpolicies
• GivenanMDP,thereisalwaysa"best"policy,calledπ*.
• ThepointofRListodiscoverthispolicybyemployingvariousalgorithms.– Somealgorithmscanusesub-optimalpoliciestodiscoverπ*.
• WedenotethevaluefunctionscorrespondingtotheoptimalpolicybyV*(s)andQ*(s,a).
![Page 13: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/13.jpg)
Bellmanequations
• TheV*(s)andQ*(s,a)functionsalwayssatisfycertainrecursiverelationshipsforanyMDP.
• Theserelationships,intheformofequations,arecalledtheBellmanequations.
![Page 14: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/14.jpg)
RecursiverelationshipofV*andQ*:
V ⇤(s) = max
aQ⇤
(s, a)
Q⇤(s, a) =X
s0
P (s0 | s, a)⇥R(s, a, s0) + �V ⇤(s0)
⇤
Theexpectedfuturerewardsfromastatesisequaltotheexpectedfuturerewardsobtainedbychoosingthebestactionfromthatstate.
Theexpectedfuturerewardsobtainedbytakinganactionfromastateistheweightedaverageoftheexpectedfuturerewardsfromthenewstate.
![Page 15: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/15.jpg)
Bellmanequations
• Noclosed-formsolutioningeneral.• Instead,mostRLalgorithmsusetheseequationsinvariouswaystoestimateV*orQ*.AnoptimalpolicycanbederivedfromeitherV*orQ*.
V ⇤(s) = max
a
X
s0
P (s0 | s, a)⇥R(s, a, s0) + �V ⇤
(s0)⇤
Q⇤(s, a) =
X
s0
P (s0 | s, a)⇥R(s, a, s0) + �max
a0Q⇤
(s0, a0)⇤
![Page 16: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/16.jpg)
RLalgorithms
• AmaincategorizationofRLalgorithmsiswhetherornottheyrequireafullmodeloftheenvironment.
• Inotherwords,doweknowP(s'|s,a)andR(s,a,s')forallcombinationsofs,a,s'?– Ifwehavethisinformation(uncommonintherealworld),wecanestimateV*orQ*directlywithverygoodaccuracy.
– Ifwedon'thavethisinformation,wecanestimateV*orQ*fromexperienceorsimulations.
![Page 17: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/17.jpg)
Valueiteration
• Valueiterationisanalgorithmthatcomputesanoptimalpolicy,givenafullmodeloftheenvironment.
• AlgorithmisderiveddirectlyfromtheBellmanequations(usuallyforV*,butcanuseQ*aswell).
![Page 18: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/18.jpg)
Valueiteration• Twosteps:• EstimateV(s)foreverystate.– Foreachstate:
• Simulatetakingeverypossibleactionfromthatstateandexaminetheprobabilitiesfortransitioningintoeverypossiblesuccessorstate.Weighttherewardsyouwouldreceivebytheprobabilitiesthatyoureceivethem.
• Findtheactionthatgaveyouthemostreward,andrememberhowmuchrewarditwas.
• Computetheoptimalpolicybydoingthefirststepagain,butthistimeremembertheactionsthatgiveyouthemostreward,nottherewarditself.
![Page 19: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/19.jpg)
Valueiteration• ValueiterationmaintainsatableofVvalues,oneforeachstate.EachvalueV[s]eventuallyconvergestothetruevalueV*(s).
![Page 20: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/20.jpg)
• Grassgivesarewardof0.• Monstergivesarewardof-5.• Potofgoldgivesarewardof+10(andendsgame).• Twoactionsarealwaysavailable:
– ActionA:50%chanceofmovingright1square,50%chanceofstayingwhereyouare.
– ActionB:50%chanceofmovingright2squares,50%chanceofmovingleft1square.
– Anymovementthatwouldtakeyouofftheboardmovesyouasfarinthatdirectionaspossibleorkeepsyouwhereyouare.
• γ (gamma)=0.9
![Page 21: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/21.jpg)
V[s]valuesconvergeto:
6.477.918.560
Howdoweusethesetocomputeπ(s)?
![Page 22: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/22.jpg)
ComputinganoptimalpolicyfromV[s]
• Laststepofthevalueiterationalgorithm:
• Inotherwords,runonelasttimethroughthevalueiterationequationforeachstate,andpicktheactionaforeachstatesthatmaximizestheexpectedreward.
⇡(s) = argmax
a
X
s0
P (s0 | s, a)[R(s, a, s0) + �V [s0]]
![Page 23: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/23.jpg)
V[s]valuesconvergeto:
6.477.918.560Optimalpolicy:
ABB---
![Page 24: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/24.jpg)
Review
• Valueiterationrequiresaperfectmodeloftheenvironment.– YouneedtoknowP(s'|s,a)andR(s,a,s')aheadoftimeforallcombinationsofs,a,ands'.
– OptimalVorQvaluesarecomputeddirectlyfromtheenvironmentusingtheBellmanequations.
• Oftenimpossibleorimpractical.
![Page 25: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/25.jpg)
SimpleBlackjack• Costs$5toplay.• Infinitedeckofshuffledcards,labeled1,2,3.• Youstartwithnocards.Ateveryturn,youcaneither"hit"(takeacard)or"stay"(endthegame).Yourgoalistogettoasumof6withoutgoingover,inwhichcaseyoulosethegame.
• Youmakeallyourdecisionsfirst,thenthedealerplaysthesamegame.
• Ifyoursumishigherthanthedealer's,youwin$10(youroriginal$5back,plusanother$5).Iflower,youlose(youroriginal$5).Ifthesame,draw(getyour$5back).
![Page 26: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/26.jpg)
SimpleBlackjack• TosetthisupasanMDP,weneedtoremovethe2nd player(thedealer)fromtheMDP.
• Usuallyatcasinos,dealershavesimplerulestheyhavetofollowanywayaboutwhentohitandwhentostay.
• Isiteveroptimalto"stay"fromS0-S3?• Assumethatonaverage,ifwe"stay"from:– S4,wewin$3(net$-2).– S5,wewin$6(net$1).– S6,wewin$7(net$2).
• Doyouevenwanttoplaythisgame?
![Page 27: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/27.jpg)
SimpleBlackjack• Whatshouldgammabe?• Assumewehavefinishedoneroundofvalueiteration.
• CompletethesecondroundofvalueiterationforS1—S6.
![Page 28: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/28.jpg)
Learningfromexperience
• Whatifwedon'tknowtheexactmodeloftheenvironment,butweareallowedtosamplefromit?– Thatis,weareallowedto"practice"theMDPasmuchaswewant.
– Thisechoesreal-lifeexperience.• Onewaytodothisistemporaldifferencelearning.
![Page 29: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/29.jpg)
Temporaldifferencelearning
• WewanttocomputeV(s)orQ(s,a).• TDlearningusestheideaoftakinglotsofsamplesofVorQ(fromtheMDP)andaveragingthemtogetagoodestimate.
• Let'sseehowTDlearningworks.
![Page 30: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/30.jpg)
Example:Timetodrivehome
• SupposefortendaysIrecordhowlongittakesmetodrivehomeafterwork.
• Ontheeleventhday,whattimeshouldIpredictmytraveltimehometobe?
![Page 31: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/31.jpg)
Example:Timetodrivehome
• BasicTDequation:• V(s)=V(s)+𝛼(reward– V(s))• Butwhatifourrewardcomesinpieces,notallatonce?
• totalreward=onestepreward+restofreward• totalreward=rt +𝛾V(s')• V(s)=V(s)+𝛼[rt +𝛾V(s')– V(s)]
![Page 32: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/32.jpg)
Q-learning
• Q-learningisatemporaldifferencelearningalgorithmthatlearnsoptimalvaluesforQ(insteadofV,asvalueiterationdid).
• Thealgorithmworksinepisodes,wheretheagent"practices"(akasamples)theMDPtolearnwhichactionsobtainthemostrewards.
• Likevalueiteration,tableofQvalueseventuallyconvergetoQ*.(undercertainconditions)
![Page 33: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/33.jpg)
• NoticetheQ[s,a]updateequationisverysimilartothedrivingtimeupdateequation.– (Theextraγmaxa' Q[s',a']pieceistohandlefuturerewards.)
– alpha(0<α<=1)iscalledthelearningrate;itcontrolshowfastthealgorithmlearns.Instochasticenvironments,alphaisusuallysmall,suchas0.1.
![Page 34: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/34.jpg)
• Note:The"chooseaction"stepdoesnotmeanyouchoosethebestactionaccordingtoyourtableofQvalues.
• Youmustbalanceexplorationandexploitation;likeintherealworld,thealgorithmlearnsbestwhenyou"practice"thebestpolicyoften,butsometimesexploreotheractionsthatmaybebetterinthelongrun.
![Page 35: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/35.jpg)
• Oftenthe"chooseaction"stepusespolicythatmostlyexploitsbutsometimesexplores.
• Onecommonidea:(epsilon-greedypolicy)– Withprobability1- ε,pickthebestaction(the"a"thatmaximizesQ[s,a].
– Withprobabilityε,pickarandomaction.• Alsocommontostartwithlargeε anddecreaseovertimewhilelearning.
![Page 36: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/36.jpg)
• WhatmakesQ-learningsoamazingisthattheQ-valuesstillconvergetotheoptimalQ*valueseventhoughthealgorithmitselfisnotfollowingtheoptimalpolicy!
![Page 37: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/37.jpg)
Q-learningwithBlackjack
• Updateformula:
• Sampleepisodes(statesandactions):S0è Hitè S3è Stayè EndS0è Hitè S3è Hitè S6è Stayè EndS0è Hitè S3è Hitè S5è Stayè End
Q[s, a] Q[s, a] + ↵hr + �max
a0Q[s0, a0]�Q[s, a]
i
![Page 38: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/38.jpg)
2-PlayerQ-learningNormalupdateequation:
Normallywealwaysmaximizeourrewards.Consider2-playerQ-learningwithplayerAmaximizingandplayerBminimizing(asinminimax).
Whydoesthisbreaktheupdateequation?
Q[s, a] Q[s, a] + ↵hr + �max
a0Q[s0, a0]�Q[s, a]
i
![Page 39: Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck of shuffled cards, labeled 1, 2, 3. •You start with no cards. At every turn, you](https://reader036.vdocuments.net/reader036/viewer/2022071219/6054027e8ed6ae5fc151128d/html5/thumbnails/39.jpg)
2-PlayerQ-learningPlayerA'supdateequation:
PlayerB'supdateequation:
PlayerA'soptimalpolicyoutput:
PlayerB'soptimalpolicyoutput:
Q[s, a] Q[s, a] + ↵hr + �min
a0Q[s0, a0]�Q[s, a]
i
Q[s, a] Q[s, a] + ↵hr + �max
a0Q[s0, a0]�Q[s, a]
i
⇡(s) = argmax
aQ[s, a]
⇡(s) = argmina
Q[s, a]