reinforcement learning - rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •infinite deck...

ReinforcementLearning

Environments

• Fully-observablevs partially-observable• Singleagentvs multipleagents• Deterministicvs stochastic• Episodicvs sequential• Staticordynamic• Discreteorcontinuous

Whatisreinforcementlearning?

• Threemachinelearningparadigms:– Supervisedlearning– Unsupervisedlearning(overlapsw/datamining)– Reinforcementlearning

• Inreinforcementlearning,theagentreceivesincrementalpiecesoffeedback,calledrewards,thatitusestojudgewhetheritisactingcorrectlyornot.

Examplesofreal-lifeRL

• Learningtoplaychess.• Animals(ortoddlers)learningtowalk.• Drivingtoschoolorworkinthemorning.

• Keyidea:MostRLtasksareepisodic,meaningtheyrepeatmanytimes.– SounlikeinotherAIproblemswhereyouhaveoneshottogetitright,inRL,it'sOKtotaketimetotrydifferentthingstoseewhat'sbest.

n-armedbanditproblem• Youhavenslotmachines.• Whenyouplayaslotmachine,itprovidesyouareward(negativeorpositive)accordingtosomefixedprobabilitydistribution.

• Eachmachinemayhaveadifferentprobabilitydistribution,andyoudon'tknowthedistributionsaheadoftime.

• Youwanttomaximizetheamountofreward(money)youget.

• Inwhatorderandhowmanytimesdoyouplaythemachines?

RLproblems

• EveryRLproblemisstructuredsimilarly.• Wehaveanenvironment,whichconsistsofasetofstates,andactions thatcanbetakeninvariousstates.– Environmentisoftenstochastic(thereisanelementofchance).

• OurRLagentwishestolearnapolicy,π,afunctionthatmapsstatestoactions.– π(s)tellsyouwhatactiontotakeinastates.

WhatisthegoalinRL?

• InotherAIproblems,the"goal"istogettoacertainstate.NotinRL!

• A RLenvironmentgivesfeedbackeverytimetheagenttakesanaction.Thisiscalledareward.– Rewardsareusuallynumbers.– Goal:Agentwantstomaximizetheamountofrewarditgetsovertime.

– Criticalpoint:Rewardsaregivenbytheenvironment,nottheagent.

Mathematicsofrewards• Assumeourrewardsarer0,r1,r2,…• Whatexpressionrepresentsourtotalrewards?

• Howdowemaximizethis?Isthisagoodidea?• Usediscounting:ateachtimestep,therewardisdiscountedbyafactorofγ (calledthediscountrate).

• Futurerewardsfromtimet=rt + �rt+1 + �2rt+2 + · · · =

1X

k=0

�krt+k

MarkovDecisionProcesses• AnMDPhasasetofstates,S,andasetofactions,A(s),foreverystatesinS.

• AnMDPencodestheprobabilityoftransitioningfromstatestostates'onactiona:P(s'|s,a)

• RLalsorequiresarewardfunction,usuallydenotedbyR(s,a,s')=rewardforbeinginstates,takingactiona,andarrivinginstates'.

• AnMDPisaMarkovchainthatallowsforoutsideactionstoinfluencethetransitions.

• Grassgivesarewardof0.• Monstergivesarewardof-5.• Potofgoldgivesarewardof+10(andendsgame).• Twoactionsarealwaysavailable:

– ActionA:50%chanceofmovingright1square,50%chanceofstayingwhereyouare.

– ActionB:50%chanceofmovingright2squares,50%chanceofmovingleft1square.

– Anymovementthatwouldtakeyouofftheboardmovesyouasfarinthatdirectionaspossibleorkeepsyouwhereyouare.

Valuefunctions• AlmostallRLalgorithmsarebasedaroundcomputing,estimating,orlearningvaluefunctions.

• Avaluefunctionrepresentstheexpectedfuturereward fromeitherastate,orastate-actionpair.– Vπ(s):Ifweareinstates,andfollowpolicyπ,whatisthetotalfuturerewardwewillsee,onaverage?

– Qπ(s,a):Ifweareinstates,andtakeactiona,thenfollowpolicyπ,whatisthetotalfuturerewardwewillsee,onaverage?

Optimalpolicies

• GivenanMDP,thereisalwaysa"best"policy,calledπ*.

• ThepointofRListodiscoverthispolicybyemployingvariousalgorithms.– Somealgorithmscanusesub-optimalpoliciestodiscoverπ*.

• WedenotethevaluefunctionscorrespondingtotheoptimalpolicybyV*(s)andQ*(s,a).

Bellmanequations

• TheV*(s)andQ*(s,a)functionsalwayssatisfycertainrecursiverelationshipsforanyMDP.

• Theserelationships,intheformofequations,arecalledtheBellmanequations.

RecursiverelationshipofV*andQ*:

V ⇤(s) = max

aQ⇤

(s, a)

Q⇤(s, a) =X

s0

P (s0 | s, a)⇥R(s, a, s0) + �V ⇤(s0)

⇤

Theexpectedfuturerewardsfromastatesisequaltotheexpectedfuturerewardsobtainedbychoosingthebestactionfromthatstate.

Theexpectedfuturerewardsobtainedbytakinganactionfromastateistheweightedaverageoftheexpectedfuturerewardsfromthenewstate.

Bellmanequations

• Noclosed-formsolutioningeneral.• Instead,mostRLalgorithmsusetheseequationsinvariouswaystoestimateV*orQ*.AnoptimalpolicycanbederivedfromeitherV*orQ*.

V ⇤(s) = max

a

X

s0

P (s0 | s, a)⇥R(s, a, s0) + �V ⇤

(s0)⇤

Q⇤(s, a) =

X

s0

P (s0 | s, a)⇥R(s, a, s0) + �max

a0Q⇤

(s0, a0)⇤

RLalgorithms

• AmaincategorizationofRLalgorithmsiswhetherornottheyrequireafullmodeloftheenvironment.

• Inotherwords,doweknowP(s'|s,a)andR(s,a,s')forallcombinationsofs,a,s'?– Ifwehavethisinformation(uncommonintherealworld),wecanestimateV*orQ*directlywithverygoodaccuracy.

– Ifwedon'thavethisinformation,wecanestimateV*orQ*fromexperienceorsimulations.

Valueiteration

• Valueiterationisanalgorithmthatcomputesanoptimalpolicy,givenafullmodeloftheenvironment.

• AlgorithmisderiveddirectlyfromtheBellmanequations(usuallyforV*,butcanuseQ*aswell).

Valueiteration• Twosteps:• EstimateV(s)foreverystate.– Foreachstate:

• Simulatetakingeverypossibleactionfromthatstateandexaminetheprobabilitiesfortransitioningintoeverypossiblesuccessorstate.Weighttherewardsyouwouldreceivebytheprobabilitiesthatyoureceivethem.

• Findtheactionthatgaveyouthemostreward,andrememberhowmuchrewarditwas.

• Computetheoptimalpolicybydoingthefirststepagain,butthistimeremembertheactionsthatgiveyouthemostreward,nottherewarditself.

Valueiteration• ValueiterationmaintainsatableofVvalues,oneforeachstate.EachvalueV[s]eventuallyconvergestothetruevalueV*(s).

• Grassgivesarewardof0.• Monstergivesarewardof-5.• Potofgoldgivesarewardof+10(andendsgame).• Twoactionsarealwaysavailable:

– ActionA:50%chanceofmovingright1square,50%chanceofstayingwhereyouare.

– ActionB:50%chanceofmovingright2squares,50%chanceofmovingleft1square.

– Anymovementthatwouldtakeyouofftheboardmovesyouasfarinthatdirectionaspossibleorkeepsyouwhereyouare.

• γ (gamma)=0.9

V[s]valuesconvergeto:

6.477.918.560

Howdoweusethesetocomputeπ(s)?

ComputinganoptimalpolicyfromV[s]

• Laststepofthevalueiterationalgorithm:

• Inotherwords,runonelasttimethroughthevalueiterationequationforeachstate,andpicktheactionaforeachstatesthatmaximizestheexpectedreward.

⇡(s) = argmax

a

X

s0

P (s0 | s, a)[R(s, a, s0) + �V [s0]]

V[s]valuesconvergeto:

6.477.918.560Optimalpolicy:

ABB---

Review

• Valueiterationrequiresaperfectmodeloftheenvironment.– YouneedtoknowP(s'|s,a)andR(s,a,s')aheadoftimeforallcombinationsofs,a,ands'.

– OptimalVorQvaluesarecomputeddirectlyfromtheenvironmentusingtheBellmanequations.

• Oftenimpossibleorimpractical.

SimpleBlackjack• Costs$5toplay.• Infinitedeckofshuffledcards,labeled1,2,3.• Youstartwithnocards.Ateveryturn,youcaneither"hit"(takeacard)or"stay"(endthegame).Yourgoalistogettoasumof6withoutgoingover,inwhichcaseyoulosethegame.

• Youmakeallyourdecisionsfirst,thenthedealerplaysthesamegame.

• Ifyoursumishigherthanthedealer's,youwin$10(youroriginal$5back,plusanother$5).Iflower,youlose(youroriginal$5).Ifthesame,draw(getyour$5back).

SimpleBlackjack• TosetthisupasanMDP,weneedtoremovethe2nd player(thedealer)fromtheMDP.

• Usuallyatcasinos,dealershavesimplerulestheyhavetofollowanywayaboutwhentohitandwhentostay.

• Isiteveroptimalto"stay"fromS0-S3?• Assumethatonaverage,ifwe"stay"from:– S4,wewin$3(net$-2).– S5,wewin$6(net$1).– S6,wewin$7(net$2).

• Doyouevenwanttoplaythisgame?

SimpleBlackjack• Whatshouldgammabe?• Assumewehavefinishedoneroundofvalueiteration.

• CompletethesecondroundofvalueiterationforS1—S6.

Learningfromexperience

• Whatifwedon'tknowtheexactmodeloftheenvironment,butweareallowedtosamplefromit?– Thatis,weareallowedto"practice"theMDPasmuchaswewant.

– Thisechoesreal-lifeexperience.• Onewaytodothisistemporaldifferencelearning.

Temporaldifferencelearning

• WewanttocomputeV(s)orQ(s,a).• TDlearningusestheideaoftakinglotsofsamplesofVorQ(fromtheMDP)andaveragingthemtogetagoodestimate.

• Let'sseehowTDlearningworks.

Example:Timetodrivehome

• SupposefortendaysIrecordhowlongittakesmetodrivehomeafterwork.

• Ontheeleventhday,whattimeshouldIpredictmytraveltimehometobe?

Example:Timetodrivehome

• BasicTDequation:• V(s)=V(s)+𝛼(reward– V(s))• Butwhatifourrewardcomesinpieces,notallatonce?

• totalreward=onestepreward+restofreward• totalreward=rt +𝛾V(s')• V(s)=V(s)+𝛼[rt +𝛾V(s')– V(s)]

Q-learning

• Q-learningisatemporaldifferencelearningalgorithmthatlearnsoptimalvaluesforQ(insteadofV,asvalueiterationdid).

• Thealgorithmworksinepisodes,wheretheagent"practices"(akasamples)theMDPtolearnwhichactionsobtainthemostrewards.

• Likevalueiteration,tableofQvalueseventuallyconvergetoQ*.(undercertainconditions)

• NoticetheQ[s,a]updateequationisverysimilartothedrivingtimeupdateequation.– (Theextraγmaxa' Q[s',a']pieceistohandlefuturerewards.)

– alpha(0<α<=1)iscalledthelearningrate;itcontrolshowfastthealgorithmlearns.Instochasticenvironments,alphaisusuallysmall,suchas0.1.

• Note:The"chooseaction"stepdoesnotmeanyouchoosethebestactionaccordingtoyourtableofQvalues.

• Youmustbalanceexplorationandexploitation;likeintherealworld,thealgorithmlearnsbestwhenyou"practice"thebestpolicyoften,butsometimesexploreotheractionsthatmaybebetterinthelongrun.

• Oftenthe"chooseaction"stepusespolicythatmostlyexploitsbutsometimesexplores.

• Onecommonidea:(epsilon-greedypolicy)– Withprobability1- ε,pickthebestaction(the"a"thatmaximizesQ[s,a].

– Withprobabilityε,pickarandomaction.• Alsocommontostartwithlargeε anddecreaseovertimewhilelearning.

• WhatmakesQ-learningsoamazingisthattheQ-valuesstillconvergetotheoptimalQ*valueseventhoughthealgorithmitselfisnotfollowingtheoptimalpolicy!

Q-learningwithBlackjack

• Updateformula:

• Sampleepisodes(statesandactions):S0è Hitè S3è Stayè EndS0è Hitè S3è Hitè S6è Stayè EndS0è Hitè S3è Hitè S5è Stayè End

Q[s, a] Q[s, a] + ↵hr + �max

a0Q[s0, a0]�Q[s, a]

i

2-PlayerQ-learningNormalupdateequation:

Normallywealwaysmaximizeourrewards.Consider2-playerQ-learningwithplayerAmaximizingandplayerBminimizing(asinminimax).

Whydoesthisbreaktheupdateequation?

Q[s, a] Q[s, a] + ↵hr + �max

a0Q[s0, a0]�Q[s, a]

i

2-PlayerQ-learningPlayerA'supdateequation:

PlayerB'supdateequation:

PlayerA'soptimalpolicyoutput:

PlayerB'soptimalpolicyoutput:

Q[s, a] Q[s, a] + ↵hr + �min

a0Q[s0, a0]�Q[s, a]

i

Q[s, a] Q[s, a] + ↵hr + �max

a0Q[s0, a0]�Q[s, a]

i

⇡(s) = argmax

aQ[s, a]

⇡(s) = argmina

Q[s, a]

reinforcement learning - rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •infinite deck...

Documents