trust region policy optimization (trpo) - sjsu · 2018-04-17 · trust region policy optimization...
TRANSCRIPT
![Page 1: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/1.jpg)
Trustregionpolicyoptimization(TRPO)
![Page 2: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/2.jpg)
ValueIteration
![Page 3: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/3.jpg)
ValueIteration
• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.
![Page 4: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/4.jpg)
ValueIteration
• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.
Model-based
Model-free
![Page 5: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/5.jpg)
ValueIteration
• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.
• OncewehaveQ(s,a),wecanfindoptimalpolicyπ*using:
![Page 6: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/6.jpg)
PolicyIteration• Wecandirectlyoptimizeinthepolicyspace.
![Page 7: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/7.jpg)
PolicyIteration• Wecandirectlyoptimizeinthepolicyspace.
Smaller thanQ-functionspace
![Page 8: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/8.jpg)
PreliminariesFollowingidentityexpressestheexpectedreturnofanotherpolicy intermsoftheadvantageoverπ,accumulatedovertimesteps:
WhereAπ istheadvantagefunction:
Andisthevisitation frequencyofstatesinpolicy
![Page 9: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/9.jpg)
PreliminariesToremovethecomplexity dueto,following localapproximation isintroduced:
Ifwehaveaparameterized policy ,where isadifferentiable functionoftheparametervector ,then matchestofirstorder. i.e.,
Thisimplies thatasufficiently small stepthatimproves willalsoimprove ,butdoesnotgiveusanyguidance onhowbigofasteptotake.
![Page 10: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/10.jpg)
• Toaddressthisissue,Kakade &Langford(2002)proposedconservativepolicyiteration:
where,
• Theyderivedthefollowinglowerbound:
Preliminaries
![Page 11: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/11.jpg)
Preliminaries
• Computationally,thisα-couplingmeansthatifwerandomlychooseaseedforourrandomnumbergenerator,andthenwesamplefromeachofπ andπnew aftersettingthatseed,theresultswillagreeforatleastfraction1-α ofseeds.• Thusα canbeconsideredasameasureofdisagreementbetweenπandπnew
![Page 12: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/12.jpg)
Theorem1• Previousresultwasapplicabletomixturepoliciesonly.Schulmanshowedthatitcanbeextended togeneralstochasticpoliciesbyusingadistancemeasurecalled“TotalVariation”divergencebetweenπandas:
• Let
• Theyprovedthatfor ,followingresultholds:
fordiscreteprobability distributionsp;q
![Page 13: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/13.jpg)
• NotethefollowingrelationbetweenTotalVariation&Kullback–Leibler:
• Thusboundingconditionbecomes:
Theorem1
![Page 14: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/14.jpg)
Algorithm1
![Page 15: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/15.jpg)
TrustRegionPolicyOptimization
• Forparameterizedpolicieswithparametervector,weareguaranteedtoimprovethetrueobjectivebyperformingfollowingmaximization:
• However,usingthepenaltycoefficientlikeaboveresultsinverysmallstepsizes.OnewaytotakelargerstepsinarobustwayistouseaconstraintontheKLdivergencebetweenthenewpolicyandtheoldpolicy,i.e.,atrustregionconstraint:
![Page 16: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/16.jpg)
TrustRegionPolicyOptimization
• Theconstraintisboundedateverypointinstatespace,whichisnotpractical.Wecanusethefollowingheuristicapproximation:
• Thus,theoptimizationproblembecomes:
![Page 17: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/17.jpg)
TrustRegionPolicyOptimization
• Intermsofexpectation,previousequationcanbewrittenas:
where,qdenotesthesamplingdistribution• Thissamplingdistributioncanbecalculatedintwoways:
Ø a)SinglePathMethodØ b)VineMethod
![Page 18: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af226/html5/thumbnails/18.jpg)
FinalAlgorithm
• Step1: Usethesinglepathorvineprocedurestocollectasetofstate-actionpairsalongwithMonteCarloestimatesoftheirQ-values• Step2:Byaveragingoversamples,constructtheestimatedobjectiveandconstraintinEquation(14)• Step3: Approximatelysolvethisconstrainedoptimizationproblemtoupdatethepolicy’sparametervector