trust region policy optimization (trpo) - sjsu · 2018-04-17 · trust region policy optimization...

Trustregionpolicyoptimization(TRPO)

ValueIteration

ValueIteration

• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.

ValueIteration


Model-based

Model-free

ValueIteration


• OncewehaveQ(s,a),wecanfindoptimalpolicyπ*using:

PolicyIteration• Wecandirectlyoptimizeinthepolicyspace.

PolicyIteration• Wecandirectlyoptimizeinthepolicyspace.

Smaller thanQ-functionspace

PreliminariesFollowingidentityexpressestheexpectedreturnofanotherpolicy intermsoftheadvantageoverπ,accumulatedovertimesteps:

WhereAπ istheadvantagefunction:

Andisthevisitation frequencyofstatesinpolicy

PreliminariesToremovethecomplexity dueto,following localapproximation isintroduced:

Ifwehaveaparameterized policy ,where isadifferentiable functionoftheparametervector ,then matchestofirstorder. i.e.,

Thisimplies thatasufficiently small stepthatimproves willalsoimprove ,butdoesnotgiveusanyguidance onhowbigofasteptotake.

• Toaddressthisissue,Kakade &Langford(2002)proposedconservativepolicyiteration:

where,

• Theyderivedthefollowinglowerbound:

Preliminaries

Preliminaries

• Computationally,thisα-couplingmeansthatifwerandomlychooseaseedforourrandomnumbergenerator,andthenwesamplefromeachofπ andπnew aftersettingthatseed,theresultswillagreeforatleastfraction1-α ofseeds.• Thusα canbeconsideredasameasureofdisagreementbetweenπandπnew

Theorem1• Previousresultwasapplicabletomixturepoliciesonly.Schulmanshowedthatitcanbeextended togeneralstochasticpoliciesbyusingadistancemeasurecalled“TotalVariation”divergencebetweenπandas:

• Let

• Theyprovedthatfor ,followingresultholds:

fordiscreteprobability distributionsp;q

• NotethefollowingrelationbetweenTotalVariation&Kullback–Leibler:

• Thusboundingconditionbecomes:

Theorem1

Algorithm1

TrustRegionPolicyOptimization

• Forparameterizedpolicieswithparametervector,weareguaranteedtoimprovethetrueobjectivebyperformingfollowingmaximization:

• However,usingthepenaltycoefficientlikeaboveresultsinverysmallstepsizes.OnewaytotakelargerstepsinarobustwayistouseaconstraintontheKLdivergencebetweenthenewpolicyandtheoldpolicy,i.e.,atrustregionconstraint:


• Theconstraintisboundedateverypointinstatespace,whichisnotpractical.Wecanusethefollowingheuristicapproximation:

• Thus,theoptimizationproblembecomes:


• Intermsofexpectation,previousequationcanbewrittenas:

where,qdenotesthesamplingdistribution• Thissamplingdistributioncanbecalculatedintwoways:

Ø a)SinglePathMethodØ b)VineMethod

FinalAlgorithm

• Step1: Usethesinglepathorvineprocedurestocollectasetofstate-actionpairsalongwithMonteCarloestimatesoftheirQ-values• Step2:Byaveragingoversamples,constructtheestimatedobjectiveandconstraintinEquation(14)• Step3: Approximatelysolvethisconstrainedoptimizationproblemtoupdatethepolicy’sparametervector

trust region policy optimization (trpo) - sjsu · 2018-04-17 · trust region policy optimization...

Documents