masashi sugiyama-statistical reinforcement learning_ modern machine learning approaches-chapman and...
Post on 04-Jan-2016
19 Views
Preview:
DESCRIPTION
TRANSCRIPT
STATISTICAL
REINFORCEMENT
LEARNING
ModernMachine
LearningApproaches
Chapman&Hall/CRC
MachineLearning&PatternRecognitionSeries
SERIESEDITORS
RalfHerbrich
ThoreGraepel
AmazonDevelopmentCenter
MicrosoftResearchLtd.
Berlin,Germany
Cambridge,UK
AIMSANDSCOPE
Thisseriesreflectsthelatestadvancesandapplicationsinmachinelearningandpatternrecognitionthroughthepublicationofabroadrangeofreferenceworks,textbooks,andhandbooks.Theinclusionofconcreteexamples,applications,andmethodsishighlyencouraged.Thescopeoftheseriesincludes,butisnotlimitedto,titlesintheareasofmachinelearning,patternrecognition,computationalintelligence,robotics,computational/statisticallearningtheory,naturallanguageprocessing,computervision,gameAI,gametheory,neuralnetworks,computationalneuroscience,andotherrelevanttopics,suchasmachinelearningappliedtobioinformaticsorcognitivescience,whichmightbeproposedbypotentialcontribu-tors.
PUBLISHEDTITLES
BAYESIANPROGRAMMING
PierreBessière,EmmanuelMazer,Juan-ManuelAhuactzin,andKamelMekhnacha
UTILITY-BASEDLEARNINGFROMDATA
CraigFriedmanandSvenSandow
HANDBOOKOFNATURALLANGUAGEPROCESSING,SECONDEDITION
NitinIndurkhyaandFredJ.Damerau
COST-SENSITIVEMACHINELEARNING
BalajiKrishnapuram,ShipengYu,andBharatRao
COMPUTATIONALTRUSTMODELSANDMACHINELEARNING
XinLiu,AnwitamanDatta,andEe-PengLim
MULTILINEARSUBSPACELEARNING:DIMENSIONALITYREDUCTIONOF
MULTIDIMENSIONALDATA
HaipingLu,KonstantinosN.Plataniotis,andAnastasiosN.Venetsanopoulos
MACHINELEARNING:AnAlgorithmicPerspective,SecondEdition
StephenMarsland
SPARSEMODELING:THEORY,ALGORITHMS,ANDAPPLICATIONS
IrinaRishandGenadyYa.Grabarnik
AFIRSTCOURSEINMACHINELEARNING
SimonRogersandMarkGirolami
STATISTICALREINFORCEMENTLEARNING:MODERNMACHINELEARNINGAPPROACHES
MasashiSugiyama
MULTI-LABELDIMENSIONALITYREDUCTION
LiangSun,ShuiwangJi,andJiepingYe
REGULARIZATION,OPTIMIZATION,KERNELS,ANDSUPPORTVECTORMACHINES
JohanA.K.Suykens,MarcoSignoretto,andAndreasArgyriou
ENSEMBLEMETHODS:FOUNDATIONSANDALGORITHMS
Zhi-HuaZhou
Chapman&Hall/CRC
MachineLearning&PatternRecognitionSeries
STATISTICAL
REINFORCEMENT
LEARNING
ModernMachine
LearningApproaches
MasashiSugiyama
UniversityofTokyo
Tokyo,Japan
CRCPress
Taylor&FrancisGroup
6000BrokenSoundParkwayNW,Suite300
BocaRaton,FL33487-2742
©2015byTaylor&FrancisGroup,LLC
CRCPressisanimprintofTaylor&FrancisGroup,anInformabusiness
NoclaimtooriginalU.S.Governmentworks
VersionDate:20150128
InternationalStandardBookNumber-13:978-1-4398-5690-1(eBook-PDF)
Thisbookcontainsinformationobtainedfromauthenticandhighlyregardedsources.Reasonableeffortshavebeenmadetopublishreliabledataandinformation,buttheauthorandpublishercannotassumeresponsibilityforthevalidityofallmaterialsortheconsequencesoftheiruse.Theauthorsandpublishershaveattemptedtotracethecopyrightholdersofallmaterialreproducedinthispublicationandapologizetocopyrightholdersifpermissiontopublishinthisformhasnotbeenobtained.Ifanycopyrightmaterialhasnotbeenacknowledgedpleasewriteandletusknowsowemayrectifyinanyfuturereprint.
ExceptaspermittedunderU.S.CopyrightLaw,nopartofthisbookmaybereprinted,reproduced,transmitted,orutilizedinanyformbyanyelectronic,mechanical,orothermeans,nowknownorhereafterinvented,includingphotocopying,microfilming,andrecording,orinanyinformationstor-ageorretrievalsystem,withoutwrittenpermissionfromthepublishers.
Forpermissiontophotocopyorusematerialelectronicallyfromthiswork,pleaseaccesswww.copy-
right.com(http://www.copyright.com/)orcontacttheCopyrightClearanceCenter,Inc.(CCC),222
RosewoodDrive,Danvers,MA01923,978-750-8400.CCCisanot-for-profitorganizationthatprovideslicensesandregistrationforavarietyofusers.FororganizationsthathavebeengrantedaphotocopylicensebytheCCC,aseparatesystemofpaymenthasbeenarranged.
TrademarkNotice:Productorcorporatenamesmaybetrademarksorregisteredtrademarks,andareusedonlyforidentificationandexplanationwithoutintenttoinfringe.
VisittheTaylor&FrancisWebsiteat
http://www.taylorandfrancis.com
andtheCRCPressWebsiteat
http://www.crcpress.com
Contents
Foreword
ix
Preface
xi
Author
xiii
I
Introduction
1
1IntroductiontoReinforcementLearning
3
1.1
ReinforcementLearning…………………
3
1.2
MathematicalFormulation
……………….
8
1.3
StructureoftheBook………………….
12
1.3.1
Model-FreePolicyIteration……………
12
1.3.2
Model-FreePolicySearch…………….
13
1.3.3
Model-BasedReinforcementLearning………
14
II
Model-FreePolicyIteration
15
2PolicyIterationwithValueFunctionApproximation
17
2.1
ValueFunctions
…………………….
17
2.1.1
StateValueFunctions………………
17
2.1.2
State-ActionValueFunctions…………..
18
2.2
Least-SquaresPolicyIteration
……………..
20
2.2.1
Immediate-RewardRegression………….
20
2.2.2
Algorithm…………………….
21
2.2.3
Regularization………………….
23
2.2.4
ModelSelection………………….
25
2.3
Remarks
………………………..
26
3BasisDesignforValueFunctionApproximation
27
3.1
GaussianKernelsonGraphs
………………
27
3.1.1
MDP-InducedGraph……………….
27
3.1.2
OrdinaryGaussianKernels……………
29
3.1.3
GeodesicGaussianKernels……………
29
3.1.4
ExtensiontoContinuousStateSpaces………
30
3.2
Illustration……………………….
30
3.2.1
Setup………………………
31
v
vi
Contents
3.2.2
GeodesicGaussianKernels……………
31
3.2.3
OrdinaryGaussianKernels……………
33
3.2.4
Graph-LaplacianEigenbases……………
34
3.2.5
DiffusionWavelets………………..
35
3.3
NumericalExamples…………………..
36
3.3.1
Robot-ArmControl……………….
36
3.3.2
Robot-AgentNavigation……………..
39
3.4
Remarks
………………………..
45
4SampleReuseinPolicyIteration
47
4.1
Formulation
………………………
47
4.2
Off-PolicyValueFunctionApproximation………..
48
4.2.1
EpisodicImportanceWeighting………….
49
4.2.2
Per-DecisionImportanceWeighting
……….
50
4.2.3
AdaptivePer-DecisionImportanceWeighting…..
50
4.2.4
Illustration……………………
51
4.3
AutomaticSelectionofFlatteningParameter………
54
4.3.1
Importance-WeightedCross-Validation………
54
4.3.2
Illustration……………………
55
4.4
Sample-ReusePolicyIteration
……………..
56
4.4.1
Algorithm…………………….
56
4.4.2
Illustration……………………
57
4.5
NumericalExamples…………………..
58
4.5.1
InvertedPendulum………………..
58
4.5.2
MountainCar…………………..
60
4.6
Remarks
………………………..
63
5ActiveLearninginPolicyIteration
65
5.1
EfficientExplorationwithActiveLearning
……….
65
5.1.1
ProblemSetup………………….
65
5.1.2
DecompositionofGeneralizationError………
66
5.1.3
EstimationofGeneralizationError………..
67
5.1.4
DesigningSamplingPolicies……………
68
5.1.5
Illustration……………………
69
5.2
ActivePolicyIteration
…………………
71
5.2.1
Sample-ReusePolicyIterationwithActiveLearning.
72
5.2.2
Illustration……………………
73
5.3
NumericalExamples…………………..
75
5.4
Remarks
………………………..
77
6RobustPolicyIteration
79
6.1
RobustnessandReliabilityinPolicyIteration
……..
79
6.1.1
Robustness……………………
79
6.1.2
Reliability…………………….
80
6.2
LeastAbsolutePolicyIteration……………..
81
Contents
vii
6.2.1
Algorithm…………………….
81
6.2.2
Illustration……………………
81
6.2.3
Properties…………………….
83
6.3
NumericalExamples…………………..
84
6.4
PossibleExtensions
…………………..
88
6.4.1
HuberLoss……………………
88
6.4.2
PinballLoss……………………
89
6.4.3
Deadzone-LinearLoss………………
90
6.4.4
ChebyshevApproximation…………….
90
6.4.5
ConditionalValue-At-Risk…………….
91
6.5
Remarks
………………………..
92
III
Model-FreePolicySearch
93
7DirectPolicySearchbyGradientAscent
95
7.1
Formulation
………………………
95
7.2
GradientApproach
…………………..
96
7.2.1
GradientAscent…………………
96
7.2.2
BaselineSubtractionforVarianceReduction…..
98
7.2.3
VarianceAnalysisofGradientEstimators…….
99
7.3
NaturalGradientApproach……………….
101
7.3.1
NaturalGradientAscent……………..
101
7.3.2
Illustration……………………
103
7.4
ApplicationinComputerGraphics:ArtistAgent…….
104
7.4.1
SumiePainting………………….
105
7.4.2
DesignofStates,Actions,andImmediateRewards..
105
7.4.3
ExperimentalResults………………
112
7.5
Remarks
………………………..
113
8DirectPolicySearchbyExpectation-Maximization
117
8.1
Expectation-MaximizationApproach
………….
117
8.2
SampleReuse
……………………..
120
8.2.1
EpisodicImportanceWeighting………….
120
8.2.2
Per-DecisionImportanceWeight…………
122
8.2.3
AdaptivePer-DecisionImportanceWeighting…..
123
8.2.4
AutomaticSelectionofFlatteningParameter…..
124
8.2.5
Reward-WeightedRegressionwithSampleReuse…
125
8.3
NumericalExamples…………………..
126
8.4
Remarks
………………………..
132
9Policy-PriorSearch
133
9.1
Formulation
………………………
133
9.2
PolicyGradientswithParameter-BasedExploration…..
134
9.2.1
Policy-PriorGradientAscent…………..
135
9.2.2
BaselineSubtractionforVarianceReduction…..
136
9.2.3
VarianceAnalysisofGradientEstimators…….
136
viii
Contents
9.2.4
NumericalExamples……………….
138
9.3
SampleReuseinPolicy-PriorSearch…………..
143
9.3.1
ImportanceWeighting………………
143
9.3.2
VarianceReductionbyBaselineSubtraction……
145
9.3.3
NumericalExamples……………….
146
9.4
Remarks
………………………..
153
IV
Model-BasedReinforcementLearning
155
10TransitionModelEstimation
157
10.1ConditionalDensityEstimation
…………….
157
10.1.1Regression-BasedApproach……………
157
10.1.2ǫ-NeighborKernelDensityEstimation………
158
10.1.3Least-SquaresConditionalDensityEstimation….
159
10.2Model-BasedReinforcementLearning………….
161
10.3NumericalExamples…………………..
162
10.3.1ContinuousChainWalk……………..
162
10.3.2HumanoidRobotControl…………….
167
10.4Remarks
………………………..
172
11DimensionalityReductionforTransitionModelEstimation173
11.1SufficientDimensionalityReduction…………..
173
11.2Squared-LossConditionalEntropy……………
174
11.2.1ConditionalIndependence…………….
174
11.2.2DimensionalityReductionwithSCE……….
175
11.2.3RelationtoSquared-LossMutualInformation…..
176
11.3NumericalExamples…………………..
177
11.3.1ArtificialandBenchmarkDatasets………..
177
11.3.2HumanoidRobot…………………
180
11.4Remarks
………………………..
182
References
183
Index
191
Foreword
Howcanagentslearnfromexperiencewithoutanomniscientteacherexplicitly
tellingthemwhattodo?Reinforcementlearningistheareawithinmachine
learningthatinvestigateshowanagentcanlearnanoptimalbehaviorby
correlatinggenericrewardsignalswithitspastactions.Thedisciplinedraws
uponandconnectskeyideasfrombehavioralpsychology,economics,control
theory,operationsresearch,andotherdisparatefieldstomodelthelearning
process.Inreinforcementlearning,theenvironmentistypicallymodeledasa
Markovdecisionprocessthatprovidesimmediaterewardandstateinforma-
tiontotheagent.However,theagentdoesnothaveaccesstothetransition
structureoftheenvironmentandneedstolearnhowtochooseappropriate
actionstomaximizeitsoverallrewardovertime.
ThisbookbyProf.MasashiSugiyamacoverstherangeofreinforcement
learningalgorithmsfromafresh,modernperspective.Withafocusonthe
statisticalpropertiesofestimatingparametersforreinforcementlearning,the
bookrelatesanumberofdifferentapproachesacrossthegamutoflearningsce-
narios.Thealgorithmsaredividedintomodel-freeapproachesthatdonotex-
plicitlymodelthedynamicsoftheenvironment,andmodel-basedapproaches
thatconstructdescriptiveprocessmodelsfortheenvironment.Withineach
ofthesecategories,therearepolicyiterationalgorithmswhichestimatevalue
functions,andpolicysearchalgorithmswhichdirectlymanipulatepolicypa-
rameters.
Foreachofthesedifferentreinforcementlearningscenarios,thebookmetic-
ulouslylaysouttheassociatedoptimizationproblems.Acarefulanalysisis
givenforeachofthesecases,withanemphasisonunderstandingthestatistical
propertiesoftheresultingestimatorsandlearnedparameters.Eachchapter
containsillustrativeexamplesofapplicationsofthesealgorithms,withquan-
titativecomparisonsbetweenthedifferenttechniques.Theseexamplesare
drawnfromavarietyofpracticalproblems,includingrobotmotioncontrol
andAsianbrushpainting.
Insummary,thebookprovidesathoughtprovokingstatisticaltreatmentof
reinforcementlearningalgorithms,reflectingtheauthor’sworkandsustained
researchinthisarea.Itisacontemporaryandwelcomeadditiontotherapidly
growingmachinelearningliterature.Bothbeginnerstudentsandexperienced
ix
x
Foreword
researcherswillfindittobeanimportantsourceforunderstandingthelatest
reinforcementlearningtechniques.
DanielD.Lee
GRASPLaboratory
SchoolofEngineeringandAppliedScience
UniversityofPennsylvania,Philadelphia,PA,USA
Preface
Inthecomingbigdataera,statisticsandmachinelearningarebecoming
indispensabletoolsfordatamining.Dependingonthetypeofdataanalysis,
machinelearningmethodsarecategorizedintothreegroups:
•Supervisedlearning:Giveninput-outputpaireddata,theobjective
ofsupervisedlearningistoanalyzetheinput-outputrelationbehindthe
data.Typicaltasksofsupervisedlearningincluderegression(predict-
ingtherealvalue),classification(predictingthecategory),andranking
(predictingtheorder).Supervisedlearningisthemostcommondata
analysisandhasbeenextensivelystudiedinthestatisticscommunity
forlongtime.Arecenttrendofsupervisedlearningresearchinthema-
chinelearningcommunityistoutilizesideinformationinadditiontothe
input-outputpaireddatatofurtherimprovethepredictionaccuracy.For
example,semi-supervisedlearningutilizesadditionalinput-onlydata,
transferlearningborrowsdatafromothersimilarlearningtasks,and
multi-tasklearningsolvesmultiplerelatedlearningtaskssimultaneously.
•Unsupervisedlearning:Giveninput-onlydata,theobjectiveofun-
supervisedlearningistofindsomethingusefulinthedata.Duetothis
ambiguousdefinition,unsupervisedlearningresearchtendstobemore
adhocthansupervisedlearning.Nevertheless,unsupervisedlearningis
regardedasoneofthemostimportanttoolsindataminingbecause
ofitsautomaticandinexpensivenature.Typicaltasksofunsupervised
learningincludeclustering(groupingthedatabasedontheirsimilarity),
densityestimation(estimatingtheprobabilitydistributionbehindthe
data),anomalydetection(removingoutliersfromthedata),datavisual-
ization(reducingthedimensionalityofthedatato1–3dimensions),and
blindsourceseparation(extractingtheoriginalsourcesignalsfromtheir
mixtures).Also,unsupervisedlearningmethodsaresometimesusedas
datapre-processingtoolsinsupervisedlearning.
•Reinforcementlearning:Supervisedlearningisasoundapproach,
butcollectinginput-outputpaireddataisoftentooexpensive.Unsu-
pervisedlearningisinexpensivetoperform,butittendstobeadhoc.
Reinforcementlearningisplacedbetweensupervisedlearningandunsu-
pervisedlearning—noexplicitsupervision(outputdata)isprovided,
butwestillwanttolearntheinput-outputrelationbehindthedata.
Insteadofoutputdata,reinforcementlearningutilizesrewards,which
xi
xii
Preface
evaluatethevalidityofpredictedoutputs.Givingimplicitsupervision
suchasrewardsisusuallymucheasierandlesscostlythangivingex-
plicitsupervision,andthereforereinforcementlearningcanbeavital
approachinmoderndataanalysis.Varioussupervisedandunsupervised
learningtechniquesarealsoutilizedintheframeworkofreinforcement
learning.
Thisbookisdevotedtointroducingfundamentalconceptsandpracti-
calalgorithmsofstatisticalreinforcementlearningfromthemodernmachine
learningviewpoint.Variousillustrativeexamples,mainlyinrobotics,arealso
providedtohelpunderstandtheintuitionandusefulnessofreinforcement
learningtechniques.Targetreadersaregraduate-levelstudentsincomputer
scienceandappliedstatisticsaswellasresearchersandengineersinrelated
fields.Basicknowledgeofprobabilityandstatistics,linearalgebra,andele-
mentarycalculusisassumed.
Machinelearningisarapidlydevelopingareaofscience,andtheauthor
hopesthatthisbookhelpsthereadergraspvariousexcitingtopicsinrein-
forcementlearningandstimulatereaders’interestinmachinelearning.Please
visitourwebsiteat:http://www.ms.k.u-tokyo.ac.jp.
MasashiSugiyama
UniversityofTokyo,Japan
Author
MasashiSugiyamawasborninOsaka,Japan,in1974.HereceivedBachelor,
Master,andDoctorofEngineeringdegreesinComputerSciencefromAll
TokyoInstituteofTechnology,Japanin1997,1999,and2001,respectively.
In2001,hewasappointedAssistantProfessorinthesameinstitute,andhe
waspromotedtoAssociateProfessorin2003.HemovedtotheUniversityof
TokyoasProfessorin2014.
HereceivedanAlexandervonHumboldtFoundationResearchFellowship
andresearchedatFraunhoferInstitute,Berlin,Germany,from2003to2004.In
2006,hereceivedaEuropeanCommissionProgramErasmusMundusSchol-
arshipandresearchedattheUniversityofEdinburgh,Scotland.Hereceived
theFacultyAwardfromIBMin2007forhiscontributiontomachinelearning
undernon-stationarity,theNagaoSpecialResearcherAwardfromtheInfor-
mationProcessingSocietyofJapanin2011andtheYoungScientists’Prize
fromtheCommendationforScienceandTechnologybytheMinisterofEd-
ucation,Culture,Sports,ScienceandTechnologyforhiscontributiontothe
density-ratioparadigmofmachinelearning.
Hisresearchinterestsincludetheoriesandalgorithmsofmachinelearning
anddatamining,andawiderangeofapplicationssuchassignalprocessing,
imageprocessing,androbotcontrol.HepublishedDensityRatioEstimationin
MachineLearning(CambridgeUniversityPress,2012)andMachineLearning
inNon-StationaryEnvironments:IntroductiontoCovariateShiftAdaptation
(MITPress,2012).
Theauthorthankshiscollaborators,HirotakaHachiya,SethuVijayaku-
mar,JanPeters,JunMorimoto,ZhaoTingting,NingXie,VootTangkaratt,
TetsuroMorimura,andNorikazuSugimoto,forexcitingandcreativediscus-
sions.HeacknowledgessupportfromMEXTKAKENHI17700142,18300057,
20680007,23120004,23300069,25700022,and26280054,theOkawaFounda-
tion,EUErasmusMundusFellowship,AOARD,SCAT,theJSTPRESTO
program,andtheFIRSTprogram.
xiii
Thispageintentionallyleftblank
PartI
Introduction
Thispageintentionallyleftblank
Chapter1
IntroductiontoReinforcement
Learning
Reinforcementlearningisaimedatcontrollingacomputeragentsothata
targettaskisachievedinanunknownenvironment.
Inthischapter,wefirstgiveaninformaloverviewofreinforcementlearning
inSection1.1.Thenweprovideamoreformalformulationofreinforcement
learninginSection1.2.Finally,thebookissummarizedinSection1.3.
1.1
ReinforcementLearning
AschematicofreinforcementlearningisgiveninFigure1.1.Inanunknown
environment(e.g.,inamaze),acomputeragent(e.g.,arobot)takesanaction
(e.g.,towalk)basedonitsowncontrolpolicy.Thenitsstateisupdated(e.g.,
bymovingforward)andevaluationofthatactionisgivenasa“reward”(e.g.,
praise,neutral,orscolding).Throughsuchinteractionwiththeenvironment,
theagentistrainedtoachieveacertaintask(e.g.,gettingoutofthemaze)
withoutexplicitguidance.Acrucialadvantageofreinforcementlearningisits
non-greedynature.Thatis,theagentistrainednottoimproveperformancein
ashortterm(e.g.,greedilyapproachinganexitofthemaze),buttooptimize
thelong-termachievement(e.g.,successfullygettingoutofthemaze).
Areinforcementlearningproblemcontainsvarioustechnicalcomponents
suchasstates,actions,transitions,rewards,policies,andvalues.Beforego-
ingintomathematicaldetails(whichwillbeprovidedinSection1.2),we
intuitivelyexplaintheseconceptsthroughillustrativereinforcementlearning
problemshere.
Letusconsideramazeproblem(Figure1.2),wherearobotagentislocated
inamazeandwewanttoguidehimtothegoalwithoutexplicitsupervision
aboutwhichdirectiontogo.Statesarepositionsinthemazewhichtherobot
agentcanvisit.IntheexampleillustratedinFigure1.3,thereare21states
inthemaze.Actionsarepossibledirectionsalongwhichtherobotagentcan
move.IntheexampleillustratedinFigure1.4,thereare4actionswhichcorre-
spondtomovementtowardthenorth,south,east,andwestdirections.States
3
4
StatisticalReinforcementLearning
Action
Environment
Reward
Agent
State
FIGURE1.1:Reinforcementlearning.
andactionsarefundamentalelementsthatdefineareinforcementlearning
problem.
Transitionsspecifyhowstatesareconnectedtoeachotherthroughactions
(Figure1.5).Thus,knowingthetransitionsintuitivelymeansknowingthemap
ofthemaze.Rewardsspecifytheincomes/coststhattherobotagentreceives
whenmakingatransitionfromonestatetoanotherbyacertainaction.Inthe
caseofthemazeexample,therobotagentreceivesapositiverewardwhenit
reachesthegoal.Morespecifically,apositiverewardisprovidedwhenmaking
atransitionfromstate12tostate17byaction“east”orfromstate18to
state17byaction“north”(Figure1.6).Thus,knowingtherewardsintuitively
meansknowingthelocationofthegoalstate.Toemphasizethefactthata
rewardisgiventotherobotagentrightaftertakinganactionandmakinga
transitiontothenextstate,itisalsoreferredtoasanimmediatereward.
Undertheabovesetup,thegoalofreinforcementlearningtofindthepolicy
forcontrollingtherobotagentthatallowsittoreceivethemaximumamount
ofrewardsinthelongrun.Here,apolicyspecifiesanactiontherobotagent
takesateachstate(Figure1.7).Throughapolicy,aseriesofstatesandac-
tionsthattherobotagenttakesfromastartstatetoanendstateisspecified.
Suchaseriesiscalledatrajectory(seeFigure1.7again).Thesumofim-
mediaterewardsalongatrajectoryiscalledthereturn.Inpractice,rewards
thatcanbeobtainedinthedistantfutureareoftendiscountedbecausere-
ceivingrewardsearlierisregardedasmorepreferable.Inthemazetask,such
adiscountingstrategyurgestherobotagenttoreachthegoalasquicklyas
possible.
Tofindtheoptimalpolicyefficiently,itisusefultoviewthereturnasa
functionoftheinitialstate.Thisiscalledthe(state-)value.Thevaluescan
beefficientlyobtainedviadynamicprogramming,whichisageneralmethod
forsolvingacomplexoptimizationproblembybreakingitdownintosimpler
subproblemsrecursively.Withthehopethatmanysubproblemsareactually
thesame,dynamicprogrammingsolvessuchoverlappedsubproblemsonly
onceandreusesthesolutionstoreducethecomputationcosts.
Inthemazeproblem,thevalueofastatecanbecomputedfromthevalues
ofneighboringstates.Forexample,letuscomputethevalueofstate7(see
IntroductiontoReinforcementLearning
5
FIGURE1.2:Amazeproblem.Wewanttoguidetherobotagenttothe
goal.
1
6
12
17
2
7
13
18
3
8
14
19
4
9
11
15
20
5
10
16
21
FIGURE1.3:Statesarevisitablepositionsinthemaze.
North
West
East
South
FIGURE1.4:Actionsarepossiblemovementsoftherobotagent.
6
StatisticalReinforcementLearning
1
6
12
17
2
7
13
18
3
8
14
19
4
9
11
15
20
5
10
16
21
FIGURE1.5:Transitionsspecifyconnectionsbetweenstatesviaactions.
Thus,knowingthetransitionsmeansknowingthemapofthemaze.
1
6
12
17
2
7
13
18
3
8
14
19
4
9
11
15
20
5
10
16
21
FIGURE1.6:Apositiverewardisgivenwhentherobotagentreachesthe
goal.Thus,therewardspecifiesthegoallocation.
FIGURE1.7:Apolicyspecifiesanactiontherobotagenttakesateach
state.Thus,apolicyalsospecifiesatrajectory,whichisaseriesofstatesand
actionsthattherobotagenttakesfromastartstatetoanendstate.
IntroductiontoReinforcementLearning
7
.35
.39
.9
1
.39
.43
.81
.9
.43
.48
.73
.81
.48
.53
.59
.66
.73
.43
.48
.59
.66
FIGURE1.8:Valuesofeachstatewhenreward+1isgivenatthegoalstate
andtherewardisdiscountedattherateof0.9accordingtothenumberof
steps.
Figure1.5again).Fromstate7,therobotagentcanreachstate2,state6,
andstate8byasinglestep.Iftherobotagentknowsthevaluesofthese
neighboringstates,thebestactiontherobotagentshouldtakeistovisitthe
neighboringstatewiththelargestvalue,becausethisallowstherobotagent
toearnthelargestamountofrewardsinthelongrun.However,thevalues
ofneighboringstatesareunknowninpracticeandthustheyshouldalsobe
computed.
Now,weneedtosolve3subproblemsofcomputingthevaluesofstate2,
state6,andstate8.Then,inthesameway,thesesubproblemsarefurther
decomposedasfollows:
•Theproblemofcomputingthevalueofstate2isdecomposedinto3
subproblemsofcomputingthevaluesofstate1,state3,andstate7.
•Theproblemofcomputingthevalueofstate6isdecomposedinto2
subproblemsofcomputingthevaluesofstate1andstate7.
•Theproblemofcomputingthevalueofstate8isdecomposedinto3
subproblemsofcomputingthevaluesofstate3,state7,andstate9.
Thus,byremovingoverlaps,theoriginalproblemofcomputingthevalueof
state7hasbeendecomposedinto6uniquesubproblems:computingthevalues
ofstate1,state2,state3,state6,state8,andstate9.
Ifwefurthercontinuethisproblemdecomposition,weencountertheprob-
lemofcomputingthevaluesofstate17,wheretherobotagentcanreceive
reward+1.Thenthevaluesofstate12andstate18canbeexplicitlycom-
puted.Indeed,ifadiscountingfactor(amultiplicativepenaltyfordelayed
rewards)is0.9,thevaluesofstate12andstate18are(0.9)1=0.9.Thenwe
canfurtherknowthatthevaluesofstate13andstate19are(0.9)2=0.81.
Byrepeatingthisprocedure,wecancomputethevaluesofallstates(asillus-
tratedinFigure1.8).Basedonthesevalues,wecanknowtheoptimalaction
8
StatisticalReinforcementLearning
therobotagentshouldtake,i.e.,anactionthatleadstherobotagenttothe
neighboringstatewiththelargestvalue.
Notethat,inreal-worldreinforcementlearningtasks,transitionsareoften
notdeterministicbutstochastic,becauseofsomeexternaldisturbance;inthe
caseoftheabovemazeexample,thefloormaybeslipperyandthustherobot
agentcannotmoveasperfectlyasitdesires.Also,stochasticpoliciesinwhich
mappingfromastatetoanactionisnotdeterministicareoftenemployed
inmanyreinforcementlearningformulations.Inthesecases,theformulation
becomesslightlymorecomplicated,butessentiallythesameideacanstillbe
usedforsolvingtheproblem.
Tofurtherhighlightthenotableadvantageofreinforcementlearningthat
nottheimmediaterewardsbutthelong-termaccumulationofrewardsismax-
imized,letusconsideramountain-carproblem(Figure1.9).Therearetwo
mountainsandacarislocatedinavalleybetweenthemountains.Thegoalis
toguidethecartothetopoftheright-handhill.However,theengineofthe
carisnotpowerfulenoughtodirectlyrunuptheright-handhillandreach
thegoal.Theoptimalpolicyinthisproblemistofirstclimbtheleft-handhill
andthengodowntheslopetotherightwithfullaccelerationtogettothe
goal(Figure1.10).
Supposewedefinetheimmediaterewardsuchthatmovingthecartothe
rightgivesapositivereward+1andmovingthecartotheleftgivesanega-
tivereward−1.Then,agreedysolutionthatmaximizestheimmediatereward
movesthecartotheright,whichdoesnotallowthecartogettothegoal
duetolackofenginepower.Ontheotherhand,reinforcementlearningseeks
asolutionthatmaximizesthereturn,i.e.,thediscountedsumofimmediate
rewardsthattheagentcancollectovertheentiretrajectory.Thismeansthat
thereinforcementlearningsolutionwillfirstmovethecartothelefteven
thoughnegativerewardsaregivenforawhile,toreceivemorepositivere-
wardsinthefuture.Thus,thenotionof“priorinvestment”canbenaturally
incorporatedinthereinforcementlearningframework.
1.2
MathematicalFormulation
Inthissection,thereinforcementlearningproblemismathematicallyfor-
mulatedastheproblemofcontrollingacomputeragentunderaMarkovde-
cisionprocess.
Weconsidertheproblemofcontrollingacomputeragentunderadiscrete-
timeMarkovdecisionprocess(MDP).Thatis,ateachdiscretetime-stept,
theagentobservesastatest∈S,selectsanactionat∈A,makesatransitionst+1∈S,andreceivesanimmediatereward,rt=r(st,at,st+1)∈R.
IntroductiontoReinforcementLearning
9
Goal
Car
FIGURE1.9:Amountain-carproblem.Wewanttoguidethecartothe
goal.However,theengineofthecarisnotpowerfulenoughtodirectlyrunup
theright-handhill.
Goal
FIGURE1.10:Theoptimalpolicytoreachthegoalistofirstclimbthe
left-handhillandthenheadfortheright-handhillwithfullacceleration.
SandAarecalledthestatespaceandtheactionspace,respectively.r(s,a,s′)
iscalledtheimmediaterewardfunction.
Theinitialpositionoftheagent,s1,isdrawnfromtheinitialprobability
distribution.IfthestatespaceSisdiscrete,theinitialprobabilitydistributionisspecifiedbytheprobabilitymassfunctionP(s)suchthat
0≤P(s)≤1,∀s∈S,XP(s)=1.
s∈SIfthestatespaceSiscontinuous,theinitialprobabilitydistributionisspeci-
fiedbytheprobabilitydensityfunctionp(s)suchthat
p(s)≥0,∀s∈S,
10
StatisticalReinforcementLearning
Z
p(s)ds=1.
s∈SBecausetheprobabilitymassfunctionP(s)canbeexpressedasaprobability
densityfunctionp(s)byusingtheDiracdeltafunction1δ(s)as
X
p(s)=
δ(s′−s)P(s′),
s′∈Swefocusonlyonthecontinuousstatespacebelow.
Thedynamicsoftheenvironment,whichrepresentthetransitionprob-
abilityfromstatestostates′whenactionaistaken,arecharacterized
bythetransitionprobabilitydistributionwithconditionalprobabilitydensity
p(s′|s,a):
p(s′|s,a)≥0,∀s,s′∈S,∀a∈A,Z
p(s′|s,a)ds′=1,∀s∈S,∀a∈A.
s′∈STheagent’sdecisionisdeterminedbyapolicyπ.Whenweconsideradeter-
ministicpolicywheretheactiontotakeateachstateisuniquelydetermined,
weregardthepolicyasafunctionofstates:
π(s)∈A,∀s∈S.Actionacanbeeitherdiscreteorcontinuous.Ontheotherhand,whendevel-
opingmoresophisticatedreinforcementlearningalgorithms,itisoftenmore
convenienttoconsiderastochasticpolicy,whereanactiontotakeatastate
isprobabilisticallydetermined.Mathematically,astochasticpolicyisacon-
ditionalprobabilitydensityoftakingactionaatstates:
π(a|s)≥0,∀s∈S,∀a∈A,Z
π(a|s)da=1,∀s∈S.a∈AByintroducingstochasticityinactionselection,wecanmoreactivelyexplore
theentirestatespace.Notethatwhenactionaisdiscrete,thestochasticpolicy
isexpressedusingDirac’sdeltafunction,asinthecaseofthestatedensities.
Asequenceofstatesandactionsobtainedbytheproceduredescribedin
Figure1.11iscalledatrajectory.
1TheDiracdeltafunctionδ(·)allowsustoobtainthevalueofafunctionfatapointτ
viatheconvolutionwithf:
Z
∞
f(s)δ(s−τ)ds=f(τ).
−∞
Dirac’sdeltafunctionδ(·)canbeexpressedastheGaussiandensitywithstandarddeviationσ→0:
1
a2
δ(a)=lim√
exp−
.
σ→0
2πσ2
2σ2
IntroductiontoReinforcementLearning
11
1.Theinitialstates1ischosenfollowingtheinitialprobabilityp(s).
2.Fort=1,…,T,
(a)Theactionatischosenfollowingthepolicyπ(at|st).
(b)Thenextstatest+1isdeterminedaccordingtothetransition
probabilityp(st+1|st,at).
FIGURE1.11:Generationofatrajectorysample.
Whenthenumberofsteps,T,isfiniteorinfinite,thesituationiscalled
thefinitehorizonorinfinitehorizon,respectively.Below,wefocusonthe
finite-horizoncasebecausethetrajectorylengthisalwaysfiniteinpractice.
Wedenoteatrajectorybyh(whichstandsfora“history”):
h=[s1,a1,…,sT,aT,sT+1].
Thediscountedsumofimmediaterewardsalongthetrajectoryhiscalled
thereturn:
T
X
R(h)=
γt−1r(st,at,st+1),
t=1
whereγ∈[0,1)iscalledthediscountfactorforfuturerewards.Thegoalofreinforcementlearningistolearntheoptimalpolicyπ∗thatmaximizestheexpectedreturn:
h
i
π∗=argmaxEpπ(h)R(h),
π
whereEpπ(h)denotestheexpectationovertrajectoryhdrawnfrompπ(h),and
pπ(h)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy
π:
T
Y
pπ(h)=p(s1)
p(st+1|st,at)π(at|st).
t=1
“argmax”givesthemaximizerofafunction(Figure1.12).
Forpolicylearning,variousmethodshavebeendevelopedsofar.These
methodscanbeclassifiedintomodel-basedreinforcementlearningandmodel-
freereinforcementlearning.Theterm“model”indicatesamodelofthetran-
sitionprobabilityp(s′|s,a).Inthemodel-basedreinforcementlearningap-
proach,thetransitionprobabilityislearnedinadvanceandthelearnedtran-
sitionmodelisexplicitlyusedforpolicylearning.Ontheotherhand,inthe
model-freereinforcementlearningapproach,policiesarelearnedwithoutex-
plicitlyestimatingthetransitionprobability.Ifstrongpriorknowledgeofthe
12
StatisticalReinforcementLearning
max
argmax
FIGURE1.12:“argmax”givesthemaximizerofafunction,while“max”
givesthemaximumvalueofafunction.
transitionmodelisavailable,themodel-basedapproachwouldbemorefavor-
able.Ontheotherhand,learningthetransitionmodelwithoutpriorknowl-
edgeitselfisahardstatisticalestimationproblem.Thus,ifgoodpriorknowl-
edgeofthetransitionmodelisnotavailable,themodel-freeapproachwould
bemorepromising.
1.3
StructureoftheBook
Inthissection,weexplainthestructureofthisbook,whichcoversmajor
reinforcementlearningapproaches.
1.3.1
Model-FreePolicyIteration
Policyiterationisapopularandwell-studiedapproachtoreinforcement
learning.Thekeyideaofpolicyiterationistodeterminepoliciesbasedonthe
valuefunction.
Letusfirstintroducethestate-actionvaluefunctionQπ(s,a)∈Rforpolicyπ,whichisdefinedastheexpectedreturntheagentwillreceivewhen
takingactionaatstatesandfollowingpolicyπthereafter:
h
i
Qπ(
s,a)=Epπ(h)R(h)s1=s,a1=a,
where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1
arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof
theaboveequationdenotestheconditionalexpectationofR(h)givens1=s
anda1=a.
LetQ∗(s,a)betheoptimalstate-actionvalueatstatesforactionadefinedas
Q∗(s,a)=maxQπ(s,a).π
Basedontheoptimalstate-actionvaluefunction,theoptimalactiontheagent
shouldtakeatstatesisdeterministicallygivenasthemaximizerofQ∗(s,a)
IntroductiontoReinforcementLearning
13
1.Initializepolicyπ(a|s).
2.Repeatthefollowingtwostepsuntilthepolicyπ(a|s)converges.
(a)Policyevaluation:Computethestate-actionvaluefunction
Qπ(s,a)forthecurrentpolicyπ(a|s).
(b)Policyimprovement:Updatethepolicyas
π(a|s)←−δa−argmaxQπ(s,a′).
a′
FIGURE1.13:Algorithmofpolicyiteration.
withrespecttoa.Thus,theoptimalpolicyπ∗(a|s)isgivenbyπ∗(a|s)=δa−argmaxQ∗(s,a′),a′
whereδ(·)denotesDirac’sdeltafunction.
Becausetheoptimalstate-actionvalueQ∗isunknowninpractice,thepolicyiterationalgorithmalternatelyevaluatesthevalueQπforthecurrent
policyπandupdatesthepolicyπbasedonthecurrentvalueQπ(Figure1.13).
Theperformanceoftheabovepolicyiterationalgorithmdependsonthe
qualityofpolicyevaluation;i.e.,howtolearnthestate-actionvaluefunction
fromdataisthekeyissue.Valuefunctionapproximationcorrespondstoare-
gressionprobleminstatisticsandmachinelearning.Thus,variousstatistical
machinelearningtechniquescanbeutilizedforbettervaluefunctionapprox-
imation.PartIIofthisbookaddressesthisissue,includingleast-squareses-
timationandmodelselection(Chapter2),basisfunctiondesign(Chapter3),
efficientsamplereuse(Chapter4),activelearning(Chapter5),androbust
learning(Chapter6).
1.3.2
Model-FreePolicySearch
Oneofthepotentialweaknessesofpolicyiterationisthatpoliciesare
learnedviavaluefunctions.Thus,improvingthequalityofvaluefunction
approximationdoesnotnecessarilycontributetoimprovingthequalityof
resultingpolicies.Furthermore,asmallchangeinvaluefunctionscancausea
bigdifferenceinpolicies,whichisproblematicin,e.g.,robotcontrolbecause
suchinstabilitycandamagetherobot’sphysicalsystem.Anotherweakness
ofpolicyiterationisthatpolicyimprovement,i.e.,findingthemaximizerof
Qπ(s,a)withrespecttoa,iscomputationallyexpensiveordifficultwhenthe
actionspaceAiscontinuous.
14
StatisticalReinforcementLearning
Policysearch,whichdirectlylearnspolicyfunctionswithoutestimating
valuefunctions,canovercometheabovelimitations.Thebasicideaofpolicy
searchistofindthepolicythatmaximizestheexpectedreturn:
h
i
π∗=argmaxEpπ(h)R(h).π
Inpolicysearch,howtofindagoodpolicyfunctioninavastfunctionspaceis
thekeyissuetobeaddressed.PartIIIofthisbookfocusesonpolicysearchand
introducesgradient-basedmethodsandtheexpectation-maximizationmethod
inChapter7andChapter8,respectively.However,apotentialweaknessof
thesedirectpolicysearchmethodsistheirinstabilityduetothestochasticity
ofpolicies.Toovercometheinstabilityproblem,analternativeapproachcalled
policy-priorsearch,whichlearnsthepolicy-priordistributionfordeterministic
policies,isintroducedinChapter9.Efficientsamplereuseinpolicy-prior
searchisalsodiscussedthere.
1.3.3
Model-BasedReinforcementLearning
Intheabovemodel-freeapproaches,policiesarelearnedwithoutexplicitly
modelingtheunknownenvironment(i.e.,thetransitionprobabilityofthe
agentintheenvironment,p(s′|s,a)).Ontheotherhand,themodel-based
approachexplicitlylearnstheenvironmentinadvanceandusesthelearned
environmentmodelforpolicylearning.
Noadditionalsamplingcostisnecessarytogenerateartificialsamplesfrom
thelearnedenvironmentmodel.Thus,themodel-basedapproachisparticu-
larlyusefulwhendatacollectionisexpensive(e.g.,robotcontrol).However,
accuratelyestimatingthetransitionmodelfromalimitedamountoftrajec-
torydatainmulti-dimensionalcontinuousstateandactionspacesishighly
challenging.PartIVofthisbookfocusesonmodel-basedreinforcementlearn-
ing.InChapter10,anon-parametrictransitionmodelestimatorthatpossesses
theoptimalconvergenceratewithhighcomputationalefficiencyisintroduced.
However,evenwiththeoptimalconvergencerate,estimatingthetransition
modelinhigh-dimensionalstateandactionspacesisstillchallenging.InChap-
ter11,adimensionalityreductionmethodthatcanbeefficientlyembedded
intothetransitionmodelestimationprocedureisintroducedanditsusefulness
isdemonstratedthroughexperiments.
PartII
Model-FreePolicy
Iteration
InPartII,weintroduceareinforcementlearningapproachbasedonvalue
functionscalledpolicyiteration.
Thekeyissueinthepolicyiterationframeworkishowtoaccuratelyap-
proximatethevaluefunctionfromasmallnumberofdatasamples.InChap-
ter2,afundamentalframeworkofvaluefunctionapproximationbasedon
leastsquaresisexplained.Inthisleast-squaresformulation,howtodesign
goodbasisfunctionsiscriticalforbettervaluefunctionapproximation.A
practicalbasisdesignmethodbasedonmanifold-basedsmoothing(Chapelle
etal.,2006)isexplainedinChapter3.
Inreal-worldreinforcementlearningtasks,gatheringdataisoftencostly.
InChapter4,wedescribeamethodforefficientlyreusingpreviouslycor-
rectedsamplesintheframeworkofcovariateshiftadaptation(Sugiyama&
Kawanabe,2012).InChapter5,weapplyastatisticalactivelearningtech-
nique(Sugiyama&Kawanabe,2012)tooptimizingdatacollectionstrategies
forreducingthesamplingcost.
Finally,inChapter6,anoutlier-robustextensionoftheleast-squares
methodbasedonrobustregression(Huber,1981)isintroduced.Sucharo-
bustmethodishighlyusefulinhandlingnoisyreal-worlddata.
Thispageintentionallyleftblank
Chapter2
PolicyIterationwithValueFunction
Approximation
Inthischapter,weintroducetheframeworkofleast-squarespolicyiteration.
InSection2.1,wefirstexplaintheframeworkofpolicyiteration,whichitera-
tivelyexecutesthepolicyevaluationandpolicyimprovementstepsforfinding
betterpolicies.Then,inSection2.2,weshowhowvaluefunctionapproxima-
tioninthepolicyevaluationstepcanbeformulatedasaregressionproblem
andintroducealeast-squaresalgorithmcalledleast-squarespolicyiteration
(Lagoudakis&Parr,2003).Finally,thischapterisconcludedinSection2.3.
2.1
ValueFunctions
Atraditionalwaytolearntheoptimalpolicyisbasedonvaluefunction.
Inthissection,weintroducetwotypesofvaluefunctions,thestatevalue
functionandthestate-actionvaluefunction,andexplainhowtheycanbe
usedforfindingbetterpolicies.
2.1.1
StateValueFunctions
ThestatevaluefunctionVπ(s)∈Rforpolicyπmeasuresthe“value”ofstates,whichisdefinedastheexpectedreturntheagentwillreceivewhen
followingpolicyπfromstates:
h
i
Vπ(
s)=Epπ(h)R(h)s1=s,
where“|s1=s”meansthattheinitialstates1isfixedats1=s.Thatis,the
right-handsideoftheaboveequationdenotestheconditionalexpectationof
returnR(h)givens1=s.
Byrecursion,Vπ(s)canbeexpressedas
h
i
Vπ(s)=Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′),
whereEp(s′|s,a)π(a|s)denotestheconditionalexpectationoveraands′drawn
17
18
StatisticalReinforcementLearning
fromp(s′|s,a)π(a|s)givens.ThisrecursiveexpressioniscalledtheBellman
equationforstatevalues.Vπ(s)maybeobtainedbyrepeatingthefollowing
updatefromsomeinitialestimate:
h
i
Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).
Theoptimalstatevalueatstates,V∗(s),isdefinedasthemaximizerofstatevalueVπ(s)withrespecttopolicyπ:
V∗(s)=maxVπ(s).π
BasedontheoptimalstatevalueV∗(s),theoptimalpolicyπ∗,whichisde-terministic,canbeobtainedas
π∗(a|s)=δ(a−a∗(s)),whereδ(·)denotesDirac’sdeltafunctionand
n
h
io
a∗(s)=argmaxEp(s′|s,a)r(s,a,s′)+γV∗(s′).
a∈A
Ep(s′|s,a)denotestheconditionalexpectationovers′drawnfromp(s′|s,a)
givensanda.Thisalgorithm,firstcomputingtheoptimalvaluefunction
andthenobtainingtheoptimalpolicybasedontheoptimalvaluefunction,is
calledvalueiteration.
Apossiblevariationistoiterativelyperformpolicyevaluationandim-
provementas
h
i
Policyevaluation:Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).
Policyimprovement:π∗(a|s)←−δ(a−aπ(s)),
where
n
h
io
aπ(s)=argmaxEp(s′|s,a)r(s,a,s′)+γVπ(s′)
.
a∈AThesetwostepsmaybeiteratedeitherforallstatesatonceorinastate-by-
statemanner.Thisiterativealgorithmiscalledthepolicyiteration(basedon
statevaluefunctions).
2.1.2
State-ActionValueFunctions
Intheabovepolicyimprovementstep,theactiontotakeisoptimizedbased
onthestatevaluefunctionVπ(s).Amoredirectwaytohandlethisaction
optimizationistoconsiderthestate-actionvaluefunctionQπ(s,a)forpolicy
π:
h
i
Qπ(
s,a)=Epπ(h)R(h)s1=s,a1=a,
PolicyIterationwithValueFunctionApproximation
19
where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1
arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof
theaboveequationdenotestheconditionalexpectationofreturnR(h)given
s1=sanda1=a.
Letr(s,a)betheexpectedimmediaterewardwhenactionaistakenat
states:
r(s,a)=Ep(s′|s,a)[r(s,a,s′)].
Then,inthesamewayasVπ(s),Qπ(s,a)canbeexpressedbyrecursionas
h
i
Qπ(s,a)=r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′),
(2.1)
whereEπ(a′|s′)p(s′|s,a)denotestheconditionalexpectationovers′anda′drawn
fromπ(a′|s′)p(s′|s,a)givensanda.Thisrecursiveexpressioniscalledthe
Bellmanequationforstate-actionvalues.
BasedontheBellmanequation,theoptimalpolicymaybeobtainedby
iteratingthefollowingtwosteps:
h
i
Policyevaluation:Qπ(s,a)←−r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).
Policyimprovement:π(a|s)←−δa−argmaxQπ(s,a′).
a′∈AInpractice,itissometimespreferabletouseanexplorativepolicy.For
example,Gibbspolicyimprovementisgivenby
exp(Qπ(s,a)/τ)
π(a|s)←−R
,
exp(Qπ(s,a′)/τ)da′
A
whereτ>0determinesthedegreeofexploration.WhentheactionspaceA
isdiscrete,ǫ-greedypolicyimprovementisalsoused:
(1−ǫ+ǫ/|A|ifa=argmaxQπ(s,a′),
π(a|s)←−
a′∈Aǫ/|A|otherwise,
whereǫ∈(0,1]determinestherandomnessofthenewpolicy.TheabovepolicyimprovementstepbasedonQπ(s,a)isessentiallythe
sameastheonebasedonVπ(s)explainedinSection2.1.1.However,the
policyimprovementstepbasedonQπ(s,a)doesnotcontaintheexpectation
operatorandthuspolicyimprovementcanbemoredirectlycarriedout.For
thisreason,wefocusontheaboveformulation,calledpolicyiterationbased
onstate-actionvaluefunctions.
20
StatisticalReinforcementLearning
2.2
Least-SquaresPolicyIteration
Asexplainedintheprevioussection,theoptimalpolicyfunctionmaybe
learnedviastate-actionvaluefunctionQπ(s,a).However,learningthestate-
actionvaluefunctionfromdataisachallengingtaskforcontinuousstates
andactiona.
Learningthestate-actionvaluefunctionfromdatacanactuallybere-
gardedasaregressionprobleminstatisticsandmachinelearning.Inthissec-
tion,weexplainhowtheleast-squaresregressiontechniquecanbeemployed
invaluefunctionapproximation,whichiscalledleast-squarespolicyiteration
(Lagoudakis&Parr,2003).
2.2.1
Immediate-RewardRegression
Letusapproximatethestate-actionvaluefunctionQπ(s,a)bythefollow-
inglinear-in-parametermodel:
B
Xθbφb(s,a),
b=1
whereφb(s,a)Barebasisfunctions,Bdenotesthenumberofbasisfunc-
b=1
tions,andθbB
areparameters.Specificdesignsofbasisfunctionswillbe
b=1
discussedinChapter3.Below,weusethefollowingvectorrepresentationfor
compactlyexpressingtheparametersandbasisfunctions:
θ⊤φ(s,a),
where⊤denotesthetransposeand
θ=(θ1,…,θB)⊤∈RB,⊤φ(s,a)=φ1(s,a),…,φB(s,a)
∈RB.FromtheBellmanequationforstate-actionvalues(2.1),wecanexpress
theexpectedimmediaterewardr(s,a)as
h
i
r(s,a)=Qπ(s,a)−γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).
Bysubstitutingthevaluefunctionmodelθ⊤φ(s,a)intheaboveequation,
theexpectedimmediaterewardr(s,a)maybeapproximatedas
h
i
r(s,a)≈θ⊤φ(s,a)−γEπ(a′|s′)p(s′|s,a)θ⊤φ(s′,a′).
Nowletusdefineanewbasisfunctionvectorψ(s,a):
h
i
ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).
PolicyIterationwithValueFunctionApproximation
21
r(s1,a1)
r(s,a)
r(sT,aT)
r(s1,a1,s2)
r(s2,a2)
T
θψ(s,a)
r(sT,aT,sT+1)
r(s2,a2,s3)
(s,a)
(s1,a1)
(s2,a2)
(sT,aT)
FIGURE2.1:Linearapproximationofstate-actionvaluefunctionQπ(s,a)
aslinearregressionofexpectedimmediaterewardr(s,a).
Thentheexpectedimmediaterewardr(s,a)maybeapproximatedas
r(s,a)≈θ⊤ψ(s,a).
Asexplainedabove,thelinearapproximationproblemofthestate-action
valuefunctionQπ(s,a)canbereformulatedasthelinearregressionproblem
oftheexpectedimmediaterewardr(s,a)(seeFigure2.1).Thekeytrickwas
topushtherecursivenatureofthestate-actionvaluefunctionQπ(s,a)into
thecompositebasisfunctionψ(s,a).
2.2.2
Algorithm
Now,weexplainhowtheparametersθarelearnedintheleast-squares
framework.Thatis,themodelθ⊤ψ(s,a)isfittedtotheexpectedimmediate
rewardr(s,a)underthesquaredloss:
(
”
#)
T
1X
2
minEpπ(h)
θ⊤ψ(st,at)−r(st,at)
,
θ
Tt=1
wherehdenotesthehistorysamplefollowingthecurrentpolicyπ:
h=[s1,a1,…,sT,aT,sT+1].
ForhistorysamplesH=h1,…,hN,where
hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n],
anempiricalversionoftheaboveleast-squaresproblemisgivenas
(
”
#)
N
T
1X
1X
2
min
θ⊤b
ψ(st,n,at,n;H)−r(st,n,at,n,st+1,n)
.
θ
N
T
n=1
t=1
22
StatisticalReinforcementLearning
1
2
θ−r
Ψ
ˆ
NT
θ
FIGURE2.2:Gradientdescent.
Here,b
ψ(s,a;H)isanempiricalestimatorofψ(s,a)givenby
X
h
i
b
1
ψ(s,a;H)=φ(s,a)−
E
γφ(s′,a′),
|H
π(a′|s′)
(s,a)|s′∈H(s,a)whereH(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfrom
statesbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),
P
and
denotesthesummationoveralldestinationstatess′intheset
s′∈Hs,a)H(s,a).
Letb
ΨbetheNT×BmatrixandrbetheNT-dimensionalvectordefined
as
b
ΨN(t−1)+n,b=b
ψb(st,n,at,n),
rN(t−1)+n=r(st,n,at,n,st+1,n).
b
Ψissometimescalledthedesignmatrix.Thentheaboveleast-squaresprob-
lemcanbecompactlyexpressedas
1
min
kb
Ψθ−rk2,
θ
NT
wherek·kdenotestheℓ2-norm.Becausethisisaquadraticfunctionwith
respecttoθ,itsglobalminimizerb
θcanbeanalyticallyobtainedbysettingits
derivativetozeroas
b
⊤⊤
θ=(b
Ψb
Ψ)−1b
Ψr.
(2.2)
⊤IfBistoolargeandcomputingtheinverseofb
Ψb
Ψisintractable,wemay
useagradientdescentmethod.Thatis,startingfromsomeinitialestimateθ,
thesolutionisupdateduntilconvergence,asfollows(seeFigure2.2):
⊤⊤θ←−θ−ε(b
Ψb
Ψθ−b
Ψr),
PolicyIterationwithValueFunctionApproximation
23
⊤⊤whereb
Ψb
Ψθ−b
Ψrcorrespondstothegradientoftheobjectivefunction
kb
Ψθ−rk2andεisasmallpositiveconstantrepresentingthestepsizeof
gradientdescent.
Anotablevariationoftheaboveleast-squaresmethodistocomputethe
solutionby
eθ=(Φ⊤b
Ψ)−1Φ⊤r,
whereΦistheNT×Bmatrixdefinedas
ΦN(t−1)+n,b=φ(st,n,at,n).
This
variation
is
called
the
least-squaresfixed-pointapproximation
(Lagoudakis&Parr,2003)andisshowntohandletheestimationerrorin-
cludedinthebasisfunctionb
ψinasoundway(Bradtke&Barto,1996).
However,forsimplicity,wefocusonEq.(2.2)below.
2.2.3
Regularization
Regressiontechniquesinmachinelearningaregenerallyformulatedasmin-
imizationofagoodness-of-fittermandaregularizationterm.Intheabove
least-squaresframework,thegoodness-of-fitofourmodelismeasuredbythe
squaredloss.Inthefollowingchapters,wediscusshowotherlossfunctionscan
beutilizedinthepolicyiterationframework,e.g.,samplereuseinChapter4
andoutlier-robustlearninginChapter6.Herewefocusontheregularization
termandintroducepracticallyusefulregularizationtechniques.
Theℓ2-regularizeristhemoststandardregularizerinstatisticsandma-
chinelearning;itisalsocalledtheridgeregression(Hoerl&Kennard,1970):
1
min
kb
Ψθ−rk2+λkθk2,
θ
NT
whereλ≥0istheregularizationparameter.Theroleoftheℓ2-regularizer
kθk2istopenalizethegrowthoftheparametervectorθtoavoidoverfitting
tonoisysamples.Apracticaladvantageoftheuseoftheℓ2-regularizeristhat
theminimizerb
θcanstillbeobtainedanalytically:
b
⊤⊤θ=(b
Ψb
Ψ+λIB)−1b
Ψr,
whereIBdenotestheB×Bidentitymatrix.BecauseoftheadditionofλIB,
thematrixtobeinvertedabovehasabetternumericalconditionandthus
thesolutiontendstobemorestablethanthesolutionobtainedbyplainleast
squareswithoutregularization.
Notethatthesamesolutionastheaboveℓ2-penalizedleast-squaresprob-
lemcanbeobtainedbysolvingthefollowingℓ2-constrainedleast-squaresprob-
lem:
1
min
kb
Ψθ−rk2
θ
NT
24
StatisticalReinforcementLearning
θ
θ
2
2
θ
ˆ
θ
ˆ
LS
LS
θ
ˆ
θ
ˆ
ℓ2−CLS
ℓ1−CLS
θ
θ
1
1
(a)ℓ2-constraint
(b)ℓ1-constraint
FIGURE2.3:Feasibleregions(i.e.,regionswheretheconstraintissatisfied).
Theleast-squares(LS)solutionisthebottomoftheellipticalhyperboloid,
whereasthesolutionofconstrainedleast-squares(CLS)islocatedatthepoint
wherethehyperboloidtouchesthefeasibleregion.
subjecttokθk2≤C,
whereCisdeterminedfromλ.Notethatthelargerthevalueofλis(i.e.,the
strongertheeffectofregularizationis),thesmallerthevalueofCis(i.e.,the
smallerthefeasibleregionis).Thefeasibleregion(i.e.,theregionwherethe
constraintkθk2≤Cissatisfied)isillustratedinFigure2.3(a).
Anotherpopularchoiceofregularizationinstatisticsandmachinelearn-
ingistheℓ1-regularizer,whichisalsocalledtheleastabsoluteshrinkageand
selectionoperator(LASSO)(Tibshirani,1996):
1
min
kb
Ψθ−rk2+λkθk1,
θ
NT
wherek·k1denotestheℓ1-normdefinedastheabsolutesumofelements:
B
X
kθk1=
|θb|.
b=1
Inthesamewayastheℓ2-regularizationcase,thesamesolutionastheabove
ℓ1-penalizedleast-squaresproblemcanbeobtainedbysolvingthefollowing
constrainedleast-squaresproblem:
1
min
kb
Ψθ−rk2
θ
NT
subjecttokθk1≤C,
PolicyIterationwithValueFunctionApproximation
25
1stSubset
(K–1)thsubset
Kthsubset
···
Estimation
Validation
FIGURE2.4:Crossvalidation.
whereCisdeterminedfromλ.ThefeasibleregionisillustratedinFig-
ure2.3(b).
Anotablepropertyofℓ1-regularizationisthatthesolutiontendstobe
sparse,i.e.,manyoftheelementsθbBbecomeexactlyzero.Thereasonwhy
b=1
thesolutionbecomessparsecanbeintuitivelyunderstoodfromFigure2.3(b):
thesolutiontendstobeononeofthecornersofthefeasibleregion,where
thesolutionissparse.Ontheotherhand,intheℓ2-constraintcase(seeFig-
ure2.3(a)again),thesolutionissimilartotheℓ1-constraintcase,butitis
notgenerallyonanaxisandthusthesolutionisnotsparse.Suchasparse
solutionhasvariouscomputationaladvantages.Forexample,thesolutionfor
large-scaleproblemscanbecomputedefficiently,becauseallparametersdo
nothavetobeexplicitlyhandled;see,e.g.,Tomiokaetal.,2011.Furthermore,
thesolutionsforalldifferentregularizationparameterscanbecomputedef-
ficiently(Efronetal.,2004),andtheoutputofthelearnedmodelcanbe
computedefficiently.
2.2.4
ModelSelection
Inregression,tuningparametersareoftenincludedinthealgorithm,such
asbasisparametersandtheregularizationparameter.Suchtuningparameters
canbeobjectivelyandsystematicallyoptimizedbasedoncross-validation
(Wahba,1990)asfollows(seeFigure2.4).
First,thetrainingdatasetHisdividedintoKdisjointsubsetsofapprox-
imatelythesamesize,HkK.Thentheregressionsolutionbθ
k=1
kisobtained
usingH\Hk(i.e.,allsampleswithoutHk),anditssquarederrorforthehold-
outsamplesHkiscomputed.Thisprocedureisrepeatedfork=1,…,K,and
themodel(suchasthebasisparameterandtheregularizationparameter)that
minimizestheaverageerrorischosenasthemostsuitableone.
Onemaythinkthattheordinarysquarederrorisdirectlyusedformodel
selection,insteadofitscross-validationestimator.However,theordinary
squarederrorisheavilybiased(orinotherwords,over-fitted)sincethesame
trainingsamplesareusedtwiceforlearningparametersandestimatingthe
generalizationerror(i.e.,theout-of-samplepredictionerror).Ontheother
hand,thecross-validationestimatorofsquarederrorisalmostunbiased,where
“almost”comesfromthefactthatthenumberoftrainingsamplesisreduced
duetodatasplittinginthecross-validationprocedure.
26
StatisticalReinforcementLearning
Ingeneral,cross-validationiscomputationallyexpensivebecausethe
squarederrorneedstobeestimatedmanytimes.Forexample,whenperform-
ing5-foldcross-validationfor10modelcandidates,thelearningprocedurehas
toberepeated5×10=50times.However,thisisoftenacceptableinpractice
becausesensiblemodelselectiongivesanaccuratesolutionevenwithasmall
numberofsamples.Thus,intotal,thecomputationtimemaynotgrowthat
much.Furthermore,cross-validationissuitableforparallelcomputingsinceer-
rorestimationfordifferentmodelsanddifferentfoldsareindependentofeach
other.Forinstance,whenperforming5-foldcross-validationfor10modelcan-
didates,theuseof50computingunitsallowsustocomputeeverythingat
once.
2.3
Remarks
Reinforcementlearningviaregressionofstate-actionvaluefunctionsisa
highlypowerfulandflexibleapproach,becausewecanutilizevariousregression
techniquesdevelopedinstatisticsandmachinelearningsuchasleast-squares,
regularization,andcross-validation.
Inthefollowingchapters,weintroducemoresophisticatedregressiontech-
niquessuchasmanifold-basedsmoothing(Chapelleetal.,2006)inChapter3,
covariateshiftadaptation(Sugiyama&Kawanabe,2012)inChapter4,active
learning(Sugiyama&Kawanabe,2012)inChapter5,androbustregression
(Huber,1981)inChapter6.
Chapter3
BasisDesignforValueFunction
Approximation
Least-squarespolicyiterationexplainedinChapter2workswell,givenappro-
priatebasisfunctionsforvaluefunctionapproximation.Becauseofitssmooth-
ness,theGaussiankernelisapopularandusefulchoiceasabasisfunction.
However,itdoesnotallowfordiscontinuity,whichisconceivableinmanyre-
inforcementlearningtasks.Inthischapter,weintroduceanalternativebasis
functionbasedongeodesicGaussiankernels(GGKs),whichexploitthenon-
linearmanifoldstructureinducedbytheMarkovdecisionprocesses(MDPs).
ThedetailsofGGKareexplainedinSection3.1,anditsrelationtoother
basisfunctiondesignsisdiscussedinSection3.2.Then,experimentalperfor-
manceisnumericallyevaluatedinSection3.3,andthischapterisconcluded
inSection3.4.
3.1
GaussianKernelsonGraphs
Inleast-squarespolicyiteration,thechoiceofbasisfunctionsφb(s,a)B
b=1
isanopendesignissue(seeChapter2).Traditionally,Gaussiankernelshave
beenapopularchoice(Lagoudakis&Parr,2003;Engeletal.,2005),butthey
cannotapproximatediscontinuousfunctionswell.Tocopewiththisproblem,
moresophisticatedmethodsofconstructingsuitablebasisfunctionshavebeen
proposedwhicheffectivelymakeuseofthegraphstructureinducedbyMDPs
(Mahadevan,2005).Inthissection,weintroduceanalternativewayofcon-
structingbasisfunctionsbyincorporatingthegraphstructureofthestate
space.
3.1.1
MDP-InducedGraph
LetGbeagraphinducedbyanMDP,wherestatesSarenodesofthe
graphandthetransitionswithnon-zerotransitionprobabilitiesfromonenode
toanotherareedges.Theedgesmayhaveweightsdetermined,e.g.,basedon
thetransitionprobabilitiesorthedistancebetweennodes.Thegraphstructure
correspondingtoanexamplegridworldshowninFigure3.1(a)isillustrated
27
28
StatisticalReinforcementLearning
123456789101112131415161718192021
1
2
→→→→→→→↓↓
→→→→→→→→
−10
3
→→→→→→→↓↓
→→→→→→↑↑↑
4
→→↓→↓→→→↓
→↑↑→→↑↑↑↑
−20
5
↓↓↓↓↓↓↓↓↓
→→→→↑↑↑↑↑
6
→→→→→→↓↓↓
→→↑→↑↑↑↑↑
−30
7
→↓↓→↓→↓↓↓
↑→↑↑↑↑↑↑↑
8
→→↓→→→↓↓↓
↑→↑→↑↑↑↑↑
9
→→→→→→↓↓↓
→→↑↑↑↑↑↑↑
10
→→→→→→→→→→→↑↑→↑↑↑↑↑
11
→→→→→→→→→↑→↑→↑↑↑↑↑↑
5
12
→→→→→→→→↑
→↑→↑→↑↑↑↑
13
→→→↑→→↑↑↑
↑→→↑↑↑↑↑↑
14
→→↑↑→↑↑→↑
↑→↑↑↑→↑↑↑
10
15
→→→→→→→→↑
↑→↑↑↑↑↑↑↑
16
↑→↑↑↑→→↑↑
↑→→↑↑↑↑↑↑
20
17
→→→→→→→→↑
↑→↑↑↑↑↑↑↑
15
15
18
↑↑↑→→→↑↑↑
→→↑↑↑→↑↑↑
10
y
19
→→→→→→→→↑
↑↑↑↑↑↑↑↑↑
5
20
20
x
(a)Blackareasarewallsoverwhich
(b)Optimalstatevaluefunction(in
theagentcannotmove,whilethegoal
log-scale).
isrepresentedingray.Arrowsonthe
gridsrepresentoneoftheoptimalpoli-
cies.
(c)GraphinducedbytheMDPanda
randompolicy.
FIGURE3.1:Anillustrativeexampleofareinforcementlearningtaskof
guidinganagenttoagoalinthegridworld.
inFigure3.1(c).Inpractice,suchgraphstructure(includingtheconnection
weights)isestimatedfromsamplesofafinitelength.Weassumethatthe
graphGisconnected.Typically,thegraphissparseinreinforcementlearning
tasks,i.e.,
ℓ≪n(n−1)/2,
whereℓisthenumberofedgesandnisthenumberofnodes.
BasisDesignforValueFunctionApproximation
29
3.1.2
OrdinaryGaussianKernels
OrdinaryGaussiankernels(OGKs)ontheEuclideanspacearedefinedas
ED(s,s′)2
K(s,s′)=exp−
,
2σ2
whereED(s,s′)aretheEuclideandistancebetweenstatessands′;forex-
ample,
ED(s,s′)=kx−x′k,
whentheCartesianpositionsofsands′inthestatespacearegivenbyxand
x′,respectively.σ2isthevarianceparameteroftheGaussiankernel.
TheaboveGaussianfunctionisdefinedonthestatespaceS,wheres′is
treatedasacenterofthekernel.InordertoemploytheGaussiankernelin
least-squarespolicyiteration,itneedstobeextendedoverthestate-action
spaceS×A.Thisisusuallycarriedoutbysimply“copying”theGaussian
functionovertheactionspace(Lagoudakis&Parr,2003;Mahadevan,2005).
Moreprecisely,letthetotalnumberkofbasisfunctionsbemp,wheremis
thenumberofpossibleactionsandpisthenumberofGaussiancenters.For
thei-thactiona(i)(∈A)(i=1,2,…,m)andforthej-thGaussiancenter
c(j)(∈S)(j=1,2,…,p),the(i+(j−1)m)-thbasisfunctionisdefinedasφi+(j−1)m(s,a)=I(a=a(i))K(s,c(j)),
(3.1)
whereI(·)istheindicatorfunction:
(1ifa=a(i),
I(a=a(i))=
0otherwise.
3.1.3
GeodesicGaussianKernels
Ongraphs,anaturaldefinitionofthedistancewouldbetheshortestpath.
TheGaussiankernelbasedontheshortestpathisgivenby
SP(s,s′)2
K(s,s′)=exp−
,
(3.2)
2σ2
whereSP(s,s′)denotestheshortestpathfromstatestostates′.Theshortest
pathonagraphcanbeinterpretedasadiscreteapproximationtothegeodesic
distanceonanon-linearmanifold(Chung,1997).Forthisreason,wecallEq.
(3.2)ageodesicGaussiankernel(GGK)(Sugiyamaetal.,2008).
ShortestpathsongraphscanbeefficientlycomputedusingtheDijkstraal-
gorithm(Dijkstra,1959).Withitsnaiveimplementation,computationalcom-
plexityforcomputingtheshortestpathsfromasinglenodetoallothernodes
isO(n2),wherenisthenumberofnodes.IftheFibonacciheapisemployed,
30
StatisticalReinforcementLearning
computationalcomplexitycanbereducedtoO(nlogn+ℓ)(Fredman&Tar-
jan,1987),whereℓisthenumberofedges.Sincethegraphinvaluefunction
approximationproblemsistypicallysparse(i.e.,ℓ≪n2),usingtheFibonacci
heapprovidessignificantcomputationalgains.Furthermore,thereexistvar-
iousapproximationalgorithmswhicharecomputationallyveryefficient(see
Goldberg&Harrelson,2005andreferencestherein).
AnalogouslytoOGKs,weneedtoextendGGKstothestate-actionspace
tousetheminleast-squarespolicyiteration.Anaivewayistojustemploy
Eq.(3.1),butthiscancauseashiftintheGaussiancenterssincethestate
usuallychangeswhensomeactionistaken.Toincorporatethistransition,
thebasisfunctionsaredefinedastheexpectationofGaussianfunctionsafter
transition:
X
φi+(j−1)m(s,a)=I(a=a(i))
P(s′|s,a)K(s′,c(j)).
(3.3)
s′∈SThisshiftingschemeisshowntoworkverywellwhenthetransitionispre-
dominantlydeterministic(Sugiyamaetal.,2008).
3.1.4
ExtensiontoContinuousStateSpaces
Sofar,wefocusedondiscretestatespaces.However,theconceptofGGKs
canbenaturallyextendedtocontinuousstatespaces,whichisexplainedhere.
First,thecontinuousstatespaceisdiscretized,whichgivesagraphasadis-
creteapproximationtothenon-linearmanifoldstructureofthecontinuous
statespace.Basedonthegraph,GGKscanbeconstructedinthesameway
asthediscretecase.Finally,thediscreteGGKsareinterpolated,e.g.,usinga
linearmethodtogivecontinuousGGKs.
Althoughthisprocedurediscretizesthecontinuousstatespace,itmustbe
notedthatthediscretizationisonlyforthepurposeofobtainingthegraphas
adiscreteapproximationofthecontinuousnon-linearmanifold;theresulting
basisfunctionsthemselvesarecontinuouslyinterpolatedandhencethestate
spaceisstilltreatedascontinuous,asopposedtoconventionaldiscretization
procedures.
3.2
Illustration
Inthissection,thecharacteristicsofGGKsarediscussedincomparisonto
existingbasisfunctions.
BasisDesignforValueFunctionApproximation
31
3.2.1
Setup
Letusconsideratoyreinforcementlearningtaskofguidinganagentto
agoalinadeterministicgridworld(seeFigure3.1(a)).Theagentcantake
4actions:up,down,left,andright.Notethatactionswhichmaketheagent
collidewiththewallaredisallowed.Apositiveimmediatereward+1isgiven
iftheagentreachesagoalstate;otherwiseitreceivesnoimmediatereward.
Thediscountfactorissetatγ=0.9.
Inthistask,astatescorrespondstoatwo-dimensionalCartesiangrid
positionxoftheagent.Forillustrationpurposes,letusdisplaythestate
valuefunction,
Vπ(s):S→R,
whichistheexpectedlong-termdiscountedsumofrewardstheagentreceives
whentheagenttakesactionsfollowingpolicyπfromstates.Fromthedefi-
nition,itcanbeconfirmedthatVπ(s)isexpressedintermsofQπ(s,a)as
Vπ(s)=Qπ(s,π(s)).
TheoptimalstatevaluefunctionV∗(s)(inlog-scale)isillustratedinFig-ure3.1(b).AnMDP-inducedgraphstructureestimatedfrom20seriesofran-
domwalksamples1oflength500isillustratedinFigure3.1(c).Here,theedge
weightsinthegrapharesetat1(whichisequivalenttotheEuclideandistance
betweentwonodes).
3.2.2
GeodesicGaussianKernels
AnexampleofGGKsforthisgraphisdepictedinFigure3.2(a),wherethe
varianceofthekernelissetatalargevalue(σ2=30)forillustrationpurposes.
ThegraphshowsthatGGKshaveanicesmoothsurfacealongthemaze,but
notacrossthepartitionbetweentworooms.SinceGGKshave“centers,”they
areextremelyusefulforadaptivelychoosingasubsetofbases,e.g.,usinga
uniformallocationstrategy,sample-dependentallocationstrategy,ormaze-
dependentallocationstrategyofthecenters.Thisisapracticaladvantage
oversomenon-orderedbasisfunctions.Moreover,sinceGGKsarelocalby
nature,theilleffectsoflocalnoiseareconstrainedlocally,whichisanother
usefulpropertyinpractice.
Theapproximatedvaluefunctionsobtainedby40GGKs2aredepictedin
Figure3.3(a),whereoneGGKcenterisputatthegoalstateandtheremaining
9centersarechosenrandomly.ForGGKs,kernelfunctionsareextendedover
theactionspaceusingtheshiftingscheme(seeEq.(3.3))sincethetransitionis
1Moreprecisely,ineachrandomwalk,aninitialstateischosenrandomly.Then,anactionischosenrandomlyandtransitionismade;thisisrepeated500times.Thisentireprocedureisindependentlyrepeated20timestogeneratethetrainingset.
2Notethatthetotalnumberkofbasisfunctionsis160sinceeachGGKiscopiedovertheactionspaceasperEq.(3.3).
32
StatisticalReinforcementLearning
1
1
1
0.5
0.5
0.5
0
0
0
5
5
5
10
10
10
20
20
20
15
15
15
15
15
15
10
y
10
y
10
y
5
20
5
20
5
20
x
x
x
(a)GeodesicGaussiankernels
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.2
0.4
0.2
5
5
5
10
10
10
20
20
20
15
15
15
15
15
15
10
y
10
y
10
y
5
20
5
20
5
20
x
x
x
(b)OrdinaryGaussiankernels
0.05
0.05
0.05
0
0
0
−0.05
−0.05
−0.05
5
5
5
10
10
10
20
20
20
15
15
15
15
15
15
10
y
10
y
10
y
5
20
5
20
5
20
x
x
x
(c)Graph-Laplacianeigenbases
0.2
0.15
0.2
0.1
0.1
0
0.05
0
−0.2
0
−0.1
5
5
5
10
10
10
20
20
20
15
15
15
15
15
15
10
y
10
y
10
y
5
20
5
20
5
20
x
x
x
(d)Diffusionwavelets
FIGURE3.2:Examplesofbasisfunctions.
BasisDesignforValueFunctionApproximation
33
−1.5
−2
−2
−2.5
−2.5
−3
−3
−3.5
−3.5
5
5
10
10
20
20
15
15
15
15
10
y
10
y
5
20
5
20
x
x
(a)GeodesicGaussiankernels(MSE=
(b)OrdinaryGaussiankernels(MSE=
1.03×10−2)
1.19×10−2)
−4
−6
−6
−8
−8
−10
−10
−12
−12
5
5
10
10
20
20
15
15
15
15
10
y
10
y
5
20
5
20
x
x
(c)Graph-Laplacianeigenbases(MSE=
(d)Diffusionwavelets
(MSE=5.00×
4.73×10−4)
10−4)
FIGURE3.3:Approximatedvaluefunctionsinlog-scale.Theerrorsarecom-
putedwithrespecttotheoptimalvaluefunctionillustratedinFigure3.1(b).
deterministic(seeSection3.1.3).TheproposedGGK-basedmethodproduces
anicesmoothfunctionalongthemazewhilethediscontinuityaroundthepar-
titionbetweentworoomsissharplymaintained(cf.Figure3.1(b)).Asaresult,
forthisparticularcase,GGKsgivetheoptimalpolicy(seeFigure3.4(a)).
AsdiscussedinSection3.1.3,thesparsityofthestatetransitionmatrixal-
lowsefficientandfastcomputationsofshortestpathsonthegraph.Therefore,
least-squarespolicyiterationwithGGK-basedbasesisstillcomputationally
attractive.
3.2.3
OrdinaryGaussianKernels
OGKssharesomeofthepreferablepropertiesofGGKsdescribedabove.
However,asillustratedinFigure3.2(b),thetailofOGKsextendsbeyondthe
partitionbetweentworooms.Therefore,OGKstendtoundesirablysmooth
outthediscontinuityofthevaluefunctionaroundthebarrierwall(see
34
StatisticalReinforcementLearning
123456789101112131415161718192021
123456789101112131415161718192021
1
1
2
→→→→→→↓↓↓
→→→→→→→→
2
→→→→→→→→↓
→→→→→→→→
3
→→→→→↓↓↓↓
→→→→→→→→↑
3
→→→→→→→→↑
→→→→→→→→↑
4
→→→→→↓↓↓↓
→→→→→→→→↑
4
→→→→→→→→↑
→→→→→→→→↑
5
→→→→↓↓↓↓↓
→→→→→→→→↑
5
→→→→→→→→↑
→→→→→→→→↑
6
→→→↓↓↓↓↓↓
→→→→→→→→↑
6
→→→→→→→→↑
→→→→→→→→↑
7
→→→↓↓↓↓↓↓
→→→→→→→↑↑
7
→→→→→→→→↑
→→→→→→→→↑
8
→→↓↓↓↓↓↓↓
→→→→→→→↑↑
8
→→→→→→→→↑
→→→→→↑↑↑↑
9
→↓↓↓↓↓↓↓↓
→→→→↑↑→↑↑
9
→→→→→→→→↑
→↑↑↑↑↑↑↑↑
10
→→→→→→→→→→→→→→↑↑↑↑↑
10
→→→→→→→→→→↑↑↑↑↑↑↑↑↑
11
→→→→→↑↑↑↑↑↑→↑↑↑↑↑↑↑
11
→→→→→→→↑↑↑↑↑↑↑↑↑↑↑↑
12
→→→↑↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
12
→→→→→→↑↑↑
↑↑↑↑↑↑↑↑↑
13
→→→↑↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
13
→→→→→↑↑↑↑
↑↑↑↑↑↑↑↑↑
14
→→→↑↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
14
→→→→→↑↑↑↑
↑↑↑↑↑↑↑↑↑
15
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
15
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
16
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
16
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
17
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
17
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
18
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
18
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
19
→→→↑↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
19
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
20
20
(a)GeodesicGaussiankernels
(b)OrdinaryGaussiankernels
123456789101112131415161718192021
123456789101112131415161718192021
1
1
2
→←↓↓↓↓↓↓↓
↓←↓↓→→→→
2
↓↓↓↓↓↓↓→↓
→→→→→→→→
3
↑←↓↓↓↓↓↓↓
↑↑↓↓→→→→↑
3
↓↓↓↓↓↓→↓↑
→→→→→→→→↑
4
↓↓↓↓↓↑↑↓↓
↑↑↑↓↓↓→→↑
4
↓↓↓↓↓→↓→↓
→→→→→→→→↑
5
↓↓↓↓↓↑↑←↓
→↑↑↑↓↓↓↑↑
5
↓↓↓↓→↓→↓↑
→→→→→→→→↑
6
↓↓↓↓↓↓↓↓↓
↓→↑↑↓↓↓↑↑
6
↓→↓→↓→↓→↓
→→→→→→→→↑
7
↓↓↓↓↓↓↓↓↓
↓→→→→↓↓←↑
7
→↓→↓→↓→↓↓
→→→→→→→→↑
8
↓↓↑↑↓↓↓↓↓
↓↓↑→→→→←←
8
↓→↓→↓→↓→↑
→←→→→→↑→↑
9
↓↓↓↑↓↓↓↓↓
↓↓↑↑→→→→↑
9
→↓→↓→↓→→↑
↑→←→→↑→↑↑
10
↓↓↓←↓↓↓↓→→↓↓↑↑←→→→↑
10
↓→↓→↓→↑→→→→↓↑↑↑→↑→↑
11
↓↓↓←↓↓↓↓→→→↓↓←←←←↑↑
11
→↓→↓→↑→→↑→↓↑↑↑↑↑→↑↑
12
↓↓↓↑↓↓↓↓↓
→→↓↓↓←←↑↑
12
↓→↓→↓→↑→↑
↑↑↑↑↑↑↑→↑
13
↓↓↑↑↓↓↓↓↓
→→→→↓←←←↓
13
→↓→↑→↑→↑↓
→↑↑↑↑↑↑↑↑
14
↓↓↓↓↓↓↓↓↓
↓→→→→→→↓↓
14
↑→↑→↑→↑→↑
↑→↑↑↑↑↑↑↑
15
↓↓↓↓↓↓↓↓↓
↓↓↓→→→←↓↓
15
↓↑↓↑↓↑→↑↓
→↑→↑↑↑↑↑↑
16
↓↓↓↓↓↑↑←↓
↓↓↓←↑↑←↓↓
16
↑↓↑↓↑↓↑↓↑
↑→↑→↑↑↑←↑
17
↓↓↓↓↓↑↑↓↓
↑↓↓←←←←↓↓
17
↓↑↓↑↓↑↓↑↓
→↑→↑↑↑←↑←
18
↑↓↓↓↓←←↓↓
↑→→↓←←↑↓↓
18
↑↓↑↓↑↓↑↓↑
↑→↑→↑←↑←↑
19
→↑←←←←←←←→→→→←←↑↑↑
19
↑↑←↑←↑←↑←→↑→↑→↑←↑←
20
20
(c)Graph-Laplacianeigenbases
(d)Diffusionwavelets
FIGURE3.4:Obtainedpolicies.
Figure3.3(b)).Thiscausesanerrorinthepolicyaroundthepartition(see
x=10,y=2,3,…,9ofFigure3.4(b)).
3.2.4
Graph-LaplacianEigenbases
Mahadevan(2005)proposedemployingthesmoothestvectorsongraphsas
basesinvaluefunctionapproximation.Accordingtothespectralgraphtheory
(Chung,1997),suchsmoothbasesaregivenbytheminoreigenvectorsofthe
graph-Laplacianmatrix,whicharecalledgraph-Laplacianeigenbases(GLEs).
GLEsmayberegardedasanaturalextensionofFourierbasestographs.
ExamplesofGLEsareillustratedinFigure3.2(c),showingthattheyhave
Fourier-likestructureonthegraph.ItshouldbenotedthatGLEsarerather
globalinnature,implyingthatnoiseinalocalregioncanpotentiallyde-
gradetheglobalqualityofapproximation.AnadvantageofGLEsisthatthey
haveanaturalorderingofthebasisfunctionsaccordingtothesmoothness.
Thisispracticallyveryhelpfulinchoosingasubsetofbasisfunctions.Fig-
ure3.3(c)depictstheapproximatedvaluefunctioninlog-scale,wherethetop
BasisDesignforValueFunctionApproximation
35
40smoothestGLEsoutof326GLEsareused(notethattheactualnumber
ofbasesis160becauseoftheduplicationovertheactionspace).Itshows
thatGLEsgloballygiveaverygoodapproximation,althoughthesmalllocal
fluctuationissignificantlyemphasizedsincethegraphisinlog-scale.Indeed,
themeansquarederror(MSE)betweentheapproximatedandoptimalvalue
functionsdescribedinthecaptionsofFigure3.3showsthatGLEsgivea
muchsmallerMSEthanGGKsandOGKs.However,theobtainedvaluefunc-
tioncontainssystematiclocalfluctuationandthisresultsinaninappropriate
policy(seeFigure3.4(c)).
MDP-inducedgraphsaretypicallysparse.Insuchcases,theresultant
graph-LaplacianmatrixisalsosparseandGLEscanbeobtainedjustbysolv-
ingasparseeigenvalueproblem,whichiscomputationallyefficient.However,
findingminoreigenvectorscouldbenumericallyunstable.
3.2.5
DiffusionWavelets
CoifmanandMaggioni(2006)proposeddiffusionwavelets(DWs),which
areanaturalextensionofwaveletstothegraph.Theconstructionisbased
onasymmetrizedrandomwalkonagraph.Itisdiffusedonthegraphupto
adesiredlevel,resultinginamulti-resolutionstructure.Adetailedalgorithm
forconstructingDWsandmathematicalpropertiesaredescribedinCoifman
andMaggioni(2006).
WhenconstructingDWs,themaximumnestlevelofwaveletsandtoler-
anceusedintheconstructionalgorithmneedstobespecifiedbyusers.The
maximumnestlevelissetat10andthetoleranceissetat10−10,whichare
suggestedbytheauthors.ExamplesofDWsareillustratedinFigure3.2(d),
showinganicemulti-resolutionstructureonthegraph.DWsareover-complete
bases,soonehastoappropriatelychooseasubsetofbasesforbetterapprox-
imation.Figure3.3(d)depictstheapproximatedvaluefunctionobtainedby
DWs,wherewechosethemostglobal40DWsfrom1626over-completeDWs
(notethattheactualnumberofbasesis160becauseoftheduplicationover
theactionspace).Thechoiceofthesubsetbasescouldpossiblybeenhanced
usingmultipleheuristics.However,thecurrentchoiceisreasonablesinceFig-
ure3.3(d)showsthatDWsgiveamuchsmallerMSEthanGaussiankernels.
Nevertheless,similartoGLEs,theobtainedvaluefunctioncontainsalotof
smallfluctuations(seeFigure3.3(d))andthisresultsinanerroneouspolicy
(seeFigure3.4(d)).
Thankstothemulti-resolutionstructure,computationofdiffusionwavelets
canbecarriedoutrecursively.However,duetotheover-completeness,itisstill
ratherdemandingincomputationtime.Furthermore,appropriatelydetermin-
ingthetuningparametersaswellaschoosinganappropriatebasissubsetis
notstraightforwardinpractice.
36
StatisticalReinforcementLearning
3.3
NumericalExamples
Asdiscussedintheprevioussection,GGKsbringanumberofpreferable
propertiesformakingvaluefunctionapproximationeffective.Inthissection,
thebehaviorofGGKsisillustratednumerically.
3.3.1
Robot-ArmControl
Here,asimulatorofatwo-jointrobotarm(movinginaplane),illustrated
inFigure3.5(a),isemployed.Thetaskistoleadtheend-effector(“hand”)
ofthearmtoanobjectwhileavoidingtheobstacles.Possibleactionsareto
increaseordecreasetheangleofeachjoint(“shoulder”and“elbow”)by5
degreesintheplane,simulatingcoarsestepper-motorjoints.Thus,thestate
spaceSisthe2-dimensionaldiscretespaceconsistingoftwojoint-angles,as
illustratedinFigure3.5(b).Theblackareainthemiddlecorrespondstothe
obstacleinthejoint-anglestatespace.TheactionspaceAinvolves4actions:
increaseordecreaseoneofthejointangles.Apositiveimmediatereward+1
isgivenwhentherobot’send-effectortouchestheobject;otherwisetherobot
receivesnoimmediatereward.Notethatactionswhichmakethearmcollide
withobstaclesaredisallowed.Thediscountfactorissetatγ=0.9.Inthis
environment,therobotcanchangethejointangleexactlyby5degrees,and
thereforetheenvironmentisdeterministic.However,becauseoftheobstacles,
itisdifficulttoexplicitlycomputeaninversekinematicmodel.Furthermore,
theobstaclesintroducediscontinuityinvaluefunctions.Therefore,thisrobot-
armcontroltaskisaninterestingtestbedforinvestigatingthebehaviorof
GGKs.
Trainingsamplesfrom50seriesof1000randomarmmovementsarecol-
lected,wherethestartstateischosenrandomlyineachtrial.Thegraph
inducedbytheaboveMDPconsistsof1605nodesanduniformweightsare
assignedtotheedges.Sincethereare16goalstatesinthisenvironment(see
Figure3.5(b)),thefirst16Gaussiancentersareputatthegoalsandthere-
mainingcentersarechosenrandomlyinthestatespace.ForGGKs,kernel
functionsareextendedovertheactionspaceusingtheshiftingscheme(see
Eq.(3.3))sincethetransitionisdeterministicinthisexperiment.
Figure3.6illustratesthevaluefunctionsapproximatedusingGGKsand
OGKs.ThegraphsshowthatGGKsgiveanicesmoothsurfacewithobstacle-
induceddiscontinuitysharplypreserved,whileOGKstendtosmoothout
thediscontinuity.Thismakesasignificantdifferenceinavoidingtheobsta-
cle.From“A”to“B”inFigure3.5(b),theGGK-basedvaluefunctionresults
inatrajectorythatavoidstheobstacle(seeFigure3.6(a)).Ontheotherhand,
theOGK-basedvaluefunctionyieldsatrajectorythattriestomovethearm
throughtheobstaclebyfollowingthegradientupward(seeFigure3.6(b)),
causingthearmtogetstuckbehindtheobstacle.
BasisDesignforValueFunctionApproximation
37
-
(a)Aschematic
A
B
(b)Statespace
FIGURE3.5:Atwo-jointrobotarm.Inthisexperiment,GGKsareputat
allthegoalstatesandtheremainingkernelsaredistributeduniformlyover
themaze;theshiftingschemeisusedinGGKs.
Figure3.7summarizestheperformanceofGGKsandOGKsmeasured
bythepercentageofsuccessfultrials(i.e.,theend-effectorreachestheobject)
over30independentruns.Moreprecisely,ineachrun,50,000trainingsamples
arecollectedusingadifferentrandomseed,apolicyisthencomputedbythe
GGK-orOGK-basedleast-squarespolicyiteration,andfinallytheobtained
policyistested.ThisgraphshowsthatGGKsremarkablyoutperformOGKs
sincethearmcansuccessfullyavoidtheobstacle.TheperformanceofOGKs
doesnotgobeyond0.6evenwhenthenumberofkernelsisincreased.Thisis
causedbythetaileffectofOGKs.Asaresult,theOGK-basedpolicycannot
leadtheend-effectortotheobjectifitstartsfromthebottomlefthalfofthe
statespace.
Whenthenumberofkernelsisincreased,theperformanceofbothGGKs
andOGKsgetsworseataroundk=20.Thisiscausedbythekernelalloca-
38
StatisticalReinforcementLearning
3
1
2
0.5
1
0
0
180
180
100
100
0
0
0
0
Joint2(degree)
Joint2(degree)
−180
−100
Joint1(degree)
−180
−100
Joint1(degree)
(a)GeodesicGaussiankernels
(b)OrdinaryGaussiankernels
FIGURE3.6:Approximatedvaluefunctionswith10kernels(theactual
numberofbasesis40becauseoftheduplicationovertheactionspace).
1
0.9
0.8
0.7
0.6
0.5
0.4
Fractionofsuccessfultrials0.3
0.2
GGK(5)
GGK(9)
0.1
OGK(5)
OGK(9)
00
20
40
60
80
100
Numberofkernels
FIGURE3.7:Fractionofsuccessfultrials.
tionstrategy:thefirst16kernelsareputatthegoalstatesandtheremaining
kernelcentersarechosenrandomly.Whenkislessthanorequalto16,the
approximatedvaluefunctiontendstohaveaunimodalprofilesinceallkernels
areputatthegoalstates.However,whenkislargerthan16,thisunimodality
isbrokenandthesurfaceoftheapproximatedvaluefunctionhasslightfluc-
tuations,causinganerrorinpoliciesanddegradingperformanceataround
BasisDesignforValueFunctionApproximation
39
k=20.Thisperformancedegradationtendstorecoverasthenumberof
kernelsisfurtherincreased.
MotionexamplesoftherobotarmtrainedwithGGKandOGKareillus-
tratedinFigure3.8andFigure3.9,respectively.
Overall,theaboveresultshowsthatwhenGGKsarecombinedwiththe
above-mentionedkernel-centerallocationstrategy,almostperfectpoliciescan
beobtainedwithasmallnumberofkernels.Therefore,theGGKmethodis
computationallyhighlyadvantageous.
3.3.2
Robot-AgentNavigation
Theabovesimplerobot-armcontrolsimulationshowsthatGGKsare
promising.Here,GGKsareappliedtoamorechallengingtaskofmobile-robot
navigation,whichinvolvesahigh-dimensionalandverylargestatespace.
AKheperarobot,illustratedinFigure3.10(a),isemployedforthenavi-
gationtask.TheKheperarobotisequippedwith8infraredsensors(“s1”to
“s8”inthefigure),eachofwhichgivesameasureofthedistancefromthesur-
roundingobstacles.Eachsensorproducesascalarvaluebetween0and1023:
thesensorobtainsthemaximumvalue1023ifanobstacleisjustinfrontofthe
sensorandthevaluedecreasesastheobstaclegetsfartheruntilitreachesthe
minimumvalue0.Therefore,thestatespaceSis8-dimensional.TheKhep-
erarobothastwowheelsandtakesthefollowingdefinedactions:forward,
leftrotation,rightrotation,andbackward(i.e.,theactionspaceAcontains
actions).Thespeedoftheleftandrightwheelsforeachactionisdescribed
inFigure3.10(a)inthebracket(theunitispulseper10milliseconds).Note
thatthesensorvaluesandthewheelspeedarehighlystochasticduetothe
crosstalk,sensornoise,slip,etc.Furthermore,perceptualaliasingoccursdue
tothelimitedrangeandresolutionofsensors.Therefore,thestatetransition
isalsohighlystochastic.Thediscountfactorissetatγ=0.9.
ThegoalofthenavigationtaskistomaketheKheperarobotexplore
theenvironmentasmuchaspossible.Tothisend,apositivereward+1is
givenwhentheKheperarobotmovesforwardandanegativereward−2is
givenwhentheKheperarobotcollideswithanobstacle.Norewardisgiven
totheleftrotation,rightrotation,andbackwardactions.Thisrewarddesign
encouragestheKheperarobottogoforwardwithouthittingobstacles,through
whichextensiveexplorationintheenvironmentcouldbeachieved.
Trainingsamplesarecollectedfrom200seriesof100randommovementsin
afixedenvironmentwithseveralobstacles(seeFigure3.11(a)).Then,agraph
isconstructedfromthegatheredsamplesbydiscretizingthecontinuousstate
spaceusingaself-organizingmap(SOM)(Kohonen,1995).ASOMconsists
ofneuronslocatedonaregulargrid.Eachneuroncorrespondstoacluster
andneuronsareconnectedtoadjacentonesbyneighborhoodrelation.The
SOMissimilartothek-meansclusteringalgorithm,butitisdifferentinthat
thetopologicalstructureoftheentiremapistakenintoaccount.Thanksto
this,theentirespacetendstobecoveredbytheSOM.Thenumberofnodes
40
StatisticalReinforcementLearning
FIGURE3.8:AmotionexampleoftherobotarmtrainedwithGGK(from
lefttorightandtoptobottom).
FIGURE3.9:AmotionexampleoftherobotarmtrainedwithOGK(from
lefttorightandtoptobottom).
BasisDesignforValueFunctionApproximation
41
(a)Aschematic
1000
800
600
400
200
0
−200
−400
−1000
−800
−600
−400
−200
0
200
400
600
800
1000
(b)Statespaceprojectedontoa2-dimensionalsubspaceforvisualization
FIGURE3.10:Kheperarobot.Inthisexperiment,GGKsaredistributed
uniformlyoverthemazewithouttheshiftingscheme.
(states)inthegraphissetat696(equivalenttotheSOMmapsizeof24×29).
√
Thisvalueiscomputedbythestandardrule-of-thumbformula5n(Vesanto
etal.,2000),wherenisthenumberofsamples.Theconnectivityofthegraph
isdeterminedbystatetransitionsoccurringinthesamples.Morespecifically,
ifthereisastatetransitionfromonenodetoanotherinthesamples,anedge
isestablishedbetweenthesetwonodesandtheedgeweightissetaccording
totheEuclideandistancebetweenthem.
Figure3.10(b)illustratesanexampleoftheobtainedgraphstructure.For
visualizationpurposes,the8-dimensionalstatespaceisprojectedontoa2-
dimensionalsubspacespannedby
(−1−10011
0
0),
(0
0
11
00
−1−1).
42
StatisticalReinforcementLearning
(a)Training
(b)Test
FIGURE3.11:Simulationenvironment.
Notethatthisprojectionisperformedonlyforthepurposeofvisualization.
Allthecomputationsarecarriedoutusingtheentire8-dimensionaldata.
Thei-thelementintheabovebasescorrespondstotheoutputofthei-th
sensor(seeFigure3.10(a)).Theprojectionontothissubspaceroughlymeans
thatthehorizontalaxiscorrespondstothedistancetotheleftandright
obstacles,whiletheverticalaxiscorrespondstothedistancetothefrontand
backobstacles.Forclearvisibility,theedgeswhoseweightislessthan250are
plotted.RepresentativelocalposesoftheKheperarobotwithrespecttothe
obstaclesareillustratedinFigure3.10(b).Thisgraphhasanotablefeature:
thenodesaroundtheregion“B”inthefigurearedirectlyconnectedtothe
nodesat“A,”butareverysparselyconnectedtothenodesat“C,”“D,”and
“E.”Thisimpliesthatthegeodesicdistancefrom“B”to“C,”“B”to“D,”
or“B”to“E”istypicallylargerthantheEuclideandistance.
Sincethetransitionfromonestatetoanotherishighlystochasticinthe
currentexperiment,theGGKfunctionissimplyduplicatedovertheaction
space(seeEq.(3.1)).ForobtainingcontinuousGGKs,GGKfunctionsneedto
beinterpolated(seeSection3.1.4).Asimplelinearinterpolationmethodmay
beemployedingeneral,butthecurrentexperimenthasuniquecharacteristics:
atleastoneofthesensorvaluesisalwayszerosincetheKheperarobotisnever
completelysurroundedbyobstacles.Therefore,samplesarealwaysonthe
surfaceofthe8-dimensionalhypercube-shapedstatespace.Ontheotherhand,
thenodecentersdeterminedbytheSOMarenotgenerallyonthesurface.This
meansthatanysampleisnotincludedintheconvexhullofitsnearestnodes
andthefunctionvalueneedstobeextrapolated.Here,theEuclideandistance
betweenthesampleanditsnearestnodeissimplyaddedwhencomputing
kernelvalues.Moreprecisely,forastatesthatisnotgenerallylocatedona
nodecenter,theGGK-basedbasisfunctionisdefinedas
(ED(s,˜
s)+SP(˜
s,c(j)))2
φi+(j−1)m(s,a)=I(a=a(i))exp−
,
2σ2
BasisDesignforValueFunctionApproximation
43
where˜
sisthenodeclosesttosintheEuclideandistance.
Figure3.12illustratesanexampleofactionsselectedateachnodebythe
GGK-basedandOGK-basedpolicies.Onehundredkernelsareusedandthe
widthissetat1000.Thesymbols↑,↓,⊂,and⊃inthefigureindicateforward,backward,leftrotation,andrightrotationactions.Thisshowsthatthereisa
cleardifferenceintheobtainedpoliciesatthestate“C.”Thebackwardaction
ismostlikelytobetakenbytheOGK-basedpolicy,whiletheleftrotation
andrightrotationaremostlikelytobetakenbytheGGK-basedpolicy.This
causesasignificantdifferenceintheperformance.Toexplainthis,supposethat
theKheperarobotisatthestate“C,”i.e.,itfacesawall.TheGGK-based
policyguidestheKheperarobotfrom“C”to“A”via“D”or“E”bytaking
theleftandrightrotationactionsanditcanavoidtheobstaclesuccessfully.
Ontheotherhand,theOGK-basedpolicytriestoplanapathfrom“C”to
“A”via“B”byactivatingthebackwardaction.Asaresult,theforwardaction
istakenat“B.”Forthisreason,theKheperarobotreturnsto“C”againand
endsupmovingbackandforthbetween“C”and“B.”
Fortheperformanceevaluation,amorecomplicatedenvironmentthan
theoneusedforgatheringtrainingsamples(seeFigure3.11)isused.This
meansthathowwelltheobtainedpoliciescanbegeneralizedtoanunknown
environmentisevaluatedhere.Inthistestenvironment,theKheperarobot
runsfromafixedstartingposition(seeFigure3.11(b))andtakes150steps
followingtheobtainedpolicy,withthesumofrewards(+1fortheforward
action)computed.IftheKheperarobotcollideswithanobstaclebefore150
steps,theevaluationisstopped.Themeantestperformanceover30indepen-
dentrunsisdepictedinFigure3.13asafunctionofthenumberofkernels.
Moreprecisely,ineachrun,agraphisconstructedbasedonthetraining
samplestakenfromthetrainingenvironmentandthespecifiednumberofker-
nelsisputrandomlyonthegraph.Then,apolicyislearnedbytheGGK-
orOGK-basedleast-squarespolicyiterationusingthetrainingsamples.Note
thattheactualnumberofbasesisfourtimesmorebecauseoftheexten-
sionofbasisfunctionsovertheactionspace.Thetestperformanceismea-
sured5timesforeachpolicyandtheaverageisoutput.Figure3.13shows
thatGGKssignificantlyoutperformOGKs,demonstratingthatGGKsare
promisingeveninthechallengingsettingwithahigh-dimensionallargestate
space.
Figure3.14depictsthecomputationtimeofeachmethodasafunctionof
thenumberofkernels.Thisshowsthatthecomputationtimemonotonically
increasesasthenumberofkernelsincreasesandtheGGK-basedandOGK-
basedmethodshavecomparablecomputationtime.However,giventhatthe
GGK-basedmethodworksmuchbetterthantheOGK-basedmethodwitha
smallernumberofkernels(seeFigure3.13),theGGK-basedmethodcouldbe
regardedasacomputationallyefficientalternativetothestandardOGK-based
method.
Finally,thetrainedKheperarobotisappliedtomapbuilding.Starting
fromaninitialposition(indicatedbyasquareinFigure3.15),theKhepera
44
StatisticalReinforcementLearning
1000
⊃⊃⊃⊃⊃⊃⊃⊃↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑
⊃⊃⊂⊂⊃⊃⊃⊃⊃⊃⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊂⊂⊂⊂
⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊂⊂↓
⊃⊂⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑
⊃⊃⊃
⊃⊃⊃⊃↓⊃⊃⊃↓
↓
⊃⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂800
⊃⊃⊃⊃⊃⊃⊂⊂⊂⊃⊃⊃↑
⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂
⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊃⊃⊃↑
↑↑⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑⊃⊃⊃⊃⊃↑⊂↑⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↑
⊂↑⊂⊂⊂⊂↑↑
⊃
⊂↑↑⊂⊂⊂⊂600
↑
⊂⊃⊃⊃⊃⊃↑⊃↑
⊂⊂⊂↑⊃⊃⊂⊃↑
↑
↑
⊂↑
⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃↑
↑
⊃↑↑
⊃↑
↑
↑
↑
↑
↑
⊂↑⊂↑↑⊂⊂⊂⊂⊃⊃⊃⊃↑
↑
↑
↑
⊂⊂↑↑↑
⊂⊂⊂⊂↑↑↑⊂⊂⊂400
⊃↑
⊃⊃⊃↑
↑
↑
↑⊂↑⊂⊂⊂⊃⊃⊃↑
↑
↑↑↑↑
↑↑↑
⊂⊂⊃⊃↑
↑
↑
⊃⊃⊃↑
↑↑
⊂↑
↑↑↑
↑↑↑
↑
⊂⊂⊃⊃⊃⊃↑↑↑↑↑
↑
⊂⊂⊂↑↑
↑
↑
⊂↑↑↑
⊂200
⊃⊃⊃↑
↑
⊂⊂⊃⊃↑↑↑↑↑
↑
↑
⊂⊃↑↑↑↑↑
↑
↑↑
⊂⊂⊃⊃⊃⊃↑
⊃↑↑
↑↑↑
↑
⊂↑
↑↑↑
↑↑↑
⊂⊂⊂⊃⊃
↑↑
↑
↑
↑↑
⊃⊃↑
↑
↑
↑↑↑↑↑
↑↑
⊂↑↑
↑↑↑↑
↑⊂⊂0
⊃⊃⊃⊃↑↑↑↑↑
↑
↑↑
↑
↑↑↑
↑
↑↑↑↑↑↑↑↑↑
↑
↑
↑
↑↑
↑↑
↑↑
↑↑↑
↑↑↑↑
↑↑
↑
↑
↑↑
↑↑
↑↑
↑
↑↑
↑↑
↑↑
↑↑↑
↑↑
↑↑
⊂↑↑
↑↑
⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑↑↑↑↑↑↑↑
↑
↑↑↑↑
↑↑↑↑
⊃⊃
⊃⊃⊃⊃↑↑
↑
↑↑↑
↑
↑↑↑
↑↑
↑↑
↑↑↑↑
↑↑
↑↑
↑
↑
↑↑
↑↑
↑↑
↑
↑
↑↑↑
−200
↑
↑↑
−400
−1000
−800
−600
−400
−200
0
200
400
600
800
1000
(a)GeodesicGaussiankernels
1000
⊃⊃⊃↓↓↓↓↓↓↓
↓↓↓
↓
⊃⊃⊃⊃⊃⊃↓↓
↓↓↓↓↓↓
↓↓
↓
⊃⊃⊃⊃⊃↓
⊃⊃↓↓⊃↓↓↓↓
↓↓
↓↓
↓↓↓↓↓
↓↓↓
↓
↓↓↓↓↓⊂
↓⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓
↓
↓
⊃⊃↓
↓↓↓
↓↓↓
↓↓
↓
↓↓
↓↓↓↓⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃
⊃↓↓⊃⊃↓↓↓
↓
↓
↓
↓
↓↓↓↓↓⊂⊂⊂⊂⊂⊂800
⊃⊃⊃⊃⊃↓
⊂⊂⊂⊃⊃⊃↓
⊂⊂⊂⊃⊃⊃↓↓↑↓
↓↓
⊂⊂⊂⊂⊂⊂⊂⊂
⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓↑
↓↓↑
↑↓↓↓↓↓
⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓↓↑↑↑
↓↓
⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓
↑↓
⊂⊂⊂⊂⊃⊃
↓
↑↑
↓↓
⊂⊂⊂600
↓
⊂⊃⊃⊃⊃⊃↓↑
↓
⊂⊂⊂⊃⊃↑
⊂↓
↑
⊂↑
↑
↓
⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑
⊂⊂⊃↑
⊃⊃
⊃↑
↑
↑↓⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑
↑
⊂⊂↑↑↑
↑↑
⊂↓⊂⊂⊂⊂⊂⊂400
⊃⊃⊃⊃⊃↑
↑
↑
↑↑
⊂⊂⊂⊂⊃⊃⊃⊃↑
↑↑↑↑
↑⊂⊂⊂⊂⊃⊃⊃↑
↑
⊃⊃↑
↑
↑⊂⊂↑
↑↑↑
↑↑⊂⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑↑
↑
⊂⊂⊂↑↑
↑
⊂⊂↑↑⊂⊂
200
⊃⊃↑
↑
↑
⊂⊂⊃⊃⊃⊃↑↑↑↑
↑
⊂↑
↑↑↑↑↑
↑
↑⊂⊂⊂⊃⊃⊃↑
↑
⊃⊃⊃↑↑↑
↑
⊂↑
↑↑↑
↑↑↑
⊂⊂⊂⊃
⊃⊃↑↑
↑
↑↑
⊃⊃↑
↑
⊂↑↑↑↑↑
↑↑
↑
↑↑
↑↑↑↑
⊂⊂⊂0
⊃⊃⊃⊃↑↑↑↑↑
↑
↑↑
↑
↑↑↑
↑
↑↑↑↑↑↑↑↑↑
↑
↑
↑
↑↑
↑↑
↑↑
↑↑↑
↑↑↑↑
↑↑
↑
↑
↑↑
↑↑
↑↑
↑
↑↑
↑↑
↑↑
↑↑↑
↑↑
↑⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑↑↑↑↑↑↑
↑
↑↑↑↑
↑↑↑↑
⊃⊃
⊃⊃⊃⊃↑↑
↑
↑↑↑
↑
↑↑↑
↑↑
↑↑
↑↑↑↑
↑↑
↑↑
↑
↑
↑↑
↑↑
↑↑
↑
↑
↑↑↑
−200
↑
↑↑
−400
−1000
−800
−600
−400
−200
0
200
400
600
800
1000
(b)OrdinaryGaussiankernels
FIGURE3.12:Examplesofobtainedpolicies.Thesymbols↑,↓,⊂,and⊃indicateforward,backward,leftrotation,andrightrotationactions.
robottakesanaction2000timesfollowingthelearnedpolicy.Eightykernels
withGaussianwidthσ=1000areusedforvaluefunctionapproximation.The
resultsofGGKsandOGKsaredepictedinFigure3.15.Thegraphsshowthat
theGGKresultgivesabroaderprofileoftheenvironment,whiletheOGK
resultonlyrevealsalocalareaaroundtheinitialposition.
MotionexamplesoftheKheperarobottrainedwithGGKandOGKare
illustratedinFigure3.16andFigure3.17,respectively.
BasisDesignforValueFunctionApproximation
45
70
GGK(200)
4500
65
GGK(1000)
OGK(200)
GGK(1000)
60
4000
OGK(1000)
OGK(1000)
55
3500
50
3000
45
2500
40
2000
35
1500
Averagedtotalrewards30
Computationtime[sec]1000
25
500
200102030405060708090100
00102030405060708090100
Numberofkernels
Numberofkernels
FIGURE3.13:Averageamountof
FIGURE3.14:Computationtime.
exploration.
(a)GeodesicGaussiankernels
(b)OrdinaryGaussiankernels
FIGURE3.15:Resultsofmapbuilding(cf.Figure3.11(b)).
3.4
Remarks
Theperformanceofleast-squarespolicyiterationdependsheavilyonthe
choiceofbasisfunctionsforvaluefunctionapproximation.Inthischapter,
thegeodesicGaussiankernel(GGK)wasintroducedandshowntopossess
severalpreferablepropertiessuchassmoothnessalongthegraphandeasy
computability.ItwasalsodemonstratedthatthepoliciesobtainedbyGGKs
arenotassensitivetothechoiceoftheGaussiankernelwidth,whichisa
usefulpropertyinpractice.Also,theheuristicsofputtingGaussiancenters
ongoalstateswasshowntoworkwell.
However,whenthetransitionishighlystochastic(i.e.,thetransitionprob-
abilityhasawidesupport),thegraphconstructedbasedonthetransition
samplescouldbenoisy.Whenanerroneoustransitionresultsinashort-cut
46
StatisticalReinforcementLearning
FIGURE3.16:AmotionexampleoftheKheperarobottrainedwithGGK
(fromlefttorightandtoptobottom).
FIGURE3.17:AmotionexampleoftheKheperarobottrainedwithOGK
(fromlefttorightandtoptobottom).
overobstacles,thegraph-basedapproachmaynotworkwellsincethetopology
ofthestatespacechangessignificantly.
Chapter4
SampleReuseinPolicyIteration
Off-policyreinforcementlearningisaimedatefficientlyusingdatasamples
gatheredfromapolicythatisdifferentfromthecurrentlyoptimizedpolicy.A
commonapproachistouseimportancesamplingtechniquesforcompensating
forthebiascausedbythedifferencebetweenthedata-samplingpolicyandthe
targetpolicy.Inthischapter,weexplainhowimportancesamplingcanbeuti-
lizedtoefficientlyreusepreviouslycollecteddatasamplesinpolicyiteration.
Afterformulatingtheproblemofoff-policyvaluefunctionapproximationin
Section4.1,representativeoff-policyvaluefunctionapproximationtechniques
includingadaptiveimportancesamplingarereviewedinSection4.2.Then,in
Section4.3,howtheadaptivityofimportancesamplingcanbeoptimallycon-
trolledisexplained.InSection4.4,off-policyvaluefunctionapproximation
techniquesareintegratedintheframeworkofleast-squarespolicyiteration
forefficientsamplereuse.ExperimentalresultsareshowninSection4.5,and
finallythischapterisconcludedinSection4.6.
4.1
Formulation
AsexplainedinSection2.2,least-squarespolicyiterationmodelsthestate-
actionvaluefunctionQπ(s,a)byalineararchitecture,
θ⊤φ(s,a),
andlearnstheparameterθsothatthegeneralizationerrorGisminimized:
”
#
T
1X
2
G(θ)=Epπ(h)
θ⊤ψ(s
.
(4.1)
T
t,at)−r(st,at)
t=1
Here,Epπ(h)denotestheexpectationoverhistory
h=[s1,a1,…,sT,aT,sT+1]
followingthetargetpolicyπand
h
i
ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).
47
48
StatisticalReinforcementLearning
Whenhistorysamplesfollowingthetargetpolicyπareavailable,thesitu-
ationiscalledon-policyreinforcementlearning.Inthiscase,justreplacingthe
expectationcontainedinthegeneralizationerrorGbysampleaveragesgives
astatisticallyconsistentestimator(i.e.,theestimatedparameterconvergesto
theoptimalvalueasthenumberofsamplesgoestoinfinity).
Here,weconsiderthesituationcalledoff-policyreinforcementlearning,
wherethesamplingpolicye
πforcollectingdatasamplesisgenerallydifferent
fromthetargetpolicyπ.Letusdenotethehistorysamplesfollowinge
πby
Heπ=heπ1,…,heπN,
whereeachepisodicsampleheπnisgivenas
heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].
Undertheoff-policysetup,naivelearningbyminimizingthesample-
approximatedgeneralizationerrorb
GNIWleadstoaninconsistentestimator:
N
XT
X
2
b
1
GNIW(θ)=
θ⊤b
ψ(seπ
,
NT
t,n,ae
π
t,n;He
π)−r(seπt,n,aeπt,n,seπt+1,n)
n=1t=1
where
X
h
i
b
1
ψ(s,a;H)=φ(s,a)−
E
γφ(s′,a′).
|H
e
π(a′|s′)
(s,a)|s′∈H(s,a)H(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfromstate
sbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),and
P
denotesthesummationoveralldestinationstatess′intheset
s′∈Hs,a)H(s,a).NIWstandsfor“NoImportanceWeight,”whichwillbeexplained
later.
Thisinconsistencyproblemcanbeavoidedbygatheringnewsamplesfol-
lowingthetargetpolicyπ,i.e.,whenthecurrentpolicyisupdated,newsam-
plesaregatheredfollowingtheupdatedpolicyandthenewsamplesareused
forpolicyevaluation.However,whenthedatasamplingcostishigh,thisis
tooexpensive.Itwouldbemorecostefficientifpreviouslygatheredsamples
couldbereusedeffectively.
4.2
Off-PolicyValueFunctionApproximation
Importancesamplingisageneraltechniquefordealingwiththeoff-policy
situation.Supposewehavei.i.d.(independentandidenticallydistributed)sam-
plesxnN
n=1fromastrictlypositiveprobabilitydensityfunctione
p(x).Using
SampleReuseinPolicyIteration
49
thesesamples,wewouldliketocomputetheexpectationofafunctiong(x)
overanotherprobabilitydensityfunctionp(x).Aconsistentapproximationof
theexpectationisgivenbytheimportance-weightedaverageas
1N
X
p(x
p(x)
g(x
n)N→∞
−→E
g(x)
N
n)ep(x
e
p(x)
e
p(x)
n=1
n)
Z
Z
p(x)
=
g(x)
e
p(x)dx=
g(x)p(x)dx=E
e
p(x)
p(x)[g(x)].
However,applyingtheimportancesamplingtechniqueinoff-policyrein-
forcementlearningisnotstraightforwardsinceourtrainingsamplesofstate
sandactionaarenoti.i.d.duetothesequentialnatureofMarkovdeci-
sionprocesses(MDPs).Inthissection,representativeimportance-weighting
techniquesforMDPsarereviewed.
4.2.1
EpisodicImportanceWeighting
Basedontheindependencebetweenepisodes,
p(h,h′)=p(h)p(h′)=p(s1,a1,…,sT,aT,sT+1)p(s′1,a′1,…,s′T,a′T,s′T+1),thegeneralizationerrorGcanberewrittenas
”
#
T
1X
2
G(θ)=Epeπ(h)
θ⊤ψ(s
w
,
T
t,at)−r(st,at)
T
t=1
wherewTistheepisodicimportanceweight(EIW):
pπ(h)
wT=
.
peπ(h)
pπ(h)andpeπ(h)aretheprobabilitydensitiesofobservingepisodicdatah
underpolicyπande
π:
T
Y
pπ(h)=p(s1)
π(at|st)p(st+1|st,at),
t=1
T
Y
peπ(h)=p(s1)
eπ(at|st)p(st+1|st,at).
t=1
Notethattheimportanceweightscanbecomputedwithoutexplicitlyknowing
p(s1)andp(st+1|st,at),sincetheyarecanceledout:
QTπ(a
w
t=1
t|st)
T=Q
.
T
t=1e
π(at|st)
50
StatisticalReinforcementLearning
UsingthetrainingdataHeπ,wecanconstructaconsistentestimatorofG
as
N
XT
X
2
b
1
GEIW(θ)=
θ⊤b
ψ(seπ
b
w
NT
t,n,ae
π
t,n;He
π)−r(seπt,n,aeπt,n,seπt+1,n)
T,n,
n=1t=1
(4.2)
where
QTπ(aeπ
b
w
t=1
t,n|se
π
t,n)
T,n=Q
.
T
t=1e
π(aeπt,n|seπt,n)
4.2.2
Per-DecisionImportanceWeighting
AcrucialobservationinEIWisthattheerroratthet-thstepdoesnot
dependonthesamplesafterthet-thstep(Precupetal.,2000).Thus,the
generalizationerrorGcanberewrittenas
”
#
T
1X
2
G(θ)=Epeπ(h)
θ⊤ψ(s
w
,
T
t,at)−r(st,at)
t
t=1
wherewtistheper-decisionimportanceweight(PIW):
Q
Q
p(s
t
π(a
t
π(a
w
1)
t′=1
t′|st′)p(st′+1|st′,at′)
t′=1
t′|st′)
t=
Q
=Q
.
p(s
t
t
1)
t′=1e
π(at′|st′)p(st′+1|st′,at′)
t′=1e
π(at′|st′)
UsingthetrainingdataHeπ,wecanconstructaconsistentestimatoras
follows(cf.Eq.(4.2)):
N
XT
X
2
b
1
GPIW(θ)=
θ⊤b
ψ(seπ
b
w
NT
t,n,ae
π
t,n;He
π)−r(seπt,n,aeπt,n,seπt+1,n)
t,n,
n=1t=1
where
Qt
π(aeπ
)
b
w
t′=1
t′,n|se
π
t′,n
t,n=Q
.
t
)
t′=1e
π(aeπt′,n|seπt′,n
b
wt,nonlycontainstherelevanttermsuptothet-thstep,whileb
wT,nincludes
allthetermsuntiltheendoftheepisode.
4.2.3
AdaptivePer-DecisionImportanceWeighting
ThePIWestimatorisguaranteedtobeconsistent.However,botharenot
efficientinthestatisticalsense(Shimodaira,2000),i.e.,theydonothavethe
smallestadmissiblevariance.Forthisreason,thePIWestimatorcanhave
largevarianceinfinitesamplecasesandthereforelearningwithPIWtendsto
beunstableinpractice.
Toimprovethestability,itisimportanttocontrolthetrade-offbetween
SampleReuseinPolicyIteration
51
consistencyandefficiency(orsimilarlybiasandvariance)basedontraining
data.Here,theflatteningparameterν(∈[0,1])isintroducedtocontrolthetrade-offbyslightly“flattening”theimportanceweights(Shimodaira,2000;
Sugiyamaetal.,2007):
N
XT
X
b
1
GAIW(θ)=
θ⊤b
ψ(seπ
NT
t,n,ae
π
t,n;He
π)
n=1t=1
2
−r(seπt,n,aeπt,n,seπt+1,n)(b
wt,n)ν,
whereAIWstandsfortheadaptiveper-decisionimportanceweight.When
ν=0,AIWisreducedtoNIWandthereforeithaslargebiasbuthasrelatively
smallvariance.Ontheotherhand,whenν=1,AIWisreducedtoPIW.Thus,
ithassmallbiasbuthasrelativelylargevariance.Inpractice,anintermediate
valueofνwillyieldthebestperformance.
Letb
ΨbetheNT×Bmatrix,c
WbetheNT×NTdiagonalmatrix,and
rbetheNT-dimensionalvectordefinedas
b
ΨN(t−1)+n,b=b
ψb(st,n,at,n),
c
WN(t−1)+n,N(t−1)+n=b
wt,n,
rN(t−1)+n=r(st,n,at,n,st+1,n).
Then,b
GAIWcanbecompactlyexpressedas
b
1
ν
GAIW(θ)=
(b
Ψθ−r)⊤c
W(b
Ψθ−r).
NT
Becausethisisaconvexquadraticfunctionwithrespecttoθ,itsglobalmin-
imizerb
θAIWcanbeanalyticallyobtainedbysettingitsderivativetozeroas
b
⊤ν
⊤ν
θ
cb
c
AIW=(b
ΨWΨ)−1b
ΨWr.
Thismeansthatthecostforcomputingb
θAIWisessentiallythesameasb
θNIW,
whichisgivenasfollows(seeSection2.2.2):
b
⊤⊤θ
b
NIW=(b
ΨΨ)−1b
Ψr.
4.2.4
Illustration
Here,theinfluenceoftheflatteningparameterνontheestimatorb
θAIWis
illustratedusingthechain-walkMDPillustratedinFigure4.1.
TheMDPconsistsof10states
S=s(1),…,s(10)
52
StatisticalReinforcementLearning
FIGURE4.1:Ten-statechain-walkMDP.
andtwoactions
A=a(1),a(2)=“L,”“R”.
Thereward+1isgivenwhenvisitings(1)ands(10).Thetransitionprobability
pisindicatedbythenumbersattachedtothearrowsinthefigure.Forexample,
p(s(2)|s(1),a=“R”)=0.9
and
p(s(1)|s(1),a=“R”)=0.1
meanthattheagentcansuccessfullymovetotherightnodewithprobability
0.9(indicatedbysolidarrowsinthefigure)andtheactionfailswithprob-
ability0.1(indicatedbydashedarrowsinthefigure).SixGaussiankernels
withstandarddeviationσ=10areusedasbasisfunctions,andkernelcen-
tersarelocatedats(1),s(5),ands(10).Morespecifically,thebasisfunctions,
φ(s,a)=(φ1(s,a),…,φ6(s,a))aredefinedas
(s−c
φ
j)2
3(i−1)+j(s,a)=I(a=a(i))exp
−
,
2σ2
fori=1,2andj=1,2,3,where
c1=1,c2=5,c3=10,
and
1ifxistrue,
I(x)=
0ifxisnottrue.
Theexperimentsarerepeated50times,wherethesamplingpolicye
π(a|s)
andthecurrentpolicyπ(a|s)arechosenrandomlyineachtrialsuchthat
eπ6=π.Thediscountfactorissetatγ=0.9.ThemodelparameterbθAIWis
learnedfromthetrainingsamplesHeπanditsgeneralizationerroriscomputed
fromthetestsamplesHπ.
TheleftcolumnofFigure4.2depictsthetruegeneralizationerrorGav-
eragedover50trialsasafunctionoftheflatteningparameterνforN=10,
30,and50.Figure4.2(a)showsthatwhenthenumberofepisodesislarge
(N=50),thegeneralizationerrortendstodecreaseastheflatteningparam-
eterincreases.Thiswouldbeanaturalresultduetotheconsistencyofb
θAIW
SampleReuseinPolicyIteration
53
0.07
0.08
0.068
0.075
0.066
Trueerror
0.064
Estimatederror
0.07
0.062
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Flatteningparameterν
Flatteningparameterν
(a)N=50
0.084
0.082
0.073
0.08
0.072
0.071
0.078
Trueerror
0.07
0.076
Estimatederror
0.069
0.074
0.068
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Flatteningparameterν
Flatteningparameterν
(b)N=30
0.11
0.14
0.135
0.105
0.13
0.125
Trueerror
0.1
0.12
Estimatederror
0.115
0.0950
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Flatteningparameterν
Flatteningparameterν
(c)N=10
FIGURE4.2:Left:TruegeneralizationerrorGaveragedover50trialsas
afunctionoftheflatteningparameterνinthe10-statechain-walkproblem.
ThenumberofstepsisfixedatT=10.ThetrendofGdiffersdependingon
thenumberNofepisodicsamples.Right:Generalizationerrorestimatedby
5-foldimportanceweightedcrossvalidation(IWCV)(b
GIWCV)averagedover
50trialsasafunctionoftheflatteningparameterνinthe10-statechain-walk
problem.ThenumberofstepsisfixedatT=10.IWCVnicelycapturesthe
trendofthetruegeneralizationerrorG.
54
StatisticalReinforcementLearning
whenν=1.Ontheotherhand,Figure4.2(b)showsthatwhenthenumberof
episodesisnotlarge(N=30),ν=1performsratherpoorly.Thisimpliesthat
theconsistentestimatortendstobeunstablewhenthenumberofepisodes
isnotlargeenough;ν=0.7worksthebestinthiscase.Figure4.2(c)shows
theresultswhenthenumberofepisodesisfurtherreduced(N=10).This
illustratesthattheconsistentestimatorwithν=1isevenworsethanthe
ordinaryestimator(ν=0)becausethebiasisdominatedbylargevariance.
Inthiscase,thebestνisevensmallerandisachievedatν=0.4.
TheaboveresultsshowthatAIWcanoutperformPIW,particularlywhen
onlyasmallnumberoftrainingsamplesareavailable,providedthattheflat-
teningparameterνischosenappropriately.
4.3
AutomaticSelectionofFlatteningParameter
Inthissection,theproblemofselectingtheflatteningparameterinAIW
isaddressed.
4.3.1
Importance-WeightedCross-Validation
Generally,thebestνtendstobelarge(small)whenthenumberoftraining
samplesislarge(small).However,thisgeneraltrendisnotsufficienttofine-
tunetheflatteningparametersincethebestvalueofνdependsontraining
samples,policies,themodelofvaluefunctions,etc.Inthissection,wediscuss
howmodelselectionisperformedtochoosethebestflatteningparameterν
automaticallyfromthetrainingdataandpolicies.
Ideally,thevalueofνshouldbesetsothatthegeneralizationerrorG
isminimized,butthetrueGisnotaccessibleinpractice.Tocopewiththis
problem,wecanusecross-validation(seeSection2.2.4)forestimatingthe
generalizationerrorG.However,intheoff-policyscenariowherethesampling
policye
πandthetargetpolicyπaredifferent,ordinarycross-validationgives
abiasedestimateofG.Intheoff-policyscenario,importance-weightedcross-
validation(IWCV)(Sugiyamaetal.,2007)ismoreuseful,wherethecross-
validationestimateofthegeneralizationerrorisobtainedwithimportance
weighting.
Morespecifically,letusdivideatrainingdatasetHeπcontainingNepisodes
intoKsubsetsHeπ
ofapproximatelythesamesize.Forsimplicity,weas-
kK
k=1
k
sumethatNisdivisiblebyK.Letb
θAIWbetheparameterlearnedfromH\Hk
(i.e.,allsampleswithoutHk).Then,thegeneralizationerrorisestimatedwith
SampleReuseinPolicyIteration
55
0.11
NIW(ν=0)
0.105
PIW(ν=1)
AIW+IWCV
0.1
0.095
0.09
Trueerror0.085
0.08
0.075
10
15
20
25
30
35
40
45
50
Numberofepisodes
FIGURE4.3:TruegeneralizationerrorGaveragedover50trialsobtained
byNIW(ν=0),PIW(ν=1),AIW+IWCV(νischosenbyIWCV)inthe
10-statechain-walkMDP.
importanceweightingas
K
X
b
1
G
b
IWCV=
Gk
K
IWCV,
k=1
where
XT
X
2
b
K
k
Gk
b
⊤b
IWCV=
θ
ψ(s
b
w
NT
AIW
t,at;He
π
k)−r(st,at,st+1)
t.
h∈Heπt=1
k
Thegeneralizationerrorestimateb
GIWCViscomputedforallcandidate
models(inthecurrentsetting,acandidatemodelcorrespondstoadifferent
valueoftheflatteningparameterν)andtheonethatminimizestheestimated
generalizationerrorischosen:
b
ν
b
IWCV=argminGIWCV.
ν
4.3.2
Illustration
ToillustratehowIWCVworks,letususethesamenumericalexamples
asSection4.2.4.TherightcolumnofFigure4.2depictsthegeneralization
errorestimatedby5-foldIWCVaveragedover50trialsasafunctionofthe
flatteningparameterν.ThegraphsshowthatIWCVnicelycapturesthetrend
ofthetruegeneralizationerrorforallthreecases.
Figure4.3describes,asafunctionofthenumberNofepisodes,theav-
eragetruegeneralizationerrorobtainedbyNIW(AIWwithν=0),PIW
56
StatisticalReinforcementLearning
(AIWwithν=1),andAIW+IWCV(ν∈0.0,0.1,…,0.9,1.0isselectedin
eachtrialusing5-foldIWCV).Thisresultshowsthattheimprovementofthe
performancebyNIWsaturateswhenN≥30,implyingthatthebiascaused
byNIWisnotnegligible.TheperformanceofPIWisworsethanNIWwhen
N≤20,whichiscausedbythelargevarianceofPIW.Ontheotherhand,
AIW+IWCVconsistentlygivesgoodperformanceforallN,illustratingthe
strongadaptationabilityofAIW+IWCV.
4.4
Sample-ReusePolicyIteration
Inthissection,AIW+IWCVisextendedfromsingle-steppolicyevaluation
tofullpolicyiteration.Thismethodiscalledsample-reusepolicyiteration
(SRPI).
4.4.1
Algorithm
LetusdenotethepolicyattheL-thiterationbyπL.Inon-policypolicy
iteration,newdatasamplesHπLarecollectedfollowingthenewpolicyπL
duringthepolicyevaluationstep.Thus,previouslycollecteddatasamples
Hπ1,…,HπL−1arenotused:
E:Hπ1
E:Hπ2
E:Hπ3
π
I
I
1
→
b
Qπ1→π2−→
b
Qπ2→π3−→···I
−→πL,
where“E:H”indicatesthepolicyevaluationstepusingthedatasampleH
and“I”indicatesthepolicyimprovementstep.Itwouldbemorecostefficient
ifallpreviouslycollecteddatasampleswerereusedinpolicyevaluation:
E:Hπ1
E:Hπ1,Hπ2
E:Hπ1,Hπ2,Hπ3
π
I
I
1
−→
b
Qπ1→π2
−→
b
Qπ2→π3
−→
···I
−→πL.
Sincethepreviouspoliciesandthecurrentpolicyaredifferentingeneral,
anoff-policyscenarioneedstobeexplicitlyconsideredtoreusepreviously
collecteddatasamples.Here,weexplainhowAIW+IWCVcanbeusedin
thissituation.Forthispurpose,thedefinitionofb
GAIWisextendedsothat
multiplesamplingpoliciesπ1,…,πLaretakenintoaccount:
L
XN
XT
X
b
1
GL
AIW=
θ⊤b
ψ(sπl
LNT
t,n,aπl
t,n;HπlL
l=1)
l=1n=1t=1
!
Qt
νL
2
πL(aπl
)
−r(
t′,n|sπl
t′,n
sπ
t′=1
l
t,n,aπl
t,n,sπl
t+1,n)
Q
,
(4.3)
t
π
)
t′=1
l(aπl
t′,n|sπl
t′,n
whereb
GL
isthegeneralizationerrorestimatedattheL-thpolicyevaluation
AIW
usingAIW.TheflatteningparameterνLischosenbasedonIWCVbefore
performingpolicyevaluation.
SampleReuseinPolicyIteration
57
ν=0
4.5
4.5
ν=1
ν^
=νIWCV
4.4
4.4
4.3
4.3
4.2
4.2
Return
Return
4.1
4.1
4
4
ν=0
3.9
ν=1
3.9
ν^
=ν
3.8
IWCV
3.8
5
10
15
20
25
30
35
40
45
10
15
20
25
30
35
40
Totalnumberofepisodes
Totalnumberofepisodes
(a)N=5
(b)N=10
FIGURE4.4:Theperformanceofpolicieslearnedinthreescenarios:ν=0,
ν=1,andSRPI(νischosenbyIWCV)inthe10-statechain-walkproblem.
Theperformanceismeasuredbytheaveragereturncomputedfromtestsam-
plesover30trials.TheagentcollectstrainingsampleHπL(N=5or10with
T=10)ateveryiterationandperformspolicyevaluationusingallcollected
samplesHπ1,…,HπL.Thetotalnumberofepisodesmeansthenumberof
trainingepisodes(N×L)collectedbytheagentinpolicyiteration.
4.4.2
Illustration
Here,thebehaviorofSRPIisillustratedunderthesameexperimental
setupasSection4.3.2.Letusconsiderthreescenarios:νisfixedat0,νisfixedat1,andνischosenbyIWCV(i.e.,SRPI).TheagentcollectssamplesHπLin
L
eachpolicyiterationfollowingthecurrentpolicyπLandcomputesb
θAIWfrom
allcollectedsamplesHπ1,…,HπLusingEq.(4.3).ThreeGaussiankernels
areusedasbasisfunctions,wherekernelcentersarerandomlyselectedfrom
thestatespaceSineachtrial.Theinitialpolicyπ1ischosenrandomlyand
Gibbspolicyimprovement,
exp(Qπ(s,a)/τ)
π(a|s)←−P
,
(4.4)
exp(Qπ(s,a′)/τ)
a′∈Aisperformedwithτ=2L.
Figure4.4depictstheaveragereturnover30trialswhenN=5and10
withafixednumberofsteps(T=10).ThegraphsshowthatSRPIprovides
stableandfastlearningofpolicies,whiletheperformanceimprovementof
policieslearnedwithν=0saturatesinearlyiterations.Themethodwith
ν=1canimprovepolicieswell,butitsprogresstendstobebehindSRPI.
Figure4.5depictstheaveragevalueoftheflatteningparameterusedin
SRPIasafunctionofthetotalnumberofepisodicsamples.Thegraphsshow
thatthevalueoftheflatteningparameterchosenbyIWCVtendstoriseinthe
beginningandgodownlater.Atfirstsight,thisdoesnotagreewiththegeneral
trendofpreferringalow-varianceestimatorinearlystagesandpreferringa
58
StatisticalReinforcementLearning
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
Flatteningparameter
0.3
Flatteningparameter
0.2
0.2
0.1
0.1
0
0
5
10
15
20
25
30
35
40
45
10
15
20
25
30
35
40
Totalnumberofepisodes
Totalnumberofepisodes
(a)N=5
(b)N=10
FIGURE4.5:FlatteningparametervaluesusedbySRPIaveragedover30
trialsasafunctionofthetotalnumberofepisodicsamplesinthe10-state
chain-walkproblem.
low-biasestimatorlater.However,thisresultisstillconsistentwiththegeneral
trend:whenthereturnincreasesrapidly(thetotalnumberofepisodicsamples
isupto15whenN=5and30whenN=10inFigure4.5),thevalueofthe
flatteningparameterincreases(seeFigure4.4).Afterthat,thereturndoes
notincreaseanymore(seeFigure4.4)sincethepolicyiterationhasalready
beenconverged.Then,itisnaturaltopreferasmallflatteningparameter
(Figure4.5)sincethesampleselectionbiasbecomesmildafterconvergence.
TheseresultsshowthatSRPIcaneffectivelyreusepreviouslycollected
samplesbyappropriatelytuningtheflatteningparameteraccordingtothe
conditionofdatasamples,policies,etc.
4.5
NumericalExamples
Inthissection,theperformanceofSRPIisnumericallyinvestigatedin
morecomplextasks.
4.5.1
InvertedPendulum
First,weconsiderthetaskoftheswing-upinvertedpendulumillustrated
inFigure4.6,whichconsistsofapolehingedatthetopofacart.Thegoalof
thetaskistoswingthepoleupbymovingthecart.Therearethreeactions:
applyingpositiveforce+50(kg·m/s2)tothecarttomoveright,negative
force−50tomoveleft,andzeroforcetojustcoast.Thatis,theactionspace
SampleReuseinPolicyIteration
59
FIGURE4.6:Illustrationoftheinvertedpendulumtask.
Aisdiscreteanddescribedby
A=50,−50,0kg·m/s2.
Notethattheforceitselfisnotstrongenoughtoswingthepoleup.Thusthe
cartneedstobemovedbackandforthseveraltimestoswingthepoleup.
ThestatespaceSiscontinuousandconsistsoftheangleϕ[rad](∈[0,2π])andtheangularvelocity˙
ϕ[rad/s](∈[−π,π]).Thus,astatesisdescribedbytwo-dimensionalvectors=(ϕ,˙
ϕ)⊤.Theangleϕandangularvelocity˙
ϕare
updatedasfollows:
ϕt+1=ϕt+˙
ϕt+1∆t,
9.8sin(ϕ
˙
ϕ
t)−αwd(˙
ϕt)2sin(2ϕt)/2+αcos(ϕt)at
t+1=˙
ϕt+
∆t,
4l/3−αwdcos2(ϕt)
whereα=1/(W+w)andat(∈A)istheactionchosenattimet.Therewardfunctionr(s,a,s′)isdefinedas
r(s,a,s′)=cos(ϕs′),
whereϕs′denotestheangleϕofstates′.Theproblemparametersaresetas
follows:themassofthecartWis8[kg],themassofthepolewis2[kg],the
lengthofthepoledis0.5[m],andthesimulationtimestep∆tis0.1[s].
Forty-eightGaussiankernelswithstandarddeviationσ=πareusedas
basisfunctions,andkernelcentersarelocatedoverthefollowinggridpoints:
0,2/3π,4/3π,2π×−3π,−π,π,3π.
Thatis,thebasisfunctionsφ(s,a)=φ1(s,a),…,φ16(s,a)aresetas
ks−c
φ
jk2
16(i−1)+j(s,a)=I(a=a(i))exp
−
,
2σ2
fori=1,2,3andj=1,…,16,where
c1=(0,−3π)⊤,c2=(0,−π)⊤,…,c12=(2π,3π)⊤.
60
StatisticalReinforcementLearning
−6
ν=0
1
−7
ν=1
0.9
ν^
=νIWCV
−8
0.8
0.7
−9
0.6
−10
0.5
−11
0.4
−12
0.3
Flatteningparameter
−13
Sumofdiscountedrewards
0.2
−14
0.1
−15
0
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
Totalnumberofepisodes
Totalnumberofepisodes
(a)Performanceofpolicy
(b)Averageflatteningparameter
FIGURE4.7:ResultsofSRPIintheinvertedpendulumtask.Theagentcol-
lectstrainingsampleHπL(N=10andT=100)ineachiterationandpolicy
evaluationisperformedusingallcollectedsamplesHπ1,…,HπL.(a)The
performanceofpolicieslearnedwithν=0,ν=1,andSRPI.Theperformance
ismeasuredbytheaveragereturncomputedfromtestsamplesover20trials.
Thetotalnumberofepisodesmeansthenumberoftrainingepisodes(N×L)
collectedbytheagentinpolicyiteration.(b)Averageflatteningparameter
valueschosenbyIWCVinSRPIover20trials.
Theinitialpolicyπ1(a|s)ischosenrandomly,andtheinitial-stateproba-
bilitydensityp(s)issettobeuniform.TheagentcollectsdatasamplesHπL
(N=10andT=100)ateachpolicyiterationfollowingthecurrentpolicy
πL.Thediscountedfactorissetatγ=0.95andthepolicyisupdatedby
Gibbspolicyimprovement(4.4)withτ=L.
Figure4.7(a)describestheperformanceoflearnedpolicies.Thegraph
showsthatSRPInicelyimprovestheperformancethroughouttheentirepolicy
iteration.Ontheotherhand,theperformancewhentheflatteningparameter
isfixedatν=0orν=1isnotproperlyimprovedafterthemiddleof
iterations.TheaverageflatteningparametervaluesdepictedinFigure4.7(b)
showthattheflatteningparametertendstoincreasequicklyinthebeginning
andtheniskeptatmediumvalues.Motionexamplesoftheinvertedpendulum
bySRPIwithνchosenbyIWCVandν=1areillustratedinFigure4.8and
Figure4.9,respectively.
Theseresultsindicatethattheflatteningparameteriswelladjustedto
reusethepreviouslycollectedsampleseffectivelyforpolicyevaluation,and
thusSRPIcanoutperformtheothermethods.
4.5.2
MountainCar
Next,weconsiderthemountaincartaskillustratedinFigure4.10.The
taskconsistsofacarandtwohillswhoselandscapeisdescribedbysin(3x).
SampleReuseinPolicyIteration
61
FIGURE4.8:MotionexamplesoftheinvertedpendulumbySRPIwithν
chosenbyIWCV(fromlefttorightandtoptobottom).
FIGURE4.9:MotionexamplesoftheinvertedpendulumbySRPIwith
ν=1(fromlefttorightandtoptobottom).
62
StatisticalReinforcementLearning
Goal
FIGURE4.10:Illustrationofthemountaincartask.
Thetopoftherighthillisthegoaltowhichwewanttoguidethecar.There
arethreeactions,
+0.2,−0.2,0,
whicharethevaluesoftheforceappliedtothecar.Notethattheforceofthe
carisnotstrongenoughtoclimbuptheslopetoreachthegoal.Thestate
spaceSisdescribedbythehorizontalpositionx[m](∈[−1.2,0.5])andthevelocity˙x[m/s](∈[−1.5,1.5]):s=(x,˙x)⊤.
Thepositionxandvelocity˙xareupdatedby
xt+1=xt+˙xt+1∆t,
a
˙x
t
t+1=˙
xt+−9.8wcos(3xt)+
−k˙x∆t,
w
t
whereat(∈A)istheactionchosenatthetimet.TherewardfunctionR(s,a,s′)isdefinedas
1ifx
R(s,a,s′)=
s′≥0.5,
−0.01otherwise,
wherexs′denotesthehorizontalpositionxofstates′.Theproblemparame-
tersaresetasfollows:themassofthecarwis0.2[kg],thefrictioncoefficientkis0.3,andthesimulationtimestep∆tis0.1[s].
Thesameexperimentalsetupastheswing-upinvertedpendulumtaskin
Section4.5.1isused,exceptthatthenumberofGaussiankernelsis36,the
kernelstandarddeviationissetatσ=1,andthekernelcentersareallocated
overthefollowinggridpoints:
−1.2,0.35,0.5×−1.5,−0.5,0.5,1.5.
Figure4.11(a)showstheperformanceoflearnedpoliciesmeasuredbythe
SampleReuseinPolicyIteration
63
0.2
1
ν=0
ν=1
0.9
ν^
=νIWCV
0.15
0.8
0.7
0.1
0.6
0.5
0.05
0.4
0.3
Flatteningparameter
Sumofdiscountedrewards
0
0.2
0.1
−0.05
0
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
Totalnumberofepisodes
Totalnumberofepisodes
(a)Performanceofpolicy
(b)Averageflatteningparameter
FIGURE4.11:Resultsofsample-reusepolicyiterationinthemountain-car
task.TheagentcollectstrainingsampleHπL(N=10andT=100)atev-
eryiterationandpolicyevaluationisperformedusingallcollectedsamples
Hπ1,…,HπL.(a)Theperformanceismeasuredbytheaveragereturncom-
putedfromtestsamplesover20trials.Thetotalnumberofepisodesmeansthe
numberoftrainingepisodes(N×L)collectedbytheagentinpolicyiteration.
(b)AverageflatteningparametervaluesusedbySRPIover20trials.
averagereturncomputedfromthetestsamples.Thegraphshowssimilarten-
denciestotheswing-upinvertedpendulumtaskforSRPIandν=1,while
themethodwithν=0performsrelativelywellthistime.Thisimpliesthat
thebiasinthepreviouslycollectedsamplesdoesnotaffecttheestimationof
thevaluefunctionsthatstrongly,becausethefunctionapproximatorisbetter
suitedtorepresentthevaluefunctionforthisproblem.Theaverageflattening
parametervalues(cf.Figure4.11(b))showthattheflatteningparameterde-
creasessoonaftertheincreaseinthebeginning,andthenthesmallervalues
tendtobechosen.ThisindicatesthatSRPItendstouselow-varianceesti-
matorsinthistask.MotionexamplesbySRPIwithνchosenbyIWCVare
illustratedinFigure4.12.
TheseresultsshowthatSRPIcanperformstableandfastlearningby
effectivelyreusingpreviouslycollecteddata.
4.6
Remarks
Instabilityhasbeenoneofthecriticallimitationsofimportance-sampling
techniques,whichoftenmakesoff-policymethodsimpractical.Toovercome
thisweakness,anadaptiveimportance-samplingtechniquewasintroducedfor
controllingthetrade-offbetweenconsistencyandstabilityinvaluefunction
64
StatisticalReinforcementLearning
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
FIGURE4.12:MotionexamplesofthemountaincarbySRPIwithνchosen
byIWCV(fromlefttorightandtoptobottom).
approximation.Furthermore,importance-weightedcross-validationwasintro-
ducedforautomaticallychoosingthetrade-offparameter.
Therangeofapplicationofimportancesamplingisnotlimitedtopolicy
iteration.Wewillexplainhowimportancesamplingcanbeutilizedforsample
reuseinthepolicysearchframeworksinChapter8andChapter9.
Chapter5
ActiveLearninginPolicyIteration
InChapter4,weconsideredtheoff-policysituationwhereadata-collecting
policyandthetargetpolicyaredifferent.Intheframeworkofsample-reuse
policyiteration,newsamplesarealwayschosenfollowingthetargetpolicy.
However,acleverchoiceofsamplingpoliciescanactuallyfurtherimprovethe
performance.Thetopicofchoosingsamplingpoliciesiscalledactivelearning
instatisticsandmachinelearning.Inthischapter,weaddresstheproblem
ofchoosingsamplingpoliciesinsample-reusepolicyiteration.InSection5.1,
weexplainhowastatisticalactivelearningmethodcanbeemployedforop-
timizingthesamplingpolicyinvaluefunctionapproximation.InSection5.2,
weintroduceactivepolicyiteration,whichincorporatestheactivelearning
ideaintotheframeworkofsample-reusepolicyiteration.Theeffectivenessof
activepolicyiterationisnumericallyinvestigatedinSection5.3,andfinally
thischapterisconcludedinSection5.4.
5.1
EfficientExplorationwithActiveLearning
Theaccuracyofestimatedvaluefunctionsdependsontrainingsamples
collectedfollowingsamplingpolicye
π(a|s).Inthissection,weexplainhowa
statisticalactivelearningmethod(Sugiyama,2006)canbeemployedforvalue
functionapproximation.
5.1.1
ProblemSetup
Letusconsiderasituationwherecollectingstate-actiontrajectorysam-
plesiseasyandcheap,butgatheringimmediaterewardsamplesishardand
expensive.Forexample,considerarobot-armcontroltaskofhittingaball
withabatanddrivingtheballasfarawayaspossible(seeFigure5.6).Let
usadoptthecarryoftheballastheimmediatereward.Inthissetting,ob-
tainingstate-actiontrajectorysamplesoftherobotarmiseasyandrelatively
cheapsincewejustneedtocontroltherobotarmandrecorditsstate-action
trajectoriesovertime.However,explicitlycomputingthecarryoftheball
fromthestate-actionsamplesishardduetofrictionandelasticityoflinks,
65
66
StatisticalReinforcementLearning
airresistance,aircurrents,andsoon.Forthisreason,inpractice,wemay
havetoputtherobotinopenspace,lettherobotreallyhittheball,and
measurethecarryoftheballmanually.Thus,gatheringimmediatereward
samplesismuchmoreexpensivethanthestate-actiontrajectorysamples.In
suchasituation,immediaterewardsamplesaretooexpensivetobeusedfor
designingthesamplingpolicy.Onlystate-actiontrajectorysamplesmaybe
usedfordesigningsamplingpolicies.
Thegoalofactivelearninginthecurrentsetupistodeterminethesampling
policysothattheexpectedgeneralizationerrorisminimized.However,since
thegeneralizationerrorisnotaccessibleinpractice,itneedstobeestimated
fromsamplesforperformingactivelearning.Adifficultyofestimatingthe
generalizationerrorinthecontextofactivelearningisthatitsestimation
needstobecarriedoutonlyfromstate-actiontrajectorysampleswithoutusing
immediaterewardsamples.Thismeansthatstandardgeneralizationerror
estimationtechniquessuchascross-validationcannotbeemployed.Below,
weexplainhowthegeneralizationerrorcanbeestimatedwithoutthereward
samples.
5.1.2
DecompositionofGeneralizationError
Theinformationweareallowedtouseforestimatingthegeneralization
errorisasetofroll-outsampleswithoutimmediaterewards:
Heπ=heπ1,…,heπN,
whereeachepisodicsampleheπnisgivenas
heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].
Letusdefinethedeviationofanobservedimmediaterewardreπ
t,nfromits
expectationr(seπt,n,aeπt,n)as
ǫeπt,n=reπt,n−r(seπt,n,aeπt,n).
Notethatǫeπt,ncouldberegardedasadditivenoiseinthecontextofleast-
squaresfunctionfitting.Bydefinition,ǫeπt,nhasmeanzeroanditsvariance
generallydependsonseπt,nandaeπt,n,i.e.,heteroscedasticnoise(Bishop,2006).
However,sinceestimatingthevarianceofǫeπt,nwithoutusingrewardsamples
isnotgenerallypossible,weignorethedependenceofthevarianceonseπt,nand
aeπt,n.Letusdenotetheinput-independentcommonvariancebyσ2.
Wewouldliketoestimatethegeneralizationerror,
”
#
1T
X
⊤2
G(b
θ)=E
bb
pe
π(h)
θψ(s
,
T
t,at;He
π)−r(st,at)
t=1
ActiveLearninginPolicyIteration
67
fromHeπ.Itsexpectationover“noise”canbedecomposedasfollows
(Sugiyama,2006):
h
i
EǫeπG(b
θ)=Bias+Variance+ModelError,
whereEǫeπdenotestheexpectationover“noise”ǫeπt,nT,N
t=1,n=1.
“Bias,”
“Variance,”and“ModelError”arethebiasterm,thevarianceterm,andthe
modelerrortermdefinedby
”
#
T
1Xn
hi
o2
Bias=E
b
pe
π(h)
(E
θ−θ∗)⊤b
ψ(s
,
T
ǫe
π
t,at;He
π)
t=1
”
#
T
1Xn
hi
o2
Variance=E
b
pe
π(h)
(b
θ−E
θ)⊤b
ψ(s
,
T
ǫe
π
t,at;He
π)
t=1
”
#
T
1X
ModelError=Epeπ(h)
(θ∗⊤b
ψ(s
.
T
t,at;He
π)−r(st,at))2
t=1
θ∗denotestheoptimalparameterinthemodel:”
#
T
1X
θ∗=argminEpeπ(h)(θ⊤ψ(st,at)−r(st,at))2.
θ
Tt=1
Notethat,foralinearestimatorb
θsuchthat
bθ=b
Lr,
whereb
LissomematrixandristheNT-dimensionalvectordefinedas
rN(t−1)+n=r(st,n,at,n,st+1,n),
thevariancetermcanbeexpressedinacompactformas
⊤Variance=σ2tr(Ub
Lb
L),
wherethematrixUisdefinedas
”
#
1T
X
U=E
b
pe
π(h)
ψ(s
.
(5.1)
T
t,at;He
π)b
ψ(st,at;Heπ)⊤t=1
5.1.3
EstimationofGeneralizationError
Sinceweareinterestedinfindingaminimizerofthegeneralizationerror
withrespecttoe
π,themodelerror,whichisconstant,canbesafelyignoredin
generalizationerrorestimation.Ontheotherhand,thebiastermincludesthe
68
StatisticalReinforcementLearning
unknownoptimalparameterθ∗.Thus,itmaynotbepossibletoestimatethebiastermwithoutusingrewardsamples.Similarly,itmaynotbepossibleto
estimatethe“noise”varianceσ2includedinthevariancetermwithoutusing
rewardsamples.
Itisknownthatthebiastermissmallenoughtobeneglectedwhenthe
modelisapproximatelycorrect(Sugiyama,2006),i.e.,θ∗⊤b
ψ(s,a)approxi-
matelyagreeswiththetruefunctionr(s,a).Thenwehave
h
i
⊤EǫeπG(b
θ)−ModelError−Bias∝tr(UbLb
L),
(5.2)
whichdoesnotrequireimmediaterewardsamplesforitscomputation.Since
Epeπ(h)includedinUisnotaccessible(seeEq.(5.1)),Uisreplacedbyits
consistentestimatorb
U:
N
XT
X
b
1
U=
b
ψ(seπ
NT
t,n,ae
π
t,n;He
π)b
ψ(seπt,n,aeπt,n;Heπ)⊤b
wt,n.
n=1t=1
Consequently,thefollowinggeneralizationerrorestimatorisobtained:
⊤J=tr(b
Ub
Lb
L),
whichcanbecomputedonlyfromHeπandthuscanbeemployedintheactive
learningscenarios.IfitispossibletogatherHeπmultipletimes,theaboveJ
maybecomputedmultipletimesandtheiraverageisusedasageneralization
errorestimator.
NotethatthevaluesofthegeneralizationerrorestimatorJandthetrue
generalizationerrorGarenotdirectlycomparablesinceirrelevantadditive
andmultiplicativeconstantsareignored(seeEq.(5.2)).However,thisisno
problemaslongastheestimatorJhasasimilarprofiletothetrueerrorGas
afunctionofsamplingpolicye
πsincethepurposeofderivingageneralization
errorestimatorinactivelearningisnottoapproximatethetruegeneralization
erroritself,buttoapproximatetheminimizerofthetruegeneralizationerror
withrespecttosamplingpolicye
π.
5.1.4
DesigningSamplingPolicies
Basedonthegeneralizationerrorestimatorderivedabove,asampling
policyisdesignedasfollows:
1.PrepareKcandidatesofsamplingpolicy:e
πkK.
k=1
2.Collectepisodicsampleswithoutimmediaterewardsforeachsampling-
policycandidate:HeπkK.
k=1
3.EstimateUusingallsamplesHeπkK:
k=1
K
XN
XT
X
b
1
U=
b
ψ(seπk
KNT
t,n,ae
πk
t,n;He
πkK
k=1)b
ψ(seπk
t,n,ae
πk
t,n;He
πkK
k=1)⊤b
weπk
t,n,
k=1n=1t=1
ActiveLearninginPolicyIteration
69
whereb
weπk
t,ndenotestheimportanceweightforthek-thsamplingpolicy
eπk:
Qt
π(aeπk
)
b
weπ
t′,n|se
πk
t′,n
k
t′=1
t,n=Q
.
t
)
t′=1e
πk(aeπk
t′,n|se
πk
t′,n
4.Estimatethegeneralizationerrorforeachk:
e
πk
e
πk
J
b⊤k=tr(b
Ub
L
L
),
e
πk
whereb
L
isdefinedas
beπk
e
πk
e
πk
e
πk
e
πk
e
πk
L
=(b
Ψ⊤c
W
b
Ψ)−1b
Ψ⊤c
W
.
beπk
e
πk
Ψ
istheNT×Bmatrixandc
W
istheNT×NTdiagonalmatrix
definedas
b
Ψeπk
=b
ψ
N(t−1)+n,b
b(se
πk
t,n,ae
πk
t,n),
c
Weπk
=
N(t−1)+n,N(t−1)+n
b
weπk
t,n.
5.(Ifpossible)repeat2to4severaltimesandcalculatetheaveragefor
eachk.
6.Determinethesamplingpolicyas
eπAL=argminJk.
k=1,…,K
7.Collecttrainingsampleswithimmediaterewardsfollowinge
πAL.
8.Learnthevaluefunctionbyleast-squarespolicyiterationusingthecol-
lectedsamples.
5.1.5
Illustration
Here,thebehavioroftheactivelearningmethodisillustratedonatoy
10-statechain-walkenvironmentshowninFigure5.1.TheMDPconsistsof
10states,
S=s(i)10
i=1=1,2,…,10,
and2actions,
A=a(i)2i=1=“L,”“R”.
70
StatisticalReinforcementLearning
02
02
.
.
1
2
3
8
9
10
···
08
08
.
.
FIGURE5.1:Ten-statechainwalk.Filledandunfilledarrowsindicatethe
transitionswhentakingaction“R”and“L,”andsolidanddashedlinesindi-
catethesuccessfulandfailedtransitions.
Theimmediaterewardfunctionisdefinedas
r(s,a,s′)=f(s′),
wheretheprofileofthefunctionf(s′)isillustratedinFigure5.2.
Thetransitionprobabilityp(s′|s,a)isindicatedbythenumbersattached
tothearrowsinFigure5.1.Forexample,p(s(2)|s(1),a=“R”)=0.8and
p(s(1)|s(1),a=“R”)=0.2.Thus,theagentcansuccessfullymovetothe
intendeddirectionwithprobability0.8(indicatedbysolid-filledarrowsinthe
figure)andtheactionfailswithprobability0.2(indicatedbydashed-filled
arrowsinthefigure).Thediscountfactorγissetat0.9.Thefollowing12
Gaussianbasisfunctionsφ(s,a)areused:
(s−c
i)2
I(a=a(j))exp−
2τ2
φ2(i−1)+j(s,a)=
fori=1,…,5andj=1,2
I(a=a(j))fori=6andj=1,2,
wherec1=1,c2=3,c3=5,c4=7,c5=9,andτ=1.5.I(a=a′)denotes
theindicatorfunction:
1ifa=a′,
I(a=a′)=
0
ifa6=a′.
Samplingpoliciesandevaluationpoliciesareconstructedasfollows.First,
3
2.5
2
’)1.5
f(s
1
0.5
01
2
3
4
5
6
7
8
9
10
s’
FIGURE5.2:Profileofthefunctionf(s′).
ActiveLearninginPolicyIteration
71
adeterministic“base”policyπisprepared.Forexample,“LLLLLRRRRR,”
wherethei-thletterdenotestheactiontakenats(i).Letπǫbethe“ǫ-greedy”
versionofthebasepolicyπ,i.e.,theintendedactioncanbesuccessfullychosen
withprobability1−ǫ/2andtheotheractionischosenwithprobabilityǫ/2.
Experimentsareperformedforthreedifferentevaluationpolicies:
π1:“RRRRRRRRRR,”
π2:“RRLLLLLRRR,”
π3:“LLLLLRRRRR,”
withǫ=0.1.Foreachevaluationpolicyπ0.1
i
(i=1,2,3),10candidatesofthe
samplingpolicye
π(k)
areprepared,where
=πk/10.Notethat
is
i
10
k=1
eπ(k)
i
i
eπ(1)
i
equivalenttotheevaluationpolicyπ0.1
i
.
Foreachsamplingpolicy,theactivelearningcriterionJiscomputed5
timesandtheiraverageistaken.Thenumbersofepisodesandstepsareset
atN=10andT=10,respectively.Theinitial-stateprobabilityp(s)is
settobeuniform.Whenthematrixinverseiscomputed,10−3isaddedto
diagonalelementstoavoiddegeneracy.Thisexperimentisrepeated100times
withdifferentrandomseedsandthemeanandstandarddeviationofthetrue
generalizationerroranditsestimateareevaluated.
TheresultsaredepictedinFigure5.3asfunctionsoftheindexkofthe
samplingpolicies.Thegraphsshowthatthegeneralizationerrorestimator
overallcapturesthetrendofthetruegeneralizationerrorwellforallthree
cases.
Next,thevaluesoftheobtainedgeneralizationerrorGisevaluatedwhen
kischosensothatJisminimized(activelearning,AL),theevaluationpolicy
(k=1)isusedforsampling(passivelearning,PL),andkischosenoptimally
sothatthetruegeneralizationerrorisminimized(optimal,OPT).Figure5.4
showsthattheactivelearningmethodcomparesfavorablywithpassivelearn-
ingandperformswellforreducingthegeneralizationerror.
5.2
ActivePolicyIteration
InSection5.1,theunknowngeneralizationerrorwasshowntobeaccu-
ratelyestimatedwithoutusingimmediaterewardsamplesinone-steppolicy
evaluation.Inthissection,thisone-stepactivelearningideaisextendedtothe
frameworkofsample-reusepolicyiterationintroducedinChapter4,whichis
calledactivepolicyiteration.LetusdenotetheevaluationpolicyattheL-th
iterationbyπL.
72
StatisticalReinforcementLearning
2.5
2
2
1.5
1.5
|G
1
J
1
0.5
0.5
0
−0.5
0
2
4
6
8
10
2
4
6
8
10
Samplingpolicyindexk
Samplingpolicyindexk
(a)π0.1
1
0.6
1.4
0.5
1.2
0.4
1
0.3
0.8
|G
J
0.2
0.6
0.1
0.4
0
0.2
−0.1
0
2
4
6
8
10
2
4
6
8
10
Samplingpolicyindexk
Samplingpolicyindexk
(b)π0.1
2
0.8
1
0.6
0.8
0.4
0.6
|G
J
0.2
0.4
0
0.2
−0.2
0
2
4
6
8
10
2
4
6
8
10
Samplingpolicyindexk
Samplingpolicyindexk
(c)π0.1
3
FIGURE5.3:Themeanandstandarddeviationofthetruegeneralization
errorG(left)andtheestimatedgeneralizationerrorJ(right)over100trials.
5.2.1
Sample-ReusePolicyIterationwithActiveLearning
Intheoriginalsample-reusepolicyiteration,newdatasamplesHπlare
collectedfollowingthenewtargetpolicyπlforthenextpolicyevaluation
step:
E:Hπ1
E:Hπ1,Hπ2
E:Hπ1,Hπ2,Hπ3
π
I
I
1
→
b
Qπ1→π2
→
b
Qπ2→π3
→
···I
→πL+1,
ActiveLearninginPolicyIteration
73
3.5
0.35
3
0.3
2.5
0.25
2
0.2
1.5
0.15
1
0.1
0.5
0.05
0
0
AL
PL
OPT
AL
PL
OPT
(a)π0.1
(b)π0.1
1
2
1
0.8
0.6
0.4
0.2
0
AL
PL
OPT
(c)π0.1
3
FIGURE5.4:Thebox-plotsofthevaluesoftheobtainedgeneralizationerror
Gover100trialswhenkischosensothatJisminimized(activelearning,AL),
theevaluationpolicy(k=1)isusedforsampling(passivelearning,PL),andk
ischosenoptimallysothatthetruegeneralizationerrorisminimized(optimal,
OPT).Thebox-plotnotationindicatesthe5%quantile,25%quantile,50%
quantile(i.e.,median),75%quantile,and95%quantilefrombottomtotop.
where“E:H”indicatespolicyevaluationusingthedatasampleHand“I”
denotespolicyimprovement.Ontheotherhand,inactivepolicyiteration,the
optimizedsamplingpolicye
πlisusedateachiteration:
E:He
π1
E:He
π1,Heπ2
E:He
π1,Heπ2,Heπ3
π
I
I
1
→
b
Qπ1→π2
→
b
Qπ2→π3
→
···I
→πL+1.
Notethat,inactivepolicyiteration,thepreviouslycollectedsamplesareused
notonlyforvaluefunctionapproximation,butalsoforactivelearning.Thus,
activepolicyiterationmakesfulluseofthesamples.
5.2.2
Illustration
Here,thebehaviorofactivepolicyiterationisillustratedusingthesame
10-statechain-walkproblemasSection5.1.5(seeFigure5.1).
74
StatisticalReinforcementLearning
Theinitialevaluationpolicyπ1issetas
π
b
1(a|s)=0.15pu(a)+0.85I(a=argmaxQ0(s,a′)),
a′
wherepu(a)denotestheprobabilitymassfunctionoftheuniformdistribution
and
12
X
b
Q0(s,a)=
φb(s,a).
b=1
Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=
0.15/l.Inthesampling-policyselectionstepofthel-thiteration,thefollowing
foursampling-policycandidatesareprepared:
eπ(1),
,
,
,π0.15/l+0.15,π0.15/l+0.5,π0.15/l+0.85
l
eπ(2)
l
eπ(3)
l
eπ(4)
l
=π0.15/l
l
l
l
l
,
whereπldenotesthepolicyobtainedbygreedyupdateusingb
Qπl−1.
Thenumberofiterationstolearnthepolicyissetat7,thenumberof
stepsissetatT=10,andthenumberNofepisodesisdifferentineachitera-
tionanddefinedasN1,…,N7,whereNl(l=1,…,7)denotesthenumberofepisodescollectedinthel-thiteration.Inthisexperiment,twotypesof
schedulingarecompared:5,5,3,3,3,1,1and3,3,3,3,3,3,3,whichare
referredtoasthe“decreasingN”strategyandthe“fixedN”strategy,respec-
tively.TheJ-valuecalculationisrepeated5timesforactivelearning.The
performanceofthefinallyobtainedpolicyπ8ismeasuredbythereturnfor
testsamplesrπ8
t,nT,N
t,n=1(50episodeswith50stepscollectedfollowingπ8):
1N
XT
X
Performance=
γt−1rπ8
N
t,n,
n=1t=1
wherethediscountfactorγissetat0.9.
Theperformanceofpassivelearning(PL;thecurrentpolicyisusedasthe
samplingpolicyineachiteration)andactivelearning(AL;thebestsampling
policyischosenfromthepolicycandidatespreparedineachiteration)is
compared.Theexperimentsarerepeated1000timeswithdifferentrandom
seedsandtheaverageperformanceofPLandALisevaluated.Theresults
aredepictedinFigure5.5,showingthatALworksbetterthanPLinboth
typesofepisodeschedulingwithstatisticalsignificancebythet-testatthe
significancelevel1%(Henkel,1976)fortheerrorvaluesobtainedafterthe7th
iteration.Furthermore,the“decreasingN”strategyoutperformsthe“fixed
N”strategyforbothPLandAL,showingtheusefulnessofthe“decreasing
N”strategy.
ActiveLearninginPolicyIteration
75
14
13
12
11
10
AL(decreasingN)
Performance(average)
9
PL(decreasingN)
AL(fixedN)
8
PL(fixedN)
71
2
3
4
5
6
7
Iteration
FIGURE5.5:Themeanperformanceover1000trialsinthe10-statechain-
walkexperiment.Thedottedlinesdenotetheperformanceofpassivelearning
(PL)andthesolidlinesdenotetheperformanceoftheproposedactivelearning
(AL)method.Theerrorbarsareomittedforclearvisibility.Forboththe
“decreasingN”and“fixedN”strategies,theperformanceofALafterthe7th
iterationissignificantlybetterthanthatofPLaccordingtothet-testatthe
significancelevel1%appliedtotheerrorvaluesatthe7thiteration.
5.3
NumericalExamples
Inthissection,theperformanceofactivepolicyiterationisevaluatedusing
aball-battingrobotillustratedinFigure5.6,whichconsistsoftwolinksand
twojoints.Thegoaloftheball-battingtaskistocontroltherobotarmso
thatitdrivestheballasfarawayaspossible.ThestatespaceSiscontinuous
andconsistsofanglesϕ1[rad](∈[0,π/4])andϕ2[rad](∈[−π/4,π/4])and
angularvelocities˙
ϕ1[rad/s]and˙
ϕ2[rad/s].Thus,astates(∈S)isdescribedbya4-dimensionalvectors=(ϕ1,˙
ϕ1,ϕ2,˙
ϕ2)⊤.TheactionspaceAisdiscrete
andcontainstwoelements:
A=a(i)2i=1=(50,−35)⊤,(−50,10)⊤,
wherethei-thelement(i=1,2)ofeachvectorcorrespondstothetorque
[N·m]addedtojointi.
Theopendynamicsengine(http://ode.org/)isusedforphysicalcalculationsincludingtheupdateoftheanglesandangularvelocities,andcollision
detectionbetweentherobotarm,ball,andpin.Thesimulationtimestepis
setat7.5[ms]andthenextstateisobservedafter10timesteps.Theaction
choseninthecurrentstateistakenfor10timesteps.Tomaketheexperi-
mentsrealistic,noiseisaddedtoactions:ifaction(f1,f2)⊤istaken,theactual
76
StatisticalReinforcementLearning
FIGURE5.6:Aball-battingrobot.
torquesappliedtothejointsaref1+ε1andf2+ε2,whereε1andε2aredrawn
independentlyfromtheGaussiandistributionwithmean0andvariance3.
Theimmediaterewardisdefinedasthecarryoftheball.Thisrewardis
givenonlywhentherobotarmcollideswiththeballforthefirsttimeatstate
s′aftertakingactionaatcurrentstates.Forvaluefunctionapproximation,
thefollowing110basisfunctionsareused:
ks−c
ik2
I(a=a(j))exp−
2τ2
φ2(i−1)+j=
fori=1,…,54andj=1,2,
I(a=a(j))fori=55andj=1,2,
whereτissetat3π/2andtheGaussiancentersci(i=1,…,54)arelocated
ontheregulargrid:0,π/4×−π,0,π×−π/4,0,π/4×−π,0,π.
ForL=7andT=10,the“decreasingN”strategyandthe“fixed
N”strategyarecompared.The“decreasingN”strategyisdefinedby
10,10,7,7,7,4,4andthe“fixedN”strategyisdefinedby7,7,7,7,7,7,7.
Theinitialstateisalwayssetats=(π/4,0,0,0)⊤,andJ-calculationsare
repeated5timesintheactivelearningmethod.Theinitialevaluationpolicy
π1issetattheǫ-greedypolicydefinedas
π
b
1(a|s)=0.15pu(a)+0.85I
a=argmaxQ0(s,a′),
a′
110
X
b
Q0(s,a)=
φb(s,a).
b=1
Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=
0.15/l.Sampling-policycandidatesarepreparedinthesamewayasthechain-
walkexperimentinSection5.2.2.
Thediscountfactorγissetat1andtheperformanceoflearnedpolicyπ8
ActiveLearninginPolicyIteration
77
70
65
60
55
50
45
AL(decreasingN)
Performance(average)
40
PL(decreasingN)
AL(fixedN)
35
PL(fixedN)
301
2
3
4
5
6
7
Iteration
FIGURE5.7:Themeanperformanceover500trialsintheball-batting
experiment.Thedottedlinesdenotetheperformanceofpassivelearning(PL)
andthesolidlinesdenotetheperformanceoftheproposedactivelearning(AL)
method.Theerrorbarsareomittedforclearvisibility.Forthe“decreasingN”
strategy,theperformanceofALafterthe7thiterationissignificantlybetter
thanthatofPLaccordingtothet-testatthesignificancelevel1%forthe
errorvaluesatthe7thiteration.
ismeasuredbythereturnfortestsamplesrπ8
t,n10,20
t,n=1(20episodeswith10
P
P
stepscollectedfollowingπ
N
T
8):
rπ8
n=1
t=1t,n.
Theexperimentisrepeated500timeswithdifferentrandomseedsand
theaverageperformanceofeachlearningmethodisevaluated.Theresults,
depictedinFigure5.7,showthatactivelearningoutperformspassivelearning.
Forthe“decreasingN”strategy,theperformancedifferenceisstatistically
significantbythet-testatthesignificancelevel1%fortheerrorvaluesafter
the7thiteration.
Motionexamplesoftheball-battingrobottrainedwithactivelearningand
passivelearningareillustratedinFigure5.8andFigure5.9,respectively.
5.4
Remarks
Whenwecannotaffordtocollectmanytrainingsamplesduetohighsam-
plingcosts,itiscrucialtochoosethemostinformativesamplesforefficiently
learningthevaluefunction.Inthischapter,anactivelearningmethodforop-
timizingdatasamplingstrategieswasintroducedintheframeworkofsample-
reusepolicyiteration,andtheresultingactivepolicyiterationwasdemon-
stratedtobepromising.
78
StatisticalReinforcementLearning
FIGURE5.8:Amotionexampleoftheball-battingrobottrainedwithactive
learning(fromlefttorightandtoptobottom).
FIGURE5.9:Amotionexampleoftheball-battingrobottrainedwithpas-
sivelearning(fromlefttorightandtoptobottom).
Chapter6
RobustPolicyIteration
Theframeworkofleast-squarespolicyiteration(LSPI)introducedinChap-
ter2isuseful,thankstoitscomputationalefficiencyandanalyticaltractabil-
ity.However,duetothesquaredloss,ittendstobesensitivetooutliersin
observedrewards.Inthischapter,weintroduceanalternativepolicyiter-
ationmethodthatemploystheabsolutelossforenhancingrobustnessand
reliability.InSection6.1,robustnessandreliabilitybroughtbytheuseofthe
absolutelossisdiscussed.InSection6.2,thepolicyiterationframeworkwith
theabsolutelosscalledleast-absolutepolicyiteration(LAPI)isintroduced.
InSection6.3,theusefulnessofLAPIisillustratedthroughexperiments.
VariationsofLAPIareconsideredinSection6.4,andfinallythischapteris
concludedinSection6.5.
6.1
RobustnessandReliabilityinPolicyIteration
ThebasicideaofLSPIistofitalinearmodeltoimmediaterewardsun-
derthesquaredloss,whiletheabsolutelossisusedinthischapter(seeFig-
ure6.1).Thisisjustreplacementoflossfunctions,butthismodificationhighly
enhancesrobustnessandreliability.
6.1.1
Robustness
Inmanyroboticsapplications,immediaterewardsareobtainedthrough
measurementsuchasdistancesensorsorcomputervision.Duetointrinsic
measurementnoiseorrecognitionerror,theobtainedrewardsoftendeviate
fromthetruevalue.Inparticular,therewardsoccasionallycontainoutliers,
whicharesignificantlydifferentfromregularvalues.
Residualminimizationunderthesquaredlossamountstoobtainingthe
meanofsamplesxim
i=1:
”
#
m
X
1m
X
argmin
(xi−c)2=mean(xim
i=1)=
xi.
c
m
i=1
i=1
Ifoneofthevaluesisanoutlierhavingaverylargeorsmallvalue,themean
79
80
StatisticalReinforcementLearning
5
Absoluteloss
Squaredloss
4
3
2
1
0
−3
−2
−1
0
1
2
3
FIGURE6.1:Theabsoluteandsquaredlossfunctionsforreducingthe
temporal-differenceerror.
wouldbestronglyaffectedbythisoutlier.Thismeansthatallthevalues
xim
i=1areresponsibleforthemean,andthereforeevenasingleoutlierob-
servationcansignificantlydamagethelearnedresult.
Ontheotherhand,residualminimizationundertheabsolutelossamounts
toobtainingthemedian:
”
#
2n+1
X
argmin
|xi−c|=median(xi2n+1)=x
i=1
n+1,
c
i=1
wherex1≤x2≤···≤x2n+1.Themedianisinfluencednotbythemagnitude
ofthevaluesxi2n+1butonlybytheirorder.Thus,aslongastheorderis
i=1
keptunchanged,themedianisnotaffectedbyoutliers.Infact,themedianis
knowntobethemostrobustestimatorinlightofbreakdown-pointanalysis
(Huber,1981;Rousseeuw&Leroy,1987).
Therefore,theuseoftheabsolutelosswouldremedytheproblemofro-
bustnessinpolicyiteration.
6.1.2
Reliability
Inpracticalrobot-controltasks,weoftenwanttoattainastableperfor-
mance,ratherthantoachievea“dream”performancewithlittlechanceof
success.Forexample,intheacquisitionofahumanoidgait,wemaywantthe
robottowalkforwardinastablemannerwithhighprobabilityofsuccess,
ratherthantorushveryfastinachancelevel.
Ontheotherhand,wedonotwanttobetooconservativewhentraining
robots.Ifweareoverlyconcernedwithunrealisticfailure,nopracticallyuseful
controlpolicycanbeobtained.Forexample,anyrobotscanbebrokenin
principleiftheyareactivatedforalongtime.However,ifwefearthisfact
toomuch,wemayendupinpraisingacontrolpolicythatdoesnotmovethe
robotsatall,whichisobviouslynonsense.
Sincethesquared-losssolutionisnotrobustagainstoutliers,itissensitive
torareeventswitheitherpositiveornegativeverylargeimmediaterewards.
RobustPolicyIteration
81
Consequently,thesquaredlossprefersanextraordinarilysuccessfulmotion
evenifthesuccessprobabilityisverylow.Similarly,itdislikesanunrealistic
troubleevenifsuchaterribleeventmaynothappeninreality.Ontheother
hand,theabsolutelosssolutionisnoteasilyaffectedbysuchrareeventsdueto
itsrobustness.Therefore,theuseoftheabsolutelosswouldproduceareliable
controlpolicyeveninthepresenceofsuchextremeevents.
6.2
LeastAbsolutePolicyIteration
Inthissection,apolicyiterationmethodwiththeabsolutelossisintro-
duced.
6.2.1
Algorithm
Insteadofthesquaredloss,alinearmodelisfittedtoimmediaterewards
undertheabsolutelossas
”
#
T
X
min
θ⊤b
ψ(st,at)−rt.
θ
t=1
Thisminimizationproblemlookscumbersomeduetotheabsolutevalueoper-
atorwhichisnon-differentiable,butthisminimizationproblemcanbereduced
tothefollowinglinearprogram(Boyd&Vandenberghe,2004):
T
X
min
bt
θ,btT
t=1
t=1
subjectto−bt≤θ⊤bψ(st,at)−rt≤bt,t=1,…,T.
ThenumberofconstraintsisTintheabovelinearprogram.WhenTislarge,
wemayemploysophisticatedoptimizationtechniquessuchascolumngen-
eration(Demirizetal.,2002)forefficientlysolvingthelinearprogramming
problem.Alternatively,anapproximatesolutioncanbeobtainedbygradient
descentorthe(quasi)-Newtonmethodsiftheabsolutelossisapproximated
byasmoothloss(see,e.g.,Section6.4.1).
Thepolicyiterationmethodbasedontheabsolutelossiscalledleastab-
solutepolicyiteration(LAPI).
6.2.2
Illustration
Forillustrationpurposes,letusconsiderthe4-stateMDPproblemde-
scribedinFigure6.2.Theagentisinitiallylocatedatstates(0)andtheactions
82
StatisticalReinforcementLearning
FIGURE6.2:IllustrativeMDPproblem.
theagentisallowedtotakearemovingtotheleftorrightstate.Iftheleft
movementactionischosen,theagentalwaysreceivessmallpositivereward
+0.1ats(L).Ontheotherhand,iftherightmovementactionischosen,the
agentreceivesnegativereward−1withprobability0.9999ats(R1)oritre-
ceivesverylargepositivereward+20,000withprobability0.0001ats(R2).The
meanandmedianrewardsforleftmovementareboth+0.1,whilethemean
andmedianrewardsforrightmovementare+1.0001and−1,respectively.
IfQ(s(0),“Left”)andQ(s(0),“Right”)areapproximatedbytheleast-
squaresmethod,itreturnsthemeanrewards,i.e.,+0.1and+1.0001,re-
spectively.Thus,theleast-squaresmethodprefersrightmovement,whichisa
“gambling”policythatnegativereward−1isalmostalwaysobtainedats(R1),
butitispossibletoobtainveryhighreward+20,000withaverysmallprob-
abilityats(R2).Ontheotherhand,ifQ(s(0),“Left”)andQ(s(0),“Right”)are
approximatedbytheleastabsolutemethod,itreturnsthemedianrewards,
i.e.,+0.1and−1,respectively.Thus,theleastabsolutemethodprefersleft
movement,whichisastablepolicythattheagentcanalwaysreceivesmall
positivereward+0.1ats(L).
IfalltherewardsinFigure6.2arenegated,thevaluefunctionsarealso
negatedandadifferentinterpretationcanbeobtained:theleast-squares
methodisafraidoftheriskofreceivingverylargenegativereward−20,000
ats(R2)withaverylowprobability,andconsequentlyitendsupinavery
conservativepolicythattheagentalwaysreceivesnegativereward−0.1at
s(L).Ontheotherhand,theleastabsolutemethodtriestoreceivepositive
reward+1ats(R1)withoutbeingafraidofvisitings(R2)toomuch.
Asillustratedabove,theleastabsolutemethodtendstoprovidequalita-
tivelydifferentsolutionsfromtheleast-squaresmethod.
RobustPolicyIteration
83
6.2.3
Properties
Here,propertiesoftheleastabsolutemethodareinvestigatedwhenthe
modelb
Q(s,a)iscorrectlyspecified,i.e.,thereexistsaparameterθ∗suchthatb
Q(s,a)=Q(s,a)
forallsanda.
Underthecorrectmodelassumption,whenthenumberofsamplesTtends
toinfinity,theleastabsolutesolutionb
θwouldsatisfythefollowingequa-
tion(Koenker,2005):
b⊤θψ(s,a)=Mp(s′|s,a)[r(s,a,s′)]forallsanda,
(6.1)
whereMp(s′|s,a)denotestheconditionalmedianofs′overp(s′|s,a)givens
anda.ψ(s,a)isdefinedby
ψ(s,a)=φ(s,a)−γEp(s′|s,a)Eπ(a′|s′)[φ(s′,a′)],
whereEp(s′|s,a)denotestheconditionalexpectationofs′overp(s′|s,a)given
sanda,andEπ(a′|s′)denotestheconditionalexpectationofa′overπ(a′|s′)
givens′.
FromEq.(6.1),wecanobtainthefollowingBellman-likerecursiveexpres-
sion:
h
i
b
Q(s,a)=M
b
p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).
(6.2)
Notethatinthecaseoftheleast-squaresmethodwhere
b⊤θψ(s,a)=Ep(s′|s,a)[r(s,a,s′)]
issatisfiedinthelimitunderthecorrectmodelassumption,wehave
h
i
b
Q(s,a)=E
b
p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).
(6.3)
ThisistheordinaryBellmanequation,andthusEq.(6.2)couldberegarded
asanextensionoftheBellmanequationtotheabsoluteloss.
FromtheordinaryBellmanequation(6.3),wecanrecovertheoriginal
definitionofthestate-valuefunctionQ(s,a):
”
#
T
X
Qπ(
s,a)=Epπ(h)
γt−1r(st,at,st+1),s1=s,a1=a,
t=1
whereEpπ(h)denotestheexpectationovertrajectoryh=[s1,a1,…,
sT,aT,sT+1]and“|s1=s,a1=a”meansthattheinitialstates1andthe
firstactiona1arefixedats1=sanda1=a,respectively.Incontrast,from
theabsolute-lossBellmanequation(6.2),wehave
”
#
T
X
Q′(
s,a)=Epπ(h)
γt−1Mp(s
s
.
t+1|st,at)[r(st,at,st+1)]1=s,a1=a
t=1
84
StatisticalReinforcementLearning
Bar
1stlink
1stjoint
2ndlink
2ndjoint
Endeffector
FIGURE6.3:Illustrationoftheacrobot.Thegoalistoswinguptheend
effectorbyonlycontrollingthesecondjoint.
Thisisthevaluefunctionthattheleastabsolutemethodistryingtoap-
proximate,whichisdifferentfromtheordinaryvaluefunction.Sincethedis-
countedsumofmedianrewards—nottheexpectedrewards—ismaximized,
theleastabsolutemethodisexpectedtobelesssensitivetooutliersthanthe
least-squaresmethod.
6.3
NumericalExamples
Inthissection,thebehaviorofLAPIisillustratedthroughexperiments
usingtheacrobotshowninFigure6.3.Theacrobotisanunder-actuated
systemandconsistsoftwolinks,twojoints,andanendeffector.Thelengthof
eachlinkis0.3[m],andthediameterofeachjointis0.15[m].Thediameterof
theendeffectoris0.10[m],andtheheightofthehorizontalbaris1.2[m].The
firstjointconnectsthefirstlinktothehorizontalbarandisnotcontrollable.
Thesecondjointconnectsthefirstlinktothesecondlinkandiscontrollable.
Theendeffectorisattachedtothetipofthesecondlink.Thecontrolcommand
(action)wecanchooseistoapplypositivetorque+50[N·m],notorque0
[N·m],ornegativetorque−50[N·m]tothesecondjoint.Notethatthe
acrobotmovesonlywithinaplaneorthogonaltothehorizontalbar.
Thegoalistoacquireacontrolpolicysuchthattheendeffectorisswungup
ashighaspossible.Thestatespaceconsistsoftheangleθi[rad]andangular
velocity˙θi[rad/s]ofthefirstandsecondjoints(i=1,2).Theimmediate
RobustPolicyIteration
85
rewardisgivenaccordingtotheheightyofthecenteroftheendeffectoras
10
ify>1.75,
r(s,a,s′)=
exp−(y−1.85)2
if1.5<y≤1.75,
2(0.2)2
0.001
otherwise.
Notethat0.55≤y≤1.85inthecurrentsetting.
Here,supposethatthelengthofthelinksisunknown.Thus,theheight
ycannotbedirectlycomputedfromstateinformation.Theheightoftheend
effectorissupposedtobeestimatedfromanimagetakenbyacamera—
theendeffectorisdetectedintheimageandthenitsverticalcoordinateis
computed.Duetorecognitionerror,theestimatedheightishighlynoisyand
couldcontainoutliers.
Ineachpolicyiterationstep,20episodictrainingsamplesoflength150
aregathered.Theperformanceoftheobtainedpolicyisevaluatedusing50
episodictestsamplesoflength300.Notethatthetestsamplesarenotused
forlearningpolicies.Theyareusedonlyforevaluatinglearnedpolicies.The
policiesareupdatedinasoft-maxmanner:
exp(Q(s,a)/η)
π(a|s)←−P
,
exp(Q(s,a′)/η)
a′∈Awhereη=10−l+1withlbeingtheiterationnumber.Thediscounted
factorissetatγ=1,i.e.,nodiscount.Asbasisfunctionsforvaluefunction
approximation,theGaussiankernelwithstandarddeviationπisused,where
Gaussiancentersarelocatedat
(θ1,θ2,˙θ1,˙θ2)∈−π,−π,0,π,π×−π,0,π×−π,0,π×−π,0,π.2
2
Theabove135(=5×3×3×3)Gaussiankernelsaredefinedforeachofthe
threeactions.Thus,405(=135×3)kernelsareusedintotal.
Letusconsidertwonoiseenvironments:oneisthecasewherenonoiseis
addedtotherewardsandtheothercaseiswhereLaplaciannoisewithmean
zeroandstandarddeviation2isaddedtotherewardswithprobability0.1.
NotethatthetailoftheLaplaciandensityisheavierthanthatoftheGaussian
density(seeFigure6.4),implyingthatasmallnumberofoutlierstendtobe
includedintheLaplaciannoiseenvironment.Anexampleofthenoisytraining
samplesisshowninFigure6.5.Foreachnoiseenvironment,theexperimentis
repeated50timeswithdifferentrandomseedsandtheaveragesofthesumof
rewardsobtainedbyLAPIandLSPIaresummarizedinFigure6.6.Thebest
methodintermsofthemeanvalueandcomparablemethodsaccordingtothe
t-test(Henkel,1976)atthesignificancelevel5%isspecifiedby“.”
Inthenoiselesscase(seeFigure6.6(a)),bothLAPIandLSPIimprovethe
performanceoveriterationsinacomparableway.Ontheotherhand,inthe
noisycase(seeFigure6.6(b)),theperformanceofLSPIisnotimprovedmuch
duetooutliers,whileLAPIstillproducesagoodcontrolpolicy.
86
StatisticalReinforcementLearning
1
10
Gaussiandensity
True
Laplaciandensity
Samplewithnoise
8
0.8
6
0.6
4
2
0.4
Immediatereward0
0.2
−2
−4
0
0.55
1.5
1.751.85
−4
−2
0
2
4
Heightofendeffector
FIGURE6.4:Probabilitydensity
FIGURE6.5:Exampleoftraining
functionsofGaussianandLapla-
sampleswithLaplaciannoise.The
ciandistributions.
horizontalaxisistheheightofthe
endeffector.Thesolidlinedenotes
thenoiselessimmediaterewardand
“”denotesanoisytrainingsample.
14
12
12
10
10
8
8
6
6
Sumofrewards
Sumofrewards
4
4
2
2
LSPI
LAPI
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
Iteration
Iteration
(a)Nonoise
(b)Laplaciannoise
FIGURE6.6:Averageandstandarddeviationofthesumofrewardsover50
runsfortheacrobotswinging-upsimulation.Thebestmethodintermsofthe
meanvalueandcomparablemethodsaccordingtothet-testatthesignificance
level5%specifiedby“.”
Figure6.7andFigure6.8depictmotionexamplesoftheacrobotlearned
byLAPIandLSPIintheLaplacian-noiseenvironment.WhenLSPIisused
(Figure6.7),thesecondjointisswunghardinordertolifttheendeffector.
However,theendeffectortendstostaybelowthehorizontalbar,andtherefore
onlyasmallamountofrewardcanbeobtainedbyLSPI.Thiswouldbedueto
theexistenceofoutliers.Ontheotherhand,whenLAPIisused(Figure6.8),
theendeffectorgoesbeyondthebar,andthereforealargeamountofreward
canbeobtainedeveninthepresenceofoutliers.
RobustPolicyIteration
87
FIGURE6.7:AmotionexampleoftheacrobotlearnedbyLSPIinthe
Laplacian-noiseenvironment(fromlefttorightandtoptobottom).
FIGURE6.8:AmotionexampleoftheacrobotlearnedbyLAPIinthe
Laplacian-noiseenvironment(fromlefttorightandtoptobottom).
88
StatisticalReinforcementLearning
6.4
PossibleExtensions
Inthissection,possiblevariationsofLAPIareconsidered.
6.4.1
HuberLoss
UseoftheHuberlosscorrespondstomakingacompromisebetweenthe
squaredandabsolutelossfunctions(Huber,1981):
”
#
T
X
argmin
ρHB
κ
θ⊤b
ψ(st,at)−rt
,
θ
t=1
whereκ(≥0)isathresholdparameterandρHB
κ
istheHuberlossdefinedas
follows(seeFigure6.9):
1x2
if|x|≤κ,
2
ρHB
κ
(x)= κ|x|−1κ2if|x|>κ.
2
TheHuberlossconvergestotheabsolutelossasκtendstozero,andit
convergestothesquaredlossasκtendstoinfinity.
TheHuberlossfunctionisratherintricate,butthesolutioncanbeob-
tainedbysolvingthefollowingconvexquadraticprogram(Mangasarian&
Musicant,2000):
T
T
X
X
1
min
b2t+κ
ct
θ,b
2
t,ctT
t=1
t=1
t=1
subjectto−ct≤θ⊤bψ(st,at)−rt−bt≤ct,t=1,…,T.
Anotherwaytoobtainthesolutionistouseagradientdescentmethod,
wheretheparameterθisupdatedasfollowsuntilconvergence:
T
X
θ←θ−ε
∆ρHB
κ
(θ⊤b
ψ(st,at)−rt)b
ψ(st,at).
t=1
ε(>0)isthelearningrateand∆ρHB
κ
isthederivativeofρHB
κ
givenby
x
if|x|≤κ,
∆ρHB
κ
(x)=
κ
ifx>κ,
−κifx<−κ.
Inpractice,thefollowingstochasticgradientmethod(Amari,1967)wouldbe
RobustPolicyIteration
89
5
Huberloss
Pinballloss
4
Deadzone-linearloss
3
2
1
0
−3
−2
−1
0
1
2
FIGURE6.9:TheHuberlossfunction(withκ=1),thepinballlossfunction
(withτ=0.3),andthedeadzone-linearlossfunction(withǫ=1).
moreconvenient.Forarandomlychosenindext∈1,…,Tineachiteration,
repeatthefollowingupdateuntilconvergence:
θ←θ−ε∆ρHB
κ
(θ⊤b
ψ(st,at)−rt)b
ψ(st,at).
Theplain/stochasticgradientmethodsalsocomeinhandywhenapprox-
imatingtheleastabsolutesolution,sincetheHuberlossfunctionwithsmall
κcanberegardedasasmoothapproximationtotheabsoluteloss.
6.4.2
PinballLoss
Theabsolutelossinducesthemedian,whichcorrespondstothe50-
percentilepoint.Asimilardiscussionisalsopossibleforanarbitrarypercentile
100τ(0≤τ≤1)basedonthepinballloss(Koenker,2005):
”
#
T
X
min
ρPB
τ
(θ⊤b
ψ(st,at)−rt),
θ
t=1
whereρPB
τ
(x)isthepinballlossdefinedby
(2τx
ifx≥0,
ρPB
τ
(x)=
2(τ−1)xifx<0.
TheprofileofthepinballlossisdepictedinFigure6.9.Whenτ=0.5,the
pinballlossisreducedtotheabsoluteloss.
Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram:
T
X
min
bt
θ,btT
t=1
t=1
b
b
subjectto
t
≤
t
θ⊤b
ψ(s
,t=1,…,T.
2(τ−1)
t,at)−rt≤2τ
90
StatisticalReinforcementLearning
6.4.3
Deadzone-LinearLoss
Anothervariantoftheabsolutelossisthedeadzone-linearloss(seeFig-
ure6.9):
”
#
T
X
min
ρDL
ǫ
(θ⊤b
ψ(st,at)−rt),
θ
t=1
whereρDL
ǫ
(x)isthedeadzone-linearlossdefinedby
(0
if|x|≤ǫ,
ρDL
ǫ
(x)=
|x|−ǫif|x|>ǫ.
Thatis,ifthemagnitudeoftheerrorislessthanǫ,noerrorisassessed.This
lossisalsocalledtheǫ-insensitivelossandusedinsupportvectorregression
(Vapnik,1998).
Whenǫ=0,thedeadzone-linearlossisreducedtotheabsoluteloss.
Thus,thedeadzone-linearlossandtheabsolutelossarerelatedtoeachother.
However,theeffectofthedeadzone-linearlossiscompletelyoppositetothe
absolutelosswhenǫ>0.Theinfluenceof“good”samples(withsmallerror)
isdeemphasizedinthedeadzone-linearloss,whiletheabsolutelosstendsto
suppresstheinfluenceof“bad”samples(withlargeerror)comparedwiththe
squaredloss.
Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd
&Vandenberghe,2004):
T
X
min
b
t
θ,btT
t=1
t=1
subjectto
−b
t−ǫ≤θ⊤b
ψ(st,at)−rt≤bt+ǫ,
bt≥0,t=1,…,T.
6.4.4
ChebyshevApproximation
TheChebyshevapproximationminimizestheerrorforthe“worst”sample:
min
max|θ⊤b
ψ(st,at)−rt|.
θ
t=1,…,T
Thisisalsocalledtheminimaxapproximation.
Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd
&Vandenberghe,2004):
min
b
θ,b
subjectto−b≤θ⊤bψ(st,at)−rt≤b,t=1,…,T.
RobustPolicyIteration
91
FIGURE6.10:Theconditionalvalue-at-risk(CVaR).
6.4.5
ConditionalValue-At-Risk
Intheareaoffinance,theconditionalvalue-at-risk(CVaR)isapopular
riskmeasure(Rockafellar&Uryasev,2002).TheCVaRcorrespondstothe
meanoftheerrorforasetof“bad”samples(seeFigure6.10).
Morespecifically,letusconsiderthedistributionoftheabsoluteerrorover
alltrainingsamples(st,at,rt)Tt=1:
Φ(α|θ)=P(st,at,rt):|θ⊤b
ψ(st,at)−rt|≤α.
Forβ∈[0,1),letαβ(θ)bethe100βpercentileoftheabsoluteerrordistribu-tion:
αβ(θ)=minα|Φ(α|θ)≥β.
Thus,onlythefraction(1−β)oftheabsoluteerror|θ⊤b
ψ(st,at)−rt|exceeds
thethresholdαβ(θ).αβ(θ)isalsoreferredtoasthevalue-at-risk(VaR).
Letusconsidertheβ-taildistributionoftheabsoluteerror:
0
ifα<αβ(θ),
Φβ(α|θ)= Φ(α|θ)−β
ifα≥α
1−β
β(θ).
Letφβ(θ)bethemeanoftheβ-taildistributionoftheabsolutetemporal
difference(TD)error:
h
i
φβ(θ)=EΦ
|θ⊤b
ψ(s
,
β
t,at)−rt|
whereEΦdenotestheexpectationoverthedistributionΦ
β
β.φβ(θ)iscalled
theCVaR.Bydefinition,theCVaRoftheabsoluteerrorisreducedtothe
meanabsoluteerrorifβ=0anditconvergestotheworstabsoluteerror
asβtendsto1.Thus,theCVaRsmoothlybridgestheleastabsoluteand
Chebyshevapproximationmethods.CVaRisalsoreferredtoastheexpected
shortfall.
92
StatisticalReinforcementLearning
TheCVaRminimizationprobleminthecurrentcontextisformulatedas
h
h
ii
minEΦ
|θ⊤b
ψ(s
.
β
t,at)−rt|
θ
Thisoptimizationproblemlookscomplicated,butthesolutionb
θCVcanbeob-
tainedbysolvingthefollowinglinearprogram(Rockafellar&Uryasev,2002):
T
X
min
T(1−β)α+
ct
θ,btT
,c
,α
t=1
tT
t=1
t=1
subjectto
−b
t≤θ⊤b
ψ(st,at)−rt≤bt,
ct≥bt−α,
ct≥0,t=1,…,T.
Notethatifthedefinitionoftheabsoluteerrorisslightlychanged,the
CVaRminimizationmethodamountstominimizingthedeadzone-linearloss
(Takeda,2007).
6.5
Remarks
LSPIcanberegardedasregressionofimmediaterewardsunderthe
squaredloss.Inthischapter,theabsolutelosswasusedforregression,which
contributestoenhancingrobustnessandreliability.Theleastabsolutemethod
isformulatedasalinearprogramanditcanbesolvedefficientlybystandard
optimizationsoftware.
LSPImaximizesthestate-actionvaluefunctionQ(s,a),whichistheex-
pectationofreturns.Anotherwaytoaddresstherobustnessandreliability
istomaximizeotherquantitiessuchasthemedianoraquantileofreturns.
AlthoughBellman-likesimplerecursiveexpressionsarenotavailableforquan-
tilesofrewards,aBellman-likerecursiveequationholdsforthedistribution
ofthediscountedsumofrewards(Morimuraetal.,2010a;Morimuraetal.,
2010b).Developingrobustreinforcementlearningalgorithmsalongthisline
ofresearchwouldbeapromisingfuturedirection.
PartIII
Model-FreePolicySearch
InthepolicyiterationapproachexplainedinPartII,thevaluefunctionis
firstestimatedandthenthepolicyisdeterminedbasedonthelearnedvalue
function.Policyiterationwasdemonstratedtoworkwellinmanyreal-world
applications,especiallyinproblemswithdiscretestatesandactions(Tesauro,
1994;Williams&Young,2007;Abeetal.,2010).Althoughpolicyiteration
canalsohandlecontinuousstatesbyfunctionapproximation(Lagoudakis&
Parr,2003),continuousactionsarehardtodealwithduetothedifficultyof
findingamaximizerofthevaluefunctionwithrespecttoactions.Moreover,
sincepoliciesareindirectlydeterminedviavaluefunctionapproximation,mis-
specificationofvaluefunctionmodelscanleadtoaninappropriatepolicyeven
inverysimpleproblems(Weaver&Baxter,1999;Baxteretal.,2001).Another
limitationofpolicyiterationespeciallyinphysicalcontroltasksisthatcontrol
policiescanvarydrasticallyineachiteration.Thiscausessevereinstabilityin
thephysicalsystemandthusisnotfavorableinpractice.
Policysearchisanalternativeapproachtoreinforcementlearningthatcan
overcomethelimitationsofpolicyiteration(Williams,1992;Dayan&Hin-
ton,1997;Kakade,2002).Inthepolicysearchapproach,policiesaredirectly
learnedsothatthereturn(i.e.,thediscountedsumoffuturerewards),
T
Xγt−1r(st,at,st+1),
t=1
ismaximized.
InPartIII,wefocusontheframeworkofpolicysearch.First,directpolicy
searchmethodsareintroduced,whichtrytofindthepolicythatachievesthe
maximumreturnviagradientascent(Chapter7)orexpectation-maximization
(Chapter8).Apotentialweaknessofthedirectpolicysearchapproachisits
instabilityduetotherandomnessofstochasticpolicies.Toovercometheinsta-
bilityproblem,analternativeapproachcalledpolicy-priorsearchisintroduced
inChapter9.
Thispageintentionallyleftblank
Chapter7
DirectPolicySearchbyGradient
Ascent
Thedirectpolicysearchapproachtriestofindthepolicythatmaximizes
theexpectedreturn.Inthischapter,weintroducegradient-basedalgorithms
fordirectpolicysearch.AftertheproblemformulationinSection7.1,the
gradientascentalgorithmisintroducedinSection7.2.Then,inSection7.3,
itsextentionusingnaturalgradientsisdescribed.InSection7.4,applicationto
computergraphicsisshown.Finally,thischapterisconcludedinSection7.5.
7.1
Formulation
Inthissection,theproblemofdirectpolicysearchismathematicallyfor-
mulated.
LetusconsideraMarkovdecisionprocessspecifiedby
(S,A,p(s′|s,a),p(s),r,γ),
whereSisasetofcontinuousstates,Aisasetofcontinuousactions,p(s′|s,a)
isthetransitionprobabilitydensityfromcurrentstatestonextstates′when
actionaistaken,p(s)istheprobabilitydensityofinitialstates,r(s,a,s′)
isanimmediaterewardfortransitionfromstos′bytakingactiona,and
0<γ≤1isthediscountedfactorforfuturerewards.
Letπ(a|s,θ)beastochasticpolicyparameterizedbyθ,whichrepresents
theconditionalprobabilitydensityoftakingactionainstates.Lethbea
trajectoryoflengthT:
h=[s1,a1,…,sT,aT,sT+1].
Thereturn(i.e.,thediscountedsumoffuturerewards)alonghisdefinedas
T
X
R(h)=
γt−1r(st,at,st+1),
t=1
andtheexpectedreturnforpolicyparameterθisdefinedas
Z
J(θ)=Ep(h|θ)[R(h)]=
p(h|θ)R(h)dh,
95
96
StatisticalReinforcementLearning
FIGURE7.1:Gradientascentfordirectpolicysearch.
whereEp(h|θ)istheexpectationovertrajectoryhdrawnfromp(h|θ),and
p(h|θ)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy
parameterθ:
T
Y
p(h|θ)=p(s1)
p(st+1|st,at)π(at|st,θ).
t=1
Thegoalofdirectpolicysearchistofindtheoptimalpolicyparameterθ∗thatmaximizestheexpectedreturnJ(θ):
θ∗=argmaxJ(θ).θ
However,directlymaximizingJ(θ)ishardsinceJ(θ)usuallyinvolveshigh
non-linearitywithrespecttoθ.Below,agradient-basedalgorithmisintro-
ducedtofindalocalmaximizerofJ(θ).Analternativeapproachbasedon
theexpectation-maximizationalgorithmisprovidedinChapter8.
7.2
GradientApproach
Inthissection,agradientascentmethodfordirectpolicysearchisintro-
duced(Figure7.1).
7.2.1
GradientAscent
Thesimplestapproachtofindingalocalmaximizeroftheexpectedreturn
isgradientascent(Williams,1992):
θ←−θ+ε∇θJ(θ),
DirectPolicySearchbyGradientAscent
97
whereεisasmallpositiveconstantand∇θJ(θ)denotesthegradientofex-pectedreturnJ(θ)withrespecttopolicyparameterθ.Thegradient∇θJ(θ)isgivenby
Z
∇θJ(θ)=∇θp(h|θ)R(h)dhZ
=
p(h|θ)∇θlogp(h|θ)R(h)dhZ
T
X
=
p(h|θ)
∇θlogπ(at|st,θ)R(h)dh,t=1
wheretheso-called“logtrick”isused:
∇θp(h|θ)=p(h|θ)∇θlogp(h|θ).Thisexpressionmeansthatthegradient∇θJ(θ)isgivenastheexpectationoverp(h|θ):
”
#
T
X
∇θJ(θ)=Ep(h|θ)∇θlogπ(at|st,θ)R(h).t=1
Sincep(h|θ)isunknown,theexpectationisapproximatedbytheempirical
averageas
N
T
1XX
∇bθJ(θ)=
∇
N
θlogπ(at,n|st,n,θ)R(hn),
n=1t=1
where
hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n]
isanindependentsamplefromp(h|θ).ThisalgorithmiscalledREINFORCE
(Williams,1992),whichisanacronymfor“REwardIncrement=Nonnegative
Factor×OffsetReinforcement×CharacteristicEligibility.”
Apopularchoiceforpolicymodelπ(a|s,θ)istheGaussianpolicymodel,
wherepolicyparameterθconsistsofmeanvectorµandstandarddeviation
σ:
1
(a−µ⊤φ(s))2
π(a|s,µ,σ)=√
exp−
.
(7.1)
σ2π
2σ2
Here,φ(s)denotesthebasisfunction.ForthisGaussianpolicymodel,the
policygradientsareexplicitlycomputedas
a−µ⊤φ(s)
∇µlogπ(a|s,µ,σ)=φ(s),
σ2
(a−µ⊤φ(s))2−σ2
∇σlogπ(a|s,µ,σ)=.
σ3
98
StatisticalReinforcementLearning
Asshownabove,thegradientascentalgorithmfordirectpolicysearchis
verysimpletoimplement.Furthermore,thepropertythatpolicyparameters
aregraduallyupdatedinthegradientascentalgorithmispreferablewhen
reinforcementlearningisappliedtothecontrolofavulnerablephysicalsystem
suchasahumanoidrobot,becausesuddenpolicychangecandamagethe
system.However,thevarianceofpolicygradientstendstobelargeinpractice
(Peters&Schaal,2006;Sehnkeetal.,2010),whichcanresultinslowand
unstableconvergence.
7.2.2
BaselineSubtractionforVarianceReduction
Baselinesubtractionisausefultechniquetoreducethevarianceofgradient
estimators.Technically,baselinesubtractioncanbeviewedasthemethodof
controlvariates(Fishman,1996),whichisaneffectiveapproachtoreducing
thevarianceofMonteCarlointegralestimators.
Thebasicideaofbaselinesubtractionisthatanunbiasedestimatorb
ηis
stillunbiasedifazero-meanrandomvariablemmultipliedbyaconstantξis
subtracted:
b
ηξ=b
η−ξm.
Theconstantξ,whichiscalledabaseline,maybechosensothatthevariance
ofb
ηξisminimized.Bybaselinesubtraction,amorestableestimatorthanthe
originalb
ηcanbeobtained.
Apolicygradientestimatorwithbaselineξsubtractedisgivenby
T
X
∇b
b
θJξ(θ)=∇θJ(θ)−ξ∇θlogπ(at,n|st,n,θ)t=1
1N
X
T
X
=
(R(h
∇N
n)−ξ)
θlogπ(at,n|st,n,θ),
n=1
t=1
wheretheexpectationof∇θlogπ(a|s,θ)iszero:Z
E[∇θlogπ(a|s,θ)]=π(a|s,θ)∇θlogπ(a|s,θ)daZ
=
∇θπ(a|s,θ)daZ
=∇θπ(a|s,θ)da=∇θ1=0.Theoptimalbaselineisdefinedastheminimizerofthevarianceofthegradient
estimatorwithrespecttothebaseline(Greensmithetal.,2004;Weaver&Tao,
2001):
ξ∗=argminVarb
p(h|θ)[∇θJξ(θ)],
ξ
DirectPolicySearchbyGradientAscent
99
whereVarp(h|θ)denotesthetraceofthecovariancematrix:
Varp(h|
E
θ)[ζ]=tr
p(h|θ)(ζ−Ep(h|θ)[ζ])(ζ−Ep(h|θ)[ζ])⊤h
i
=Ep(h|θ)kζ−Ep(h|θ)[ζ]k2.
ItwasshowninPetersandSchaal(2006)thattheoptimalbaselineξ∗isgivenas
P
E
T
ξ∗=p(h|θ)[R(h)k
t=1∇θlogπ(at|st,θ)k2]P
.
E
T
p(h|θ)[k
t=1∇θlogπ(at|st,θ)k2]Inpractice,theexpectationsareapproximatedbysampleaverages.
7.2.3
VarianceAnalysisofGradientEstimators
Here,thevarianceofgradientestimatorsistheoreticallyinvestigatedfor
theGaussianpolicymodel(7.1)withφ(s)=s.SeeZhaoetal.(2012)for
technicaldetails.
Inthetheoreticalanalysis,subsetsofthefollowingassumptionsarecon-
sidered:
Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such
thatkstk≥ctandkstk≤dtholdwithprobabilityatleast1−δ,
2N
respectively,overthechoiceofsamplepaths.
NotethatAssumption(B)isstrongerthanAssumption(A).Let
ζ(T)=CTα2−DTβ2/(2π),
where
T
X
T
X
CT=
c2tandDT=
d2t.
t=1
t=1
First,thevarianceofgradientestimatorsisanalyzed.
Theorem7.1UnderAssumptions(A)and(C),thefollowingupperbound
holdswithprobabilityatleast1−δ/2:
h
i
D
Var
b
Tβ2(1−γT)2
p(h|θ)∇µJ(µ,σ)≤
.
Nσ2(1−γ)2
UnderAssumption(A),itholdsthat
h
i
2Tβ2(1−γT)2
Var
b
p(h|θ)∇σJ(µ,σ)≤
.
Nσ2(1−γ)2
100
StatisticalReinforcementLearning
Theaboveupperboundsaremonotoneincreasingwithrespecttotrajec-
torylengthT.
Forthevarianceof∇bµJ(µ,σ),thefollowinglowerboundholds(itsupper
boundhasnotbeenderivedyet):
Theorem7.2UnderAssumptions(B)and(C),thefollowinglowerbound
holdswithprobabilityatleast1−δ:
h
i
(1−γT)2
Var
b
p(h|θ)∇µJ(µ,σ)≥
ζ(T).
Nσ2(1−γ)2
Thislowerboundisnon-trivialifζ(T)>0,whichcanbefulfilled,e.g.,if
αandβsatisfy
2πCTα2>DTβ2.
Next,thecontributionoftheoptimalbaselineisinvestigated.Itwasshown
(Greensmithetal.,2004;Weaver&Tao,2001)thattheexcessvarianceforan
arbitrarybaselineξisgivenby
Var
b
b
p(h|θ)[∇θJξ(θ)]−Varp(h|θ)[∇θJξ∗(θ)]
2
(ξ−ξ∗)2T
X
=
E
∇.
N
p(h|θ)
θlogπ(at|st,θ)
t=1
Basedonthisexpression,thefollowingtheoremcanbeobtained.
Theorem7.3UnderAssumptions(B)and(C),thefollowingboundshold
withprobabilityatleast1−δ:
CTα2(1−γT)2≤Var
b
J(µ,σ)]−Var
b
Jξ∗(µ,σ)]Nσ2(1−γ)2
p(h|θ)[∇µp(h|θ)[∇µβ2(1−γT)2D
≤
T.
Nσ2(1−γ)2
Thistheoremshowsthatthelowerboundoftheexcessvarianceispositive
andmonotoneincreasingwithrespecttothetrajectorylengthT.Thismeans
thatthevarianceisalwaysreducedbyoptimalbaselinesubtractionandthe
amountofvariancereductionismonotoneincreasingwithrespecttothetra-
jectorylengthT.Notethattheupperboundisalsomonotoneincreasingwith
respecttothetrajectorylengthT.
Finally,thevarianceofgradientestimatorswiththeoptimalbaselineis
investigated:
Theorem7.4UnderAssumptions(B)and(C),itholdsthat
(1−γT)2
Var
b
p(h|θ)[∇µJξ∗(µ,σ)]≤(β2D
Nσ2(1−γ)2
T−α2CT),
wheretheinequalityholdswithprobabilityatleast1−δ.
DirectPolicySearchbyGradientAscent
101
(a)Ordinarygradients
(b)Naturalgradients
FIGURE7.2:Ordinaryandnaturalgradients.Ordinarygradientstreatall
dimensionsequally,whilenaturalgradientstaketheRiemannianstructure
intoaccount.
Thistheoremshowsthattheupperboundofthevarianceofthegradient
estimatorswiththeoptimalbaselineisstillmonotoneincreasingwithrespect
tothetrajectorylengthT.Thus,whenthetrajectorylengthTislarge,the
varianceofthegradientestimatorscanstillbelargeevenwiththeoptimal
baseline.
InChapter9,anothergradientapproachwillbeintroducedforovercoming
thislarge-varianceproblem.
7.3
NaturalGradientApproach
Thegradient-basedpolicyparameterupdateusedintheREINFORCE
algorithmisperformedundertheEuclideanmetric.Inthissection,weshow
anotherusefulchoiceofthemetricforgradient-basedpolicysearch.
7.3.1
NaturalGradientAscent
UseoftheEuclideanmetricimpliesthatalldimensionsofthepolicypa-
rametervectorθaretreatedequally(Figure7.2(a)).However,sinceapolicy
parameterθspecifiesaconditionalprobabilitydensityπ(a|s,θ),useofthe
Euclideanmetricintheparameterspacedoesnotnecessarilymeanalldi-
mensionsaretreatedequallyinthespaceofconditionalprobabilitydensities.
Thus,asmallchangeinthepolicyparameterθcancauseabigchangeinthe
conditionalprobabilitydensityπ(a|s,θ)(Kakade,2002).
Figure7.3describestheGaussiandensitieswithmeanµ=−5,0,5and
standarddeviationσ=1,2.Thisshowsthatifthestandarddeviationis
102
StatisticalReinforcementLearning
0.4
0.3
0.2
0.1
0
−10
−5
0
5
10
a
FIGURE7.3:Gaussiandensitieswithdifferentmeansandstandarddevi-
ations.Ifthestandarddeviationisdoubled(fromthesolidlinestodashed
lines),thedifferenceinmeanshouldalsobedoubledtomaintainthesame
overlappinglevel.
doubled,thedifferenceinmeanshouldalsobedoubledtomaintainthesame
overlappinglevel.Thus,itis“natural”tocomputethedistancebetweentwo
Gaussiandensitiesparameterizedwith(µ,σ)and(µ+∆µ,σ)notby∆µ,but
by∆µ/σ.
Gradientsthattreatalldimensionsequallyinthespaceofprobability
densitiesarecallednaturalgradients(Amari,1998;Amari&Nagaoka,2000).
Theordinarygradientisdefinedasthesteepestascentdirectionunderthe
Euclideanmetric(Figure7.2(a)):
∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤∆θ≤ǫ,
∆θ
whereǫisasmallpositivenumber.Ontheotherhand,thenaturalgradi-
entisdefinedasthesteepestascentdirectionundertheRiemannianmetric
(Figure7.2(b)):
e
∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤Rθ∆θ≤ǫ,
∆θ
whereRθistheRiemannianmetric,whichisapositivedefinitematrix.The
solutionoftheaboveoptimizationproblemisgivenby
e
∇θJ(θ)=R−1θ
∇θJ(θ).Thus,theordinarygradient∇θJ(θ)ismodifiedbytheinverseRiemannianmetricR−1inthenaturalgradient.
θ
Astandarddistancemetricinthespaceofprobabilitydensitiesisthe
Kullback–Leibler(KL)divergence(Kullback&Leibler,1951).TheKLdiver-
gencefromdensityptodensityqisdefinedas
Z
p(θ)
KL(pkq)=
p(θ)log
dθ.
q(θ)
DirectPolicySearchbyGradientAscent
103
KL(pkq)isalwaysnon-negativeandzeroifandonlyifp=q.Thus,smaller
KL(pkq)meansthatpandqare“closer.”However,notethattheKLdiver-
genceisnotsymmetric,i.e.,KL(pkq)6=KL(qkp)ingeneral.
Forsmall∆θ,theKLdivergencefromp(h|θ)top(h|θ+∆θ)canbeap-
proximatedby
∆θ⊤Fθ∆θ,
whereFθistheFisherinformationmatrix:
Fθ=Ep(h|θ)[∇θlogp(h|θ)∇θlogp(h|θ)⊤].
Thus,FθistheRiemannianmetricinducedbytheKLdivergence.
Thentheupdateruleofthepolicyparameterθbasedonthenatural
gradientisgivenby
−1
θ←−θ+εb
Fθ∇θJ(θ),whereεisasmallpositiveconstantandb
FθisasampleapproximationofFθ:
N
X
b
1
Fθ=
∇N
θlogp(hn|θ)∇θlogp(hn|θ)⊤.
n=1
Undermildregularityconditions,theFisherinformationmatrixFθcan
beexpressedas
Fθ=−Ep(h|θ)[∇2θlogp(h|θ)],where∇2logp(hθ
|θ)denotestheHessianmatrixoflogp(h|θ).Thatis,the
(b,b′)-thelementof∇2logp(hlogp(h
θ
|θ)isgivenby
∂2
∂θ
|θ).Thismeans
b∂θb′
thatthenaturalgradienttakesthecurvatureintoaccount,bywhichthecon-
vergencebehavioratflatplateausandsteepridgestendstobeimproved.On
theotherhand,apotentialweaknessofnaturalgradientsisthatcomputation
oftheinverseRiemannianmetrictendstobenumericallyunstable(Deisenroth
etal.,2013).
7.3.2
Illustration
Letusillustratethedifferencebetweenordinaryandnaturalgradients
numerically.
Considerone-dimensionalreal-valuedstatespaceS=Randone-
dimensionalreal-valuedactionspaceA=R.Thetransitiondynamicsislin-
earanddeterministicass′=s+a,andtherewardfunctionisquadraticas
r=0.5s2−0.05a.Thediscountfactorissetatγ=0.95.TheGaussianpolicy
model,
1
(a−µs)2
π(a|s,µ,σ)=√
exp−
,
σ2π
2σ2
isemployed,whichcontainsthemeanparameterµandthestandarddevia-
tionparameterσ.Theoptimalpolicyparametersinthissetuparegivenby
(µ∗,σ∗)≈(−0.912,0).
104
StatisticalReinforcementLearning
1
1
0.8
0.8
0.6
0.6
σ
σ
0.4
0.4
0.2
0.2
0
0
−1.5
−1
−0.5
−1.5
−1
−0.5
µ
µ
(a)Ordinarygradients
(b)Naturalgradients
FIGURE7.4:Numericalillustrationsofordinaryandnaturalgradients.
Figure7.4showsnumericalcomparisonofordinaryandnaturalgradients
fortheGaussianpolicy.Thecontourlinesandthearrowsindicatetheex-
pectedreturnsurfaceandthegradientdirections,respectively.Thegraphs
showthattheordinarygradientstendtostronglyreducethestandarddevia-
tionparameterσwithoutreallyupdatingthemeanparameterµ.Thismeans
thatthestochasticityofthepolicyislostquicklyandthustheagentbecomes
lessexploratory.Consequently,onceσgetsclosertozero,thesolutionisat
aflatplateaualongthedirectionofµandthuspolicyupdatesinµarevery
slow.Ontheotherhand,thenaturalgradientsreduceboththemeanparam-
eterµandthestandarddeviationparameterσinabalancedway.Asaresult,
convergencegetsmuchfasterthantheordinarygradientmethod.
7.4
ApplicationinComputerGraphics:ArtistAgent
Orientalinkpainting,whichisalsocalledsumie,isoneofthemostdis-
tinctivepaintingstylesandhasattractedartistsaroundtheworld.Major
challengesinsumiesimulationaretoabstractcomplexsceneinformationand
reproducesmoothandnaturalbrushstrokes.Reinforcementlearningisuseful
toautomaticallygeneratesuchsmoothandnaturalstrokes(Xieetal.,2013).
Inthissection,theREINFORCEalgorithmexplainedinSection7.2isapplied
tosumieagenttraining.
DirectPolicySearchbyGradientAscent
105
7.4.1
SumiePainting
Amongvarioustechniquesofnon-photorealisticrendering(Gooch&
Gooch,2001),stroke-basedpainterlyrenderingsynthesizesanimagefroma
sourceimageinadesiredpaintingstylebyplacingdiscretestrokes(Hertz-
mann,2003).Suchanalgorithmsimulatesthecommonpracticeofhuman
painterswhocreatepaintingswithbrushstrokes.
Westernpaintingstylessuchaswater-color,pastel,andoilpaintingoverlay
strokesontomultiplelayers,whileorientalinkpaintingusesafewexpressive
strokesproducedbysoftbrushtuftstoconveysignificantinformationabouta
targetscene.Theappearanceofthestrokeinorientalinkpaintingistherefore
determinedbytheshapeoftheobjecttopaint,thepathandpostureofthe
brush,andthedistributionofpigmentsinthebrush.
Drawingsmoothandnaturalstrokesinarbitraryshapesischallenging
sinceanoptimalbrushtrajectoryandthepostureofabrushfootprintare
differentforeachshape.Existingmethodscanefficientlymapbrushtexture
bydeformationontoauser-giventrajectorylineortheshapeofatargetstroke
(Hertzmann,1998;Guo&Kunii,2003).However,thegeometricalprocessof
morphingtheentiretextureofabrushstrokeintothetargetshapeleads
toundesirableeffectssuchasunnaturalfoldingsandcreasedappearancesat
cornersorcurves.
Here,asoft-tuftbrushistreatedasareinforcementlearningagent,andthe
REINFORCEalgorithmisusedtoautomaticallydrawartisticstrokes.More
specifically,givenanyclosedcontourthatrepresentstheshapeofadesired
singlestrokewithoutoverlap,theagentmovesthebrushonthecanvastofill
thegivenshapefromastartpointtoanendpointwithstableposesalonga
smoothcontinuousmovementtrajectory(seeFigure7.5).
Inorientalinkpainting,thereareseveraldifferentbrushstylesthatcharac-
terizethepaintings.Below,tworepresentativestylescalledtheuprightbrush
styleandtheobliquebrushstyleareconsidered(seeFigure7.6).Intheupright
brushstyle,thetipofthebrushshouldbelocatedonthemedialaxisofthe
expectedstrokeshape,andthebottomofthebrushshouldbetangenttoboth
sidesoftheboundary.Ontheotherhand,intheobliquebrushstyle,thetip
ofthebrushshouldtouchonesideoftheboundaryandthebottomofthe
brushshouldbetangenttotheothersideoftheboundary.Thechoiceofthe
uprightbrushstyleandtheobliquebrushstyleisexclusiveandauserisasked
tochooseoneofthestylesinadvance.
7.4.2
DesignofStates,Actions,andImmediateRewards
Here,specificdesignofstates,actions,andimmediaterewardstailoredto
thesumieagentisdescribed.
106
StatisticalReinforcementLearning
(a)Brushmodel
(b)Footprints
(c)Basicstrokestyles
FIGURE7.5:Illustrationofthebrushagentanditspath.(a)Astrokeisgen-
eratedbymovingthebrushwiththefollowing3actions:Action1isregulating
thedirectionofthebrushmovement,Action2ispushingdown/liftingupthe
brush,andAction3isrotatingthebrushhandle.OnlyAction1isdetermined
byreinforcementlearning,andAction2andAction3aredeterminedbased
onAction1.(b)Thetopsymbolillustratesthebrushagent,whichconsistsof
atipQandacirclewithcenterCandradiusr.Othersillustratefootprintsof
arealbrushwithdifferentinkquantities.(c)Thereare6basicstrokestyles:
fullink,dryink,first-halfhollow,hollow,middlehollow,andboth-endhollow.
Smallfootprintsonthetopofeachstrokeshowtheinterpolationorder.
7.4.2.1
States
Theglobalmeasurement(i.e.,theposeconfigurationofafootprintunder
theglobalCartesiancoordinate)andthelocalmeasurement(i.e.,thepose
andthelocomotioninformationofthebrushagentrelativetothesurrounding
environment)areusedasstates.Here,onlythelocalmeasurementisusedto
calculatearewardandapolicy,bywhichtheagentcanlearnthedrawing
policythatisgeneralizabletonewshapes.Below,thelocalmeasurementis
regardedasstatesandtheglobalmeasurementisdealtwithonlyimplicitly.
DirectPolicySearchbyGradientAscent
107
FIGURE7.6:Uprightbrushstyle(left)andobliquebrushstyle(right).
Thelocalstate-spacedesignconsistsoftwocomponents:acurrentsur-
roundingshapeandanupcomingshape.Morespecifically,statevectorscon-
sistsofthefollowingsixfeatures:
s=(ω,φ,d,κ1,κ2,l)⊤.
Eachfeatureisdefinedasfollows(seeFigures7.7):
•ω∈(−π,π]:Theangleofthevelocityvectorofthebrushagentrelativetothemedialaxis.
•φ∈(−π,π]:Theheadingdirectionofthebrushagentrelativetothemedialaxis.
•d∈[−2,2]:TheratioofoffsetdistanceδfromthecenterCofthebrushagenttothenearestpointPonthemedialaxisMovertheradiusrof
thebrushagent(|d|=δ/r).dtakesapositive/negativevaluewhenthe
centerofthebrushagentisontheleft-/right-handsideofthemedial
axis:
–dtakesthevalue0whenthecenterofthebrushagentisonthe
medialaxis.
–dtakesavaluein[−1,1]whenthebrushagentisinsidethebound-
aries.
–Thevalueofdisin[−2,−1)orin(1,2]whenthebrushagentgoes
overtheboundaryofoneside.
108
StatisticalReinforcementLearning
dt–1<=1
t
P
rt–1
f
t–1
C
t–1
t–1
Q
Qt–1
C
r
t
r
d
P
C
t
t
P
>1
Qt
t
f
t
t
Pt–1
FIGURE7.7:Illustrationofthedesignofstates.Left:Thebrushagent
consistsofatipQandacirclewithcenterCandradiusr.Right:Theratiod
oftheoffsetdistanceδovertheradiusr.Footprintft−1isinsidethedrawing
area,andthecirclewithcenterCt−1andthetipQt−1touchtheboundaryon
eachside.Inthiscase,δt−1≤rt−1anddt−1∈[0,1].Ontheotherhand,ftgoesovertheboundary,andthenδt>rtanddt>1.Notethatdisrestrictedtobein[−2,2],andPisthenearestpointonmedialaxisMtoC.
Notethatthecenteroftheagentisrestrictedwithintheshape.There-
fore,theextremevaluesofdare±2whenthecenteroftheagentison
theboundary.
•κ1,κ2∈(−1,1):κ1providesthecurrentsurroundinginformationonthepointPt,whereasκ2providestheupcomingshapeinformationonpoint
Pt+1:
2
p
κi=
arctan0.05/r′,
π
i
wherer′iistheradiusofthecurve.Morespecifically,thevaluetakes
0/negative/positivewhentheshapeisstraight/left-curved/right-curved,
andthelargeritsabsolutevalueis,thetighterthecurveis.
•l∈0,1:Abinarylabelthatindicateswhethertheagentmovestoaregioncoveredbythepreviousfootprintsornot.l=0meansthatthe
agentmovestoaregioncoveredbythepreviousfootprint.Otherwise,
l=1meansthatitmovestoanuncoveredregion.
7.4.2.2
Actions
Togenerateelegantbrushstrokes,thebrushagentshouldmoveinside
givenboundariesproperly.Here,thefollowingactionsareconsideredtocontrol
thebrush(seeFigure7.5(a)):
•Action1:Movementofthebrushonthecanvaspaper.
•Action2:Scalingup/downofthefootprint.
DirectPolicySearchbyGradientAscent
109
•Action3:Rotationoftheheadingdirectionofthebrush.
Sinceproperlycoveringthewholedesiredregionisthemostimportantin
termsofthevisualquality,themovementofthebrush(Action1)isregarded
astheprimaryaction.Morespecifically,Action1takesavaluein(−π,−π]
thatindicatestheoffsetturningangleofthemotiondirectionrelativetothe
medialaxisofanexpectedstrokeshape.Inpracticalapplications,theagent
shouldbeabletodealwitharbitrarystrokesinvariousscales.Toachieve
stableperformanceindifferentscales,thevelocityisadaptivelychangedas
r/3,whereristheradiusofthecurrentfootprint.
Action1isdeterminedbytheGaussianpolicyfunctiontrainedbythe
REINFORCEalgorithm,andAction2andAction3aredeterminedasfollows.
•Obliquebrushstrokestyle:Thetipoftheagentissettotouchoneside
oftheboundary,andthebottomoftheagentissettobetangenttothe
othersideoftheboundary.
•Uprightbrushstrokestyle:Thetipoftheagentischosentotravelalong
themedialaxisoftheshape.
IfitisnotpossibletosatisfytheaboveconstraintsbyadjustingAction2and
Action3,thenewfootprintwillsimplybethesamepostureastheprevious
one.
7.4.2.3
ImmediateRewards
Theimmediaterewardfunctionmeasuresthequalityofthebrushagent’s
movementaftertakinganactionateachtimestep.Therewardisdesignedto
reflectthefollowingtwoaspects:
•Thedistancebetweenthecenterofthebrushagentandthenearestpoint
onthemedialaxisoftheshapeatthecurrenttimestep:Thisdetects
whethertheagentmovesoutoftheregionortravelsbackwardfromthe
correctdirection.
•Changeofthelocalconfigurationofthebrushagentafterexecutingan
action:Thisdetectswhethertheagentmovessmoothly.
Thesetwoaspectsareformalizedbydefiningtherewardfunctionasfol-
lows:
0
ifft=ft+1orlt+1=0,
r(st,at,st+1)= 2+|κ1(t)|+|κ2(t)|
otherwise,
E(t)
+E(t)
location
posture
whereftandft+1arethefootprintsattimestepstandt+1,respectively.This
rewarddesignimpliesthattheimmediaterewardiszerowhenthebrushis
blockedbyaboundaryasft=ft+1orthebrushisgoingbackwardtoaregion
110
StatisticalReinforcementLearning
thathasalreadybeencoveredbypreviousfootprints.κ1(t)andκ2(t)arethe
valuesofκ1andκ2attimestept.|κ1(t)|+|κ2(t)|adaptivelyincreasesthe
immediaterewarddependingonthecurvaturesκ1(t)andκ2(t)ofthemedial
axis.
E(t)
measuresthequalityofthelocationofthebrushagentwithre-
location
specttothemedialaxis,definedby
(τ
E(t)
=
1|ωt|+τ2(|dt|+5)
dt∈[−2,−1)∪(1,2],location
τ1|ωt|+τ2|dt|
dt∈[−1,1],wheredtisthevalueofdattimestept.τ1andτ2areweightparameters,
whicharechosendependingonthebrushstyle:τ1=τ2=0.5fortheupright
brushstyleandτ1=0.1andτ2=0.9fortheobliquebrushstyle.Sincedt
containsinformationaboutwhethertheagentgoesovertheboundaryornot,
asillustratedinFigure7.7,thepenalty+5isaddedtoElocationwhenthe
agentgoesovertheboundaryoftheshape.
E(t)
posturemeasuresthequalityofthepostureofthebrushagentbasedon
neighboringfootprints,definedby
E(t)
posture=∆ωt/3+∆φt/3+∆dt/3,
where∆ωt,∆φt,and∆dtarechangesinangleωofthevelocityvector,heading
directionφ,andratiodoftheoffsetdistance,respectively.Thenotation∆xt
denotesthenormalizedsquaredchangebetweenxt−1andxtdefinedby
1
ifxt=xt−1=0,
∆xt=
(x
t−xt−1)2
otherwise.
(|xt|+|xt−1|)2
7.4.2.4
TrainingandTestSessions
Anaivewaytotrainanagentistouseanentirestrokeshapeasatraining
sample.However,thishasseveraldrawbacks,e.g.,collectingmanytraining
samplesiscostlyandgeneralizationtonewshapesishard.Toovercomethese
limitations,theagentistrainedbasedonpartialshapes,nottheentireshapes
(Figure7.8(a)).Thisallowsustogeneratevariouspartialshapesfromasingle
entireshape,whichsignificantlyincreasesthenumberandvariationoftrain-
ingsamples.Anothermeritisthatthegeneralizationabilitytonewshapes
canbeenhanced,becauseevenwhentheentireprofileofanewshapeisquite
differentfromthatoftrainingdata,thenewshapemaycontainsimilarpartial
shapes.Figure7.8(c)illustrates8examplesof80digitizedrealsinglebrush
strokesthatarecommonlyusedinorientalinkpainting.Boundariesareex-
tractedastheshapeinformationandarearrangedinaqueuefortraining(see
Figure7.8(b)).
Inthetrainingsession,theinitialpositionofthefirstepisodeischosento
bethestartpointofthemedialaxis,andthedirectiontomoveischosentobe
DirectPolicySearchbyGradientAscent
111
(a)Combinationofshapes
(b)Setupofpolicytraining
(c)Trainingshapes
FIGURE7.8:Policytrainingscheme.(a)Eachentireshapeiscomposed
ofoneoftheupperregionsUi,thecommonregionΩ,andoneofthelower
regionsLj.(b)Boundariesareextractedastheshapeinformationandare
arrangedinaqueuefortraining.(c)Eightexamplesof80digitizedrealsingle
brushstrokesthatarecommonlyusedinorientalinkpaintingareillustrated.
thegoalpoint,asillustratedinFigure7.8(b).Inthefirstepisode,theinitial
footprintissetatthestartpointoftheshape.Then,inthefollowingepisodes,
theinitialfootprintissetateitherthelastfootprintinthepreviousepisode
orthestartpointoftheshape,dependingonwhethertheagentmovedwell
orwasblockedbytheboundaryinthepreviousepisode.
Afterlearningadrawingpolicy,thebrushagentappliesthelearnedpolicy
tocoveringgivenboundarieswithsmoothstrokes.Thelocationoftheagentis
112
StatisticalReinforcementLearning
30
30
25
25
20
20
15
15
Return10
Return10
Upperbound
Upperbound
5
RL
5
RL
0
10
20
30
40
10
20
30
40
Iteration
Iteration
(a)Uprightbrushstyle
(b)Obliquebrushstyle
FIGURE7.9:Averageandstandarddeviationofreturnsobtainedbythe
reinforcementlearning(RL)methodover10trialsandtheupperlimitofthe
returnvalue.
initializedatthestartpointofanewshape.Theagentthensequentiallyselects
actionsbasedonthelearnedpolicyandmakestransitionsuntilitreachesthe
goalpoint.
7.4.3
ExperimentalResults
First,theperformanceofthereinforcementlearning(RL)methodisin-
vestigated.PoliciesareseparatelytrainedbytheREINFORCEalgorithmfor
theuprightbrushstyleandtheobliquebrushstyleusing80singlestrokesas
trainingdata(seeFigure7.8(c)).Theparametersoftheinitialpolicyareset
at
θ=(µ⊤,σ)⊤=(0,0,0,0,0,0,2)⊤,
wherethefirstsixelementscorrespondtotheGaussianmeanandthelast
elementistheGaussianstandarddeviation.TheagentcollectsN=300
episodicsampleswithtrajectorylengthT=32.Thediscountedfactoris
setatγ=0.99.
Theaverageandstandarddeviationsofthereturnfor300trainingepisodic
samplesover10trialsareplottedinFigure7.9.Thegraphsshowthatthe
averagereturnssharplyincreaseinanearlystageandapproachtheoptimal
values(i.e.,receivingthemaximumimmediatereward,+1,forallsteps).
Next,theperformanceoftheRLmethodiscomparedwiththatofthe
dynamicprogramming(DP)method(Xieetal.,2011),whichinvolvesdis-
cretizationofcontinuousstatespace.InFigure7.10,theexperimentalresults
obtainedbyDPwithdifferentnumbersoffootprintcandidatesineachstep
oftheDPsearchareplottedtogetherwiththeresultobtainedbyRL.This
showsthattheexecutiontimeoftheDPmethodincreasessignificantlyasthe
numberoffootprintcandidatesincreases.IntheDPmethod,thebestreturn
DirectPolicySearchbyGradientAscent
113
30
2500
DP
2000
RL
20
1500
10
1000
Averagereturn
0
DP
Computationtime
500
RL
−10
0
0
50
100
150
200
0
50
100
150
200
Thenumberoffootprintcandidates
Thenumberoffootprintcandidates
(a)Averagereturn
(b)Computationtime
FIGURE7.10:Averagereturnandcomputationtimeforreinforcement
learning(RL)anddynamicprogramming(DP).
value26.27isachievedwhenthenumberoffootprintcandidatesissetat180.
Althoughthismaximumvalueiscomparabletothereturnobtainedbythe
RLmethod(26.44),RLisabout50timesfasterthantheDPmethod.Fig-
ure7.11showssomeexemplarystrokesgeneratedbyRL(thetoptworows)
andDP(thebottomtworows).ThisshowsthattheagenttrainedbyRLis
abletodrawnicestrokeswithstableposesafterthe30thpolicyupdateiter-
ation(seealsoFigure7.9).Ontheotherhand,asillustratedinFigure7.11,
theDPresultsfor5,60,and100footprintcandidatesareunacceptablypoor.
GiventhattheDPmethodrequiresmanualtuningofthenumberoffootprint
candidatesateachstepforeachinputshape,theRLmethodisdemonstrated
tobepromising.
TheRLmethodisfurtherappliedtomorerealisticshapes,illustratedin
Figure7.12.Althoughtheshapesarenotincludedinthetrainingsamples,
theRLmethodcanproducesmoothandnaturalbrushstrokesforvarious
unlearnedshapes.MoreresultsareillustratedinFigure7.13,showingthat
theRLmethodispromisinginphotoconversionintothesumiestyle.
7.5
Remarks
Inthischapter,gradient-basedalgorithmsfordirectpolicysearchareintro-
duced.Thesegradient-basedmethodsaresuitableforcontrollingvulnerable
physicalsystemssuchashumanoidrobots,thankstothenatureofgradient
methodsthatparametersareupdatedgradually.Furthermore,directpolicy
searchcanhandlecontinuousactionsinastraightforwardway,whichisan
advantageoverpolicyiteration,explainedinPartII.
114
StatisticalReinforcementLearning
1stiteration
10thiteration
20thiteration
30thiteration
40thiteration
(a)RLmethod
5candidates
60candidates
100candidates
140candidates
180candidates
(b)DPmethod
FIGURE7.11:ExamplesofstrokesgeneratedbyRLandDP.Thetoptwo
rowsshowtheRLresultsoverpolicyupdateiterations,whilethebottomtwo
rowsshowtheDPresultsfordifferentnumbersoffootprintcandidates.The
linesegmentconnectsthecenterandthetipofafootprint,andthecircle
denotesthebottomcircleofthefootprint.
Thegradient-basedmethodwassuccessfullyappliedtoautomaticsumie
paintinggeneration.Consideringlocalmeasurementsinstatedesignwas
showntobeuseful,whichallowedabrushagenttolearnageneraldrawing
policythatisindependentofaspecificentireshape.Anotherimportantfactor
wastotrainthebrushagentonpartialshapes,nottheentireshapes.This
contributedhighlytoenhancingthegeneralizationabilitytonewshapes,be-
causeevenwhenanewshapeisquitedifferentfromtrainingdataasawhole,
itoftencontainssimilarpartialshapes.Inthiskindofreal-worldapplica-
tionsmanuallydesigningimmediaterewardfunctionsisoftentimeconsuming
anddifficult.Theuseofinversereinforcementlearning(Abbeel&Ng,2004)
wouldbeapromisingapproachforthispurpose.Inparticular,inthecon-
DirectPolicySearchbyGradientAscent
115
(a)Realphoto
(b)Userinputboundaries
(c)TrajectoriesestimatedbyRL
(d)Renderingresults
FIGURE7.12:Resultsonnewshapes.
textofsumiedrawing,suchdata-drivendesignofrewardfunctionswillallow
automaticlearningofthestyleofaparticularartistfromhis/herdrawings.
Apracticalweaknessofthegradient-basedapproachisthatthestepsize
ofgradientascentisoftendifficulttochoose.InChapter8,astep-size-free
methodofdirectpolicysearchbasedontheexpectation-maximizationalgo-
rithmwillbeintroduced.Anothercriticalproblemofdirectpolicysearchis
thatpolicyupdateisratherunstableduetothestochasticityofpolicies.Al-
thoughvariancereductionbybaselinesubtractioncanmitigatethisproblem
tosomeextent,theinstabilityproblemisstillcriticalinpractice.Thenatural
gradientmethodcouldbeanalternative,butcomputingtheinverseRieman-
nianmetrictendstobeunstable.InChapter9,anothergradientapproach
thatcanaddresstheinstabilityproblemwillbeintroduced.
116
StatisticalReinforcementLearning
FIGURE7.13:Photoconversionintothesumiestyle.
Chapter8
DirectPolicySearchby
Expectation-Maximization
Gradient-baseddirectpolicysearchmethodsintroducedinChapter7are
usefulparticularlyincontrollingcontinuoussystems.However,appropriately
choosingthestepsizeofgradientascentisoftendifficultinpractice.In
thischapter,weintroduceanotherdirectpolicysearchmethodbasedonthe
expectation-maximization(EM)algorithmthatdoesnotcontainthestepsize
parameter.InSection8.1,themainideaoftheEM-basedmethodisdescribed,
whichisexpectedtoconvergefasterbecausepoliciesaremoreaggressivelyup-
datedthanthegradient-basedapproach.Inpractice,however,directpolicy
searchoftenrequiresalargenumberofsamplestoobtainastablepolicy
updateestimator.Toimprovethestabilitywhenthesamplesizeissmall,
reusingpreviouslycollectedsamplesisapromisingapproach.InSection8.2,
thesample-reusetechniquethathasbeensuccessfullyusedtoimprovethe
performanceofpolicyiteration(seeChapter4)isappliedtotheEM-based
method.ThenitsexperimentalperformanceisevaluatedinSection8.3and
thischapterisconcludedinSection8.4.
8.1
Expectation-MaximizationApproach
Thegradient-basedoptimizationalgorithmsintroducedinSection7.2
graduallyupdatepolicyparametersoveriterations.Althoughthisisadvan-
tageouswhencontrollingaphysicalsystem,itrequiresmanyiterationsuntil
convergence.Inthissection,theexpectation-maximization(EM)algorithm
(Dempsteretal.,1977)isusedtocopewiththisproblem.
ThebasicideaofEM-basedpolicysearchistoiterativelyupdatethepolicy
parameterθbymaximizingalowerboundoftheexpectedreturnJ(θ):
Z
J(θ)=
p(h|θ)R(h)dh.
ToderivealowerboundofJ(θ),Jensen’sinequality(Bishop,2006)isutilized:
Z
Z
q(h)f(g(h))dh≥f
q(h)g(h)dh,
117
118
StatisticalReinforcementLearning
whereqisaprobabilitydensity,fisaconvexfunction,andgisanon-negative
function.Forf(t)=−logt,Jensen’sinequalityyields
Z
Z
q(h)logg(h)dh≤log
q(h)g(h)dh.
(8.1)
AssumethatthereturnR(h)isnonnegative.Lete
θbethecurrentpolicy
parameterduringtheoptimizationprocedure,andqandginEq.(8.1)areset
as
p(h|e
θ)R(h)
p(h|θ)
q(h)=
andg(h)=
.
J(e
θ)
p(h|e
θ)
Thenthefollowinglowerboundholdsforallθ:
Z
J(θ)
p(h|θ)R(h)
log
=log
dh
J(e
θ)
J(e
θ)
Zp(h|eθ)R(h)p(h|θ)
=log
dh
J(e
θ)
p(h|e
θ)
Zp(h|eθ)R(h)
p(h|θ)
≥
log
dh.
J(e
θ)
p(h|e
θ)
Thisyields
logJ(θ)≥loge
J(θ),
where
ZR(h)p(h|eθ)
p(h|θ)
loge
J(θ)=
log
dh+logJ(e
θ).
J(e
θ)
p(h|e
θ)
IntheEMapproach,theparameterθisiterativelyupdatedbymaximizing
thelowerbounde
J(θ):
bθ=argmaxe
J(θ).
θ
Sinceloge
J(e
θ)=logJ(e
θ),thelowerbounde
JtouchesthetargetfunctionJat
thecurrentsolutione
θ:
e
J(e
θ)=J(e
θ).
Thus,monotonenon-decreaseoftheexpectedreturnisguaranteed:
J(b
θ)≥J(e
θ).
Thisupdateisiterateduntilconvergence(seeFigure8.1).
LetusemploytheGaussianpolicymodeldefinedas
π(a|s,θ)=π(a|s,µ,σ)
DirectPolicySearchbyExpectation-Maximization
119
FIGURE8.1:PolicyparameterupdateintheEM-basedpolicysearch.The
policyparameterθisupdatediterativelybymaximizingthelowerbound
e
J(θ),whichtouchesthetrueexpectedreturnJ(θ)atthecurrentsolutione
θ.
1
(a−µ⊤φ(s))2
=
√
exp−
,
σ2π
2σ2
whereθ=(µ⊤,σ)⊤andφ(s)denotesthebasisfunction.
Themaximizerb
θ=(b
µ⊤,b
σ)⊤ofthelowerbounde
J(θ)canbeanalytically
obtainedas
Z
!
!
T
−1
X
Z
T
X
b
µ=
p(h|e
θ)R(h)
φ(st)φ(st)⊤dh
p(h|e
θ)R(h)
atφ(st)dh
t=1
t=1
!
!
N
−1
X
T
X
N
X
T
X
≈
R(hn)
φ(st,n)φ(st,n)⊤R(hn)
at,nφ(st,n),
n=1
t=1
n=1
t=1
Z
!
−1
Z
T
1X
b
σ2=
p(h|e
θ)R(h)dh
p(h|e
θ)R(h)
(a
T
t−b
µ⊤φ(st))2dh
t=1
!
!
N
−1
X
N
X
T
1X
≈
R(hn)
R(hn)
(a
,
T
t,n−b
µ⊤φ(st,n))2
n=1
n=1
t=1
wheretheexpectationoverhisapproximatedbytheaverageoverroll-out
samplesH=hnN
n=1fromthecurrentpolicye
θ:
hn=[s1,n,a1,n,…,sT,n,aT,n].
NotethatEM-basedpolicysearchforGaussianmodelsiscalledreward-
weightedregression(RWR)(Peters&Schaal,2007).
120
StatisticalReinforcementLearning
8.2
SampleReuse
Inpractice,alargenumberofsamplesisneededtoobtainastablepolicy
updateestimatorintheEM-basedpolicysearch.Inthissection,thesample-
reusetechniqueisappliedtotheEMmethodtocopewiththeinstability
problem.
8.2.1
EpisodicImportanceWeighting
TheoriginalRWRmethodisanon-policyalgorithmthatusesdatadrawn
fromthecurrentpolicy.Ontheotherhand,thesituationcalledoff-policyrein-
forcementlearningisconsideredhere,wherethesamplingpolicyforcollecting
datasamplesisdifferentfromthetargetpolicy.Morespecifically,Ntrajec-
torysamplesaregatheredfollowingthepolicyπℓintheℓ-thpolicyupdate
iteration:
Hπℓ=hπℓ
1,…,hπℓ
N,
whereeachtrajectorysamplehπℓ
nisgivenas
hπℓ
n=[sπℓ
1,n,aπℓ
1,n,…,sπℓ,aπℓ,sπℓ
].
T,n
T,n
T+1,n
Wewanttoutilizeallthesesamplestoimprovethecurrentpolicy.
SupposethatwearecurrentlyattheL-thpolicyupdateiteration.Ifthe
policiesπℓL
remainunchangedovertheRWRupdates,justusingthe
ℓ=1
NIW
plainupdaterulesprovidedinSection8.1givesaconsistentestimatorb
θL+1=
(b
µNIW⊤L+1
,b
σNIW)⊤,where
L+1
!
L
−1
XN
X
T
X
b
µNIW
L+1=
R(hπℓ
n)
φ(sπℓ
t,n)φ(sπℓ
t,n)⊤ℓ=1n=1
t=1
!
L
XN
X
T
X
×
R(hπℓ
n)
aπℓ
t,nφ(sπℓ
t,n)
,
ℓ=1n=1
t=1
!
L
−1
XN
X
(b
σNIW
L+1)2=
R(hπℓ
n)
ℓ=1n=1
!
L
XN
X
T
1X
2
×
R(hπℓ
⊤n)
aπℓ
φ(sπℓ
.
T
t,n−b
µNIW
L+1
t,n)
ℓ=1n=1
t=1
Thesuperscript“NIW”standsfor“noimportanceweight.”However,since
policiesareupdatedineachRWRiteration,datasamplesHπℓL
collected
ℓ=1
overiterationsgenerallyfollowdifferentprobabilitydistributionsinducedby
differentpolicies.Therefore,naiveuseoftheaboveupdateruleswillresultin
aninconsistentestimator.
DirectPolicySearchbyExpectation-Maximization
121
InthesamewayasthediscussioninChapter4,importancesamplingcan
beusedtocopewiththisproblem.Thebasicideaofimportancesampling
istoweightthesamplesdrawnfromadifferentdistributiontomatchthe
targetdistribution.Morespecifically,fromi.i.d.(independentandidentically
distributed)sampleshπℓ
nN
n=1followingp(h|θℓ),theexpectationofafunction
g(h)overanotherprobabilitydensityfunctionp(h|θL)canbeestimatedina
consistentmannerbytheimportance-weightedaverage:
N
1X
p(hπℓ
p(h|θ
g(hπ
N→∞
L)
ℓ
n|θL)
−→E
g(h)
N
n)p(hπℓ
p(h|θℓ)
n|θ
p(h|θ
n=1
ℓ)
ℓ)
Z
Z
p(h|θ
=
g(h)
L)p(h|θ
g(h)p(h|θ
p(h|
ℓ)dh=
L)dh
θℓ)
=Ep(h|θL)[g(h)].
Theratiooftwodensitiesp(h|θL)/p(h|θℓ)iscalledtheimportanceweightfor
trajectoryh.
ThisimportancesamplingtechniquecanbeemployedinRWRtoobtain
EIW
aconsistentestimatorb
θ
⊤L+1=(b
µEIW
L+1
,b
σEIW)⊤,where
L+1
!
L
−1
XN
X
T
X
b
µEIW
L+1=
R(hπℓ
n)w(L,ℓ)(h)
φ(sπℓ
t,n)φ(sπℓ
t,n)⊤ℓ=1n=1
t=1
!
L
XN
X
T
X
×
R(hπℓ
n)w(L,ℓ)(h)
aπℓ
t,nφ(sπℓ
t,n)
,
ℓ=1n=1
t=1
!
L
−1
XN
X
(b
σEIW
L+1)2=
R(hπℓ
n)w(L,ℓ)(hπℓ
n)
ℓ=1n=1
!
L
XN
X
T
1X
2
×
R(hπℓ
⊤n)w(L,ℓ)(hπℓ
n)
aπℓ
φ(sπℓ
.
T
t,n−b
µEIW
L+1
t,n)
ℓ=1n=1
t=1
Here,w(L,ℓ)(h)denotestheimportanceweightdefinedby
p(h|θ
w(L,ℓ)(h)=
L).
p(h|θℓ)
Thesuperscript“EIW”standsfor“episodicimportanceweight.”
p(h|θL)andp(h|θℓ)denotetheprobabilitydensityofobservingtrajectory
h=[s1,a1,…,sT,aT,sT+1]
underpolicyparametersθLandθℓ,whichcanbeexplicitlywrittenas
T
Y
p(h|θL)=p(s1)
p(st+1|st,at)π(at|st,θL),
t=1
122
StatisticalReinforcementLearning
T
Y
p(h|θℓ)=p(s1)
p(st+1|st,at)π(at|st,θℓ).
t=1
Thetwoprobabilitydensitiesp(h|θL)andp(h|θℓ)bothcontainunknownprob-
abilitydensitiesp(s1)andp(st+1|st,at)Tt=1.However,sincetheycancelout
intheimportanceweight,itcanbecomputedwithouttheknowledgeofp(s)
andp(s′|s,a)as
QTπ(a
w(L,ℓ)(h)=
t=1
t|st,θL)
Q
.
T
π(a
t=1
t|st,θℓ)
EIW
Althoughtheimportance-weightedestimatorb
θL+1isguaranteedtobe
consistent,ittendstohavelargevariance(Shimodaira,2000;Sugiyama&
Kawanabe,2012).Therefore,theimportance-weightedestimatortendstobe
unstablewhenthenumberofepisodesNisrathersmall.
8.2.2
Per-DecisionImportanceWeight
Sincetherewardatthet-thstepdoesnotdependonfuturestate-action
transitionsafterthet-thstep,anepisodicimportanceweightcanbedecom-
posedintostepwiseimportanceweights(Precupetal.,2000).Forinstance,
theexpectedreturnJ(θL)canbeexpressedas
Z
J(θL)=
R(h)p(h|θL)dh
ZT
X
=
γt−1r(st,at,st+1)w(L,ℓ)(h)p(h|θℓ)dh
t=1
ZT
X
=
γt−1r(st,at,st+1)w(L,ℓ)
t
(h)p(h|θℓ)dh,
t=1
wherew(L,ℓ)
t
(h)isthet-stepimportanceweight,calledtheper-decisionim-
portanceweight(PIW),definedas
Qt
π(a
w(L,ℓ)
t′=1
t′|st′,θL)
t
(h)=Q
.
t
π(a
t′=1
t′|st′,θℓ)
Here,thePIWideaisappliedtoRWRandamorestablealgorithmis
developed.Aslightcomplicationisthatthepolicyupdateformulagivenin
Section8.2.1containsdoublesumsoverTsteps,e.g.,
T
X
T
X
R(h)
φ(st′)φ(st′)=
γt−1r(st,at,st+1)φ(st′)φ(st′).
t′=1
t,t′=1
Inthiscase,thesummand
γt−1r(st,at,st+1)φ(st′)φ(st′)
DirectPolicySearchbyExpectation-Maximization
123
doesnotdependonfuturestate-actionpairsafterthemax(t,t′)-thstep.Thus,
theepisodicimportanceweightfor
γt−1r(st,at,st+1)φ(st′)φ(st′)
canbesimplifiedtotheper-decisionimportanceweightw(L,ℓ)
.Conse-
max(t,t′)
quently,thePIW-basedpolicyupdaterulesaregivenas
−1
L
XN
XT
X
b
µPIW
L+1=
γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)
(hπℓ
t′,n
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
L
XN
XT
X
×
γt−1r
t,naπℓφ(sπℓ)w(L,ℓ)
(hπℓ
,
t′,n
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
!
L
−1
XN
XT
X
(b
σPIW
L+1)2=
γt−1rt,nw(L,ℓ)
t
(hπℓ
n)
ℓ=1n=1t=1
!
L
N
T
1XXX
2
×
γt−1r
aπℓ
⊤φ(sπℓ)
w(L,ℓ)
(hπℓ
,
T
t,n
t′,n−b
µPIW
L+1
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
where
rt,n=r(st,n,at,n,st+1,n).
PIW
ThisPIWestimatorb
θ
⊤L+1=(b
µPIW
L+1
,b
σPIW)⊤isconsistentandpotentially
L+1EIW
morestablethantheplainEIWestimatorb
θL+1.
8.2.3
AdaptivePer-DecisionImportanceWeighting
TomoreactivelycontrolthestabilityofthePIWestimator,theadaptive
per-decisionimportanceweight(AIW)isemployed.Morespecifically,anim-
portanceweightw(L,ℓ)
(h)is“flattened”byflatteningparameterν
max(t,t
∈[0,1]′)
ν
asw(L,ℓ)
(h)
,i.e.,theν-thpoweroftheper-decisionimportanceweight.
max(t,t′)
AIW
Thenwehaveb
θ
⊤L+1=(b
µAIW
L+1
,b
σAIW
L+1)⊤,where
−1
L
XN
XT
X
ν
b
µAIW
L+1=
γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)
(hπℓ
t′,n
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
L
XN
XT
X
ν
×
γt−1r
t,naπℓφ(sπℓ)
w(L,ℓ)
(hπℓ
,
t′,n
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
!
L
−1
XN
XT
X
ν
(b
σAIW
L+1)2=
γt−1rt,nw(L,ℓ)
t
(hπℓ
n)
ℓ=1n=1t=1
124
StatisticalReinforcementLearning
!
L
N
T
1XXX
2
ν
×
γt−1r
aπℓ
⊤φ(sπℓ)
w(L,ℓ)
(hπℓ
.
T
t,n
t′,n−b
µAIW
L+1
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
Whenν=0,AIWisreducedtoNIW.Therefore,itisrelativelystable,but
notconsistent.Ontheotherhand,whenν=1,AIWisreducedtoPIW.
Therefore,itisconsistent,butratherunstable.Inpractice,anintermediate
νoftenproducesabetterestimator.Notethatthevalueoftheflattening
parametercanbedifferentineachiteration,i.e.,νmaybereplacedbyνℓ.
However,forsimplicity,asinglecommonvalueνisconsideredhere.
8.2.4
AutomaticSelectionofFlatteningParameter
Theflatteningparameterallowsustocontrolthetrade-offbetweenconsis-
tencyandstability.Here,weshowhowthevalueoftheflatteningparameter
canbeoptimallychosenusingdatasamples.
Thegoalofpolicysearchistofindtheoptimalpolicythatmaximizesthe
expectedreturnJ(θ).Therefore,theoptimalflatteningparametervalueν∗LattheL-thiterationisgivenby
AIW
ν∗L=argmaxJ(bθL+1(ν)).ν
Directlyobtainingν∗requiresthecomputationoftheexpectedreturnL
AIW
J(b
θL+1(ν))foreachcandidateofν.Tothisend,datasamplesfollowing
AIW
π(a|s;bθL+1(ν))areneededforeachν,whichisprohibitivelyexpensive.To
reusesamplesgeneratedbypreviouspolicies,avariationofcross-validation
calledimportance-weightedcross-validation(IWCV)(Sugiyamaetal.,2007)
isemployed.
ThebasicideaofIWCVistosplitthetrainingdatasetHπ1:L=HπℓLℓ=1
intoan“estimationpart”anda“validationpart.”Thenthepolicyparam-
AIW
eterb
θL+1(ν)islearnedfromtheestimationpartanditsexpectedreturn
AIW
J(b
θ
(ν))isapproximatedusingtheimportance-weightedlossfortheval-
idationpart.AspointedoutinSection8.2.1,importanceweightingtendsto
beunstablewhenthenumberNofepisodesissmall.Forthisreason,per-
decisionimportanceweightingisusedforcross-validation.Below,howIWCV
isappliedtotheselectionoftheflatteningparameterνinthecurrentcontext
isexplainedinmoredetail.
LetusdividethetrainingdatasetHπ1:L=HπℓLintoKdisjointsubsets
ℓ=1
Hπ1:L
ofthesamesize,whereeach
containsN/Kepisodicsamples
k
K
k=1
Hπ1:L
k
fromeveryHπℓ.Forsimplicity,weassumethatNisdivisiblebyK,i.e.,N/K
isaninteger.K=5willbeusedintheexperimentslater.
AIW
Letb
θL+1,k(ν)bethepolicyparameterlearnedfromHπ1:L
k
′
k′6=k(i.e.,all
AIW
datawithoutHπ1:L)byAIWestimation.Theexpectedreturnofb
θ
k
L+1,k(ν)is
DirectPolicySearchbyExpectation-Maximization
125
estimatedusingthePIWestimatorfromHπ1:Las
k
X
T
X
b
AIW
1
Jk
IWCV(b
θL+1,k(ν))=
γt−1r(s
η
t,at,st+1)w(L,ℓ)
t
(h),
π
h∈H1:Lt=1k
whereηisanormalizationconstant.Anordinarychoiceisη=LN/K,buta
morestablevariantgivenby
X
η=
w(L,ℓ)
t
(h)
π
h∈H1:Lk
isoftenpreferredinpractice(Precupetal.,2000).
Theaboveprocedureisrepeatedforallk=1,…,K,andtheaverage
score,
K
X
b
AIW
1
AIW
J
b
IWCV(b
θL+1(ν))=
Jk
K
IWCV(b
θL+1,k(ν)),
k=1
AIW
iscomputed.ThisistheK-foldIWCVestimatorofJ(b
θL+1(ν)),whichwas
showntobealmostunbiased(Sugiyamaetal.,2007).
ThisK-foldIWCVscoreiscomputedforeachcandidatevalueoftheflat-
teningparameterνandtheonethatmaximizestheIWCVscoreischosen:
AIW
b
ν
b
IWCV=argmaxJIWCV(b
θL+1(ν)).
ν
ThisIWCVschemecanalsobeusedforchoosingthebasisfunctionsφ(s)in
theGaussianpolicymodel.
Notethatwhentheimportanceweightsw(L,ℓ)
areallone(i.e.,noim-
max(t,t′)
portanceweighting),theaboveIWCVprocedureisreducedtotheordinary
CVprocedure.TheuseofIWCVisessentialheresincethetargetpolicy
AIW
π(a|s,bθL+1(ν))isusuallydifferentfromthepreviouspoliciesusedforcollect-
ingthedatasamplesHπ1:L.Therefore,theexpectedreturnestimatedusing
AIW
ordinaryCV,b
JCV(b
θL+1(ν)),wouldbeheavilybiased.
8.2.5
Reward-WeightedRegressionwithSampleReuse
Sofar,wehaveintroducedAIWtocontrolthestabilityofthepolicy-
parameterupdateandIWCVtoautomaticallychoosetheflatteningparameter
basedontheestimatedexpectedreturn.Thepolicysearchalgorithmthat
combinesthesetwomethodsiscalledreward-weightedregressionwithsample
reuse(RRR).
Ineachiteration(L=1,2,…)ofRRR,episodicdatasamplesHπLare
collectedfollowingthecurrentpolicyπ(a|s,θAIW
L
),theflatteningparameter
νischosensoastomaximizetheexpectedreturnb
JIWCV(ν)estimatedby
IWCVusingHπℓL,andthenthepolicyparameterisupdatedto
ℓ=1
θAIW
L+1
usingHπℓL.
ℓ=1
126
StatisticalReinforcementLearning
elbow
wrist
FIGURE8.2:Ballbalancingusingarobotarmsimulator.Twojointsofthe
robotsarecontrolledtokeeptheballinthemiddleofthetray.
8.3
NumericalExamples
TheperformanceofRRRisexperimentallyevaluatedonaball-balancing
taskusingarobotarmsimulator(Schaal,2009).
AsillustratedinFigure8.2,a7-degree-of-freedomarmismountedonthe
ceilingupsidedown,whichisequippedwithacirculartrayofradius0.24[m]
attheendeffector.Thegoalistocontrolthejointsoftherobotsothatthe
ballisbroughttothemiddleofthetray.However,thedifficultyisthatthe
angleofthetraycannotbecontrolleddirectly,whichisatypicalrestriction
inreal-worldjoint-motionplanningbasedonfeedbackfromtheenvironment
(e.g.,thestateoftheball).
Tosimplifytheproblem,onlytwojointsarecontrolledhere:thewristangle
αrollandtheelbowangleαpitch.Alltheremainingjointsarefixed.Control
ofthewristandelbowangleswouldroughlycorrespondtochangingtheroll
andpitchanglesofthetray,butnotdirectly.
Twoseparatecontrolsubsystemsaredesignedhere,eachofwhichisin
chargeofcontrollingtherollandpitchangles.Eachsubsystemhasitsown
policyparameterθ,statespaceS,andactionspaceA.ThestatespaceSis
continuousandconsistsof(x,˙x),wherex[m]isthepositionoftheballonthe
trayalongeachaxisand˙x[m/s]isthevelocityoftheball.Theactionspace
Aiscontinuousandcorrespondstothetargetanglea[rad]ofthejoint.The
rewardfunctionisdefinedas
5(x′)2+(˙x′)2+a2
r(s,a,s′)=exp−
,
2(0.24/2)2
wherethenumber0.24inthedenominatorcomesfromtheradiusofthetray.
Below,howthecontrolsystemisdesignedisexplainedinmoredetail.
DirectPolicySearchbyExpectation-Maximization
127
FIGURE8.3:Theblockdiagramoftherobot-armcontrolsystemforball
balancing.Thecontrolsystemhastwofeedbackloops,i.e.,joint-trajectory
planningbyRRRandtrajectorytrackingbyahigh-gainproportional-
derivative(PD)controller.
AsillustratedinFigure8.3,thecontrolsystemhastwofeedbackloopsfor
trajectoryplanningusinganRRRcontrollerandtrajectorytrackingusinga
high-gainproportional-derivative(PD)controller(Siciliano&Khatib,2008).
TheRRRcontrolleroutputsthetargetjointangleobtainedbythecurrent
policyatevery0.2[s].NineGaussiankernelsareusedasbasisfunctionsφ(s)
withthekernelcenterscb9
locatedoverthestatespaceat
b=1
(x,˙x)∈(−0.2,−0.4),(−0.2,0),(−0.1,0.4),(0,−0.4),(0,0),(0,0.4),
(0.1,−0.4),(0.2,0),(0.2,0.4).
TheGaussianwidthissetatσbasis=0.1.Basedonthediscrete-timetarget
anglesobtainedbyRRR,thedesiredjointtrajectoryinthecontinuoustime
domainislinearlyinterpolatedas
at,u=at+u˙at,
whereuisthetimefromthelastoutputatofRRRatthet-thstep.˙atisthe
angularvelocitycomputedby
a
˙a
t−at−1
t=
,
0.2
wherea0istheinitialangleofajoint.Theangularvelocityisassumedtobe
constantduringthe0.2[s]cycleoftrajectoryplanning.
Ontheotherhand,thePDcontrollerconvertsdesiredjointtrajectoriesto
motortorquesas
τt,u=µp∗(at,u−αt,u)+µd∗(˙at−˙αt,u),whereτisthe2-dimensionalvectorconsistingofthetorqueappliedtothe
wristandelbowjoints.a=(apitch,aroll)⊤and˙a=(˙apitch,˙aroll)⊤arethe
2-dimensionalvectorsconsistingofthedesiredanglesandvelocities.α=
128
StatisticalReinforcementLearning
(αpitch,αroll)⊤and˙α=(˙αpitch,˙αroll)⊤arethe2-dimensionalvectorsconsist-
ingofthecurrentjointanglesandvelocities.µpandµdarethe2-dimensional
vectorsconsistingoftheproportionalandderivativegains.“∗”denotestheelement-wiseproduct.Sincethecontrolcycleoftherobotarmis0.002[s],
thePDcontrollerisapplied100times(i.e.,t=0.002,0.004,…,0.198,0.2)ineachRRRcycle.
Figure8.4depictsadesiredtrajectoryofthewristjointgeneratedby
arandompolicyandanactualtrajectoryobtainedusingthehigh-gainPD
controllerdescribedabove.Thegraphsshowthatthedesiredtrajectoryis
followedbytherobotarmreasonablywell.
ThepolicyparameterθLislearnedthroughtheRRRiterations.Theinitial
policyparametersθ1=(µ⊤
1,σ1)⊤aresetmanuallyas
µ1=(−0.5,−0.5,0,−0.5,0,0,0,0,0)⊤andσ1=0.1,
sothatawiderangeofstatesandactionscanbesafelyexploredinthefirstiter-
ation.Theinitialpositionoftheballisrandomlyselectedasx∈[−0.05,0.05].Thedatasetcollectedineachiterationconsistsof10episodeswith20steps.
Thedurationofanepisodeis4[s]andthesamplingcyclebyRRRis0.2[s].
Threescenariosareconsideredhere:
•NIW:Samplereusewithν=0.
•PIW:Samplereusewithν=1.
•RRR:SamplereusewithνchosenbyIWCVfrom0,0.25,0.5,0.75,1
ineachiteration.
Thediscountfactorissetatγ=0.99.Figure8.5depictstheaveragedexpected
returnover10trialsasafunctionofthenumberofpolicyupdateiterations.
Theexpectedreturnineachtrialiscomputedfrom20testepisodicsamples
thathavenotbeenusedfortraining.ThegraphshowsthatRRRnicelyim-
provestheperformanceoveriterations.Ontheotherhand,theperformance
forν=0issaturatedafterthe3rditeration,andtheperformanceforν=1
isimprovedinthebeginningbutsuddenlygoesdownatthe5thiteration.
Theresultforν=1indicatesthatalargechangeinpoliciescausessevere
instabilityinsamplereuse.
Figure8.6andFigure8.7depictexamplesoftrajectoriesofthewristangle
αroll,theelbowangleαpitch,resultingballmovementx,andrewardrfor
policiesobtainedbyNIW(ν=0)andRRR(νischosenbyIWCV)after
the10thiteration.BythepolicyobtainedbyNIW,theballgoesthroughthe
middleofthetray,i.e.,(xroll,xpitch)=(0,0),anddoesnotstop.Ontheother
hand,thepolicyobtainedbyRRRsuccessfullyguidestheballtothemiddle
ofthetrayalongtherollaxis,althoughthemovementalongthepitchaxis
lookssimilartothatbyNIW.MotionexamplesbyRRRwithνchosenby
IWCVareillustratedinFigure8.8.
DirectPolicySearchbyExpectation-Maximization
129
0.2
1
0.15
0.5
0.1
0
0.05
Angle[rad]
−0.5
Angularvelocity[rad/s]
0
−1
Desiredtrajectory
Actualtrajectory
−0.05
−1.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Time[s]
Time[s]
(a)Trajectoryinangles
(b)Trajectoryinangularvelocities
FIGURE8.4:Anexampleofdesiredandactualtrajectoriesofthewrist
jointintherealisticball-balancingtask.Thetargetjointangleisdetermined
byarandompolicyatevery0.2[s],andthenalinearlyinterpolatedangleand
constantvelocityaretrackedusingtheproportional-derivative(PD)controller
inthecycleof0.002[s].
17
16
15
Reusen=0(NIW)
14
Reusen=1(PIW)
RRR(
^
n=νIWCV)
13
12
Expectedreturn11
10
9
2
4
6
8
10
Iteration
FIGURE8.5:Theperformanceoflearnedpolicieswhenν=0(NIW),ν=1
(PIW),andνischosenbyIWCV(RRR)inballbalancingusingasimulated
robot-armsystem.Theperformanceismeasuredbythereturnaveragedover
10trials.Thesymbol“”indicatesthatthemethodisthebestorcomparable
tothebestoneintermsoftheexpectedreturnbythet-testatthesignifi-
cancelevel5%,performedateachiteration.Theerrorbarsindicate1/10ofa
standarddeviation.
130
StatisticalReinforcementLearning
0.2
1.7
0.15
1.65
0.1
1.6
0.05
[rad]
[rad]
roll
0
α
pitch1.55
α
−0.05
Angle
1.5
Angle
−0.1
1.45
−0.15
−0.2
1.4
0
1
2
3
4
0
1
2
3
4
Time[s]
Time[s]
0.2
1
Pitch
Roll
0.15
Middleoftray
0.8
0.1
[m]
r
x
0.6
0.05
Reward0.4
0
Ballposition
0.2
−0.05
−0.1
0
0
1
2
3
4
0
1
2
3
4
Time[s]
Time[s]
FIGURE8.6:Typicalexamplesoftrajectoriesofwristangleαroll,elbow
angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby
NIW(ν=0)atthe10thiterationintheball-balancingtask.
0.2
1.7
0.15
1.65
0.1
1.6
0.05
[rad]
[rad]
roll
0
α
pitch1.55
α
−0.05
Angle
1.5
Angle
−0.1
1.45
−0.15
−0.2
1.4
0
1
2
3
4
0
1
2
3
4
Time[s]
Time[s]
0.2
1
Pitch
Roll
0.15
Middleoftray
0.8
0.1
[m]
r
x
0.6
0.05
Reward0.4
0
Ballposition
0.2
−0.05
−0.1
0
0
1
2
3
4
0
1
2
3
4
Time[s]
Time[s]
FIGURE8.7:Typicalexamplesoftrajectoriesofwristangleαroll,elbow
angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby
RRR(νischosenbyIWCV)atthe10thiterationintheball-balancingtask.
DirectPolicySearchbyExpectation-Maximization
131
FIGURE8.8:MotionexamplesofballbalancingbyRRR(fromlefttoright
andtoptobottom).
132
StatisticalReinforcementLearning
8.4
Remarks
Adirectpolicysearchalgorithmbasedonexpectation-maximization(EM)
iterativelymaximizesthelower-boundoftheexpectedreturn.TheEM-based
approachdoesnotincludethestepsizeparameter,whichisanadvantageover
thegradient-basedapproachintroducedinChapter7.Asample-reusevariant
oftheEM-basedmethodwasalsoprovided,whichcontributestoimproving
thestabilityofthealgorithminsmall-samplescenarios.
Inpractice,however,theEM-basedapproachisstillratherinstableevenif
itiscombinedwiththesample-reusetechnique.InChapter9,anotherpolicy
searchapproachwillbeintroducedtofurtherimprovethestabilityofpolicy
updates.
Chapter9
Policy-PriorSearch
ThedirectpolicysearchmethodsexplainedinChapter7andChapter8are
usefulinsolvingproblemswithcontinuousactionssuchasrobotcontrol.How-
ever,theytendtosufferfrominstabilityofpolicyupdate.Inthischapter,we
introduceanalternativepolicysearchmethodcalledpolicy-priorsearch,which
isadoptedinthePGPE(policygradientswithparameter-basedexploration)
method(Sehnkeetal.,2010).Thebasicideaistousedeterministicpoliciesto
removeexcessiverandomnessandintroduceusefulstochasticitybyconsidering
apriordistributionforpolicyparameters.
Afterformulatingtheproblemofpolicy-priorsearchinSection9.1,a
gradient-basedalgorithmisintroducedinSection9.2,includingitsimprove-
mentusingbaselinesubtraction,theoreticalanalysis,andexperimentaleval-
uation.Then,inSection9.3,asample-reusevariantisdescribedanditsper-
formanceistheoreticallyanalyzedandexperimentallyinvestigatedusinga
humanoidrobot.Finally,thischapterisconcludedinSection9.4.
9.1
Formulation
Inthissection,thepolicysearchproblemisformulatedbasedonpolicy
priors.
Thebasicideaistouseadeterministicpolicyandintroducestochasticity
bydrawingpolicyparametersfromapriordistribution.Morespecifically,pol-
icyparametersarerandomlydeterminedfollowingthepriordistributionatthe
beginningofeachtrajectory,andthereafteractionselectionisdeterministic
(Figure9.1).Notethattransitionsaregenerallystochastic,andthustrajecto-
riesarealsostochasticeventhoughthepolicyisdeterministic.Thankstothis
per-trajectoryformulation,thevarianceofgradientestimatorsinpolicy-prior
searchdoesnotincreasewithrespecttothetrajectorylength,whichallows
ustoovercomethecriticaldrawbackofdirectpolicysearch.
Policy-priorsearchusesadeterministicpolicywithtypicallyalinearar-
chitecture:
π(a|s,θ)=δ(a=θ⊤φ(s)),
whereδ(·)istheDiracdeltafunctionandφ(s)isthebasisfunction.Thepolicy
133
134
StatisticalReinforcementLearning
a
s
a
s
a
s
s
a
s
a
s
a
s
(a)Stochasticpolicy
a
s
a
s
s
a
s
a
s
a
s
a
s
(b)Deterministicpolicywithprior
FIGURE9.1:Illustrationofthestochasticpolicyandthedeterministicpol-
icywithapriorunderdeterministictransition.Thenumberofpossibletra-
jectoriesisexponentialwithrespecttothetrajectorylengthwhenstochastic
policiesareused,whileitdoesnotgrowwhendeterministicpoliciesdrawn
fromapriordistributionareused.
parameterθisdrawnfromapriordistributionp(θ|ρ)withhyper-parameter
ρ.
Theexpectedreturninpolicy-priorsearchisdefinedintermsoftheex-
pectationsoverbothtrajectoryhandpolicyparameterθasafunctionof
hyper-parameterρ:
ZZ
J(ρ)=Ep(h|θ)p(θ|ρ)[R(h)]=
p(h|θ)p(θ|ρ)R(h)dhdθ,
whereEp(h|θ)p(θ|ρ)denotestheexpectationovertrajectoryhandpolicy
parameterθdrawnfromp(h|θ)p(θ|ρ).Inpolicy-priorsearch,thehyper-
parameterρisoptimizedsothattheexpectedreturnJ(ρ)ismaximized.
Thus,theoptimalhyper-parameterρ∗isgivenbyρ∗=argmaxJ(ρ).ρ
9.2
PolicyGradientswithParameter-BasedExploration
Inthissection,agradient-basedalgorithmforpolicy-priorsearchisgiven.
Policy-PriorSearch
135
9.2.1
Policy-PriorGradientAscent
Here,agradientmethodisusedtofindalocalmaximizeroftheexpected
returnJwithrespecttohyper-parameterρ:
ρ←−ρ+ε∇ρJ(ρ),whereεisasmallpositiveconstantand∇ρJ(ρ)isthederivativeofJwithrespecttoρ:
ZZ
∇ρJ(ρ)=p(h|θ)∇ρp(θ|ρ)R(h)dhdθ
ZZ
=
p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθ=Ep(h|θ)p(θ|ρ)[∇ρlogp(θ|ρ)R(h)],wherethelogarithmicderivative,
∇∇ρp(θ|ρ)
ρlogp(θ|ρ)=
,
p(θ|ρ)
wasusedinthederivation.Theexpectationsoverhandθareapproximated
bytheempiricalaverages:
1N
X
∇bρJ(ρ)=
∇N
ρlogp(θn|ρ)R(hn),
(9.1)
n=1
whereeachtrajectorysamplehnisdrawnindependentlyfromp(h|θn)and
parameterθnisdrawnfromp(θ|ρ).Thus,inpolicy-priorsearch,samplesare
pairsofθandh:
H=(θ1,h1),…,(θN,hN).
Asthepriordistributionforpolicyparameterθ=(θ1,…,θB)⊤,where
Bisthedimensionalityofthebasisvectorφ(s),theindependentGaussian
distributionisastandardchoice.ForthisGaussianprior,thehyper-parameter
ρconsistsofpriormeansη=(η1,…,ηB)⊤andpriorstandarddeviations
τ=(τ1,…,τB)⊤:
B
Y
1
(θ
p(
b−ηb)2
θ|η,τ)=
√
exp−
.
(9.2)
τ
2π
2τ2
b=1
b
b
Thenthederivativesoflog-priorlogp(θ|η,τ)withrespecttoηbandτbare
givenas
θ
∇b−ηb
ηlogp(θ|η,τ)=
,
b
τ2
b
(θ
∇b−ηb)2−τ2
b
τlogp(θ|η,τ)=
.
b
τ3
b
BysubstitutingthesederivativesintoEq.(9.1),thepolicy-priorgradientswith
respecttoηandτcanbeapproximated.
136
StatisticalReinforcementLearning
9.2.2
BaselineSubtractionforVarianceReduction
AsexplainedinSection7.2.2,subtractionofabaselinecanreducethevari-
anceofgradientestimators.Here,abaselinesubtractionmethodforpolicy-
priorsearchisdescribed.
Forabaselineξ,amodifiedgradientestimatorisgivenby
1N
X
∇bρJξ(ρ)=
(R(h
N
n)−ξ)∇ρlogp(θn|ρ).n=1
Letξ∗betheoptimalbaselinethatminimizesthevarianceofthegradient:ξ∗=argminVarb
p(h|θ)p(θ|ρ)[∇ρJξ(ρ)],ξ
whereVarp(h|θ)p(θ|ρ)denotesthetraceofthecovariancematrix:
Varp(h|θ)p(θ|ρ)[ζ]
=trEp(h|θ)p(θ|ρ)(ζ−Ep(h|θ)p(θ|ρ)[ζ])(ζ−Ep(h|θ)p(θ|ρ)[ζ])⊤
h
i
=Ep(h|θ)p(θ|ρ)kζ−Ep(h|θ)p(θ|ρ)[ζ]k2.
ItwasshowninZhaoetal.(2012)thattheoptimalbaselineforpolicy-prior
searchisgivenby
E
ξ∗=p(h|θ)p(θ|ρ)[R(h)k∇ρlogp(θ|ρ)k2],Ep(θ|ρ)[k∇ρlogp(θ|ρ)k2]whereEp(θ|ρ)denotestheexpectationoverpolicyparameterθdrawnfrom
p(θ|ρ).Inpractice,theexpectationsareapproximatedbythesampleaverages.
9.2.3
VarianceAnalysisofGradientEstimators
Herethevarianceofgradientestimatorsistheoreticallyinvestigatedfor
theindependentGaussianprior(9.2)withφ(s)=s.SeeZhaoetal.(2012)
fortechnicaldetails.
Below,subsetsofthefollowingassumptionsareconsidered(whicharethe
sameastheonesusedinSection7.2.3):
Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such
that
kstk≥ctandtk≤dt
holdwithprobabilityatleast1−δ,respectively,overthechoiceof
2N
samplepaths.
Policy-PriorSearch
137
NotethatAssumption(B)isstrongerthanAssumption(A).
Let
B
X
G=
τ−2.
b
b=1
First,thevarianceofgradientestimatorsinpolicy-priorsearchisanalyzed:
Theorem9.1UnderAssumption(A),thefollowingupperboundshold:
h
i
β2(1−γT)2G
β2G
Var
b
p(h|θ)p(θ|ρ)∇ηJ(η,τ)≤≤
,
N(1−γ)2
N(1−γ)2
h
i
2β2(1−γT)2G
2β2G
Var
b
p(h|θ)p(θ|ρ)∇τJ(η,τ)≤≤
.
N(1−γ)2
N(1−γ)2
ThesecondupperboundsareindependentofthetrajectorylengthT,while
theupperboundsfordirectpolicysearch(Theorem7.1inSection7.2.3)are
monotoneincreasingwithrespecttothetrajectorylengthT.Thus,gradient
estimationinpolicy-priorsearchisexpectedtobemorereliablethanthatin
directpolicysearchwhenthetrajectorylengthTislarge.
Thefollowingtheoremmoreexplicitlycomparesthevarianceofgradient
estimatorsindirectpolicysearchandpolicy-priorsearch:
Theorem9.2InadditiontoAssumptions(B)and(C),assumethat
ζ(T)=CTα2−DTβ2/(2π)
ispositiveandmonotoneincreasingwithrespecttoT,where
T
X
T
X
CT=
c2tandDT=
d2t.
t=1
t=1
IfthereexistsT0suchthat
ζ(T0)≥β2Gσ2,
thenitholdsthat
Var
b
b
p(h|θ)p(θ|ρ)[∇µJ(θ)]>Varp(h|θ)p(θ|ρ)[∇ηJ(η,τ)]forallT>T0,withprobabilityatleast1−δ.
Theabovetheoremmeansthatpolicy-priorsearchismorefavorablethan
directpolicysearchintermsofthevarianceofgradientestimatorsofthe
mean,iftrajectorylengthTislarge.
Next,thecontributionoftheoptimalbaselinetothevarianceofthegradi-
entestimatorwithrespecttomeanparameterηisinvestigated.Itwasshown
inZhaoetal.(2012)thattheexcessvarianceforabaselineξisgivenby
Var
b
b
p(h|θ)p(θ|ρ)[∇ρJξ(ρ)]−Varp(h|θ)p(θ|ρ)[∇ρJξ∗(ρ)]
138
StatisticalReinforcementLearning
(ξ−ξ∗)2h
i
=
E
k∇.
N
p(h|θ)p(θ|ρ)
ρlogp(θ|ρ)k2
Basedonthisexpression,thefollowingtheoremholds.
Theorem9.3Ifr(s,a,s′)≥α>0,thefollowinglowerboundholds:
α2(1−γT)2G
Var
b
b
p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≥.
N(1−γ)2
UnderAssumption(A),thefollowingupperboundholds:
β2(1−γT)2G
Var
b
b
p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤.
N(1−γ)2
Theabovetheoremshowsthatthelowerboundoftheexcessvariance
ispositiveandmonotoneincreasingwithrespecttothetrajectorylengthT.
Thismeansthatthevarianceisalwaysreducedbysubtractingtheoptimal
baselineandtheamountofvariancereductionismonotoneincreasingwith
respecttothetrajectorylengthT.Notethattheupperboundisalsomonotone
increasingwithrespecttothetrajectorylengthT.
Finally,thevarianceofthegradientestimatorwiththeoptimalbaseline
isinvestigated:
Theorem9.4UnderAssumptions(B)and(C),thefollowingupperbound
holdswithprobabilityatleast1−δ:
(1−γT)2
(β2−α2)G
Var
b
p(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤(β2−α2)G≤
.
N(1−γ)2
N(1−γ)2
ThesecondupperboundisindependentofthetrajectorylengthT,while
Theorem7.4inSection7.2.3showedthattheupperboundofthevariance
ofgradientestimatorswiththeoptimalbaselineindirectpolicysearchis
monotoneincreasingwithrespecttotrajectorylengthT.Thus,whentrajec-
torylengthTislarge,policy-priorsearchismorefavorablethandirectpolicy
searchintermsofthevarianceofthegradientestimatorwithrespecttothe
meanevenwhenoptimalbaselinesubtractionisapplied.
9.2.4
NumericalExamples
Here,theperformanceofthedirectpolicysearchandpolicy-priorsearch
algorithmsareexperimentallycompared.
9.2.4.1
Setup
LetthestatespaceSbeone-dimensionalandcontinuous,andtheinitial
stateisrandomlychosenfollowingthestandardnormaldistribution.Theac-
tionspaceAisalsosettobeone-dimensionalandcontinuous.Thetransition
dynamicsoftheenvironmentissetat
st+1=st+at+ε,
Policy-PriorSearch
139
TABLE9.1:Varianceandbiasofestimatedparameters.
(a)TrajectorylengthT=10
Method
Variance
Bias
µ,η
σ,τ
µ,η
σ,τ
REINFORCE
13.25726.917-0.310-1.510
REINFORCE-OB
0.091
0.120
0.067
0.129
PGPE
0.971
1.686
-0.069
0.132
PGPE-OB
0.037
0.069
-0.016
0.051
(b)TrajectorylengthT=50
Method
Variance
Bias
µ,η
σ,τ
µ,η
σ,τ
REINFORCE
188.386278.310-1.813-5.175
REINFORCE-OB
0.545
0.900
-0.299-0.201
PGPE
1.657
3.372
-0.105-0.329
PGPE-OB
0.085
0.182
0.048
-0.078
whereε∼N(0,0.52)isstochasticnoiseandN(µ,σ2)denotesthenormaldistributionwithmeanµandvarianceσ2.Theimmediaterewardisdefined
as
r=exp−s2/2−a2/2+1,
whichisboundedas1<r≤2.ThelengthofthetrajectoryissetatT=10
or50,thediscountfactorissetatγ=0.9,andthenumberofepisodicsamples
issetatN=100.
9.2.4.2
VarianceandBias
First,thevarianceandthebiasofgradientestimatorsofthefollowing
methodsareinvestigated:
•REINFORCE:REINFORCE(gradient-baseddirectpolicysearch)
withoutabaseline(Williams,1992).
•REINFORCE-OB:REINFORCEwithoptimalbaselinesubtraction
(Peters&Schaal,2006).
•PGPE:PGPE(gradient-basedpolicy-priorsearch)withoutabaseline
(Sehnkeetal.,2010).
•PGPE-OB:PGPEwithoptimalbaselinesubtraction(Zhaoetal.,
2012).
Table9.1summarizesthevarianceofgradientestimatorsover100runs,
showingthatthevarianceofREINFORCEisoveralllargerthanPGPE.A
notabledifferencebetweenREINFORCEandPGPEisthatthevarianceof
REINFORCEsignificantlygrowsasthetrajectorylengthTincreases,whereas
140
StatisticalReinforcementLearning
thatofPGPEisnotinfluencedthatmuchbyT.Thisagreeswellwiththe
theoreticalanalysesgiveninSection7.2.3andSection9.2.3.Optimalbaseline
subtraction(REINFORCE-OBandPGPE-OB)isshowntocontributehighly
toreducingthevariance,especiallywhentrajectorylengthTislarge,which
alsoagreeswellwiththetheoreticalanalysis.
Thebiasofthegradientestimatorofeachmethodisalsoinvestigated.
Here,gradientsestimatedwithN=1000areregardedastruegradients,and
thebiasofgradientestimatorsiscomputed.Theresultsarealsoincludedin
Table9.1,showingthatintroductionofbaselinesdoesnotincreasethebias;
rather,ittendstoreducethebias.
9.2.4.3
VarianceandPolicyHyper-ParameterChangethroughEn-
tirePolicy-UpdateProcess
Next,thevarianceofgradientestimatorsisinvestigatedwhenpolicyhyper-
parametersareupdatedoveriterations.Ifthedeviationparameterσtakesa
negativevalueduringthepolicy-updateprocess,itissetat0.05.Inthisex-
periment,thevarianceiscomputedfrom50runsforT=20andN=10,and
policiesareupdatedover50iterations.Inordertoevaluatethevariancein
astablemanner,theaboveexperimentsarerepeated20timeswithrandom
choiceofinitialmeanparameterµfrom[−3.0,−0.1],andtheaveragevariance
ofgradientestimatorsisinvestigatedwithrespecttomeanparameterµover
20trials.TheresultsareplottedinFigure9.2.Figure9.2(a)comparesthe
varianceofREINFORCEwith/withoutbaselines,whereasFigure9.2(b)com-
paresthevarianceofPGPEwith/withoutbaselines.Thesegraphsshowthat
introductionofbaselinescontributeshighlytothereductionofthevariance
overiterations.
LetusillustratehowparametersareupdatedbyPGPE-OBover50itera-
tionsforN=10andT=10.Theinitialmeanparameterissetatη=−1.6,
−0.8,or−0.1,andtheinitialdeviationparameterissetatτ=1.Figure9.3
depictsthecontouroftheexpectedreturnandillustratestrajectoriesofpa-
rameterupdatesoveriterationsbyPGPE-OB.Inthegraph,themaximumof
thereturnsurfaceislocatedatthemiddlebottom,andPGPE-OBleadsthe
solutionstoamaximumpointrapidly.
9.2.4.4
PerformanceofLearnedPolicies
Finally,thereturnobtainedbyeachmethodisevaluated.Thetrajectory
lengthisfixedatT=20,andthemaximumnumberofpolicy-updateitera-
tionsissetat50.Averagereturnsover20runsareinvestigatedasfunctions
ofthenumberofepisodicsamplesN.Figure9.4(a)showstheresultswhen
initialmeanparameterµischosenrandomlyfrom[−1.6,−0.1],whichtends
toperformwell.ThegraphshowsthatPGPE-OBperformsthebest,espe-
ciallywhenN<5;thenREINFORCE-OBfollowswithasmallmargin.The
Policy-PriorSearch
141
6
REINFORCE
REINFORCE−OB
5
−scale4
10
3
2
Varianceinlog
1
00
10
20
30
40
50
Iteration
(a)REINFORCEandREINFORCE-OB
4
PGPE
3.5
PGPE−OB
3
2.5
−scale
10
2
1.5
1
0.5
Varianceinlog
0
−0.50
10
20
30
40
50
Iteration
(b)PGPEandPGPE-OB
FIGURE9.2:Meanandstandarderrorofthevarianceofgradientestimators
withrespecttothemeanparameterthroughpolicy-updateiterations.
1
17.00
τ
17.54
17.81
17.27
0.8
18.07
17.54
18.34
18.0717.81
18.61
0.6
18.88
19.14
18.34
0.4
18.61
19.41
18.88
19.68
19.14
0.2
19.41
19.68
Policy-priorstandarddeviation
0
−1.6
−1.4
−1.2
−1
−0.8
−0.6
−0.4
−0.2
Policy-priormeanη
FIGURE9.3:Trajectoriesofpolicy-priorparameterupdatesbyPGPE.
142
StatisticalReinforcementLearning
16.5
16
15.5
Return
15
REINFORCE
14.5
REINFORCE−OB
PGPE
PGPE−OB
0
5
10
15
20
Iteration
(a)Goodinitialpolicy
16.5
16
15.5
15
14.5
Return
14
13.5
REINFORCE
REINFORCE−OB
13
PGPE
PGPE−OB
12.50
5
10
15
20
Iteration
(b)Poorinitialpolicy
FIGURE9.4:Averageandstandarderrorofreturnsover20runsasfunctions
ofthenumberofepisodicsamplesN.
plainPGPEalsoworksreasonablywell,althoughitisslightlyunstabledueto
largervariance.TheplainREINFORCEishighlyunstable,whichiscausedby
thehugevarianceofgradientestimators(seeFigure9.2again).Figure9.4(b)
describestheresultswheninitialmeanparameterµischosenrandomlyfrom
[−3.0,−0.1],whichtendstoresultinpoorerperformance.Inthissetup,the
differenceamongthecomparedmethodsismoresignificantthanthecasewith
goodinitialpolicies,meaningthatREINFORCEissensitivetothechoiceof
initialpolicies.Overall,thePGPEmethodstendtooutperformtheREIN-
FORCEmethods,andamongthePGPEmethods,PGPE-OBworksvery
wellandconvergesquickly.
Policy-PriorSearch
143
9.3
SampleReuseinPolicy-PriorSearch
AlthoughPGPEwasshowntooutperformREINFORCE,itsbehavioris
stillratherunstableifthenumberofdatasamplesusedforestimatingthegra-
dientissmall.Inthissection,thesample-reuseideaisappliedtoPGPE.Tech-
nically,theoriginalPGPEiscategorizedasanon-policyalgorithmwheredata
drawnfromthecurrenttargetpolicyisusedtoestimatepolicy-priorgradients.
Ontheotherhand,off-policyalgorithmsaremoreflexibleinthesensethat
adata-collectingpolicyandthecurrenttargetpolicycanbedifferent.Here,
PGPEisextendedtotheoff-policyscenariousingtheimportance-weighting
technique.
9.3.1
ImportanceWeighting
Letusconsideranoff-policyscenariowhereadata-collectingpolicyand
thecurrenttargetpolicyaredifferentingeneral.InthecontextofPGPE,
twohyper-parametersareconsidered:ρasthetargetpolicytolearnandρ′
asapolicyfordatacollection.Letusdenotethedatasamplescollectedwith
hyper-parameterρ′byH′:
H′=
i.i.d.
θ′n,h′nN′
n=1
∼p(h|θ)p(θ|ρ′).IfdataH′isnaivelyusedtoestimatepolicy-priorgradientsbyEq.(9.1),we
sufferaninconsistencyproblem:
N′
1X∇N′
ρlogp(θ′n|ρ)R(h′n)N′−→∞
9
∇ρJ(ρ),n=1
where
ZZ
∇ρJ(ρ)=p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθisthegradientoftheexpectedreturn,
ZZ
J(ρ)=
p(h|θ)p(θ|ρ)R(h)dhdθ,
withrespecttothepolicyhyper-parameterρ.Below,thisnaivemethodis
referredtoasnon-importance-weightedPGPE(NIW-PGPE).
Thisinconsistencyproblemcanbesystematicallyresolvedbyimportance
weighting:
1N′
X
∇bN′→∞
ρJIW(ρ)=
w(θ′
−→∇N′
n)∇ρlogp(θ′n|ρ)R(h′n)ρJ(ρ),
n=1
144
StatisticalReinforcementLearning
wherew(θ)=p(θ|ρ)/p(θ|ρ′)istheimportanceweight.Thisextendedmethod
iscalledimportance-weightedPGPE(IW-PGPE).
Below,thevarianceofgradientestimatorsinIW-PGPEistheoretically
analyzed.SeeZhaoetal.(2013)fortechnicaldetails.AsdescribedinSec-
tion9.2.1,thedeterministiclinearpolicymodelisusedhere:
π(a|s,θ)=δ(a=θ⊤φ(s)),
(9.3)
whereδ(·)istheDiracdeltafunctionandφ(s)istheB-dimensionalbasis
function.Policyparameterθ=(θ1,…,θB)⊤isdrawnfromtheindependent
Gaussianprior,wherepolicyhyper-parameterρconsistsofpriormeansη=
(η1,…,ηB)⊤andpriorstandarddeviationsτ=(τ1,…,τB)⊤:
B
Y
1
(θ
p(
b−ηb)2
θ|η,τ)=
√
exp−
.
(9.4)
τ
2π
2τ2
b=1
b
b
Let
B
X
G=
τ−2,
b
b=1
andletVarp(h′|θ′)p(θ′|ρ′)denotethetraceofthecovariancematrix:
Varp(h′|θ′)p(θ′|ρ′)[ζ]
=trEp(h′|θ′)p(θ′|ρ′)(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])⊤h
i
=Ep(h′|θ′)p(θ′|ρ′)kζ−Ep(h′|θ′)p(θ′|ρ′)[ζ]k2,
whereEp(h′|θ′)p(θ′|ρ′)denotestheexpectationovertrajectoryh′andpolicy
parameterθ′drawnfromp(h′|θ′)p(θ′|ρ′).Thenthefollowingtheoremholds:
Theorem9.5Assumethatforalls,a,ands′,thereexistsβ>0suchthat
r(s,a,s′)∈[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.Then,thefollowingupperboundshold:
h
i
β2(1−γT)2G
Var
b
p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)≤
w
N′(1−γ)2
max,
h
i
2β2(1−γT)2G
Var
b
p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)≤
w
N′(1−γ)2
max.
Itisinterestingtonotethattheupperboundsarethesameastheones
fortheplainPGPE(Theorem9.1inSection9.2.3)exceptforfactorwmax.
Whenwmax=1,theboundsarereducedtothoseoftheplainPGPEmethod.
However,ifthesamplingdistributionissignificantlydifferentfromthetarget
distribution,wmaxcantakealargevalueandthusIW-PGPEcanproducea
gradientestimatorwithlargevariance.Therefore,IW-PGPEmaynotbea
reliableapproachasitis.
Below,avariancereductiontechniqueforIW-PGPEisintroducedwhich
leadstoapracticallyusefulalgorithm.
Policy-PriorSearch
145
9.3.2
VarianceReductionbyBaselineSubtraction
Here,abaselineisintroducedforIW-PGPEtoreducethevarianceof
gradientestimators,inthesamewayastheplainPGPEexplainedinSec-
tion9.2.2.
Apolicy-priorgradientestimatorwithabaselineξ∈RisdefinedasN′
1X
∇bρJξ
(ρ)=
(R(h′
IW
N′
n)−ξ)w(θ′n)∇ρlogp(θ′n|ρ).n=1
Here,thebaselineξisdeterminedsothatthevarianceisminimized.Letξ∗betheoptimalbaselineforIW-PGPEthatminimizesthevariance:
ξ∗=argminVarb
p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)].IW
ξ
ThentheoptimalbaselineforIW-PGPEisgivenasfollows(Zhaoetal.,2013):
E
ξ∗=p(h′|θ′)p(θ′|ρ′)[R(h′)w2(θ′)k∇ρlogp(θ′|ρ)k2],Ep(θ′|ρ′)[w2(θ′)k∇ρlogp(θ′|ρ)k2]whereEp(θ′|ρ′)denotestheexpectationoverpolicyparameterθ′drawnfrom
p(θ′|ρ′).Inpractice,theexpectationsareapproximatedbythesampleaver-
ages.Theexcessvarianceforabaselineξisgivenas
Var
b
b
p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)]Jξ∗(ρ)]IW
−Varp(h′|θ′)p(θ′|ρ′)[∇ρIW(ξ−ξ∗)2=
E
N′
p(θ′|ρ′)[w2(θ′)k∇ρlogp(θ′|ρ)k2].Next,contributionsoftheoptimalbaselinetovariancereductioninIW-
PGPEareanalyzedforthedeterministiclinearpolicymodel(9.3)andthe
independentGaussianprior(9.4).SeeZhaoetal.(2013)fortechnicaldetails.
Theorem9.6Assumethatforalls,a,ands′,thereexistsα>0suchthat
r(s,a,s′)≥α,and,forallθ,thereexistswmin>0suchthatw(θ)≥wmin.
Then,thefollowinglowerboundshold:
h
i
h
i
Var
b
b
p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW
α2(1−γT)2G
≥
w
N′(1−γ)2
min,
h
i
h
i
Var
b
b
p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW
2α2(1−γT)2G
≥
w
N′(1−γ)2
min.
Assumethatforalls,a,ands′,thereexistsβ>0suchthatr(s,a,s′)∈
146
StatisticalReinforcementLearning
[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.
Then,thefollowingupperboundshold:
h
i
h
i
Var
b
b
p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW
β2(1−γT)2G
≤
w
N′(1−γ)2
max,
h
i
h
i
Var
b
b
p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW
2β2(1−γT)2G
≤
w
N′(1−γ)2
max.
ThistheoremshowsthattheboundsofthevariancereductioninIW-PGPE
broughtbytheoptimalbaselinedependontheboundsoftheimportance
weight,wminandwmax—thelargertheupperboundwmaxis,themore
optimalbaselinesubtractioncanreducethevariance.
FromTheorem9.5andTheorem9.6,thefollowingcorollarycanbeimme-
diatelyobtained:
Corollary9.7Assumethatforalls,a,ands′,thereexists0<α<βsuch
thatr(s,a,s′)∈[α,β],and,forallθ,thereexists0<wmin<wmax<∞suchthatwmin≤w(θ)≤wmax.Then,thefollowingupperboundshold:
h
i
(1−γT)2G
Var
b
p(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)(β2w
IW
≤N′(1−γ)2
max−α2wmin),
h
i
2(1−γT)2G
Var
b
p(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)(β2w
IW
≤N′(1−γ)2
max−α2wmin).
FromTheorem9.5andthiscorollary,wecanconfirmthattheupper
boundsforthebaseline-subtractedIW-PGPEaresmallerthanthoseforthe
plainIW-PGPEwithoutbaselinesubtraction,becauseα2wmin>0.Inpartic-
ular,ifwminislarge,theupperboundsforthebaseline-subtractedIW-PGPE
canbemuchsmallerthanthosefortheplainIW-PGPEwithoutbaseline
subtraction.
9.3.3
NumericalExamples
Here,weconsiderthecontrollingtaskofthehumanoidrobotCB-i(Cheng
etal.,2007)showninFigure9.5(a).Thegoalistoleadtheendeffectorof
therightarm(righthand)toatargetobject.First,itssimulatedupper-body
model,illustratedinFigure9.5(b),isusedtoinvestigatetheperformanceof
theIW-PGPE-OBmethod.ThentheIW-PGPE-OBmethodisappliedtothe
realrobot.
9.3.3.1
Setup
Theperformanceofthefollowing4methodsiscompared:
Policy-PriorSearch
147
(a)CB-i
(b)Simulatedupper-bodymodel
FIGURE9.5:HumanoidrobotCB-ianditsupper-bodymodel.Thehu-
manoidrobotCB-iwasdevelopedbytheJST-ICORPComputationalBrain
ProjectandATRComputationalNeuroscienceLabs(Chengetal.,2007).
•IW-REINFORCE-OB:Importance-weightedREINFORCEwiththe
optimalbaseline.
•NIW-PGPE-OB:Data-reusePGPE-OBwithoutimportanceweight-
ing.
•PGPE-OB:PlainPGPE-OBwithoutdatareuse.
•IW-PGPE-OB:Importance-weightedPGPEwiththeoptimalbase-
line.
TheupperbodyofCB-ihas9degreesoffreedom:theshoulderpitch,
shoulderroll,elbowpitchoftherightarm;shoulderpitch,shoulderroll,elbow
pitchoftheleftarm;waistyaw;torsoroll;andtorsopitch(Figure9.5(b)).At
eachtimestep,thecontrollerreceivesstatesfromthesystemandsendsout
actions.Thestatespaceis18-dimensional,whichcorrespondstothecurrent
angleandangularvelocityofeachjoint.Theactionspaceis9-dimensional,
whichcorrespondstothetargetangleofeachjoint.Bothstatesandactions
arecontinuous.
Giventhestateandactionineachtimestep,thephysicalcontrolsystem
calculatesthetorquesateachjointbyusingaproportional-derivative(PD)
controlleras
τi=Kp(a
˙s
i
i−si)−Kdii,
148
StatisticalReinforcementLearning
wheresi,˙si,andaidenotethecurrentangle,thecurrentangularvelocity,
andthetargetangleofthei-thjoint,respectively.KpandK
denotethe
i
di
positionandvelocitygainsforthei-thjoint,respectively.Theseparameters
aresetat
Kp=200andK=10
i
di
fortheelbowpitchjoints,and
Kp=2000andK=100
i
di
forotherjoints.
Theinitialpositionoftherobotisfixedatthestanding-up-straightpose
withthearmsdown.Theimmediaterewardrtatthetimesteptisdefinedas
rt=exp(−10dt)−0.0005min(ct,10,000),
wheredtisthedistancebetweentherighthandoftherobotandthetarget
object,andctisthesumofcontrolcostsforeachjoint.Thelineardeterministic
policyisusedforthePGPEmethods,andtheGaussianpolicyisusedforIW-
REINFORCE-OB.Inbothcases,thelinearbasisfunctionφ(s)=sisused.
ForPGPE,theinitialpriormeanηisrandomlychosenfromthestandard
normaldistribution,andtheinitialpriorstandarddeviationτissetat1.
Toevaluatetheusefulnessofdatareusemethodswithasmallnumber
ofsamples,theagentcollectsonlyN=3on-policysampleswithtrajectory
lengthT=100ateachiteration.Allpreviousdatasamplesarereusedto
estimatethegradientsinthedatareusemethods,whileonlyon-policysam-
plesareusedtoestimatethegradientsintheplainPGPE-OBmethod.The
discountfactorissetatγ=0.9.
9.3.3.2
Simulationwith2DegreesofFreedom
First,theperformanceonthereachingtaskwithonly2degreesoffreedom
isinvestigated.Thebodyoftherobotisfixedandonlytherightshoulderpitch
andrightelbowpitchareused.Figure9.6depictstheaveragedexpectedreturn
over10trialsasafunctionofthenumberofiterations.Theexpectedreturn
ateachtrialiscomputedfrom50newlydrawntestepisodicdatathatarenot
usedforpolicylearning.ThegraphshowsthatIW-PGPE-OBnicelyimproves
theperformanceoveriterationswithonlyasmallnumberofon-policysamples.
TheplainPGPE-OBmethodcanalsoimprovetheperformanceoveritera-
tions,butslowly.NIW-PGPE-OBisnotasgoodasIW-PGPE-OB,especially
atthelateriterations,becauseoftheinconsistencyoftheNIWestimator.
Thedistancefromtherighthandtotheobjectandthecontrolcostsalong
thetrajectoryarealsoinvestigatedforthreepolicies:theinitialpolicy,thepolicyobtainedatthe20thiterationbyIW-PGPE-OB,andthepolicyobtained
atthe50thiterationbyIW-PGPE-OB.Figure9.7(a)plotsthedistanceto
thetargetobjectasafunctionofthetimestep.Thisshowsthatthepolicy
obtainedatthe50thiterationdecreasesthedistancerapidlycomparedwith
Policy-PriorSearch
149
5
IW−PGPE−OB
NIW−PGPE−OB
PGPE−OB
4
IW−REINFORCE−OB
3
Return
2
1
0
10
20
30
40
50
Iteration
FIGURE9.6:Averageandstandarderrorofreturnsover10runsasfunctions
ofthenumberofiterationsforthereachingtaskwith2degreesoffreedom
(rightshoulderpitchandrightelbowpitch).
0.35
Initialpolicy
Policyatthe20thiteration
0.3
Policyatthe50thiteration
0.25
0.2
0.15
Distance
0.1
0.05
00
10
20
30
40
50
60
70
80
90
100
TImesteps
(a)Distance
120
Initialpolicy
110
Policyatthe20thiteration
Policyatthe50thiteration
100
90
80
70
Controlcosts60
50
40
300
10
20
30
40
50
60
70
80
90
100
Timesteps
(b)Controlcosts
FIGURE9.7:Distanceandcontrolcostsofarmreachingwith2degreesof
freedomusingthepolicylearnedbyIW-PGPE-OB.
150
StatisticalReinforcementLearning
FIGURE9.8:Typicalexampleofarmreachingwith2degreesoffreedom
usingthepolicyobtainedbyIW-PGPE-OBatthe50thiteration(fromleftto
rightandtoptobottom).
theinitialpolicyandthepolicyobtainedatthe20thiteration,whichmeans
thattherobotcanreachtheobjectquicklybyusingthelearnedpolicy.
Figure9.7(b)plotsthecontrolcostasafunctionofthetimestep.This
showsthatthepolicyobtainedatthe50thiterationdecreasesthecontrol
coststeadilyuntilthereachingtaskiscompleted.Thisisbecausetherobot
mainlyadjuststheshoulderpitchinthebeginning,whichconsumesalarger
amountofenergythantheenergyrequiredforcontrollingtheelbowpitch.
Then,oncetherighthandgetsclosertothetargetobject,therobotstarts
adjustingtheelbowpitchtoreachthetargetobject.Thepolicyobtainedat
the20thiterationactuallyconsumeslesscontrolcosts,butitcannotleadthe
armtothetargetobject.
Figure9.8illustratesatypicalsolutionofthereachingtaskwith2degrees
offreedombythepolicyobtainedbyIW-PGPE-OBatthe50thiteration.The
imagesshowthattherighthandissuccessfullyledtothetargetobjectwithin
only10timesteps.
9.3.3.3
SimulationwithAll9DegreesofFreedom
Finally,thesameexperimentiscarriedoutusingall9degreesoffreedom.
Thepositionofthetargetobjectismoredistantfromtherobotsothatit
cannotbereachedbyonlyusingtherightarm.
Policy-PriorSearch
151
−2
−3
−4
−5
−6
Return
−7
−8
TruncatedIW−PGPE−OB
−9
IW−PGPE−OB
−10
NIW−PGPE−OB
PGPE−OB
0
50
100
150
200
250
300
350
400
Iteration
FIGURE9.9:Averageandstandarderrorofreturnsover10runsasfunctions
ofthenumberofiterationsforthereachingtaskwithall9degreesoffreedom.
Becauseall9jointsareused,thedimensionalityofthestatespaceismuch
increasedandthisgrowsthevaluesofimportanceweightsexponentially.In
ordertomitigatethelargevaluesofimportanceweights,wedecidednotto
reuseallpreviouslycollectedsamples,butonlysamplescollectedinthelast
5iterations.Thisallowsustokeepthedifferencebetweenthesamplingdis-
tributionandthetargetdistributionreasonablysmall,andthusthevaluesof
importanceweightscanbesuppressedtosomeextent.Furthermore,follow-
ingWawrzynski(2009),weconsideraversionofIW-PGPE-OB,denotedas
“truncatedIW-PGPE-OB”below,wheretheimportanceweightistruncated
asw=min(w,2).
TheresultsplottedinFigure9.9showthattheperformanceofthetrun-
catedIW-PGPE-OBisthebest.Thisimpliesthatthetruncationofimpor-
tanceweightsishelpfulwhenapplyingIW-PGPE-OBtohigh-dimensional
problems.
Figure9.10illustratesatypicalsolutionofthereachingtaskwithall9
degreesoffreedombythepolicyobtainedbythetruncatedIW-PGPE-OB
atthe400thiteration.Theimagesshowthatthepolicylearnedbyourpro-
posedmethodsuccessfullyleadstherighthandtothetargetobject,andthe
irrelevantpartsarekeptattheinitialpositionforreducingthecontrolcosts.
9.3.3.4
RealRobotControl
Finally,theIW-PGPE-OBmethodisappliedtotherealCB-irobotshown
inFigure9.11(Sugimotoetal.,2014).
Theexperimentalsettingisessentiallythesameastheabovesimulation
studieswith9joints,butpoliciesareupdatedonlyevery5trialsandsamples
takenfromthelast10trialsarereusedforstabilizationpurposes.Figure9.12
152
StatisticalReinforcementLearning
FIGURE9.10:Typicalexampleofarmreachingwithall9degreesoffree-
domusingthepolicyobtainedbythetruncatedIW-PGPE-OBatthe400th
iteration(fromlefttorightandtoptobottom).
FIGURE9.11:ReachingtaskbytherealCB-irobot(Sugimotoetal.,2014).
plotstheobtainedrewardscumulatedoverpolicyupdateiterations,showing
thatrewardsaresteadilyincreasedoveriteration.Figure9.13exhibitsthe
acquiredreachingmotionbasedonthepolicyobtainedatthe120thiteration,
showingthattheendeffectoroftherobotcansuccessfullyreachthetarget
object.
Policy-PriorSearch
153
60
40
Cumulativerewards20
0
20
40
60
80
100
120
Numberofupdates
FIGURE9.12:Obtainedrewardcumulatedoverpolicyupdatediterations.
9.4
Remarks
Whenthetrajectorylengthislarge,directpolicysearchtendstoproduce
gradientestimatorswithlargevariance,duetotherandomnessofstochas-
ticpolicies.Policy-priorsearchcanavoidthisproblembyusingdeterminis-
ticpoliciesandintroducingstochasticitybyconsideringapriordistribution
overpolicyparameters.Boththeoreticallyandexperimentally,advantagesof
policy-priorsearchoverdirectpolicysearchwereshown.
Asamplereuseframeworkforpolicy-priorsearchwasalsointroduced
whichishighlyusefulinreal-worldreinforcementlearningproblemswithhigh
samplingcosts.Followingthesamelineasthesamplereusemethodsforpolicy
iterationdescribedinChapter4anddirectpolicysearchintroducedinChap-
ter8,importanceweightingplaysanessentialroleinsample-reusepolicy-prior
search.Whenthedimensionalityofthestate-actionspaceishigh,however,
importanceweightstendtotakeextremelylargevalues,whichcausesinstabil-
ityoftheimportanceweightingmethods.Tomitigatethisproblem,truncation
oftheimportanceweightsisusefulinpractice.
154
StatisticalReinforcementLearning
FIGURE9.13:Typicalexampleofarmreachingusingthepolicyobtained
bytheIW-PGPE-OBmethod(fromlefttorightandtoptobottom).
PartIV
Model-Based
ReinforcementLearning
ThereinforcementlearningmethodsexplainedinPartIIandPartIIIare
categorizedintothemodel-freeapproach,meaningthatpoliciesarelearned
withoutexplicitlymodelingtheunknownenvironment(i.e.,thetransition
probabilityoftheagent).Ontheotherhand,inPartIV,weintroducean
alternativeapproachcalledthemodel-basedapproach,whichexplicitlymodels
theenvironmentinadvanceandusesthelearnedenvironmentmodelforpolicy
learning.
Inthemodel-basedapproach,noadditionalsamplingcostisnecessaryto
generateartificialsamplesfromthelearnedenvironmentmodel.Thus,the
model-basedapproachisusefulwhendatacollectionisexpensive(e.g.,robot
control).However,accuratelyestimatingthetransitionmodelfromalimited
amountoftrajectorydatainmulti-dimensionalcontinuousstateandaction
spacesishighlychallenging.
InChapter10,weintroduceanon-parametricmodelestimatorthatpos-
sessestheoptimalconvergenceratewithhighcomputationalefficiency,and
demonstrateitsusefulnessthroughexperiments.Then,inChapter11,we
combinedimensionalityreductionwithmodelestimationtocopewithhigh
dimensionalityofstateandactionspaces.
Thispageintentionallyleftblank
Chapter10
TransitionModelEstimation
Inthischapter,weintroducetransitionprobabilityestimationmethodsfor
model-basedreinforcementlearning(Wang&Dietterich,2003;Deisenroth&
Rasmussen,2011).AmongthemethodsdescribedinSection10.1,anon-
parametrictransitionmodelestimatorcalledleast-squaresconditionaldensity
estimation(LSCDE)(Sugiyamaetal.,2010)isshowntobethemostpromis-
ingapproach(Tangkarattetal.,2014a).TheninSection10.2,wedescribe
howthetransitionmodelestimatorcanbeutilizedinmodel-basedreinforce-
mentlearning.InSection10.3,experimentalperformanceofamodel-based
policy-priorsearchmethodisevaluated.Finally,inSection10.4,thischapter
isconcluded.
10.1
ConditionalDensityEstimation
Inthissection,theproblemofapproximatingthetransitionprobabil-
ityp(s′|s,a)fromindependenttransitionsamples(sm,am,s′m)M
m=1isad-
dressed.
10.1.1
Regression-BasedApproach
Intheregression-basedapproach,theproblemoftransitionprobability
estimationisformulatedasafunctionapproximationproblemofpredicting
outputs′giveninputsandaunderGaussiannoise:
s′=f(s,a)+ǫ,
wherefisanunknownregressionfunctiontobelearned,ǫisanindepen-
dentGaussiannoisevectorwithmeanzeroandcovariancematrixσ2I,andI
denotestheidentitymatrix.
Letusapproximatefbythefollowinglinear-in-parametermodel:
f(s,a,Γ)=Γ⊤φ(s,a),
whereΓistheB×dim(s)parametermatrixandφ(s,a)istheB-dimensional
157
158
StatisticalReinforcementLearning
basisvector.AtypicalchoiceofthebasisvectoristheGaussiankernel,which
isdefinedforB=Mas
ks−s
φ
bk2+(a−ab)2
b(s,a)=exp
−
,
2κ2
andκ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof
basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian
centers.DifferentGaussianwidthsforsandamaybeusedifnecessary.
TheparametermatrixΓislearnedsothattheregularizedsquarederror
isminimized:
”
#
M
X
2
b
Γ=argmin
f(sm,am,Γ)−f(sm,am)
+trΓ⊤RΓ
,
Γ
m=1
whereRistheB×Bpositivesemi-definitematrixcalledtheregularization
matrix.Thesolutionb
Γisgivenanalyticallyas
b
Γ=(Φ⊤Φ+R)−1Φ⊤(s′1,…,s′M)⊤,
whereΦistheM×Bdesignmatrixdefinedas
Φm,b=φb(sm,am).
Wecanconfirmthatpredictedoutputvectorbs′=f(s,a,b
Γ)actuallyfollows
theGaussiandistributionwithmean
(s′1,…,s′M)Φ(Φ⊤Φ+R)−1φ(s,a)
andcovariancematrixb
δ2I,where
bδ2=σ2tr(Φ⊤Φ+R)−2Φ⊤Φ.
ThetuningparameterssuchastheGaussiankernelwidthκandtheregu-
larizationmatrixRcanbedeterminedeitherbycross-validationorevidence
maximizationiftheabovemethodisregardedasGaussianprocessregression
intheBayesianframework(Rasmussen&Williams,2006).
Thisistheregression-basedestimatorofthetransitionprobabilitydensity
p(s′|s,a)foranarbitrarytestinputsanda.Thus,bytheuseofkernel
regressionmodels,theregressionfunctionf(whichistheconditionalmeanof
outputs)isapproximatedinanon-parametricway.However,theconditional
distributionofoutputsitselfisrestrictedtobeGaussian,whichishighly
restrictiveinreal-worldreinforcementlearning.
10.1.2
ǫ-NeighborKernelDensityEstimation
Whentheconditioningvariables(s,a)arediscrete,theconditionaldensity
p(s′|s,a)canbeeasilyestimatedbystandarddensityestimatorssuchaskernel
TransitionModelEstimation
159
densityestimation(KDE)byonlyusingsampless′iisuchthat(si,ai)agrees
withthetargetvalues(s,a).ǫ-neighborKDE(ǫKDE)extendsthisideatothe
continuouscasesuchthat(si,ai)areclosetothetargetvalues(s,a).
Morespecifically,ǫKDEwiththeGaussiankernelisgivenby
1
X
b
p(s′|s,a)=
N(s′;s′
|I
i,σ2I),
(s,a),ǫ|i∈I(s,a),ǫwhereI(s,a),ǫisthesetofsampleindicessuchthatk(s,a)−(si,ai)k≤ǫ
andN(s′;s′i,σ2I)denotestheGaussiandensitywithmeans′iandcovariance
matrixσ2I.TheGaussianwidthσandthedistancethresholdǫmaybechosen
bycross-validation.
ǫKDEisausefulnon-parametricdensityestimatorthatiseasytoim-
plement.However,itisunreliableinhigh-dimensionalproblemsduetothe
distance-basedconstruction.
10.1.3
Least-SquaresConditionalDensityEstimation
Anon-parametricconditionaldensityestimatorcalledleast-squarescondi-
tionaldensityestimation(LSCDE)(Sugiyamaetal.,2010)possessesvarious
usefulproperties:
•Itcandirectlyhandlemulti-dimensionalmulti-modalinputsandout-
puts.
•Itwasprovedtoachievetheoptimalconvergencerate(Kanamorietal.,
2012).
•Ithashighnumericalstability(Kanamorietal.,2013).
•Itisrobustagainstoutliers(Sugiyamaetal.,2010).
•Itssolutioncanbeanalyticallyandefficientlycomputedjustbysolving
asystemoflinearequations(Kanamorietal.,2009).
•Generatingsamplesfromthelearnedtransitionmodelisstraightforward.
Letusmodelthetransitionprobabilityp(s′|s,a)bythefollowinglinear-
in-parametermodel:
α⊤φ(s,a,s′),
(10.1)
whereαistheB-dimensionalparametervectorandφ(s,a,s′)istheB-
dimensionalbasisfunctionvector.Atypicalchoiceofthebasisfunctionis
theGaussiankernel,whichisdefinedforB=Mas
ks−s
φ
bk2+(a−ab)2+ks′−s′bk2
b(s,a,s′)=exp
−
.
2κ2
160
StatisticalReinforcementLearning
κ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof
basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian
centers.DifferentGaussianwidthsfors,a,ands′maybeusedifnecessary.
Theparameterαislearnedsothatthefollowingsquarederrorismini-
mized:
ZZZ
1
2
J0(α)=
α⊤φ(s,a,s′)−p(s′|s,a)p(s,a)dsdads′
2ZZZ
1
2
=
α⊤φ(s,a,s′)
p(s,a)dsdads′
2ZZZ
−
α⊤φ(s,a,s′)p(s,a,s′)dsdads′+C,
wheretheidentityp(s′|s,a)=p(s,a,s′)/p(s,a)isusedinthesecondterm
and
ZZZ
1
C=
p(s′|s,a)p(s,a,s′)dsdads′.
2
BecauseCisconstantindependentofα,onlythefirsttwotermswillbe
consideredfromhereon:
1
J(α)=J0(α)−C=α⊤Uα−α⊤v,
2
whereUistheB×BandvistheB-dimensionalvectordefinedas
ZZ
U=
Φ(s,a)p(s,a)dsda,
ZZZ
v=
φ(s,a,s′)p(s,a,s′)dsdads′,
Z
Φ(s,a)=
φ(s,a,s′)φ(s,a,s′)⊤ds′.
Notethat,fortheGaussianmodel(10.1),the(b,b′)-thelementofmatrix
Φ(s,a)canbecomputedanalyticallyas
√
ks′
Φ
b−s′b′k2
b,b′(s,a)=(
πκ)dim(s′)exp−
4κ2
ks−s
×exp−
bk2+ks−sb′k2+(a−ab)2+(a−ab′)2
.
2κ2
BecauseUandvincludedinJ(α)containtheexpectationsoverunknown
densitiesp(s,a)andp(s,a,s′),theyareapproximatedbysampleaverages.
Thenwehave
b
1
J(α)=
α⊤b
Uα−b
v⊤α,
2
TransitionModelEstimation
161
where
M
X
M
X
b
1
1
U=
Φ(s
φ(s
M
m,am)
and
bv=M
m,am,s′m).
m=1
m=1
Byaddinganℓ2-regularizertob
J(α)toavoidoverfitting,theLSCDEop-
timizationcriterionisgivenas
λ
e
α=argminb
J(α)+
kαk2,
α∈RM2
whereλ≥0istheregularizationparameter.Thesolutione
αisgivenanalyti-
callyas
e
α=(b
U+λI)−1b
v,
whereIdenotestheidentitymatrix.Becauseconditionalprobabilitydensities
arenon-negativebydefinition,thesolutione
αismodifiedas
b
αb=max(0,e
αb).
Finally,thesolutionisnormalizedinthetestphase.Morespecifically,given
atestinputpoint(s,a),thefinalLSCDEsolutionisgivenas
b
α⊤φ(s,a,s′)
b
p(s′|s,a)=R
,
b
α⊤φ(s,a,s′′)ds′′
where,fortheGaussianmodel(10.1),thedenominatorcanbeanalytically
computedas
Z
√
B
X
ks−s
b
bk2+(a−ab)2
α⊤φ(s,a,s′′)ds′′=(2πκ)dim(s′)
αbexp−
.
2κ2
b=1
ModelselectionoftheGaussianwidthκandtheregularizationparameterλ
ispossiblebycross-validation(Sugiyamaetal.,2010).
10.2
Model-BasedReinforcementLearning
Model-basedreinforcementlearningissimplycarriedoutasfollows.
1.Collecttransitionsamples(sm,am,s′m)M
m=1.
2.Obtainatransitionmodelestimateb
p(s′|s,a)from(sm,am,s′m)M
m=1.
162
StatisticalReinforcementLearning
3.Runamodel-freereinforcementlearningmethodusingtrajectorysam-
plesehte
T
t=1artificiallygeneratedfromestimatedtransitionmodel
b
p(s′|s,a)andcurrentpolicyπ(a|s,θ).
Model-basedreinforcementlearningisparticularlyadvantageouswhenthe
samplingcostislimited.Morespecifically,inmodel-freemethods,weneedto
fixthesamplingscheduleinadvance—forexample,whethermanysamples
aregatheredinthebeginningoronlyasmallbatchofsamplesiscollectedfor
alongerperiod.However,optimizingthesamplingscheduleinadvanceisnot
possiblewithoutstrongpriorknowledge.Thus,weneedtojustblindlydesign
thesamplingscheduleinpractice,whichcancausesignificantperformance
degradation.Ontheotherhand,model-basedmethodsdonotsufferfromthis
problem,becausewecandrawasmanytrajectorysamplesaswewantfrom
thelearnedtransitionmodelwithoutadditionalsamplingcosts.
10.3
NumericalExamples
Inthissection,theexperimentalperformanceofthemodel-freeandmodel-
basedversionsofPGPE(policygradientswithparameter-basedexploration)
areevaluated:
M-PGPE(LSCDE):Themodel-basedPGPEmethodwithtransitionmodel
estimatedbyLSCDE.
M-PGPE(GP):Themodel-basedPGPEmethodwithtransitionmodeles-
timatedbyGaussianprocess(GP)regression.
IW-PGPE:Themodel-freePGPEmethodwithsamplereusebyimportance
weighting(themethodintroducedinChapter9).
10.3.1
ContinuousChainWalk
Letusfirstconsiderasimplecontinuouschainwalktask,describedin
Figure10.1.
10.3.1.1
Setup
Let
(1(4<s′<6),
s∈S=[0,10],a∈A=[−5,5],andr(s,a,s′)=0(otherwise).Thatis,theagentreceivespositivereward+1atthecenterofthestatespace.
ThetrajectorylengthissetatT=10andthediscountfactorissetat
TransitionModelEstimation
163
0
4
6
10
FIGURE10.1:Illustrationofcontinuouschainwalk.
γ=0.99.Thefollowinglinear-in-parameterpolicymodelisusedinboth
theM-PGPEandIW-PGPEmethods:
6
X
(s−c
a=
θ
i)2
iexp
−
,
2
i=1
where(c1,…,c6)=(0,2,4,6,8,10).Ifanactiondeterminedbytheabove
policyisoutoftheactionspace,itispulledbacktobeconfinedinthedomain.
Astransitiondynamics,thefollowingtwoscenariosareconsidered:
Gaussian:Thetruetransitiondynamicsisgivenby
st+1=st+at+εt,
whereεtistheGaussiannoisewithmean0andstandarddeviation0.3.
Bimodal:Thetruetransitiondynamicsisgivenby
st+1=st±at+εt,
whereεtistheGaussiannoisewithmean0andstandarddeviation0.3,
andthesignofatisrandomlychosenwithprobability1/2.
Ifthenextstateisoutofthestatespace,itisprojectedbacktothe
domain.Below,thebudgetfordatacollectionisassumedtobelimitedto
N=20trajectorysamples.
10.3.1.2
ComparisonofModelEstimators
WhenthetransitionmodelislearnedintheM-PGPEmethods,allN=20
trajectorysamplesaregatheredrandomlyinthebeginningatonce.More
specifically,theinitialstates1andtheactiona1arechosenfromtheuniform
distributionsoverSandA,respectively.Thenthenextstates2andtheim-
mediaterewardr1areobtained.Afterthat,theactiona2ischosenfromthe
uniformdistributionoverA,andthenextstates3andtheimmediatereward
r2areobtained.ThisprocessisrepeateduntilrTisobtained,bywhichatra-
jectorysampleisobtained.ThisdatagenerationprocessisrepeatedNtimes
toobtainNtrajectorysamples.
Figure10.2andFigure10.3illustratethetruetransitiondynamicsand
164
StatisticalReinforcementLearning
)10
,as’|(sp’s5
argmax05
10
0
5
−5
a
0
s
(a)Truetransition
)10
)10
,a
,a
s’|
s’|
(s
(s
p’
p’
s5
s5
argmax
argmax
0
0
5
5
10
10
0
0
5
5
−5
a
0
s
−5
a
0
s
(b)TransitionestimatedbyLSCDE
(c)TransitionestimatedbyGP
FIGURE10.2:GaussiantransitiondynamicsanditsestimatesbyLSCDE
andGP.
theirestimatesobtainedbyLSCDEandGPintheGaussianandbimodal
cases,respectively.Figure10.2showsthatbothLSCDEandGPcanlearnthe
entireprofileofthetruetransitiondynamicswellintheGaussiancase.Onthe
otherhand,Figure10.3showsthatLSCDEcanstillsuccessfullycapturethe
entireprofileofthetruetransitiondynamicswelleveninthebimodalcase,
butGPfailstocapturethebimodalstructure.
Basedontheestimatedtransitionmodels,policiesarelearnedbytheM-
PGPEmethod.Morespecifically,fromthelearnedtransitionmodel,1000
artificialtrajectorysamplesaregeneratedforgradientestimationandan-
other1000artificialtrajectorysamplesareusedforbaselineestimation.Then
policiesareupdatedbasedontheseartificialtrajectorysamples.Thispolicy
updatestepisrepeated100times.Forevaluatingthereturnofalearnedpol-
icy,100additionaltesttrajectorysamplesareusedwhicharenotemployedfor
policylearning.Figure10.4andFigure10.5depicttheaveragesandstandard
errorsofreturnsover100runsfortheGaussianandbimodalcases,respec-
tively.Theresultsshowthat,intheGaussiancase,theGP-basedmethod
performsverywellandLSCDEalsoexhibitsreasonableperformance.Inthe
bimodalcase,ontheotherhand,GPperformspoorlyandLSCDEgivesmuch
betterresultsthanGP.ThisillustratesthehighflexibilityofLSCDE.
TransitionModelEstimation
165
)10
,as’|(sp’s5
argmax05
10
0
5
−5
a
0
s
(a)Truetransition
)10
)
,a
10
s
,a
’|
s’|
(s
(s
p’
p
s5
’s5
argmax0
argmax0
5
5
10
10
0
0
5
5
−5
a
0
s
−5
a
0
s
(b)TransitionestimatedbyLSCDE
(c)TransitionestimatedbyGP
FIGURE10.3:BimodaltransitiondynamicsanditsestimatesbyLSCDE
andGP.
10
2.8
M−PGPE(LSCDE)
M−PGPE(GP)
8
2.6
IW−PGPE
2.4
M−PGPE(LSCDE)
6
M−PGPE(GP)
2.2
Return
IW−PGPE
Return
4
2
1.8
2
1.6
0
20
40
60
80
100
0
20
40
60
80
100
Iteration
Iteration
FIGURE10.4:Averagesandstan-
FIGURE10.5:Averagesandstan-
darderrorsofreturnsofthepolicies
darderrorsofreturnsofthepolicies
over100runsobtainedbyM-PGPE
over100runsobtainedbyM-PGPE
withLSCDE,M-PGPEwithGP,
withLSCDE,M-PGPEwithGP,
andIW-PGPEforGaussiantransi-
andIW-PGPEforbimodaltransi-
tion.
tion.
166
StatisticalReinforcementLearning
4
2
1.9
3.5
1.8
3
Return
Return1.7
2.5
1.6
2
1.5
20x1
10x2
5x4
4x5
2x10
1x20
20x1
10x2
5x4
4x5
2x10
1x20
Samplingschedules
Samplingschedules
FIGURE10.6:Averagesandstan-
FIGURE10.7:Averagesandstan-
darderrorsofreturnsobtainedby
darderrorsofreturnsobtainedby
IW-PGPEover100runsforGaus-
IW-PGPEover100runsforbimodal
siantransitionwithdifferentsam-
transitionwithdifferentsampling
plingschedules(e.g.,5×4means
schedules(e.g.,5×4meansgathering
gatheringk=5trajectorysamples
k=5trajectorysamples4times).
4times).
10.3.1.3
ComparisonofModel-BasedandModel-FreeMethods
Next,theperformanceofthemodel-basedandmodel-freePGPEmethods
arecompared.
Underthefixedbudgetscenario,thescheduleofcollecting20trajectory
samplesneedstobedeterminedfortheIW-PGPEmethod.First,theinfluence
ofthechoiceofsamplingschedulesisillustrated.Figure10.6andFigure10.7
showexpectedreturnsaveragedover100runsunderthesamplingschedule
thatabatchofktrajectorysamplesaregathered20/ktimesfordifferentval-
uesofk.Here,policyupdateisperformed100timesafterobservingeachbatch
ofktrajectorysamples,becausethisperformedbetterthantheusualscheme
ofupdatingthepolicyonlyonce.Figure10.6showsthattheperformanceof
IW-PGPEdependsheavilyonthesamplingschedule,andgatheringk=20
trajectorysamplesatonceisshowntobethebestchoiceintheGaussiancase.
Figure10.7showsthatgatheringk=20trajectorysamplesatonceisalsothe
bestchoiceinthebimodalcase.
Althoughthebestsamplingscheduleisnotaccessibleinpractice,theop-
timalsamplingscheduleisusedforevaluatingtheperformanceofIW-PGPE.
Figure10.4andFigure10.5showtheaveragesandstandarderrorsofreturns
obtainedbyIW-PGPEover100runsasfunctionsofthesamplingsteps.These
graphsshowthatIW-PGPEcanimprovethepoliciesonlyinthebeginning,
becausealltrajectorysamplesaregatheredatonceinthebeginning.The
performanceofIW-PGPEmaybefurtherimprovedifitispossibletogather
moretrajectorysamples.However,thisisprohibitedunderthefixedbudget
scenario.Ontheotherhand,returnsofM-PGPEkeepincreasingoveriter-
TransitionModelEstimation
167
ations,becauseartificialtrajectorysamplescanbekeptgeneratedwithout
additionalsamplingcosts.Thisillustratesapotentialadvantageofmodel-
basedreinforcementlearning(RL)methods.
10.3.2
HumanoidRobotControl
Finally,theperformanceofM-PGPEisevaluatedonapracticalcontrol
problemofasimulatedupper-bodymodelofthehumanoidrobotCB-i(Cheng
etal.,2007),whichwasalsousedinSection9.3.3;seeFigure9.5forthe
illustrationsofCB-ianditssimulator.
10.3.2.1
Setup
ThesimulatorisbasedontheupperbodyoftheCB-ihumanoidrobot,
whichhas9jointsforshoulderpitch,shoulderroll,elbowpitchoftheright
arm,andshoulderpitch,shoulderroll,elbowpitchoftheleftarm,waistyaw,
torsoroll,andtorsopitch.Thestatevectoris18-dimensionalandreal-valued,
whichcorrespondstothecurrentangleindegreeandthecurrentangular
velocityforeachjoint.Theactionvectoris9-dimensionalandreal-valued,
whichcorrespondstothetargetangleofeachjointindegree.Thegoalofthe
controlproblemistoleadtheendeffectoroftherightarm(righthand)tothe
targetobject.Anoisycontrolsystemissimulatedbyperturbingactionvectors
withindependentbimodalGaussiannoise.Morespecifically,foreachelement
oftheactionvector,Gaussiannoisewithmean0andstandarddeviation3is
addedwithprobability0.6,andGaussiannoisewithmean−5andstandard
deviation3isaddedwithprobability0.4.
Theinitialpostureoftherobotisfixedtobestandingupstraightwith
armsdown.Thetargetobjectislocatedinfrontofandabovetherighthand,
whichisreachablebyusingthecontrollablejoints.Therewardfunctionat
eachtimestepisdefinedas
rt=exp(−10dt)−0.000005minct,1,000,000,
wheredtisthedistancebetweentherighthandandtargetobjectattimestep
t,andctisthesumofcontrolcostsforeachjoint.Thedeterministicpolicy
modelusedinM-PGPEandIW-PGPEisdefinedasa=θ⊤φ(s)withthe
basisfunctionφ(s)=s.ThetrajectorylengthissetatT=100andthe
discountfactorissetatγ=0.9.
10.3.2.2
Experimentwith2Joints
First,weconsiderusingonly2jointsamongthe9joints,i.e.,onlytheright
shoulderpitchandrightelbowpitchareallowedtobecontrolled,whilethe
otherjointsremainstillateachtimestep(nocontrolsignalissenttothese
168
StatisticalReinforcementLearning
joints).Therefore,thedimensionalitiesofstatevectorsandactionvectora
are4and2,respectively.
WesupposethatthebudgetfordatacollectionislimitedtoN=50trajec-
torysamples.FortheM-PGPEmethods,alltrajectorysamplesarecollected
atfirstusingtheuniformlyrandominitialstatesandpolicy.Morespecifically,
theinitialstateischosenfromtheuniformdistributionoverS.Ateachtime
step,theactionaiofthei-thjointisfirstdrawnfromtheuniformdistribu-
tionon[si−5,si+5],wheresidenotesthestateforthei-thjoint.Intotal,
5000transitionsamplesarecollectedformodelestimation.Then,fromthe
learnedtransitionmodel,1000artificialtrajectorysamplesaregeneratedfor
gradientestimationandanother1000artificialtrajectorysamplesaregener-
atedforbaselineestimationineachiteration.Thesamplingscheduleofthe
IW-PGPEmethodischosentocollectk=5trajectorysamples50/ktimes,
whichperformswell,asshowninFigure10.8.Theaverageandstandarderror
ofthereturnobtainedbyeachmethodover10runsareplottedinFigure10.9,
showingthatM-PGPE(LSCDE)tendstooutperformbothM-PGPE(GP)and
IW-PGPE.
Figure10.10illustratesanexampleofthereachingmotionwith2joints
obtainedbyM-PGPE(LSCDE)atthe60thiteration.Thisshowsthatthe
learnedpolicysuccessfullyleadstherighthandtothetargetobjectwithin
only13stepsinthisnoisycontrolsystem.
10.3.2.3
Experimentwith9Joints
Finally,theperformanceofM-PGPE(LSCDE)andIW-PGPEisevaluated
onthereachingtaskwithall9joints.
Theexperimentalsetupisessentiallythesameasthe2-jointcase,butthe
budgetforgatheringN=1000trajectorysamplesisgiventothiscomplex
andhigh-dimensionaltask.Thepositionofthetargetobjectismovedto
farleft,whichisnotreachablebyusingonly2joints.Thus,therobotis
requiredtomoveotherjointstoreachtheobjectwiththerighthand.Five
thousandrandomlychosentransitionsamplesareusedasGaussiancentersfor
M-PGPE(LSCDE).ThesamplingscheduleforIW-PGPEissetatgathering
1000trajectorysamplesatonce,whichisthebestsamplingscheduleaccording
toFigure10.11.Theaveragesandstandarderrorsofreturnsobtainedby
M-PGPE(LSCDE)andIW-PGPEover30runsareplottedinFigure10.12,
showingthatM-PGPE(LSCDE)tendstooutperformIW-PGPE.
Figure10.13exhibitsatypicalreachingmotionwith9jointsobtainedby
M-PGPE(LSCDE)atthe1000thiteration.Thisshowsthattherighthandis
ledtothedistantobjectsuccessfullywithin14steps.
TransitionModelEstimation
169
3.5
3
Return2.5
2
1.5
50x1
25x2
10x5
5x10
1x50
Samplingschedules
FIGURE10.8:AveragesandstandarderrorsofreturnsobtainedbyIW-
PGPEover10runsforthe2-jointhumanoidrobotsimulatorfordifferent
samplingschedules(e.g.,5×10meansgatheringk=5trajectorysamples10
times).
0
150
300
450
600
750
1000
5
4
3
Return2
1
M−PGPE(LSCDE)
M−PGPE(GP)
IW−PGPE
0
0
20
40
60
Iteration
FIGURE10.9:Averagesandstandarderrorsofobtainedreturnsover10
runsforthe2-jointhumanoidrobotsimulator.Allmethodsuse50trajectory
samplesforpolicylearning.InM-PGPE(LSCDE)andM-PGPE(GP),all50
trajectorysamplesaregatheredinthebeginningandtheenvironmentmodel
islearned;then2000artificialtrajectorysamplesaregeneratedineachup-
dateiteration.InIW-PGPE,abatchof5trajectorysamplesisgatheredfor
10iterations,whichwasshowntobethebestsamplingscheduling(seeFig-
ure10.8).Notethatpolicyupdateisperformed100timesafterobservingeach
batchoftrajectorysamples,whichweconfirmedtoperformwell.Thebottom
horizontalaxisisfortheM-PGPEmethods,whilethetophorizontalaxisis
fortheIW-PGPEmethod.
170
StatisticalReinforcementLearning
FIGURE10.10:Exampleofarmreachingwith2jointsusingapolicyob-
tainedbyM-PGPE(LSCDE)atthe60thiteration(fromlefttorightandtop
tobottom).
−4.5
−5
−5.5
Return
−6
−6.5
−71000x1
500x2
100x10
50x20
10x100
5x200
1x1000
Samplingschedules
FIGURE10.11:AveragesandstandarderrorsofreturnsobtainedbyIW-
PGPEover30runsforthe9-jointhumanoidrobotsimulatorfordifferent
samplingschedules(e.g.,100×10meansgatheringk=100trajectorysamples
10times).
TransitionModelEstimation
171
0
20
40
60
80
100
−4
−5
−6
Return
−7
M−PGPE
IW−PGPE
−8
0
200
400
600
800
1000
Iteration
FIGURE10.12:Averagesandstandarderrorsofobtainedreturnsover30
runsforthehumanoidrobotsimulatorwith9joints.Bothmethodsuse1000
trajectorysamplesforpolicylearning.InM-PGPE(LSCDE),all1000tra-
jectorysamplesaregatheredinthebeginningandtheenvironmentmodel
islearned;then2000artificialtrajectorysamplesaregeneratedineachup-
dateiteration.InIW-PGPE,abatchof1000trajectorysamplesisgatheredat
once,whichwasshowntobethebestscheduling(seeFigure10.11).Notethat
policyupdateisperformed100timesafterobservingeachbatchoftrajectory
samples.ThebottomhorizontalaxisisfortheM-PGPEmethod,whilethe
tophorizontalaxisisfortheIW-PGPEmethod.
FIGURE10.13:Exampleofarmreachingwith9jointsusingapolicyob-
tainedbyM-PGPE(LSCDE)atthe1000thiteration(fromlefttorightand
toptobottom).
172
StatisticalReinforcementLearning
10.4
Remarks
Model-basedreinforcementlearningisapromisingapproach,giventhat
thetransitionmodelcanbeestimatedaccurately.However,estimatingthe
high-dimensionalconditionaldensityischallenging.Inthischapter,anon-
parametricconditionaldensityestimatorcalledleast-squaresconditionalden-
sityestimation(LSCDE)wasintroduced,andmodel-basedPGPEwith
LSCDEwasshowntoworkexcellentlyinexperiments.
Underthefixedsamplingbudget,themodel-freeapproachrequiresusto
designthesamplingscheduleappropriatelyinadvance.However,thisisprac-
ticallyveryhardunlessstrongpriorknowledgeisavailable.Ontheotherhand,
model-basedmethodsdonotsufferfromthisproblem,whichisanexcellent
practicaladvantageoverthemodel-freeapproach.
Inrobotics,themodel-freeapproachseemstobepreferredbecauseac-
curatelylearningthetransitiondynamicsofcomplexrobotsischallenging
(Deisenrothetal.,2013).Furthermore,model-freemethodscanutilizethe
priorknowledgeintheformofpolicydemonstration(Kober&Peters,2011).
Ontheotherhand,themodel-basedapproachisadvantageousinthatnoin-
teractionwiththerealrobotisrequiredoncethetransitionmodelhasbeen
learnedandthelearnedtransitionmodelcanbeutilizedforfurthersimulation.
Actually,thechoiceofmodel-freeormodel-basedmethodsisnotonlyan
ongoingresearchtopicinmachinelearning,butalsoabigdebatableissuein
neuroscience.Therefore,furtherdiscussionwouldbenecessarytomoredeeply
understandtheprosandconsofthemodel-basedandmodel-freeapproaches.
Combiningorswitchingthemodel-freeandmodel-basedapproacheswould
alsobeaninterestingdirectiontobefurtherinvestigated.
Chapter11
DimensionalityReductionfor
TransitionModelEstimation
Least-squaresconditionaldensityestimation(LSCDE),introducedinChap-
ter10,isapracticaltransitionmodelestimator.However,transitionmodel
estimationisstillchallengingwhenthedimensionalityofstateandaction
spacesishigh.Inthischapter,adimensionalityreductionmethodisintro-
ducedtoLSCDEwhichfindsalow-dimensionalexpressionoftheoriginal
stateandactionvectorthatisrelevanttopredictingthenextstate.After
mathematicallyformulatingtheproblemofdimensionalityreductioninSec-
tion11.1,adetaileddescriptionofthedimensionalityreductionalgorithm
basedonsquared-lossconditionalentropyisprovidedinSection11.2.Then
numericalexamplesaregiveninSection11.3,andthischapterisconcluded
inSection11.4.
11.1
SufficientDimensionalityReduction
Sufficientdimensionalityreduction(Li,1991;Cook&Ni,2005)isaframe-
workofdimensionalityreductioninasupervisedlearningsettingofanalyzing
aninput-outputrelation—inourcase,inputisthestate-actionpair(s,a)
andoutputisthenextstates′.Sufficientdimensionalityreductionisaimedat
findingalow-dimensionalexpressionzofinput(s,a)thatcontains“sufficient”
informationaboutoutputs′.
Letzbealinearprojectionofinput(s,a).Morespecifically,usingmatrix
WsuchthatWW⊤=IwhereIdenotestheidentitymatrix,zisgivenby
s
z=W
.
a
Thegoalofsufficientdimensionalityreductionis,fromindependenttransition
samples(sm,am,s′m)M
m=1,tofindWsuchthats′and(s,a)areconditionally
independentgivenz.Thisconditionalindependencemeansthatzcontainsall
informationabouts′andisequivalentlyexpressedas
p(s′|s,a)=p(s′|z).
(11.1)
173
174
StatisticalReinforcementLearning
11.2
Squared-LossConditionalEntropy
Inthissection,thedimensionalityreductionmethodbasedonthesquared-
lossconditionalentropy(SCE)isintroduced.
11.2.1
ConditionalIndependence
SCEisdefinedandexpressedas
ZZ
1
SCE(s′|z)=−
p(s′|z)p(s′,z)dzds′
2ZZ
Z
1
2
1
=−
p(s′|z)−1p(z)dzds′−1+
ds′.
2
2
ItwasshowninTangkarattetal.(2015)that
SCE(s′|z)≥SCE(s′|s,a),
andtheequalityholdsifandonlyifEq.(11.1)holds.Thus,sufficientdimen-
sionalityreductioncanbeperformedbyminimizingSCE(s′|z)withrespect
toW:
W∗=argminSCE(s′|z).W∈GHere,GdenotestheGrassmannmanifold,whichisthesetofmatricesW
suchthatWW⊤=Iwithoutredundancyintermsofthespan.
SinceSCEcontainsunknowndensitiesp(s′|z)andp(s′,z),itcannotbe
directlycomputed.Here,letusemploytheLSCDEmethodintroducedin
Chapter10toobtainanestimatorb
p(s′|z)ofconditionaldensityp(s′|z).Then,
byreplacingtheexpectationoverp(s′,z)withthesampleaverage,SCEcan
beapproximatedas
M
X
d
1
1
SCE(s′|z)=−
b
p(s′
e
α⊤b
v,
2M
m|zm)=−2
m=1
where
M
s
1X
z
m
m=W
and
bv=
φ(z
a
m,s′m).
m
Mm=1
φ(z,s′)isthebasisfunctionvectorusedinLSCDEgivenby
kz−z
φ
bk2+ks′−s′bk2
b(z,s′)=exp
−
,
2κ2
DimensionalityReductionforTransitionModelEstimation
175
whereκ>0denotestheGaussiankernelwidth.e
αistheLSCDEsolution
givenby
e
α=(b
U+λI)−1b
v,
whereλ≥0istheregularizationparameterand
√
b
(πκ)dim(s′)
ks′
U
b−s′b′k2
b,b′=
exp−
M
4κ2
M
X
kz
×
exp−
m−zbk2+kzm−zb′k2
.
2κ2
m=1
11.2.2
DimensionalityReductionwithSCE
WiththeaboveSCEestimator,apracticalformulationforsufficientdi-
mensionalityreductionisgivenby
c
W=argmaxS(W),whereS(W)=e
α⊤b
v.
W∈GThegradientofS(W)withrespecttoWℓ,ℓ′isgivenby
∂S
∂b
v⊤=−e
α⊤∂b
U
e
α+2
e
α.
∂Wℓ,ℓ′
∂Wℓ,ℓ′
∂Wℓ,ℓ′
IntheEuclideanspace,theabovegradientgivesthesteepestdirection(see
alsoSection7.3.1).However,ontheGrassmannmanifold,thenaturalgradi-
ent(Amari,1998)givesthesteepestdirection.ThenaturalgradientatW
istheprojectionoftheordinarygradienttothetangentspaceoftheGrass-
mannmanifold.Ifthetangentspaceisequippedwiththecanonicalmetric
W,W′=1tr(W⊤W′),thenaturalgradientatWisgivenasfollows(Edel-
2
manetal.,1998):
∂SW⊤∂W
⊥W⊥,
whereW⊥isthematrixsuchthatW⊤,W⊤isanorthogonalmatrix.
⊥ThegeodesicfromWtothedirectionofthenaturalgradientoverthe
Grassmannmanifoldcanbeexpressedusingt∈Ras”
#!
O
∂SW⊤W
W
∂W
⊥t=
I
Oexp−t
⊤,
−W
∂S
W
⊥O
⊥∂W
where“exp”foramatrixdenotesthematrixexponentialandOdenotesthe
zeromatrix.Thenlinesearchalongthegeodesicinthenaturalgradientdi-
rectionisperformedbyfindingthemaximizerfromWt|t≥0(Edelman
etal.,1998).
176
StatisticalReinforcementLearning
OnceWisupdatedbythenaturalgradientmethod,SCEisre-estimated
fornewWandnaturalgradientascentisperformedagain.Thisentirepro-
cedureisrepeateduntilWconverges,andthefinalsolutionisgivenby
b
α⊤φ(z,s′)
b
p(s′|z)=R
,
b
α⊤φ(z,s′′)ds′′
whereb
αb=max(0,e
αb),andthedenominatorcanbeanalyticallycomputedas
Z
√
B
X
kz−z
b
bk2
α⊤φ(z,s′′)ds′′=(2πκ)dim(s′)
αbexp−
.
2κ2
b=1
WhenSCEisre-estimated,performingcross-validationforLSCDEinevery
stepiscomputationallyexpensive.Inpractice,cross-validationmaybeper-
formedonlyonceeveryseveralgradientupdates.Furthermore,tofindabetter
localoptimalsolution,thisgradientascentproceduremaybeexecutedmul-
tipletimeswithrandomlychoseninitialsolutions,andtheoneachievingthe
largestobjectivevalueischosen.
11.2.3
RelationtoSquared-LossMutualInformation
TheabovedimensionalityreductionmethodminimizesSCE:
ZZ
1
p(z,s′)2
SCE(s′|z)=−
dzds′.
2
p(z)
Ontheotherhand,thedimensionalityreductionmethodproposedinSuzuki
andSugiyama(2013)maximizessquared-lossmutualinformation(SMI):
ZZ
1
p(z,s′)2
SMI(z,s′)=
dzds′.
2
p(z)p(s′)
NotethatSMIcanbeapproximatedalmostinthesamewayasSCEby
theleast-squaresmethod(Suzuki&Sugiyama,2013).Theaboveequations
showthattheessentialdifferencebetweenSCEandSMIiswhetherp(s′)
isincludedinthedenominatorofthedensityratio,andSCEisreducedto
thenegativeSMIifp(s′)isuniform.However,ifp(s′)isnotuniform,the
densityratiofunctionp(z,s′)includedinSMImaybemorefluctuatedthan
p(z)p(s′)
p(z,s′)includedinSCE.Sinceasmootherfunctioncanbemoreaccurately
p(z)
estimatedfromasmallnumberofsamplesingeneral(Vapnik,1998),SCE-
baseddimensionalityreductionisexpectedtoworkbetterthanSMI-based
dimensionalityreduction.
DimensionalityReductionforTransitionModelEstimation
177
11.3
NumericalExamples
Inthissection,experimentalbehavioroftheSCE-baseddimensionality
reductionmethodisillustrated.
11.3.1
ArtificialandBenchmarkDatasets
Thefollowingdimensionalityreductionschemesarecompared:
•None:Nodimensionalityreductionisperformed.
•SCE(Section11.2):Dimensionalityreductionisperformedbymini-
mizingtheleast-squaresSCEapproximatorusingnaturalgradientsover
theGrassmannmanifold(Tangkarattetal.,2015).
•SMI(Section11.2.3):Dimensionalityreductionisperformedbymax-
imizingtheleast-squaresSMIapproximatorusingnaturalgradientsover
theGrassmannmanifold(Suzuki&Sugiyama,2013).
•True:The“true”subspaceisused(onlyforartificialdatasets).
Afterdimensionalityreduction,thefollowingconditionaldensityestimators
arerun:
•LSCDE(Section10.1.3):Least-squaresconditionaldensityestima-
tion(Sugiyamaetal.,2010).
•ǫKDE(Section10.1.2):ǫ-neighborkerneldensityestimation,where
ǫischosenbyleast-squarescross-validation.
First,thebehaviorofSCE-LSCDEiscomparedwiththeplainLSCDE
withnodimensionalityreduction.Thedatasetshave5-dimensionalinputx=
(x(1),…,x(5))⊤and1-dimensionaloutputy.Amongthe5dimensionsofx,
onlythefirstdimensionx(1)isrelevanttopredictingtheoutputyandthe
other4dimensionsx(2),…,x(5)arejuststandardGaussiannoise.Figure11.1
plotsthefirstdimensionofinputandoutputofthesamplesinthedatasets
andconditionaldensityestimationresults.Thegraphsshowthattheplain
LSCDEdoesnotperformwellduetotheirrelevantnoisedimensionsininput,
whileSCE-LSCDEgivesmuchbetterestimates.
Next,artificialdatasetswith5-dimensionalinputx=(x(1),…,x(5))⊤and1-dimensionaloutputyareused.Eachelementofxfollowsthestandard
Gaussiandistributionandyisgivenby
(a)y=x(1)+(x(1))2+(x(1))3+ε,
(b)y=(x(1))2+(x(2))2+ε,
178
StatisticalReinforcementLearning
6
6
Sample
Sample
Plain-LSCDE
Plain-LSCDE
5
SCE-LSCDE
SCE-LSCDE
4
4
y2
y3
2
0
1
−2
0
2
3
4
5
6
7
3
4
5
6
7
8
x(1)
x(1)
(a)Bonemineraldensity
(b)OldFaithfulgeyser
FIGURE11.1:ExamplesofconditionaldensityestimationbyplainLSCDE
andSCE-LSCDE.
whereεistheGaussiannoisewithmeanzeroandstandarddeviation1/4.
ThetoprowofFigure11.2showsthedimensionalityreductionerrorbe-
tweentrueW∗anditsestimatecWfordifferentsamplesizen,measured
by
⊤Error
c
DR=kc
WW−W∗⊤W∗kFrobenius,wherek·kFrobeniusdenotestheFrobeniusnorm.TheSMI-basedandSCE-based
dimensionalityreductionmethodsbothperformsimilarlyforthedataset(a),
whiletheSCE-basedmethodclearlyoutperformstheSMI-basedmethodfor
thedataset(b).Thehistogramsofy400
i=1plottedinthe2ndrowofFigure11.2
showthattheprofileofthehistogram(whichisasampleapproximationof
p(y))inthedataset(b)ismuchsharperthanthatinthedataset(a).As
explainedinSection11.2.3,thedensityratiofunctionusedinSMIcontains
p(y)inthedenominator.Therefore,itwouldbehighlynon-smoothandthus
ishardtoapproximate.Ontheotherhand,thedensityratiofunctionused
inSCEdoesnotcontainp(y).Therefore,itwouldbesmootherthantheone
usedinSMIandthusiseasiertoapproximate.
The3rdand4throwsofFigure11.2plottheconditionaldensityestimation
errorbetweentruep(y|x)anditsestimateb
p(y|x),evaluatedbythesquared
loss(withoutaconstant):
Z
1
n′
X
1n′
X
ErrorCDE=
b
p(y|e
x
b
p(e
y
2n′
i)2dy−n′
i|e
xi),
i=1
i=1
where(e
xi,e
yi)n′
i=1isasetoftestsamplesthathavenotbeenusedfor
conditionaldensityestimation.Wesetn′=1000.Thegraphsshowthat
LSCDEoveralloutperformsǫKDEforbothdatasets.Forthedataset(a),
SMI-LSCDEandSCE-LSCDEperformequallywell,andaremuchbetterthan
DimensionalityReductionforTransitionModelEstimation
179
1
0.25
SMI-based
SMI-based
SCE-based
SCE-based
0.8
0.2
0.6
0.15
DR
DR
Error0.4
Error
0.1
0.2
0.05
0
0
50
100150200250300350400
50
100150200250300350400
Samplesizen
Samplesizen
40
200
30
150
20
100
Frequency
Frequency
10
50
0
0
−2
0
2
4
6
−5
0
5
10
y
y
1
0.1
LSCDE
εKDE
LSCDE
εKDE
LSCDE*
εKDE*
0
LSCDE*
εKDE*
0.5
−0.1
0
−0.2
−0.5
CDE
CDE
−0.3
−1
Error−0.4
Error
−1.5
−0.5
−0.6
−2
−0.7
−2.5
50
100150200250300350400
50
100150200250300350400
Samplesizen
Samplesizen
0.1
1
SMI-LSCDE
SMI-LSCDE
SMI-
SMI-
εKDE
εKDE
0
SCE-LSCDE
SCE-εKDE
SCE-LSCDE
SCE-εKDE
0.5
−0.1
0
−0.2
−0.5
CDE
CDE
−0.3
−1
Error−0.4
Error
−1.5
−0.5
−0.6
−2
−0.7
−2.5
50
100150200250300350400
50
100150200250300350400
Samplesizen
Samplesizen
FIGURE11.2:Toprow:Themeanandstandarderrorofthedimensionality
reductionerrorover20runsontheartificialdatasets.2ndrow:Histograms
ofoutputyi400
i=1.3rdand4throws:Themeanandstandarderrorofthe
conditionaldensityestimationerrorover20runs.
180
StatisticalReinforcementLearning
plainLSCDEwithnodimensionalityreduction(LSCDE)andcomparableto
LSCDEwiththetruesubspace(LSCDE*).Forthedataset(b),SCE-LSCDE
outperformsSMI-LSCDEandLSCDEandiscomparabletoLSCDE*.
Next,theUCIbenchmarkdatasets(Bache&Lichman,2013)areusedfor
performanceevaluation.nsamplesareselectedrandomlyfromeachdatasetfor
conditionaldensityestimation,andtherestofthesamplesareusedtomeasure
theconditionaldensityestimationerror.Sincethedimensionalityofzisun-
knownforthebenchmarkdatasets,itwasdeterminedbycross-validation.The
resultsaresummarizedinTable11.1,showingthatSCE-LSCDEworkswell
overall.Table11.2describesthedimensionalitiesselectedbycross-validation,
showingthatboththeSCE-basedandSMI-basedmethodsreducethedimen-
sionalitysignificantly.
11.3.2
HumanoidRobot
Finally,SCE-LSCDEisappliedtotransitionestimationofahumanoid
robot.Weuseasimulatoroftheupper-bodypartofthehumanoidrobot
CB-i(Chengetal.,2007)(seeFigure9.5).
Therobothas9controllablejoints:shoulderpitch,shoulderroll,elbow
pitchoftherightarm,andshoulderpitch,shoulderroll,elbowpitchofthe
leftarm,waistyaw,torsoroll,andtorsopitchjoints.Postureoftherobotis
describedby18-dimensionalreal-valuedstatevectors,whichcorrespondsto
theangleandangularvelocityofeachjointinradianandradian-per-second,
respectively.Therobotiscontrolledbysendinganactioncommandatothe
system.Theactioncommandaisa9-dimensionalreal-valuedvector,which
correspondstothetargetangleofeachjoint.Whentherobotiscurrentlyat
statesandreceivesactiona,thephysicalcontrolsystemofthesimulator
calculatestheamountoftorquetobeappliedtoeachjoint(seeSection9.3.3
fordetails).
Intheexperiment,theactionvectoraisrandomlychosenandanoisy
controlsystemissimulatedbyaddingabimodalGaussiannoisevector.More
specifically,theactionaiofthei-thjointisfirstdrawnfromtheuniformdis-
tributionon[si−0.087,si+0.087],wheresidenotesthestateforthei-th
joint.ThedrawnactionisthencontaminatedbyGaussiannoisewithmean
0andstandarddeviation0.034withprobability0.6andGaussiannoisewith
mean−0.087andstandarddeviation0.034withprobability0.4.Byrepeat-
edlycontrollingtherobotMtimes,transitionsamples(sm,am,s′m)M
m=1
areobtained.Ourgoalistolearnthesystemdynamicsasastatetransition
probabilityp(s′|s,a)fromthesesamples.
Thefollowingthreescenariosareconsidered:usingonly2joints(right
shoulderpitchandrightelbowpitch),only4joints(inaddition,rightshoulder
rollandwaistyaw),andall9joints.Thesesetupscorrespondto6-dimensional
inputand4-dimensionaloutputinthe2-jointcase,12-dimensionalinputand
8-dimensionaloutputinthe4-jointcase,and27-dimensionalinputand18-
dimensionaloutputinthe9-jointcase.Fivehundred,1000,and1500transition
DimensionalityReductionforTransitionModelEstimation
181
r
llea
0
t-test
le
1
1
1
1
1
1
1
1
1
0
1
1
0
0
ca
1
1
1
(sm
×××××××××
××
ed
S
×
××
ira
)
)
)
)
)
)
)
)
)
)
)
)
)
)
sets
p
1
4
6
2
1
1
4
2
3
4
4
4
1
2
ta
E
a
le
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.1
(.1
(.1
(.0
(.0
d
p
n
D
3
6
2
5
1
9
3
6
0
5
5
5
5
9
s
m
.1
.4
.7
.9
.9
.8
.1
.9
.8
.9
.2
.6
.7
.8
u
ǫK
1
1
2
2
0
1
1
6
0
1
9
3
0
0
-sa
ctio
−−−−−−−−−−−−−−
rio
o
u
)
)
va
tw
)
)
)
)
)
)
)
)
)
)
)
)
red
5
4
9
4
1
1
2
7
2
6
3
3
3
6
r
e
o
E
fo
th
D
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.1
(.1
(.0
(.3
s
N
C
1
5
2
2
9
6
3
0
1
2
5
5
3
0
n
to
S
.4
.6
.7
.0
.0
.4
.1
.1
.3
.9
.8
.6
1
.7
2
1
.1
2
2
3
1
2
7
3
0
1
ru
L
1
1
g
−
−−
−−−−−−−−−
0
−
−
1
inrd
)
er
)
)
)
)
)
)
)
)
)
)
)
)
)
8
5
1
9
2
2
7
2
4
9
3
0
6
1
ov
cco
E
(.3
r
a
(.0
(.0
(.1
(.2
(.0
(.1
(.1
(.0
(.0
(.4
(.6
(.1
(.5
s
D
2
7
5
7
7
0
3
3
8
1
7
4
8
7
d
.6
.7
.9
.4
.9
.6
.9
.9
.1
.4
.2
.4
.3
.3
erro
o
ǫK
0
sed
1
1
2
5
0
2
1
6
1
3
1
7
1
2
n
eth
a
−
−−
−−−−
−−−
b
−−
−
−
tioam
I-
)
)
)
)
)
)
)
)
)
)
)
)
)
)
le
M
5
4
8
6
1
2
3
4
3
7
0
4
5
3
b
S
ED(.0(.0(.1(.2(.0(.0(.0(.0(.0(.4(.5(.8(.2(.6
estim
ra
3
3
4
6
a
C
1
5
9
0
5
2
0
2
0
4
y
p
S
.9
.8
.6
.6
.2
.3
.8
.9
.3
.0
.4
.0
.0
.7
L
1
1
2
5
1
2
2
6
1
6
9
8
2
9
sit
m
−−−−−−−−−−−−−−
en
co
d
d
)
)
)
)
)
l
)
6
4
4
)
5
)
)
)
)
7
)
)
)
a
n
1
2
7
3
6
2
4
4
7
a
n
r
E
(.1
(.0
(.1
(.1
(.0
(.1
(.1
(.0
(.0
(.2
(.3
(.5
(.1
(.1
io
D
7
4
3
3
9
7
5
3
0
8
5
0
3
4
it
.5
.9
.9
.9
.2
.1
.5
.7
.4
d
erro
ce.
ǫK
1
.7
.0
.2
0
.4
1
6
1
4
.7
7
1
2
n
1
3
6
2
9
n
fa
sed
−−−−−−−−−−−−−−
co
a
e
ea
ld
-b
)
m
o
E
)
)
)
)
)
)
)
)
)
)
)
6
)
)
th
b
9
4
8
2
1
1
2
2
3
4
3
1
3
f
e
C
E
o
y
S
(.8
th
b
D
(.0
(.0
(.1
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.5
9
(.2
(.8
r
f
C
3
0
2
6
9
1
5
8
6
3
7
1
7
o
ed
S
.7
.8
.9
.4
.1
.3
.8
.1
.3
.1
.3
.4
.8
.3
L
1
1
2
6
1
2
2
7
1
7
8
0
2
8
erro
s
−
−1
ecifi
−−−−−
−−−−
−−−
rdatermsp
0
0
0
0
0
0
0
0
0
0
0
0
d
0
0
0
0
0
0
0
0
0
0
0
0
0
0
n
n
in
re
1
1
5
8
5
4
3
1
3
2
1
1
2
5
a
)
)
sta
d
)
)
)
)
)
o
)
)
)
)
)
)
)
)
8
d
%5
,dy
,1
,1
,1
,1
,1
,1
,1
,1
,1
,2
,2
,4
,8
,1
n
3
1
1
2
2
7
a
eth
el
(1
(7
(4
(6
(9
(1
(1
(1
(8
(8
(7
(6
(1
n
m
v
(dx
(2
le
ea
est
e
e
es
M
b
ce
g
G
em
in
:
P
ir
y
ts
ts
ts
e
n
set
o
t
ch
in
W
F
.1
h
ca
sin
M
ch
o
W
crete
ck
erg
in
in
in
1
T
o
o
o
ifi
ta
u
erv
a
e
n
n
to
J
J
J
1
a
n
o
to
sic
it
D
S
Y
o
S
H
u
y
h
ed
rest
C
E
2
4
9
E
o
sig
A
h
W
R
F
L
P
e
B
etter).
th
A
b
t
T
is
a
182
StatisticalReinforcementLearning
TABLE11.2:Meanandstandarderrorofthechosensubspacedimensional-
ityover10runsforbenchmarkandrobottransitiondatasets.
SCE-based
SMI-based
Dataset
(dx,dy)
LSCDE
ǫKDE
LSCDE
ǫKDE
Housing
(13,1)
3.9(0.74)
2.0(0.79)
2.0(0.39)
1.3(0.15)
AutoMPG
(7,1)
3.2(0.66)
1.3(0.15)
2.1(0.67)
1.1(0.10)
Servo
(4,1)
1.9(0.35)
2.4(0.40)
2.2(0.33)
1.6(0.31)
Yacht
(6,1)
1.0(0.00)
1.0(0.00)
1.0(0.00)
1.0(0.00)
Physicochem
(9,1)
6.5(0.58)
1.9(0.28)
6.6(0.58)
2.6(0.86)
WhiteWine
(11,1)
1.2(0.13)
1.0(0.00)
1.4(0.31)
1.0(0.00)
RedWine
(11,1)
1.0(0.00)
1.3(0.15)
1.2(0.20)
1.0(0.00)
ForestFires
(12,1)
1.2(0.20)
4.9(0.99)
1.4(0.22)
6.8(1.23)
Concrete
(8,1)
1.0(0.00)
1.0(0.00)
1.2(0.13)
1.0(0.00)
Energy
(8,2)
5.9(0.10)
3.9(0.80)
2.1(0.10)
2.0(0.30)
Stock
(7,2)
3.2(0.83)
2.1(0.59)
2.1(0.60)
2.7(0.67)
2Joints
(6,4)
2.9(0.31)
2.7(0.21)
2.5(0.31)
2.0(0.00)
4Joints
(12,8)
5.2(0.68)
6.2(0.63)
5.4(0.67)
4.6(0.43)
9Joints
(27,18)
13.8(1.28)15.3(0.94)11.4(0.75)13.2(1.02)
samplesaregeneratedforthe2-joint,4-joint,and9-jointcases,respectively.
Thenrandomlychosenn=100,200,and500samplesareusedforconditional
densityestimation,andtherestisusedforevaluatingthetesterror.The
resultsaresummarizedinTable11.1,showingthatSCE-LSCDEperforms
wellfortheallthreecases.Table11.2describesthedimensionalitiesselected
bycross-validation.Thisshowsthatthedimensionalitiesaremuchreduced,
implyingthattransitionofthehumanoidrobotishighlyredundant.
11.4
Remarks
Copingwithhighdimensionalityofthestateandactionspacesisoneof
themostimportantchallengesinmodel-basedreinforcementlearning.Inthis
chapter,adimensionalityreductionmethodforconditionaldensityestimation
wasintroduced.Thekeyideawastousethesquared-lossconditionalentropy
(SCE)fordimensionalityreduction,whichcanbeestimatedbyleast-squares
conditionaldensityestimation.Thisallowedustoperformdimensionalityre-
ductionandconditionaldensityestimationsimultaneouslyinanintegrated
manner.Incontrast,dimensionalityreductionbasedonsquared-lossmutual
information(SMI)yieldsatwo-stepprocedureoffirstreducingthedimension-
alityandthentheconditionaldensityisestimated.SCE-baseddimensionality
reductionwasshowntooutperformtheSMI-basedmethod,particularlywhen
outputfollowsaskeweddistribution.
References
Abbeel,P.,&Ng,A.Y.(2004).Apprenticeshiplearningviainverserein-
forcementlearning.ProceedingsofInternationalConferenceonMachine
Learning(pp.1–8).
Abe,N.,Melville,P.,Pendus,C.,Reddy,C.K.,Jensen,D.L.,Thomas,V.P.,
Bennett,J.J.,Anderson,G.F.,Cooley,B.R.,Kowalczyk,M.,Domick,M.,
&Gardinier,T.(2010).Optimizingdebtcollectionsusingconstrainedrein-
forcementlearning.ProceedingsofACMSIGKDDInternationalConference
onKnowledgeDiscoveryandDataMining(pp.75–84).
Amari,S.(1967).Theoryofadaptivepatternclassifiers.IEEETransactions
onElectronicComputers,EC-16,299–307.
Amari,S.(1998).Naturalgradientworksefficientlyinlearning.NeuralCom-
putation,10,251–276.
Amari,S.,&Nagaoka,H.(2000).Methodsofinformationgeometry.Provi-
dence,RI,USA:OxfordUniversityPress.
Bache,K.,&Lichman,M.(2013).UCImachinelearningrepository.http:
//archive.ics.uci.edu/ml/
Baxter,J.,Bartlett,P.,&Weaver,L.(2001).Experimentswithinfinite-
horizon,policy-gradientestimation.JournalofArtificialIntelligenceRe-
search,15,351–381.
Bishop,C.M.(2006).Patternrecognitionandmachinelearning.NewYork,
NY,USA:Springer.
Boyd,S.,&Vandenberghe,L.(2004).Convexoptimization.Cambridge,UK:
CambridgeUniversityPress.
Bradtke,S.J.,&Barto,A.G.(1996).Linearleast-squaresalgorithmsfor
temporaldifferencelearning.MachineLearning,22,33–57.
Chapelle,O.,Schölkopf,B.,&Zien,A.(Eds.).(2006).Semi-supervisedlearn-
ing.Cambridge,MA,USA:MITPress.
Cheng,G.,Hyon,S.,Morimoto,J.,Ude,A.,Joshua,G.H.,Colvin,G.,Scrog-
gin,W.,&Stephen,C.J.(2007).CB:Ahumanoidresearchplatformfor
exploringneuroscience.AdvancedRobotics,21,1097–1114.
183
184
References
Chung,F.R.K.(1997).Spectralgraphtheory.Providence,RI,USA:American
MathematicalSociety.
Coifman,R.,&Maggioni,M.(2006).Diffusionwavelets.AppliedandCom-
putationalHarmonicAnalysis,21,53–94.
Cook,R.D.,&Ni,L.(2005).Sufficientdimensionreductionviainverse
regression.JournaloftheAmericanStatisticalAssociation,100,410–428.
Dayan,P.,&Hinton,G.E.(1997).Usingexpectation-maximizationforrein-
forcementlearning.NeuralComputation,9,271–278.
Deisenroth,M.P.,Neumann,G.,&Peters,J.(2013).Asurveyonpolicy
searchforrobotics.FoundationsandTrendsinRobotics,2,1–142.
Deisenroth,M.P.,&Rasmussen,C.E.(2011).PILCO:Amodel-basedand
data-efficientapproachtopolicysearch.ProceedingsofInternationalCon-
ferenceonMachineLearning(pp.465–473).
Demiriz,A.,Bennett,K.P.,&Shawe-Taylor,J.(2002).Linearprogramming
boostingviacolumngeneration.MachineLearning,46,225–254.
Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).Maximumlikelihood
fromincompletedataviatheEMalgorithm.JournaloftheRoyalStatistical
Society,seriesB,39,1–38.
Dijkstra,E.W.(1959).Anoteontwoproblemsinconnexion[sic]withgraphs.
NumerischeMathematik,1,269–271.
Edelman,A.,Arias,T.A.,&Smith,S.T.(1998).Thegeometryofalgo-
rithmswithorthogonalityconstraints.SIAMJournalonMatrixAnalysis
andApplications,20,303–353.
Efron,B.,Hastie,T.,Johnstone,I.,&Tibshirani,R.(2004).Leastangle
regression.AnnalsofStatistics,32,407–499.
Engel,Y.,Mannor,S.,&Meir,R.(2005).ReinforcementlearningwithGaus-
sianprocesses.ProceedingsofInternationalConferenceonMachineLearn-
ing(pp.201–208).
Fishman,G.S.(1996).MonteCarlo:Concepts,algorithms,andapplications.
Berlin,Germany:Springer-Verlag.
Fredman,M.L.,&Tarjan,R.E.(1987).Fibonacciheapsandtheiruses
inimprovednetworkoptimizationalgorithms.JournaloftheACM,34,
569–615.
Goldberg,A.V.,&Harrelson,C.(2005).Computingtheshortestpath:A*
searchmeetsgraphtheory.ProceedingsofAnnualACM-SIAMSymposium
onDiscreteAlgorithms(pp.156–165).
References
185
Gooch,B.,&Gooch,A.(2001).Non-photorealisticrendering.Natick,MA,
USA:A.K.PetersLtd.
Greensmith,E.,Bartlett,P.L.,&Baxter,J.(2004).Variancereductiontech-
niquesforgradientestimatesinreinforcementlearning.JournalofMachine
LearningResearch,5,1471–1530.
Guo,Q.,&Kunii,T.L.(2003).“Nijimi”renderingalgorithmforcreating
qualityblackinkpaintings.ProceedingsofComputerGraphicsInternational
(pp.152–159).
Henkel,R.E.(1976).Testsofsignificance.BeverlyHills,CA,USA.:SAGE
Publication.
Hertzmann,A.(1998).Painterlyrenderingwithcurvedbrushstrokesofmul-
tiplesizes.ProceedingsofAnnualConferenceonComputerGraphicsand
InteractiveTechniques(pp.453–460).
Hertzmann,A.(2003).Asurveyofstrokebasedrendering.IEEEComputer
GraphicsandApplications,23,70–81.
Hoerl,A.E.,&Kennard,R.W.(1970).Ridgeregression:Biasedestimation
fornonorthogonalproblems.Technometrics,12,55–67.
Huber,P.J.(1981).Robuststatistics.NewYork,NY,USA:Wiley.
Kakade,S.(2002).Anaturalpolicygradient.AdvancesinNeuralInformation
ProcessingSystems14(pp.1531–1538).
Kanamori,T.,Hido,S.,&Sugiyama,M.(2009).Aleast-squaresapproachto
directimportanceestimation.JournalofMachineLearningResearch,10,
1391–1445.
Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2012).Statisticalanalysisof
kernel-basedleast-squaresdensity-ratioestimation.MachineLearning,86,
335–367.
Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2013).Computationalcomplex-
ityofkernel-baseddensity-ratioestimation:Aconditionnumberanalysis.
MachineLearning,90,431–460.
Kober,J.,&Peters,J.(2011).Policysearchformotorprimitivesinrobotics.
MachineLearning,84,171–203.
Koenker,R.(2005).Quantileregression.Cambridge,MA,USA:Cambridge
UniversityPress.
Kohonen,T.(1995).Self-organizingmaps.Berlin,Germany:Springer.
Kullback,S.,&Leibler,R.A.(1951).Oninformationandsufficiency.Annals
ofMathematicalStatistics,22,79–86.
186
References
Lagoudakis,M.G.,&Parr,R.(2003).Least-squarespolicyiteration.Journal
ofMachineLearningResearch,4,1107–1149.
Li,K.(1991).Slicedinverseregressionfordimensionreduction.Journalof
theAmericanStatisticalAssociation,86,316–342.
Mahadevan,S.(2005).Proto-valuefunctions:Developmentalreinforcement
learning.ProceedingsofInternationalConferenceonMachineLearning(pp.
553–560).
Mangasarian,O.L.,&Musicant,D.R.(2000).Robustlinearandsupport
vectorregression.IEEETransactionsonPatternAnalysisandMachine
Intelligence,22,950–955.
Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.(2010a).
Nonparametricreturndistributionapproximationforreinforcementlearn-
ing.ProceedingsofInternationalConferenceonMachineLearning(pp.
799–806).
Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.
(2010b).Parametricreturndensityestimationforreinforcementlearning.
ConferenceonUncertaintyinArtificialIntelligence(pp.368–375).
Peters,J.,&Schaal,S.(2006).Policygradientmethodsforrobotics.Process-
ingoftheIEEE/RSJInternationalConferenceonIntelligentRobotsand
Systems(pp.2219–2225).
Peters,J.,&Schaal,S.(2007).Reinforcementlearningbyreward-weighted
regressionforoperationalspacecontrol.ProceedingsofInternationalCon-
ferenceonMachineLearning(pp.745–750).Corvallis,Oregon,USA.
Precup,D.,Sutton,R.S.,&Singh,S.(2000).Eligibilitytracesforoff-policypolicyevaluation.ProceedingsofInternationalConferenceonMachine
Learning(pp.759–766).
Rasmussen,C.E.,&Williams,C.K.I.(2006).Gaussianprocessesformachine
learning.Cambridge,MA,USA:MITPress.
Rockafellar,R.T.,&Uryasev,S.(2002).Conditionalvalue-at-riskforgeneral
lossdistributions.JournalofBanking&Finance,26,1443–1471.
Rousseeuw,P.J.,&Leroy,A.M.(1987).Robustregressionandoutlierdetec-
tion.NewYork,NY,USA:Wiley.
Schaal,S.(2009).TheSLsimulationandreal-timecontrolsoftwarepack-
age(TechnicalReport).ComputerScienceandNeuroscience,Universityof
SouthernCalifornia.
Sehnke,F.,Osendorfer,C.,Rückstiess,T.,Graves,A.,Peters,J.,&Schmid-
huber,J.(2010).Parameter-exploringpolicygradients.NeuralNetworks,
23,551–559.
References
187
Shimodaira,H.(2000).Improvingpredictiveinferenceundercovariateshift
byweightingthelog-likelihoodfunction.JournalofStatisticalPlanningand
Inference,90,227–244.
Siciliano,B.,&Khatib,O.(Eds.).(2008).Springerhandbookofrobotics.
Berlin,Germany:Springer-Verlag.
Sugimoto,N.,Tangkaratt,V.,Wensveen,T.,Zhao,T.,Sugiyama,M.,&Mo-
rimoto,J.(2014).Efficientreuseofpreviousexperiencesinhumanoidmotor
learning.ProceedingsofIEEE-RASInternationalConferenceonHumanoid
Robots(pp.554–559).
Sugiyama,M.(2006).Activelearninginapproximatelylinearregressionbased
onconditionalexpectationofgeneralizationerror.JournalofMachine
LearningResearch,7,141–166.
Sugiyama,M.,Hachiya,H.,Towell,C.,&Vijayakumar,S.(2008).Geodesic
Gaussiankernelsforvaluefunctionapproximation.AutonomousRobots,
25,287–304.
Sugiyama,M.,&Kawanabe,M.(2012).Machinelearninginnon-stationary
environments:Introductiontocovariateshiftadaptation.Cambridge,MA,
USA:MITPress.
Sugiyama,M.,Krauledat,M.,&Müller,K.-R.(2007).Covariateshiftadapta-
tionbyimportanceweightedcrossvalidation.JournalofMachineLearning
Research,8,985–1005.
Sugiyama,M.,Suzuki,T.,&Kanamori,T.(2012).Densityratiomatching
undertheBregmandivergence:Aunifiedframeworkofdensityratioesti-
mation.AnnalsoftheInstituteofStatisticalMathematics,64,1009–1044.
Sugiyama,M.,Takeuchi,I.,Suzuki,T.,Kanamori,T.,Hachiya,H.,&
Okanohara,D.(2010).Least-squaresconditionaldensityestimation.IEICE
TransactionsonInformationandSystems,E93-D,583–594.
Sutton,R.S.,&Barto,G.A.(1998).Reinforcementlearning:Anintroduction.
Cambridge,MA,USA:MITPress.
Suzuki,T.,&Sugiyama,M.(2013).
Sufficientdimensionreductionvia
squared-lossmutualinformationestimation.NeuralComputation,25,725–
758.
Takeda,A.(2007).Supportvectormachinebasedonconditionalvalue-at-risk
minimization(TechnicalReportB-439).DepartmentofMathematicaland
ComputingSciences,TokyoInstituteofTechnology.
Tangkaratt,V.,Mori,S.,Zhao,T.,Morimoto,J.,&Sugiyama,M.(2014).
Model-basedpolicygradientswithparameter-basedexplorationbyleast-
squaresconditionaldensityestimation.NeuralNetworks,57,128–140.
188
References
Tangkaratt,V.,Xie,N.,&Sugiyama,M.(2015).Conditionaldensityesti-
mationwithdimensionalityreductionviasquared-lossconditionalentropy
minimization.NeuralComputation,27,228–254.
Tesauro,G.(1994).
TD-gammon,aself-teachingbackgammonprogram,
achievesmaster-levelplay.NeuralComputation,6,215–219.
Tibshirani,R.(1996).Regressionshrinkageandsubsetselectionwiththe
lasso.JournaloftheRoyalStatisticalSociety,SeriesB,58,267–288.
Tomioka,R.,Suzuki,T.,&Sugiyama,M.(2011).Super-linearconvergenceof
dualaugmentedLagrangianalgorithmforsparsityregularizedestimation.
JournalofMachineLearningResearch,12,1537–1586.
Vapnik,V.N.(1998).Statisticallearningtheory.NewYork,NY,USA:Wiley.
Vesanto,J.,Himberg,J.,Alhoniemi,E.,&Parhankangas,J.(2000).SOM
toolboxforMatlab5(TechnicalReportA57).HelsinkiUniversityofTech-
nology.
Wahba,G.(1990).Splinemodelsforobservationaldata.Philadelphia,PA,
USA:SocietyforIndustrialandAppliedMathematics.
Wang,X.,&Dietterich,T.G.(2003).Model-basedpolicygradientrein-
forcementlearning.ProceedingsofInternationalConferenceonMachine
Learning(pp.776–783).
Wawrzynski,P.(2009).Real-timereinforcementlearningbysequentialactor-
criticsandexperiencereplay.NeuralNetworks,22,1484–1497.
Weaver,L.,&Baxter,J.(1999).Reinforcementlearningfromstateandtem-
poraldifferences(TechnicalReport).DepartmentofComputerScience,
AustralianNationalUniversity.
Weaver,L.,&Tao,N.(2001).Theoptimalrewardbaselineforgradient-
basedreinforcementlearning.ProceedingsofConferenceonUncertaintyin
ArtificialIntelligence(pp.538–545).
Williams,J.D.,&Young,S.J.(2007).PartiallyobservableMarkovdecision
processesforspokendialogsystems.ComputerSpeechandLanguage,21,
393–422.
Williams,R.J.(1992).Simplestatisticalgradient-followingalgorithmsfor
connectionistreinforcementlearning.MachineLearning,8,229–256.
Xie,N.,Hachiya,H.,&Sugiyama,M.(2013).Artistagent:Areinforcement
learningapproachtoautomaticstrokegenerationinorientalinkpainting.
IEICETransactionsonInformationandSystems,E95-D,1134–1144.
Xie,N.,Laga,H.,Saito,S.,&Nakajima,M.(2011).Contour-drivenSumi-e
renderingofrealphotos.Computers&Graphics,35,122–134.
References
189
Zhao,T.,Hachiya,H.,Niu,G.,&Sugiyama,M.(2012).Analysisandim-
provementofpolicygradientestimation.NeuralNetworks,26,118–129.
Zhao,T.,Hachiya,H.,Tangkaratt,V.,Morimoto,J.,&Sugiyama,M.(2013).
Efficientsamplereuseinpolicygradientswithparameter-basedexploration.
NeuralComputation,25,1512–1547.
Thispageintentionallyleftblank
DocumentOutlineCoverContentsForewordPrefaceAuthorPartI:Introduction
Chapter1:IntroductiontoReinforcementLearningPartII:Model-FreePolicyIteration
Chapter2:PolicyIterationwithValueFunctionApproximationChapter3:BasisDesignforValueFunctionApproximationChapter4:SampleReuseinPolicyIterationChapter5:ActiveLearninginPolicyIterationChapter6:RobustPolicyIteration
PartIII:Model-FreePolicySearchChapter7:DirectPolicySearchbyGradientAscentChapter8:DirectPolicySearchbyExpectation-MaximizationChapter9:Policy-PriorSearch
PartIV:Model-BasedReinforcementLearningChapter10:TransitionModelEstimationChapter11:DimensionalityReductionforTransitionModelEstimation
References
top related